[Home](../index.html)

[Analysis](./XboxTA_Analysis.html)

[Webscraping Methodology](./XboxTA_Webscrape.html)

[Exploratory Data Analysis](./XboxTA_EDA.html)

# Xbox TrueAchievements Web Scraping

This notebook outlines the processes and methods used to scrape the data from TrueAchievements.com. The libraries utilized include NumPy, Pandas and Selenium. 

In [1]:
from src.py.manifest import *
from src.py.scrape_main import *
from src.py.rescrape.rescrape_gamer_achievements import *
from src.py.rescrape.rescrape_gamer_games import *
from src.py.leaderboard import *
from src.py.anonymize import *

#from src.py.sql.leaderboard import *
#import sqlite3

## Leaderboard Operations

First we use a web scraper to extract data for all players on the leaderboard. Users can choose the start and end pages for data extraction and specify the frequency of CSV file saves during scraping. As the scraper progresses, data is automatically saved in CSV files at the specified intervals, ensuring secure storage. Each CSV file contains collected data up to that point. This method is primarily used to get a collection user profile URLs to be used for scraping later as individual analysis will be more accurate.

Additionally, the project provides a method to merge individual CSV files into a single comprehensive CSV, creating a unified dataset ready for further analysis or integration into data pipelines. 

In [10]:
#scrape_save_pagerange(startpage = 18200, endpage = 18740, saveafter = 100)
#combine_leaderboard_csv_files()

## Gamer Profile Operations

We have two methods to extract individual gamer's information. The first method allows users to specify the sample size and choose between scraping achievements or games histories data. Users can select the data type according to their specific needs. 

The second method focuses on scraping the locations of all gamers from their profiles, whenever available. 

In [None]:
#scrape_random_gamers_data(sample_size = 1, data_type="achievements")  # Scrape random gamer achievements

In [None]:
#scrape_random_gamers_data(sample_size = 5, data_type="games")  # Scrape random gamer games

In [None]:
#scrape_all_gamer_locations()

## Manifest Gamer Operations

This method updates the manifest CSV that contains the list of profiles, their file directory references and when they were last scraped. 

In [None]:
update_gamer_manifest()

## Incomplete Profile Operations

These methods scrape the profiles of those missing either games or achievements so that the manifest has a complete set of profiles.

In [None]:
scrape_incomplete_profiles_achievements()

In [None]:
scrape_incomplete_profiles_games()

## Rescrape Profile Operations

These methods are used when one wants to update the profiles achievements or games potentially due to being out of date generally or compared with the rest of the data. 

In [None]:
#rescrape_profiles_achievements()

In [None]:
#rescrape_profiles_games()

## Manifest Achievement Operations

In [None]:
#rnd_achievements = read_random_gamers(200, "_achievements.csv")
#update_achievement_manifest(rnd_achievements)

## Check for Duplicates

In [None]:
check_duplicates_all_profiles("./data/gamer/achievements/", metric_type="achievements")
check_duplicates_all_profiles("./data/gamer/games/", metric_type="games")

## Anonymize Filenames in Raw Data Folder

Here we convert the prefix which contains the gamertag of the CSV files to a 6 character length string of uppercase letters and numbers. This will be used to better track and anonymously so the data of gamers within the sample. 

In [3]:
nonanon_directory = "./data/gamer/raw_data/non-anonymous/"
anon_directory = "./data/gamer/raw_data/anonymous/"
data_types = ["achievements", "games", "metrics"]

# Anonymize files for achievements, games, and metrics
#id_mapping = anonymize_filenames_in_directory(nonanon_directory, anon_directory, data_types)

# Process anonymized files for achievements, games, and metrics
#process_anonymized_filenames(anon_directory, data_types, id_mapping)

## Anonymize IDs within Files

Here we sift through the newly anonymous files and remove all instances of the user ID attached to each one. This then completes the process of making all data anonymous. 

In [None]:
# Remove user IDs attached to achievement files
process_files_in_directory(f'{anon_directory}{data_types[0]}')

In [None]:
# Remove user IDs attached to game files
process_files_in_directory(f'{anon_directory}{data_types[1]}')