### Data Loading

The scraped data was saved as dictionaries and exported as .pkl files. The data files can be found within the data folder. Each file is labelled with the platform and features indicating what data is saved in each file.

Goal: Load all the data files into one master data frame.

Refer to https://stackoverflow.com/questions/30898038/python-run-script-on-multiple-files for running script on multiple files

In [1]:
#libraries
import pickle
import pandas as pd
from tqdm import tqdm

Note: scraped data from 2 different websites.
- files starting with 'all' were scraped from **cdst**
- files starting with 'psn', 'xbl', or 'battlenet' were scraped from **cdtr**

They may require different ways to join together. Let's experiment first with cdtr.

In [3]:
#getting psn Kills from cdtr
with open('../data/psn_Kills.pkl', 'rb') as file:
    psn_Kills = pickle.load(file)

In [4]:
#how many players?
len(psn_Kills)

831249

In [6]:
#lets get psn Wins as well
with open('../data/psn_Wins.pkl', 'rb') as file:
    psn_Wins = pickle.load(file)

In [7]:
#how many players?
len(psn_Wins)

635033

Missing about 200,000 psn players between Wins and Kills - not ideal.

Let's try to merge them.

In [8]:
#using player TTV-Scop3s_ as an example
psn_Wins['TTV-Scop3s_']

{'platform': 'psn', 'wins': 1225, 'mp': 2785}

In [9]:
psn_Kills['TTV-Scop3s_']

{'platform': 'psn', 'Kills': '26,580'}

In [10]:
#get a copy of psn_Wins
df_dict = psn_Wins.copy()

In [11]:
#need to add kills to wins

for player, stat in tqdm(psn_Kills.items()):
    #check if player is in df_dict and platforms are the same (now dealing with only psn but will be needed later)
    if player in df_dict and df_dict[player]['platform'] == psn_Kills[player]['platform']:
        #add kills to df_dict
        df_dict[player]['Kills'] = stat['Kills']
    #if not in dict, add them in
    else:
        df_dict[player] = {'platform': stat['platform'], 'Kills': stat['Kills']}

100%|██████████████████████████████████████████████████████████████████████| 831249/831249 [00:01<00:00, 744169.87it/s]


In [12]:
#check if kills were added to our example player
df_dict['TTV-Scop3s_']

{'platform': 'psn', 'wins': 1225, 'mp': 2785, 'Kills': '26,580'}

##### Next Steps

- check how to add cdst data in df_dict
- build script to go through each pkl file and add it to a master dictionary
- load dictionary containing all the player data into a pandas dataframe for data cleaning