### Data Loading

The scraped data was saved as dictionaries and exported as .pkl files. The data files can be found within the data folder. Each file is labelled with the platform and features indicating what data is saved in each file.

Goal: Load all the data files into one master data frame.

Refer to https://stackoverflow.com/questions/30898038/python-run-script-on-multiple-files for running script on multiple files

Thought that crossed my mind:

**How does pandas deal with missing attributes?**

In [8]:
#generating fake player data with missing values for players sneef and sreef
players = {'smeef':{'wins':22, 'mp':123, 'kd': 0.23},
          'sneef': {'wins': 34, 'mp': 129},
          'sreef': {'wins':44, 'kd':0.24},
          'schniegel': {'wins':43, 'mp':555, 'kd':1.23}
          }

In [3]:
import pandas as pd

In [10]:
pd.DataFrame.from_dict(players).T

Unnamed: 0,wins,mp,kd
smeef,22.0,123.0,0.23
sneef,34.0,129.0,
sreef,44.0,,0.24
schniegel,43.0,555.0,1.23


Ok, so the missing attributes are returned as NaNs. Fine, we can later impute these values. The implication on data loading means we can simply merge all the data dictionaries without worrying about missing values initially - will be dealt with once loaded into panas DataFrame.

#### Steps to Group data into one dictionary (later to be loaded into a pandas dataframe)

1. Initialize an empty dictionary (call it df_dict for example)
2. Load in data file (which is in the form of a dict)
3. Add player data loaded from step 2 into df_dict
    - if first file, then simply add all the players
    - if a subsequent file, append player data; if player does not exist in df_dict, add them as a new player
4. Repeat steps 2-3 for all data files
5. Once all data is added to df_dict, load it into pandas dataframe using from_dict() method

In [34]:
#libraries
import pickle
import pandas as pd
from tqdm import tqdm

Note: scraped data from 2 different websites.
- files starting with 'all' were scraped from **cdst**
- files starting with 'psn', 'xbl', or 'battlenet' were scraped from **cdtr**

The data has been standardized during scraping such that each player's is saved in a dictionary within a dictionary. For clarity, each player in the main dictionary is in the form of:

{'random_player' : {platform : 'psn', stat : 23), 'next_player':{...}, ...}

In [38]:
#Step 1: initialize main dictionary
df_dict = {}

In [39]:
#Step 2: Load in data

#example using psn Kills from cdtr
with open('../data/psn_Kills.pkl', 'rb') as file:
    psn_Kills = pickle.load(file)

In [42]:
#how many players?
len(psn_Kills)

831249

In [46]:
#Step 3: add player data to df_dict

#although this is the first time we are adding data to df_dict, 
#let's pretend we are not sure if df_dict has data already

#we can see if df_dict is not empty using the bool() function
bool(df_dict)

False

In [47]:
#if df_dict is empty, add data
if bool(df_dict) == False:
    df_dict = psn_Kills.copy()
len(df_dict)

831249

In [48]:
#Step 4: Let's add data from a new file

#using psn Wins
with open('../data/psn_Wins.pkl', 'rb') as file:
    psn_Wins = pickle.load(file)

In [56]:
#we know df_dict has player data now so we need to match players by
#their name and platform

#for each player in the psn_Wins dict...
for player, stat in tqdm(psn_Wins.items()):
    #...if player is in df_dict and platforms are the same (now dealing with only psn but will be needed later)
    if player in df_dict and df_dict[player]['platform'] == psn_Wins[player]['platform']:
        #add wins to df_dict
        df_dict[player]['wins'] = stat['wins']
    #...but if not in df_dict, add them in as a new player (note that this will lead in missing values which later need to be imputed)
    else:
        df_dict[player] = {'platform': stat['platform'], 'wins': stat['wins']}

100%|██████████████████████████████████████████████████████████████████████| 635033/635033 [00:00<00:00, 803784.70it/s]


In [57]:
#check if kills were added to an example player
df_dict['TTV-Scop3s_']

{'platform': 'psn', 'Kills': '26,580', 'wins': 1225}

Nice! Both kills and wins are there.

In [59]:
#we also had a lot of new players added
len(df_dict)

936521

##### Next Steps

- check how to add cdst data in df_dict
- build script to go through each pkl file and add it to a master dictionary (e.g. df_dict)
- load dictionary containing all the player data into a pandas dataframe