Data Preprocessing
====
In this notebook, we show how we prepare the data for performing machine learning on it.

In a high level, after reading the csv file containing the information of players and referees, we group the data based on players and aggregate the referee information (e.g. meanIAT and meanExp) for each player. Finally, we change the string fields into integer values via label encoding. More detailed explanation for each phase can be found below.

The following snippet shows several libraries we import:

In [1]:
# basic imports
import pandas as pd
import numpy as np
import os
from sklearn import preprocessing

Then, we read the csv file and create a dataframe out of it. Then we remove the rows for which the `rater1` and `rater2` attributes are not available (i.e. are NA).

In [2]:
filename=os.path.join('data','CrowdstormingDataJuly1st.csv') 
df = pd.read_csv(filename)
df = df.dropna(subset=['rater1', 'rater2'])

The following snippet adds a column which specifies whether a player is black or not (as a boolean value). 
This is achieved by adding the value obtained from the two raters and checking whether this value is greater 
than (and not equal) 1. *(This is an assumption we have made throughout this homework)*

In [3]:
df['black'] = df['rater1'] + df['rater2'] > 1

Next we observe that we have around 3.5 million entries in the data frame.

In [4]:
df.size

3614009

Then, we group this data frame based on the short version of the name of each player. 

Note that without setting `as_index` to false, the `playerShort` columns would be an index, which is not we want later on for this data frame. Hence, we set this parameter to false.

In [6]:
df_grouped = df.groupby('playerShort', as_index=False)

Next code snippet is one of the core parts of the data precossing phase. 
In this part, we aggregate the information of each group (for a specific player). 
The aggregation function is different for each column. 
Below you can find different aggregation functions we apply for each column:
* `first`: This aggregation function returns the first observed value for that column. We use this aggregation function for columns which has identical values for a particular player (e.g. `height`, `position`, `weight`, etc.). *(Definitely for some columns such as `club` and `leagueCountry` we assume that the value remains the same for each player)*
* `np.sum`: We use this aggregation function for the number of games (e.g. `victories`, `defeats`) and the number of cards (e.g. `yellowCards`, `redCards`).
* `np.mean`: We use the mean function for aggregating the two information we use from referees: `meanExp` and `meanIAT`. In order to have a better precision, we could have performed two additional improvements. First, we could use weighted average for aggregating these two fields. Second, we could have included standard deviation. However, we have found that neither of these two improves the precision in machine learning. Hence, for the sake of simplicity we excluded them.

The following code shows clearly the aggregation functions used for each attribute.

In [7]:
def first(x):
    """Returns the first element of a dataframe"""
    return x.iloc[0]
df_aggregated = df_grouped.agg({'games': np.sum, 'victories': np.sum, 
         'club': first, 'leagueCountry': first,
         'birthday': first, 'height': first,
         'weight': first, 'position': first,
         'games': np.sum, 'victories': np.sum,
         'ties':np.sum, 'defeats': np.sum,
         'goals': np.sum, 'yellowCards': np.sum,
         'yellowReds': np.sum, 'redCards': np.sum,
         'meanIAT': np.mean, 'meanExp': np.mean,
         'black': first})

Here you can find the dataframe created after performing the aggregation.

In [8]:
df_aggregated

Unnamed: 0,playerShort,yellowReds,goals,birthday,ties,leagueCountry,black,defeats,weight,victories,height,meanExp,yellowCards,games,redCards,position,meanIAT,club
0,aaron-hughes,0,9,08.11.1979,179,England,False,228,71.0,247,182.0,0.494575,19,654,0,Center Back,0.346459,Fulham FC
1,aaron-hunt,0,62,04.09.1986,73,Germany,False,122,73.0,141,183.0,0.449220,42,336,1,Attacking Midfielder,0.348818,Werder Bremen
2,aaron-lennon,0,31,16.04.1987,97,England,False,115,63.0,200,165.0,0.491482,11,412,0,Right Midfielder,0.345893,Tottenham Hotspur
3,aaron-ramsey,0,39,26.12.1990,42,England,False,68,76.0,150,178.0,0.514693,31,260,1,Center Midfielder,0.346821,Arsenal FC
4,abdelhamid-el-kaoutari,4,1,17.03.1990,40,France,False,43,73.0,41,180.0,0.335587,8,124,2,Center Back,0.331600,Montpellier HSC
5,abdou-traore_2,1,3,17.01.1988,23,France,True,33,74.0,41,180.0,0.296562,11,97,0,Right Midfielder,0.320079,Girondins Bordeaux
6,abdoulaye-diallo_2,0,0,30.03.1992,8,France,True,8,80.0,8,189.0,0.400818,0,24,0,Goalkeeper,0.341625,Stade Rennes
7,abdoulaye-keita_2,0,0,19.08.1990,1,France,True,2,83.0,0,188.0,0.417225,0,3,0,Goalkeeper,0.355406,Girondins Bordeaux
8,abdoulwhaid-sissoko,0,3,20.03.1990,25,France,True,62,68.0,34,180.0,0.429630,21,121,2,Defensive Midfielder,0.348178,Stade Brest
9,abdul-rahman-baba,0,0,02.07.1994,8,Germany,True,25,70.0,17,179.0,0.361068,3,50,1,Left Fullback,0.342072,SpVgg Greuther Fürth


The random forest classifier provided by `sklearn` does not accept string values.
Hence, we need to convert the values of this column into numeric values. 

This is achieved by `LabelEncoder` provided by `sklearn`. For each of the columns we create a separate label encoder. Then we update the value of each of these columns by fitting and transforming the old values (i.e. string values) of these columns.

In [9]:
lelc = preprocessing.LabelEncoder()
lec = preprocessing.LabelEncoder()
leps = preprocessing.LabelEncoder()
lep = preprocessing.LabelEncoder()
leb = preprocessing.LabelEncoder()

df_aggregated['leagueCountry']=lelc.fit_transform(df_aggregated['leagueCountry'])
df_aggregated['club']=lec.fit_transform(df_aggregated['club'])
df_aggregated['playerShort']=leps.fit_transform(df_aggregated['playerShort'])
df_aggregated['black']=leb.fit_transform(df_aggregated['black'])
df_aggregated['position'] = df_aggregated['position'].astype(str)
df_aggregated['position']=lep.fit_transform(df_aggregated['position'])

Finally, we write the transformed dataframe into a csv file so that it can be used for the random forest classifier.

In [None]:
df_aggregated.to_csv('data/clean_data.csv')