# 04 - Applied ML

## Deadline
Monday November 21, 2016 at 11:59PM

## Important Notes
* Make sure you push on GitHub your Notebook with all the cells already evaluated
* Don't forget to add a textual description of your thought process, the assumptions you made, and the solution
you plan to implement!
* Please write all your comments in English, and use meaningful variable names in your code

## Background
In this homework we will gain experience on Applied Machine Learning, exploring an interesting dataset about soccer players and referees.
You can find all the data in the `CrowdstormingDataJuly1st.csv` file, while you can read a thorough [dataset description here](DATA.md).
Given that the focus of this homework is Machine Learning, I recommend you to first take a look at [this notebook](http://nbviewer.jupyter.org/github/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb)
containing a solid work in pre-processing + visualization of the given dataset. You are *not* allowed to just copy/paste the pre-processing steps
performed by the notebook authors -- you are still supposed to perform your own data analysis for the homework. Still, I'm confident that consulting first
the work done by expert data analysts will speed up tangibly your effort (i.e., they have already found for you many glitches in the data :)


## Assignment
1. Train a `sklearn.ensemble.RandomForestClassifier` that given a soccer player description outputs his skin color. Show how different parameters 
passed to the Classifier affect the overfitting issue. Perform cross-validation to mitigate the overfitting of your model. Once you assessed your model,
inspect the `feature_importances_` attribute and discuss the obtained results. With different assumptions on the data (e.g., dropping certain features even
before feeding them to the classifier), can you obtain a substantially different `feature_importances_` attribute?
*BONUS*: plot the learning curves against at least 2 different sets of parameters passed to your Random Forest. To obtain smooth curves, partition
your data in at least 20 folds. Can you find a set of parameters that leads to high bias, and one which does not?

2. Aggregate the referee information grouping by soccer player, and use an unsupervised learning technique to cluster the soccer players in 2 disjoint
clusters. Remove features iteratively, and at each step perform again the clustering and compute the silhouette score -- can you find a configuration of features with high silhouette
score where players with dark and light skin colors belong to different clusters? Discuss the obtained results.


--------------------------------------------------------------------------------------------------------------
# Data Description

From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3147) that these players played under in their professional career. We created a dataset of player–referee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.
 
Player photos were available from the source for 1586 out of 2053 players. Players’ skin tone was coded by two independent raters blind to the research question who, based on their profile photo, categorized players on a 5-point scale ranging from “very light skin” to “very dark skin” with “neither dark nor light skin” as the center value. 

Additionally, implicit bias scores for each referee country were calculated using a race implicit association test (IAT), with higher values corresponding to faster white | good, black | bad associations. Explicit bias scores for each referee country were calculated using a racial thermometer task, with higher values corresponding to greater feelings of warmth toward whites versus blacks. Both these measures were created by aggregating data from many online users in referee countries taking these tests on [Project Implicit](http://projectimplicit.net).

In all, the dataset has a total of 146028 dyads of players and referees. A detailed description of all variables in the dataset can be seen in the list below.

## Variables:

*playerShort* - short player ID

*player* - player name

*club* - player club

*leagueCountry* - country of player club (England, Germany, France, and Spain)

*birthday* - player birthday

*height* - player height (in cm)

*weight* - player weight (in kg)

*position* - detailed player position

*games* - number of games in the player-referee dyad

*victories* - victories in the player-referee dyad

*ties* - ties in the player-referee dyad

*defeats* - losses in the player-referee dyad

*goals* - goals scored by a player in the player-referee dyad

*yellowCards* - number of yellow cards player received from referee

*yellowReds* - number of yellow-red cards player received from referee

*redCards* - number of red cards player received from referee

*photoID* - ID of player photo (if available)

*rater1* - skin rating of photo by rater 1 (5-point scale ranging from “very light skin” to “very dark skin”)

*rater2* - skin rating of photo by rater 2 (5-point scale ranging from “very light skin” to “very dark skin”)

*refNum* - unique referee ID number (referee name removed for anonymizing purposes)

*refCountry* - unique referee country ID number (country name removed for anonymizing purposes)

*meanIAT* - mean implicit bias score (using the race IAT) for referee country, higher values correspond to faster white | good, black | bad associations

*nIAT* - sample size for race IAT in that particular country

*seIAT* - standard error for mean estimate of race IAT

*meanExp* - mean explicit bias score (using a racial thermometer task) for referee country, higher values correspond to greater feelings of warmth toward whites versus blacks

*nExp* - sample size for explicit bias in that particular country

*seExp* - standard error for mean estimate of explicit bias measure




--------------------------------------------------------------------------------------------------------------

# IMPLEMENTATION

In [29]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cross_validation import cross_val_score
import processing as preproc

In [30]:
data = 'CrowdstormingDataJuly1st.csv'
dyads = pd.read_csv(data)
nb_cols = len(dyads.columns)
print(nb_cols)


28


In [31]:
dyads.ix[:10, :14]

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,ties,defeats,goals,yellowCards
0,lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,31.08.1983,177.0,72.0,Attacking Midfielder,1,0,0,1,0,0
1,john-utaka,John Utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,0,1,0,1
2,abdon-prats,Abdón Prats,RCD Mallorca,Spain,17.12.1992,181.0,79.0,,1,0,1,0,0,1
3,pablo-mari,Pablo Marí,RCD Mallorca,Spain,31.08.1993,191.0,87.0,Center Back,1,1,0,0,0,0
4,ruben-pena,Rubén Peña,Real Valladolid,Spain,18.07.1991,172.0,70.0,Right Midfielder,1,1,0,0,0,0
5,aaron-hughes,Aaron Hughes,Fulham FC,England,08.11.1979,182.0,71.0,Center Back,1,0,0,1,0,0
6,aleksandar-kolarov,Aleksandar Kolarov,Manchester City,England,10.11.1985,187.0,80.0,Left Fullback,1,1,0,0,0,0
7,alexander-tettey,Alexander Tettey,Norwich City,England,04.04.1986,180.0,68.0,Defensive Midfielder,1,0,0,1,0,0
8,anders-lindegaard,Anders Lindegaard,Manchester United,England,13.04.1984,193.0,80.0,Goalkeeper,1,0,1,0,0,0
9,andreas-beck,Andreas Beck,1899 Hoffenheim,Germany,13.03.1987,180.0,70.0,Right Fullback,1,1,0,0,0,0


In [32]:
dyads.ix[:10, 14:28]

Unnamed: 0,yellowReds,redCards,photoID,rater1,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
0,0,0,95212.jpg,0.25,0.5,1,1,GRC,0.326391,712.0,0.000564,0.396,750.0,0.002696
1,0,0,1663.jpg,0.75,0.75,2,2,ZMB,0.203375,40.0,0.010875,-0.204082,49.0,0.061504
2,0,0,,,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
3,0,0,,,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
4,0,0,,,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
5,0,0,3868.jpg,0.25,0.0,4,4,LUX,0.325185,127.0,0.003297,0.538462,130.0,0.013752
6,0,0,47704.jpg,0.0,0.25,4,4,LUX,0.325185,127.0,0.003297,0.538462,130.0,0.013752
7,0,0,22356.jpg,1.0,1.0,4,4,LUX,0.325185,127.0,0.003297,0.538462,130.0,0.013752
8,0,0,16528.jpg,0.25,0.25,4,4,LUX,0.325185,127.0,0.003297,0.538462,130.0,0.013752
9,0,0,36499.jpg,0.0,0.0,4,4,LUX,0.325185,127.0,0.003297,0.538462,130.0,0.013752


Through the visualization of the data and the ipython notebook which previously worked on the same dataset, we observed that each data entry is a dyad between a player and a referee. They all contain a player ID, a referee ID, their respective descriptors and some informations about the dyad itself.

We could define a dyad as a relationship between a player and a referee. A dyad can contain multiple encounters of the pair (player <-> referee) from different matches. 



Our goal is to define if a player has dark skin or light skin. We thus need to design a model and train it on a part of this data with predefined skin labels. We will take the columns 'rater1' and 'rater2' to define this label. As the model needs an output (a valid skin label) we need at least one of these columns to be valid.

In [33]:
dyads = dyads[~(np.isnan(dyads['rater1']) & np.isnan(dyads['rater2']))]
dyads.reset_index(inplace=True)

As the label is defined by two columns we want these columns to agree with each other.

In [34]:
max_diff = (dyads['rater1']-dyads['rater2']).max()
max_diff

0.25

In [35]:
dyads['rater1'].unique()

array([ 0.25,  0.75,  0.  ,  1.  ,  0.5 ])

The scale is going from 0 to 1 with a step of 0.25 and the max difference is 0.25.
Therefore we cannot find a case where there are opposite opinions about a player (one black, the other white)

We want only one label so we need to combine these values together.

First approach would be to take the mean of these values and if the result is smaller than 0.5 we want to output 0 ('white') and if it is greater than 0.5 output 1 ('black')

Beforehand, it is important to check if both values are equal to 0.5. Indeed, this method won't be able to set a label to the given entry. We thus check if our dataset contains such values:

In [36]:
dyads[(dyads['rater1'] == 0.5) & (dyads['rater2'] == 0.5)]['playerShort'].count()

8989

We find 8989 cases where both raters find 0.5 as rating. We decide that those players will be represented by a 1 ('black').

In [37]:
dyads['y'] = (dyads['rater1'] + dyads['rater2']) / 2 

In [38]:
dyads.loc[dyads['y'] < 0.5, 'y'] = 0
dyads.loc[dyads['y'] >= 0.5, 'y'] = 1

Now we drop a column that looks useless to us : photoID
and Alpha_3 as it is completely represented by refCountry

In [39]:
dyads = preproc.drop_columns(dyads, ['photoID', 'Alpha_3', 'birthday'])
dyads.head()

Unnamed: 0,index,playerShort,player,club,leagueCountry,height,weight,position,games,victories,...,rater2,refNum,refCountry,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,y
0,0,lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,177.0,72.0,Attacking Midfielder,1,0,...,0.5,1,1,0.326391,712.0,0.000564,0.396,750.0,0.002696,0.0
1,1,john-utaka,John Utaka,Montpellier HSC,France,179.0,82.0,Right Winger,1,0,...,0.75,2,2,0.203375,40.0,0.010875,-0.204082,49.0,0.061504,1.0
2,5,aaron-hughes,Aaron Hughes,Fulham FC,England,182.0,71.0,Center Back,1,0,...,0.0,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752,0.0
3,6,aleksandar-kolarov,Aleksandar Kolarov,Manchester City,England,187.0,80.0,Left Fullback,1,1,...,0.25,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752,0.0
4,7,alexander-tettey,Alexander Tettey,Norwich City,England,180.0,68.0,Defensive Midfielder,1,0,...,1.0,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752,1.0


TODO EXPLIQUER

In [40]:
dyads = preproc.fill_na_columns(dyads, ['position', 'weight', 'height'], ['Unknown', 0, 0])
dyads = dyads.dropna()
len(dyads)

124468

After creating our y (label) column we drop the two raters columns from our dataset and separate y from the data to obtain our X frame.

In [41]:
data = dyads.drop(['rater1', 'rater2'], axis=1)
labels = data.pop('y')

We encode our categorical data columns:

In [42]:
to_encode = ['playerShort', 'player', 'club', 'leagueCountry','position']
encoded_data = preproc.label_encode(data, to_encode)

Random Forest Classifier: check this out!

In [44]:
#CV then score (=prediction)
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.20, random_state=0)
rfcl = RandomForestClassifier(n_estimators=10).fit(X_train, y_train) 
score = rfcl.score(X_test, y_test)
print("Classic score")
print(score)

#Classifier then CV K-Fold (k=3, default)
clf = RandomForestClassifier(n_estimators=10)
scores = cross_val_score(clf, data, labels)
print("Cross-val score")
print(scores.mean())

Classic score
0.992809512332
Cross-val score
0.714728127513
