# Homework 4: machine learning

---
### NOTE: Sometimes we refer to [the original work](http://nbviewer.jupyter.org/github/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb) 

---

In [None]:
%load_ext autoreload
%autoreload 2 
%matplotlib inline

You need to install `scikit v0.18`: `conda update scikit-learn`

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score, KFold

Read and minimal cleanup: for the first part, we need the labels (i.e. colour ratings), so we can't use the points where they don't exist. 

Since we will later `aggregate` the players, it is **important** to note that this doesn't produce inconsistencies because `dyads` is constructed by a join between a `players` table and a `referees` table, so it is natural that the missing values are missing for all instances of a player.

In [None]:
dyads = pd.read_csv("CrowdstormingDataJuly1st.csv", index_col=0)
print(dyads.shape)

dyads.dropna(subset=['rater1'], inplace=True)
print(dyads.shape)

# since both values are missing at the same time, this should be 0:
print(dyads.rater2.isnull().sum())

# the groupby object user later on
group_players = dyads.groupby(level=0)

Let's see the numbers: 

<sub>yes, they were done by the other guys, but it's useful to have them at hand:</sub>

Also, they were excluding some referees that have been _carried over_, and that only removes ~3% of the data. Since we're not doing statistics on referees, we won't drop them. Every little data helps :)

In [None]:
print("Number of players: " , dyads.index.unique().size)
print("Number of referees: ", dyads.refNum.unique().size)

The original analysis mentioned that "*the two raters disagree on 28742 or 19% of the time*". Since there are only 1585 players, it means they ran it on the `dyads` set. _**WHY?**_ That doesn't make sense, so let's just check that the ratings for each player are consistent:

for the group of each player, we check that the number of values in `raterX` is **exactly** one:

In [None]:
def build_player_consitency(player_df):
    """ Needs to return a Series of {col_name: col_value}. """
    return pd.Series({col+"_INconsistent" : (player_df[col].unique().size != 1) 
                                             for col in ['rater1', 'rater2'] })
consistency = group_players.apply(build_player_consitency)
print("Rater1 has been inconsistent %d times" % consistency.rater1_INconsistent.sum())
print("Rater2 has been inconsistent %d times" % consistency.rater2_INconsistent.sum())

OK, so they _ARE_ consistent. This means that their statistic doesn't account for players who have more matches than others, so the numbers are skewed. Let's check again, this time on _unique_ players

In [None]:
player_ratings = group_players.agg({'rater1':'first', 'rater2':'first'})
diffs = player_ratings.rater1 - player_ratings.rater2
print("The raters disagree for {p:.3f}% of the players".format(p=(diffs != 0).sum() / len(diffs) ))

print("Diffs std dev: ", diffs.std())

max_diff = diffs.abs().max()
num_occur = (diffs.abs() == max_diff).sum()
print("Max disagreement value {0}, occuring {1} times".format(max_diff * 4, num_occur)) # *4 to pass from float to int

So this means:
  1. that there is slightly more agreement between the raters for players who have more entries in `dyads` i.e. who played under more referees
  2. that if we use both labels, using `accuracy` as a measure of performance is not a very good idea. Keep in mind that the differences are not ordered, so it could have an impact double as big on the accuracy, i.e. at most $1 - 2 * \mathit{disagreementPercentage} = 1 - 2 * 0.24 \approx 0.5 $

#### Curiosity
Who are the 'controversial' guys :) ?

In [None]:
diffs[diffs.abs() == max_diff]

<img style='float:left' alt='Kyle-walker' src='http://www.thefootballsocial.co.uk/images/players/Tottenham%20Hotspur/Kyle%20Walker.jpg' /> <img alt='Mario_Goetze' src='http://i0.web.de/image/176/31756176,pd=2/mario-goetze.jpg' width=300/>

## Feature selection

Aggregate the data for each player we consider the following variables:
- the height and weight of the player
- The total amount of games played
- The total amount of victories, ties and defeats
- the total number of goals made
- The total number of red cards, yellow reds and yellow cards received

In [None]:
players = group_players.agg({'height':'first', 'weight':'first', 'games':'sum', 
                             'victories':'sum','defeats':'sum', 'ties': 'sum', 'goals':'sum', 
                             'redCards':'sum', 'yellowReds': 'sum', 'yellowCards':'sum'})
print(players.shape)
players.head()

If the weight or the height is NaN we replace it by the average height and weight of all the players

In [None]:
av_height = players['height'].mean()
av_weight = players['weight'].mean()
players['height'] = players['height'].fillna(value=av_height)
players['weight'] = players['weight'].fillna(value=av_weight)

We create extra features by normalizing the data:
- The percentage of victories, ties and defeats
- The number of red cards, yellow reds and yellow cards divided by the number of games played

In [None]:
categorical_values = ['victories', 'ties', 'defeats', 'redCards', 'yellowReds', 'yellowCards']
for name in categorical_values:
    players['percentage_'+name] = players[name]/players['games']
players.head()

Compute extra feature based on correlation between mean IAT, cards given and mean Exp

In [None]:
c = group_players.corr()

In [None]:
for racism in ['meanIAT', 'meanExp']:
    for card in ['redCards', 'yellowCards', 'yellowReds']:
        a = c.loc[c.index.get_level_values(1)==racism, card].reset_index(level=1).fillna(value=0)
        players['cor_'+racism+card] = a[card]
players.head()

Transform categorical data into numerical values (example spain = 3) so that it can be used in random forest. We use:
- club
- country of the league
- position

In [None]:
le = preprocessing.LabelEncoder()

categorical_values = ['club', 'leagueCountry', 'position']
for name in categorical_values:
    categorie = group_players.agg({name:'first'})
    le.fit(categorie.as_matrix().flatten().tolist())
    players[name] = le.transform(categorie.as_matrix().flatten().tolist())

players.head()

In [None]:
skin_color = group_players.agg({'rater1' : 'first'})
skin_color.head()

## Assignment 1: predict player's skin color

We convert the pandas data frame to lists in order to match the expected data format for scikit learn. We also map the player's skin color to an integer instead of a float.

In [None]:
X = players.as_matrix()
Y = skin_color.as_matrix().flatten()
# map 0.25 to 1 etc
Y = np.array(list(map((lambda x: x*4), Y)))

Train the random forest using cross validation

In [None]:
kf = KFold(n_splits=4)
clf = RandomForestClassifier(n_estimators=10, max_depth=5, max_features=None)

for train_index, test_index in kf.split(X):
    clf = clf.fit(X[train_index], Y[train_index])
    # test model
    Y_predict = clf.predict(X[test_index])
    Y_predict2 = clf.predict(X[train_index])
    print("accurancy predictions test data: ",(Y[test_index] - Y_predict).tolist().count(0) / len(Y_predict))
    print("accurancy predictions training data: ",(Y[train_index] - Y_predict2).tolist().count(0) / len(Y_predict2))

In [None]:
cross_val_score(clf, X, Y, scoring='accuracy', cv=4)

### Feature importance

In [None]:
importances = clf.feature_importances_
std = np.std([clf.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1,  players.columns[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()