# Homework 4 - Applied ML

In [None]:
# Panda
import numpy as np
import pandas as pd

# matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# our code, mark it at autoreload at every cell execution (usefull in developement mode)
%load_ext autoreload
%autoreload 1
%aimport utils

## 0 - Cleaning data

In [None]:
soccer_data = pd.read_csv('CrowdstormingDataJuly1st.csv', sep=',', parse_dates=['birthday'])
soccer_data.head()

First, we must understand data we will use and clean them if needed.

A detailed description of each columns is provided in the file `DATA.md`. We invite the reader to take note of these descriptions before continuing.

### Compute age

A first modification that we propose to do is to compute the age of players according to the given birthday date. Thus, we'll use this feature if needed instead of the birthday's column (which is quite understandable as we use a random forest where each decision tree will split data according to the values). To keep futur model, we would made, usable with other data, we prefere compute the age with the moment when data has been collected.

> Doc compute_age

In [None]:
soccer_data['age'] = soccer_data.apply(utils.compute_age, axis=1)
soccer_data.head()

### Merge raters' values

Here, as we want to determine the skin color of a player according to given data, it is important that at least one rater has given a note.

In [None]:
soccer_data['rater1'].value_counts(dropna=False)

In [None]:
soccer_data['rater2'].value_counts(dropna=False)

As we can see, there is some players for whom there is no rater 1 or rater 2 (in particular, here, it seems that when there is no rater 1, there is no rater 2).

We decide to remove all dyad when we don't have any note.

Note: We could have chose to drop all rows where there is no photo ID, but it is better to consider directly raters instead, as in theory (!) nothing prevents having a photo ID but for a player but no raters.

In [None]:
soccer_data_clean = soccer_data[soccer_data['rater1'].notnull() | soccer_data['rater2'].notnull()].copy()

> D'après le notebook donné, on fait une moyenne --> Thus, we decide to combine these data to have an unique note. Here, we suppose that raters' votes are independent (no influence on votes between the two raters) and that raters were honest, for lack of exactitude. So, we used the mean to compute this unique note.

> On préfère travailler avec des entiers qu'avec des floats d'où le fois 100

In [None]:
soccer_data_clean['rater'] = np.floor(soccer_data_clean[['rater1', 'rater2']].mean(axis=1) * 100)

In [None]:
rater_distinct_values = soccer_data_clean['rater'].value_counts(dropna=False, sort=False).plot(kind='bar')
rater_distinct_values.set_ylabel('Number of rates')
rater_distinct_values.set_xlabel('Rate values')
rater_distinct_values.set_title('Number of rates by values')

### Manage null values

Now, let's display if there are any null values in the data.

In [None]:
count_null_values = soccer_data_clean.isnull().sum()
count_null_values[[value > 0 for value in count_null_values.values]]

For position, we do nothing at this stage. However, for height and weight, we decide to use mean of values to replace null values.

In [None]:
soccer_data_clean[['height', 'weight']] = soccer_data_clean[['height', 'weight']].fillna(soccer_data_clean[['height', 'weight']].mean())

> Comment here

In [None]:
soccer_data_clean['position'] = soccer_data_clean['position'].fillna('Unknown')

> Suppression de alpha 3, meanIAT et meanExp

IAT and Explicit bias scores are very important, so we decide to drop any dyad where these values are missing.

In [None]:
soccer_data_clean = soccer_data_clean.dropna(axis=0, how='any', subset=['Alpha_3', 'meanIAT', 'meanExp'])

In [None]:
count_null_values = soccer_data_clean.isnull().sum()
print('There is null values? ' + str(len(count_null_values[[value > 0 for value in count_null_values.values]]) > 0))

## 1 - Processing data for machine learning

### Manage the dimension of dyad

Let's describe all data related to IAT and Explicit bias scores.

In [None]:
soccer_data_clean[['meanIAT', 'nIAT', 'seIAT', 'meanExp', 'nExp', 'seExp']].describe()

> Commentaire ici et regarder les plots

> As we can see from the min of nExp and nIAT is equal to two, from those row  we can deduce nothing from the IAT who is not very representative of the entire population.
We need to take into account those values in the following ponderation :

In [None]:
# Higher is the standard error, the lesser we can trust the result of the tests.
# We reverse the value in order to take in account only the value who have
# a high standard error
reverse_seIAT = abs(soccer_data_clean['seIAT'] - max(soccer_data_clean['seIAT']))
reverse_seExp = abs(soccer_data_clean['seExp'] - max(soccer_data_clean['seExp']))
# In order to not penalize one study compared to the other we need to have the same maximum
soccer_data_clean['reverse_seIAT'] = reverse_seIAT / max(reverse_seIAT)
soccer_data_clean['reverse_seExp'] = reverse_seExp / max(reverse_seExp)

# Compute the score of ponderation, taking in account the standard error and the value of 
# differents test.
soccer_data_clean['associationScore'] = (soccer_data_clean['meanIAT']*soccer_data_clean['reverse_seIAT']  +
                                         soccer_data_clean['meanExp']*soccer_data_clean['reverse_seExp']) / (2)


# Plot the result to show that there are not too much value equals to 0.
#soccer_data_clean['reverse_seIAT'].plot()
#soccer_data_clean['reverse_seExp'].plot()

<p style="color:red;text-align:justify;">To schematize, there are four cases we may consider, for each player, regarding skin color and IAT and Explicit bias scores' influence.

1. Referee's country has a positive score (IAT or Explicit bias) and player is black (rate from 0.5 to 1).
2. Referee's country has a positive score (IAT or Explicit bias) and player is white (rate 0 to 0.5).
3. Referee's country has a negative score (IAT or Explicit bias) and player is black (rate from 0.5 to 1).
4. Referee's country has a negative score (IAT or Explicit bias) and player is white (rate from 0 to 0.5).

Note: We remind that a positive score for IAT or Explicit bias corresponds to faster white | good, black | bad associations and to greater feelings of warmth toward whites versus blacks (respectively). The countrary is true if score is negative.

Now, we must make some assumptions and important decisions.

The first case will be the case we'll focus on the most. Indeed, we assume that there are some correlation between the number of red/yellow cards given to a player and the referee's country (and it is basically why IAT and Explicit bias are given here). Thus, in such case, we'll increase number of yellow/red cards to take into account the bias.

The other cases are not really interesting. For example, for the second and third cases, we assume here that if a yellow/red card was given, the skin color of the player was not taken into account.

We don't deny that it is possible for a referee to not give a yellow/red card even if he must had to, because the player's skin color is the same as the one which is "favourite" (second and third cases), or that a referee gave more yellow/red cards to a white player because his "favourite" skin color is black (opposite of the first case), but if we also increase the number of cards given it would be difficult to highlight some racism behaviour and to entirely use the number of red/yellow cards (increasing data in these four cases would simply shift values).

Note: Our decision is subjective, but describe the most actual problems in soccer (it is more common to have racism with black players than with white players). Also, the major part of referees are from countries where white people are the majority (Europe, North America):
</p>

We define a function that will increase the number of yellow/red cards iff a player is black, and this for each dyad.

> Doc pondered_number_of_cards

In [None]:
for column_name in ['yellowCards', 'yellowReds', 'redCards']:
    soccer_data_clean['regulated' + column_name[0].upper() + column_name[1:]] = soccer_data_clean.apply(func=utils.regulate_number_of_cards, args=(column_name,), axis=1)

In [None]:
index = [(y > 0 and yr > 0 and r > 0) for y, yr, r in soccer_data_clean[['yellowCards', 'yellowReds', 'redCards']].values]
soccer_data_clean[['playerShort', 'yellowCards', 'regulatedYellowCards', 'yellowReds', 'regulatedYellowReds', 'redCards', 'regulatedRedCards']][index].head()

> Si c'est possible de faire un graphique de la différence entre notre pondération et les valeurs initiales

In [None]:
soccer_data_clean.plot.scatter(x='yellowCards', y='regulatedYellowCards');
soccer_data_clean.plot.scatter(x='yellowReds', y='regulatedYellowReds');
soccer_data_clean.plot.scatter(x='redCards', y='regulatedRedCards');

### Aggregate by players

Then, we sum all the statistics as we want to have one row for each player.

> On fait une simple somme pour ces attributs

In [None]:
global_statistics = soccer_data_clean[['playerShort', 'games', 'victories', 'defeats', 'goals', 'regulatedYellowCards', 'regulatedYellowReds', 'regulatedRedCards']].groupby('playerShort').sum()
global_statistics.head()

Finally, we create our final DataFrame containing information about a player and some statistics for his career.

> On assume que toutes les caractéristiques d'un joueur est la même -> on prend la première ligne

> At the end of this part, DataFrame's size was substantially reduced. However, we draw reader's attention on the fact that either we created new features which includes data from previous features (it's the case for the ponderation of cards, which uses IAT and Explicit bias scores for example) or we dropped features which are not useful for what we plan to do (like the photoID or the refNum), so we can safely continue our analysis.

In [None]:
players = soccer_data_clean.groupby('playerShort').first()

# Contain all features
soccer_data_all_features = global_statistics.join(players[['age', 'height', 'weight', 'rater']]).copy()

for feature in ['club', 'leagueCountry', 'position']:
    soccer_data_all_features = soccer_data_all_features.merge(pd.get_dummies(players[feature]), left_index=True, right_index=True)
    
soccer_data_all_features.head()

### Categorize the color skin

> We categorized two categories

In [None]:
classes_column = 'raterBinarized'
soccer_data_all_features[classes_column] = pd.cut(soccer_data_all_features['rater'], [0, 51, 101], labels=['light skin', 'dark skin'], right=False)

In [None]:
colorSkin = pd.cut(soccer_data_all_features['rater'], [0, 26, 51, 76, 101], labels=['very light skin','light skin','dark skin','very dark skin'], right=False)
colorSkin.value_counts().plot(kind='pie', figsize=(6, 6))

## 2 - Question 1 : from player description to skin color

**Train a sklearn.ensemble.RandomForestClassifier that given a soccer player description outputs his skin color. Show how different parameters passed to the Classifier affect the overfitting issue. Perform cross-validation to mitigate the overfitting of your model. Once you assessed your model, inspect the feature_importances_ attribute and discuss the obtained results. With different assumptions on the data (e.g., dropping certain features even before feeding them to the classifier), can you obtain a substantially different feature_importances_ attribute?**

### Select features

In [None]:
all_features = [col for col in soccer_data_all_features.columns if col not in ['rater', classes_column]]

### RandomForest without cross validation

> Doc get_random_forest et run_once

In [None]:
cfs = utils.get_random_forests()

In [None]:
results_single = utils.run_once(cfs, soccer_data_all_features, all_features, classes_column)
results_single

In [None]:
# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(results_single[0]['confusion_matrix'], classes=['dark skin', 'light skin'], title='Confusion matrix, without normalization')
plt.show()

### Cross-validation

> Doc run_cross_validation

In [None]:
cfs = utils.get_random_forests(nb_trees=[10,100], min_leaf=[2], min_split=[2])
results_cv = utils.run_cross_validation(cfs, soccer_data_all_features, all_features, classes_column)
results_cv

### Features importance

In [None]:
features_importance = utils.retriew_above_thresold(results_single[0]['classifier'], all_features, 7).sort_values(ascending=False)

features_importance_plot = features_importance.plot(kind='bar')
features_importance_plot.set_ylabel('Importance')
features_importance_plot.set_xlabel('Features')
features_importance_plot.set_title('Features\' importances')

> Faire le ménage ?

Useful links:

(Plot)
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

(Indexes and names)
http://stackoverflow.com/questions/22361781/how-does-sklearn-random-forest-index-feature-importances

Useful links:

http://blog.yhat.com/posts/random-forests-in-python.html
https://www.dataquest.io/blog/machine-learning-python/
https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests

http://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest

### Try with important features only

In [None]:
soccer_data_selected_features = soccer_data_all_features[np.concatenate((features_importance.index.values, [classes_column]), axis=0)].copy()
selected_features = features_importance.index.values

In [None]:
cfs = utils.get_random_forests(nb_trees=[10,100], min_leaf=[2], min_split=[2])
results_cv = utils.run_cross_validation(cfs, soccer_data_selected_features, selected_features, classes_column)
results_cv

## 3 - Bonus : learning curves

In [None]:
# learning curve , default parameter of classifier : n_est = 10, min_leaf = 2,min_split = 2
cfs = utils.get_random_forests()
utils.plot_learning_curve(cfs[0], 'Learning curves', soccer_data_all_features, all_features, classes_column, ylim=(0.8, 0.95))
plt.show()

# learning curve n_est = 100, min_leaf = 10,min_split = 10,min_split = 10
#cfs = utils.get_random_forests(nb_trees=[100], min_leaf=[10], min_split=[10]
#utils.plot_learning_curve(cfs[0], 'Learning curves', soccer_data_all_features, all_features, classes_column, ylim=(0.8, 0.95))
#plt.show()

In [None]:
# learning curve , default parameter of classifier : n_est = 10, min_leaf = 2,min_split = 2
cfs = utils.get_random_forests()
plt = utils.learning_curve_mean_squared_error(cfs[0],soccer_data_all_features, all_features, classes_column,5,50,ylim=(0.8, 0.95))
plt.show()

# learning curve n_est = 100, min_leaf = 10,min_split = 10,min_split = 10
cfs = utils.get_random_forests(nb_trees=[100], min_leaf=[10], min_split=[10])
plt = utils.learning_curve_mean_squared_error(cfs[0],soccer_data_all_features, all_features, classes_column,5,50,ylim=(0.8, 0.95))
plt.show()

## 4 - Question 2 : clustering the soccer players

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn import preprocessing

from ipywidgets import FloatProgress
from IPython.display import display

import itertools

> Je suppose que l'objectif est de ne pas donner l'attribut binarizedRater pour la calcul des clusters, et après comparer ce que le cluster propose avec avec nos valeurs déjà connu au niveau de la couleur de peau. Donc dans un premier temps je récupère le y de façon binaire :

In [None]:
y = preprocessing.LabelEncoder().fit_transform(soccer_data_all_features[classes_column])

> Au niveau de la façon dont on retire les features, je sais pas comment faire. J'ai fait le choix de retirer la feature la moins importante à chaque itération d'après les résultats précèdent, donc je récupère l'importance pour toutes les features :

In [None]:
all_features_importance = utils.retriew_above_thresold(results_single[0]['classifier'], all_features, -1).sort_values(ascending=True)

> Cependant, ne faudrait-il pas le faire avec tous les subsets ? J'espère pas car comme on a 124 features, le nombre de sous-ensemble est de 1,063382397×10^34 (2^123) ...

> Méthode qui calcul la séparation des joueurs par rapport à la couleur de peau. Comme on ne sait pas si le cluter essaye de mettre des 1 pour els blancs et des 0 pour les noirs, il faut comparer le tout. Pour cela on regarde le nombre d'égalité et de différence entre les deux colonnes (estimée et réelle), puis on prend la valeur absolue de la différence. On divise le tout par la taille du data set afin d'avoir un pourcentage. Ainsi si les deux colonnes sont très différentes ou très identiques le pourcentage sera très haut, donc une forte séparation entre les deux couleurs.

In [None]:
def compute_skin_separation_percentage(estimation, real_classes):
    if len(estimation) != len(real_classes):
        print('The size of the two arrays are not equals')
        return 0
    
    size = len(estimation)
    equal = 0
    for i in range(0, size):
        if estimation[i] == real_classes[i]:
            equal += 1
    different = size - equal
    return abs(equal - different) / size

> Test avec K-Means++ car on connait déjà le nombre de cluster. On garde dans les résultats le pourcentage de séparation de la couleur de peau, l'estimateur, le silhouette score et les features utilisées :

In [None]:
results_kmeans = []

progress_bar = FloatProgress(min=0, max=len(all_features_importance))
display(progress_bar)

for i in range(0, len(all_features_importance)):
    
    X = soccer_data_all_features[all_features_importance.index[i:]]
    estimator = KMeans(init='k-means++', n_clusters=2, n_init=10).fit(X)
    sil_score = silhouette_score(X, estimator.labels_)
    skin_separation = compute_skin_separation_percentage(estimator.labels_, y)
    
    results_kmeans.append({
            'features': set(all_features_importance.index[i:]),
            'silhouette_score': sil_score,
            'estimator': estimator,
            'skin_separation': skin_separation
        })

    progress_bar.value += 1

> Comme dit plus haut, on peut essayer avec tous les subsets des features. Pour éviter d'avoir trop d'ensembles, j'essaye  uniquement avec les sous-ensembles des features selectionnées en fin de la partie 1:

In [None]:
def findsubsets(values,groupby):
    return set(itertools.combinations(values, groupby))

all_subsets_features = set()
for i in range(1, len(selected_features)):
    all_subsets_features |= findsubsets(selected_features, i)

In [None]:
results_selected_features = []

total_iteration = len(all_subsets_features)
progress_bar = FloatProgress(min=0, max=total_iteration)
display(progress_bar)

print('Number of total iteration to do (very long): ' + str(total_iteration))

for features in all_subsets_features:

    X = soccer_data_all_features[list(features)]
    estimator = KMeans(init='k-means++', n_clusters=2, n_init=10).fit(X)
    sil_score = silhouette_score(X, estimator.labels_)
    skin_separation = compute_skin_separation_percentage(estimator.labels_, y)
    
    results_selected_features.append({
            'features': set(all_features_importance.index[i:]),
            'silhouette_score': sil_score,
            'estimator': estimator,
            'skin_separation': skin_separation
        })

    progress_bar.value += 1