# Homework 4 - Applied ML

In [None]:
# Panda
import numpy as np
import pandas as pd

# matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Utils
import collections
from pprint import pprint
from dateutil import relativedelta
from datetime import date
import itertools

# sickit
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn import metrics

# 0 - Prepare data

In [None]:
soccer_data = pd.read_csv('CrowdstormingDataJuly1st.csv', sep=',', parse_dates=['birthday'])
soccer_data.head()

First, we must understand data we will use and clean them if needed.

A detailed description of each columns is provided in the file DATA.md. We invite the reader to take note of these descriptions before continuing.

A first modification that we propose to do is to compute the age of players according to the given birthday date. Thus, we'll use this feature if needed instead of the birthday's column (which is quite understandable as we use a random forest where each decision tree will split data according to the values). To keep futur model, we would made, usable with other data, we prefere compute the age with the moment when data has been collected.  

In [None]:
def compute_age(row):
    '''
    Given a player, function returns the years of player.
    
    row: Row of the DataFrame, representing a dyad which contains a player.
    '''
    data_date = date(2013, 1, 1)
    delta = relativedelta.relativedelta(data_date, row['birthday'])
    return delta.years

soccer_data['age'] = soccer_data.apply(compute_age, axis=1)
soccer_data.head()

Here, as we want to determine the skin color of a player according to given data, it is important that at least one rater has given a note.

In [None]:
soccer_data['rater1'].value_counts(dropna=False)

In [None]:
soccer_data['rater2'].value_counts(dropna=False)

As we can see, there is some players for whom there is no rater 1 or rater 2 (in particular, here, it seems that when there is no rater 1, there is no rater 2).

We decide to remove all dyad when we don't have any note.

Note: We could have chose to drop all rows where there is no photo ID, but it is better to consider directly raters instead, as in theory (!) nothing prevents having a photo ID but for a player but no raters.

In [None]:
soccer_data_clean = soccer_data[soccer_data['rater1'].notnull() | soccer_data['rater2'].notnull()].copy()
soccer_data_clean.head()

In [None]:
soccer_data_clean['rater2'].value_counts(dropna=False)

In [None]:
soccer_data_clean['rater1'].value_counts(dropna=False)

As we can see, votes are quite often different between the two raters. Thus, we decide to combine these data to have an unique note.

Here, we suppose that raters' votes are independent (no influence on votes between the two raters) and that raters were honest, for lack of exactitude. So, we used the mean to compute this unique note.

In [None]:
soccer_data_clean['rater'] = np.floor(soccer_data_clean[['rater1', 'rater2']].mean(axis=1) * 100)

In [None]:
soccer_data_clean['rater'].value_counts(dropna=False)

Now, let's display if there are any null values in the data.

In [None]:
soccer_data_clean.isnull().sum()

For position, we do nothing at this stage. However, for height and weight, we decide to use mean of values to replace null values.

In [None]:
soccer_data_clean[['height', 'weight']] = soccer_data_clean[['height', 'weight']].fillna(soccer_data_clean[['height', 'weight']].mean())

IAT and Explicit bias scores are very important, so we decide to drop any dyad where these values are missing.

In [None]:
soccer_data_clean = soccer_data_clean[soccer_data_clean['meanIAT'].notnull() & soccer_data_clean['meanExp'].notnull()].copy()

Let's describe all data related to IAT and Explicit bias scores.

In [None]:
soccer_data_clean[['meanIAT', 'nIAT', 'seIAT', 'meanExp', 'nExp', 'seExp']].describe()

In [None]:
soccer_data_clean['associationScore'] = (soccer_data_clean['meanIAT'] + soccer_data_clean['meanExp']) / 2
soccer_data_clean.head()

<p style="color:red;">To schematize, there are four cases we may consider, for each player, regarding skin color and IAT and Explicit bias scores' influence.

1. Referee's country has a positive score (IAT or Explicit bias) and player is black (rate from 0.5 to 1).
2. Referee's country has a positive score (IAT or Explicit bias) and player is white (rate 0 to 0.5).
3. Referee's country has a negative score (IAT or Explicit bias) and player is black (rate from 0.5 to 1).
4. Referee's country has a negative score (IAT or Explicit bias) and player is white (rate from 0 to 0.5).

Note: We remind that a positive score for IAT or Explicit bias corresponds to faster white | good, black | bad associations and to greater feelings of warmth toward whites versus blacks (respectively). The countrary is true if score is negative.

Now, we must make some assumptions and important decisions.

The first case will be the case we'll focus on the most. Indeed, we assume that there are some correlation between the number of red/yellow cards given to a player and the referee's country (and it is basically why IAT and Explicit bias are given here). Thus, in such case, we'll increase number of yellow/red cards to take into account the bias.

The other cases are not really interesting. For example, for the second and third cases, we assume here that if a yellow/red card was given, the skin color of the player was not taken into account.

We don't deny that it is possible for a referee to not give a yellow/red card even if he must had to, because the player's skin color is the same as the one which is "favourite" (second and third cases), or that a referee gave more yellow/red cards to a white player because his "favourite" skin color is black (opposite of the first case), but if we also increase the number of cards given it would be difficult to highlight some racism behaviour and to entirely use the number of red/yellow cards (increasing data in these four cases would simply shift values).

Note: Our decision is subjective, but describe the most actual problems in soccer (it is more common to have racism with black players than with white players). Also, the major part of referees are from countries where white people are the majority (Europe, North America):
</p>

In [None]:
soccer_data_clean[['refNum', 'Alpha_3']].drop_duplicates('refNum')['Alpha_3'].value_counts()

We define a function that will increase the number of yellow/red cards iff a player is black, and this for each dyad.

In [None]:
def pondered_number_of_cards(row, cardsName):
    '''
    Given a player, function analyzes the skin color and ponderates the number of received cards if player is black.
    
    row: Row of the DataFrame, representing a dyad which contains a player.
    cardsName: Type of received cards for the player
    '''

    nbCards = row[cardsName]
    
    if row['associationScore'] > 0:
        coef = (row['rater'] / 100) * row['associationScore']
    elif row['associationScore'] < 0:
        coef = (1 - (row['rater'] / 100) ) * row['associationScore']
    else:
        coef = 0

    nbCards += nbCards * coef

    return nbCards

soccer_data_clean['ponderedYellowCards'] = soccer_data_clean.apply(func=pondered_number_of_cards, args=('yellowCards',), axis=1)
soccer_data_clean['ponderedYellowReds'] = soccer_data_clean.apply(func=pondered_number_of_cards, args=('yellowReds',), axis=1)
soccer_data_clean['ponderedRedCards'] = soccer_data_clean.apply(func=pondered_number_of_cards, args=('redCards',), axis=1)

In [None]:
soccer_data_clean[['yellowCards', 'yellowReds', 'redCards', 'ponderedYellowCards', 'ponderedYellowReds', 'ponderedRedCards']].describe()

Then, we sum all the statistics as we want to have one row for each player.

In [None]:
global_statistics = soccer_data_clean[['playerShort', 'games', 'victories', 'defeats', 'goals', 'ponderedYellowCards', 'ponderedYellowReds', 'ponderedRedCards']].groupby('playerShort').sum()
global_statistics.head()

Finally, we create our final DataFrame containing information about a player and some statistics for his career.

In [None]:
players = soccer_data_clean.groupby('playerShort').first()
soccer_data_final = global_statistics.join(players[['age', 'height', 'weight', 'rater']])

for feature in ['club', 'leagueCountry', 'position']:
    global_statistics = global_statistics.merge(pd.get_dummies(players[feature]), left_index=True, right_index=True)

soccer_data_final_all_features = global_statistics.join(players[['age', 'height', 'weight', 'rater']])
soccer_data_final_all_features.head()

#soccer_data_final = global_statistics.join(players[['age', 'height', 'weight', 'rater']])
#soccer_data_final.head()

Important note:

At the end of this part, DataFrame's size was substantially reduced. However, we draw reader's attention on the fact that either we created new features which includes data from previous features (it's the case for the ponderation of cards, which uses IAT and Explicit bias scores for example) or we dropped features which are not useful for what we plan to do (like the photoID or the refNum), so we can safely continue our analysis.

# 1 - From player description to skin color

**Train a sklearn.ensemble.RandomForestClassifier that given a soccer player description outputs his skin color. Show how different parameters passed to the Classifier affect the overfitting issue. Perform cross-validation to mitigate the overfitting of your model. Once you assessed your model, inspect the feature_importances_ attribute and discuss the obtained results. With different assumptions on the data (e.g., dropping certain features even before feeding them to the classifier), can you obtain a substantially different feature_importances_ attribute?**

First, we use categorical data for skin color as it will be the feature used as output here.

In [None]:
soccer_data_final['rater'] = pd.cut(soccer_data_final['rater'], [0, 26, 51, 76, 101], labels=['very light skin','light skin','dark skin','very dark skin'], right=False)

In [None]:
soccer_data_final['rater'].value_counts(dropna=False)

(See useful links below for source.)

In [None]:
features = [col for col in soccer_data_final.columns if col not in ['rater']]
print(features)

In [None]:
X = soccer_data_final[features]
label_encoder = preprocessing.LabelEncoder()
y = label_encoder.fit_transform(soccer_data_final['rater'])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

In [None]:
label_binarizer = preprocessing.LabelBinarizer()
y_test_binary = label_binarizer.fit_transform(y_test)

In [None]:
# ONE-TIME EXECUTION

# Function is defined in sklearn documentation and was slightly modified to fit with our needs and our situation
# See: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
y_probabilities = rfc.predict_proba(X_test)
cm = metrics.confusion_matrix(y_test, predictions)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cm, classes=soccer_data_final['rater'].unique(), title='Confusion matrix, without normalization')

plt.show()

In [None]:
# http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics
# CHECK FOR CORRECT AVERAGE METHOD... 

### Single fit/predict execution

In [None]:
results = collections.defaultdict(list)

# We loop over different parameters' values to compute some metrics and find a good model
for n_est in [1, 10, 100, 1000]:
    for min_leaf in range(10,11):
        for min_split in range (10,11):
            clf = RandomForestClassifier(n_jobs=-1, n_estimators=n_est, min_samples_leaf=min_leaf, min_samples_split=min_split)
            clf.fit(X_train, y_train)
            predictions = clf.predict(X_test)
            y_probabilities = clf.predict_proba(X_test)
            accuracy = metrics.accuracy_score(y_test, predictions)
            precision = metrics.precision_score(y_test, predictions, average='macro')
            recall = metrics.precision_score(y_test, predictions, average='macro')
            f1 = metrics.f1_score(y_test, predictions, average='macro')
            roc_auc_score = metrics.roc_auc_score(y_test_binary, y_probabilities)
            
            results['min_leaf'].append(min_leaf)
            results['min_split'].append(min_split)
            results['n_est'].append(n_est)
            results['accuracy'].append(accuracy)
            results['precision'].append(precision)
            results['recall'].append(recall)
            results['f1'].append(f1)
            results['roc_auc'].append(roc_auc_score)
            
            print('min_leaf: ' + str(min_leaf) + ' min_split: '+ str(min_leaf) + ' n_est: ' + str(n_est))
            print('\t' + 'accuracy: ' + str(accuracy))
            print('\t' + 'precision: ' + str(precision))
            print('\t' + 'recall: ' + str(recall))
            print('\t' + 'f1: ' + str(f1))
            print('\t' + 'roc_auc: ' + str(roc_auc_score))

### Cross-validation

In [None]:
y_binary = label_binarizer.fit_transform(soccer_data_final['rater'])

In [None]:
cv_results = collections.defaultdict(list)

# We repeat the iterations but, this time, we use cross-validation
for n_est in [1, 10, 100, 1000]:
    for min_leaf in range(10,11):
        for min_split in range (10,11):
            # cross validation using RandomForestClassifier
            clf = RandomForestClassifier(n_jobs=-1, n_estimators=n_est, min_samples_leaf=min_leaf, min_samples_split=min_split)
            cv_accuracy = cross_val_score(clf, soccer_data_final[features], soccer_data_final['rater'], cv=10, scoring='accuracy')
            cv_roc_auc = cross_val_score(clf, soccer_data_final[features], y_binary, cv=10, scoring='roc_auc')

            # adding result to the dic.
            cv_results['min_leaf'].append(min_leaf)
            cv_results['min_split'].append(min_split)
            cv_results['n_est'].append(n_est)
            cv_results['cv_accuracy_mean'].append(np.mean(cv_accuracy))
            cv_results['cv_roc_auc_mean'].append(np.mean(cv_roc_auc))
            
            # Print results
            print('min_leaf: ' + str(min_leaf) + ' min_split: '+ str(min_leaf) + ' n_est: ' + str(n_est))
            print('\tAccuracy (mean): ' + str(np.mean(cv_accuracy)))
            print('\tROC AUC (mean): ' + str(np.mean(cv_roc_auc)))

### Features importance

In [None]:
fi = rfc.feature_importances_

Useful links:

(Plot)
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

(Indexes and names)
http://stackoverflow.com/questions/22361781/how-does-sklearn-random-forest-index-feature-importances

In [None]:
fi

In [None]:
rfc = RandomForestClassifier()
soccer_data_splitted = soccer_data_final.copy()
soccer_data_splitted['trainingMode'] = np.random.uniform(0, 1, len(soccer_data_final)) <= .75
train, test = soccer_data_final[soccer_data_splitted['trainingMode'] == True], soccer_data_splitted[soccer_data_splitted['trainingMode'] == False]

y, _ = pd.factorize(train['rater'])
rfc.fit(train[features], y)

predictions = rfc.predict(test[features])
#scores_predictions = rfc.score(predictions, test['rater'])

#print(scores_predictions)

In [None]:
##### SEE PREVIOUS CELL (beginning of "Cross-validation" section / SAME CODE

n_est = 5
result = collections.defaultdict(list)

# We loop to find the best parameter for our classifier.

# --> les fenetres des valeurs possible doivent étre changé.
for n_est in [1,10,100,1000,2000]:
    for min_leaf in range(10,11):
        for min_split in range (10,11):
            
            # cross validation using RandomForestClassifier
            clf = RandomForestClassifier(n_jobs=-1, n_estimators=n_est, min_samples_leaf=min_leaf, min_samples_split=min_split)
            scores = cross_val_score(clf, soccer_data_final[features], soccer_data_final['rater'] , cv=10, scoring='accuracy')
            
            # adding result to the dic.
            result['min_leaf'].append(min_leaf)
            result['min_split'].append(min_split)
            result['n_est'].append(n_est)
            result['scores_accuracy'].append(np.mean(scores))
            
            print('min_leaf: '+str(min_leaf) +
                  ' min_split: '+str(min_leaf) +
                  ' n_est: '+str(n_est))
            print(np.mean(scores))

In [None]:
resultDataFrame = pd.DataFrame.from_dict(result)
resultDataFrame.head()

In [None]:
indexed_df = resultDataFrame.set_index(['n_est', 'min_leaf','min_split'])
indexed_df.plot(kind='line')

Useful links:

http://blog.yhat.com/posts/random-forests-in-python.html
https://www.dataquest.io/blog/machine-learning-python/
https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests

http://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest

In [None]:
def doCrossValidation(dataDataframe): 
    
    result = collections.defaultdict(list)

    # We loop to find the best parameter for our classifier.

    # --> les fenetres des valeurs possible doivent étre changé.
    for n_est in [1,10,100,1000]:
        for min_leaf in range(10,11):
            for min_split in range (10,11):

                # cross validation using RandomForestClassifier
                clf = RandomForestClassifier(n_jobs=-1, n_estimators=n_est, min_samples_leaf=min_leaf, min_samples_split=min_split)
                scores = cross_val_score(clf, dataDataframe[features], dataDataframe['rater'] , cv=10, scoring='accuracy')

                # adding result to the dic.
                result['min_leaf'].append(min_leaf)
                result['min_split'].append(min_split)
                result['n_est'].append(n_est)
                result['scores_accuracy'].append(np.mean(scores))

                print('min_leaf: '+str(min_leaf) +
                      ' min_split: '+str(min_leaf) +
                      ' n_est: '+str(n_est))
                print(np.mean(scores))
                
    resultDataFrame = pd.DataFrame.from_dict(result)
    resultDataFrame.head()
    indexed_df = resultDataFrame.set_index(['n_est', 'min_leaf','min_split'])
    indexed_df.plot(kind='line')
    return indexed_df

In [None]:
doCrossValidation(soccer_data_final)

In [None]:
doCrossValidation(soccer_data_final)

In [None]:
#soccer_data_clean.drop('birthday', axis=1, inplace=True)
#soccer_data_clean.drop('rater1', axis=1, inplace=True)
#soccer_data_clean.drop('rater2', axis=1, inplace=True)

> Vérifier que les notes pour la couleur

In [None]:
#rfc = RandomForestClassifier()

x = soccer_data_clean[['games','victories','ties','defeats','goals','yellowCards','yellowReds','redCards','age']]
y = soccer_data_clean['rater']

#scores = cross_val_score(rfc, x, y, cv=10, scoring='accuracy')
n_est = 10
result = collections.defaultdict(list)

#rfc = RandomForestClassifier()

#x = soccer_data_clean[['club','leagueCountry','height','weight','position','games','victories','ties','defeats','goals','yellowCards','yellowReds','redCards','refNum','refCountry','Alpha_3','meanIAT','nIAT','seIAT','meanExp','nExp','seExp','age']]
#y = soccer_data_clean['rater']

#scores = cross_val_score(rfc, x, y, cv=10, scoring='accuracy')
#print(scores)

#rfc.fit(x, y)
#rfc.predict([23, 2, 1, 0])