<h1>Homework 04 - Applied ML</h1>

Importation of libraries

In [None]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn import metrics
import matplotlib.pyplot as plt
from functools import reduce
import math
from collections import Counter

Importation of the datas

In [None]:
filename = os.path.join('data','CrowdstormingDataJuly1st.csv') 
df = pd.read_csv(filename)
df.head(2)

<h2>Data exploration</h2>

In [None]:
print('Number of dyads (rows in dataframe): ', len(df))
print('Total number of interactions between a referee and a player (nb of games): ', sum(df.games))
print('Mean number of games for a dyad: ', np.mean(df['games']))

## Data cleaning / setup

In [None]:
print("Number of rows in dataframe: ", len(df))

<b>Removing raters</b>: We decided to remove row where the two rates were significantly different or if any of the rates were absent (Nan value).

In [None]:
cleandf = df.copy()

## Removing null values in raters
cleandf = cleandf[cleandf["rater1"].notnull() & cleandf["rater2"].notnull()]

## Removing all rows where the difference between the two raters is larger than 0.25
cleandf['difference'] = abs(cleandf.rater1 - cleandf.rater2)
cleandf = cleandf[cleandf['difference'] <= 0.25]
cleandf.drop('difference', axis =1, inplace=True)

print("Number of rows in the cleaned dataframe: ", len(cleandf))
print("Number of rows removed: ", (len(df)-len(cleandf)))

<b>Skin tone</b>: Then we decide to take the skin tone as the mean between the two raters. This is the value that will be predicted later.

In [None]:
cleandf["meanSkinTone"] = abs(cleandf["rater1"] + cleandf["rater2"] ) / 2
cleandf.head(2)

<b>Birthday date</b>: As the classifier can not understand date, we decided to change birthday date in seconds. It seemed important for us to keep the birthday date, as it could help predict the color skin if there were more people from a certain demography that played during some years.

In [None]:
def time_to_seconds(t):
    seconds = (pd.to_datetime(t) - datetime.datetime(1970, 1,1)).total_seconds()
    return int(seconds)

cleandf.birthday = cleandf.birthday.apply(time_to_seconds)
cleandf.head(2)

<h3>Changing data attributes to numerals function</h3>

<b><i>Dummy variables</i></b>: We noticed that a lot of the columns could not be used in the Random forest as they are non-numerical. As most of these features can be seen as categorical variables, we decided to make dummy variables with them.

In [None]:
print("The number of different positions is", cleandf["position"].unique().size)
print("The number of different clubs is",cleandf["club"].unique().size)
print("The number of different league countries is",cleandf["leagueCountry"].unique().size)
print("The number of different referee countries is",cleandf["Alpha_3"].unique().size)

Out of these datas, we decided to remove the "referee country" (Alpha_3) and to make dummy variables with the 3 other categories. We decided to remove the referee country because there were a lot of them and it seemed it would induce more error and overfitting to our classifier than it would help it.

<b><i>Replacing Nan</i></b>: We also realize that the dataframe still have some NaN values. We decided to substitute every NaN with the mean of their column.

In [None]:
cleandf.isnull().values.any()

In [None]:
def changeDfAttributesToNumerals(cleandf):
    ## Making the dummy variables
    dummydf = pd.get_dummies(cleandf, prefix=None, prefix_sep='_', dummy_na=False, columns=["position"], sparse=False, drop_first=False)
    dummydf = pd.get_dummies(dummydf, prefix=None, prefix_sep='_', dummy_na=False, columns=["club"], sparse=False, drop_first=False)
    dummydf = pd.get_dummies(dummydf, prefix=None, prefix_sep='_', dummy_na=False, columns=["leagueCountry"], sparse=False, drop_first=False)
    dummydf.drop("Alpha_3", axis = 1, inplace = True)
    
    ## Replacing NaN values
    # For each column, if there is any NaN value, we compute the mean and replace the NaN values with it.
    for i in range(len(dummydf.columns)):
        if (dummydf[dummydf.columns[i]].isnull().values.any()):
            mean = np.mean(dummydf[dummydf.columns[i]])
            dummydf[dummydf.columns[i]].fillna(mean, inplace = True)
    
    return dummydf

We can now apply the function to our dataset and observe the results:

In [None]:
dummydf = changeDfAttributesToNumerals(cleandf)
print("Number of columns in the previous dataframe: ",cleandf.columns.size)
print("Number of columns in the new dataframe with dummy variables: ",dummydf.columns.size)

In [None]:
dummydf.isnull().values.any()

<b>Removing useless columns</b>: For the classifier, there are some columns that it makes no sense to use. These columns are the player name (and short name), the photo and the initial ratings (of rater 1 and 2). Therefore we will drop them.

In [None]:
usedf = dummydf.drop(['playerShort', 'player', 'photoID', 'rater1', 'rater2']  , axis=1)

In [None]:
usedf.head(3)

## Classifier

Here we will show how we classify the data using random forest.

In [114]:

def executingRandomForest(X, y, Xpd, printingInfo, numb_trees, nb_features):
    # Creating kfolds
    once = True
    kf = KFold(n_splits=5, shuffle=True, random_state=1)
    sumAccurancy_tr = 0
    sumAccurancy_te = 0

    #Iterating for kfold validation
    for train_index, test_index in kf.split(X):
        if(once):
            print("Train/Test sample sizes:", train_index.size, " / ", test_index.size)

        ## Making indices
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        #Make classifier
        rf = RandomForestClassifier(n_estimators = numb_trees, max_features = nb_features, n_jobs = 4)

        #Train the classifier
        clf = rf.fit(X_train, y_train)

        # Make prediction for testing data with the classifier
        y_pred = clf.predict(X_test)

        #Computing error
        rf_train_score = metrics.accuracy_score(y_train, clf.predict(X_train))
        rf_test_score = metrics.accuracy_score(y_test, y_pred)
        if(printingInfo):
            print('Train/Test:  {:.4f} / {:.4f}'.format(rf_train_score, rf_test_score))
#             print('   {:.4f}'.format(rf_test_score))
        
        sumAccurancy_tr = sumAccurancy_tr + rf_train_score
        sumAccurancy_te = sumAccurancy_te + rf_test_score


        # Features importance score
        if(once):
            once = False
            importances = rf.feature_importances_
            std = np.std([tree.feature_importances_ for tree in rf.estimators_],
                         axis=0)
            indices = np.argsort(importances)[::-1]

    #Printing the final result score:
    print('=> Final Train/Test:  {:.4f} / {:.4f}'.format(sumAccurancy_tr / 5.0, sumAccurancy_te / 5.0))
    
    if(printingInfo):
        # Plot the feature importances of the forest
        fig = plt.figure()
        plt.title("Feature importances")
        plt.bar(range(X.shape[1]), importances[indices],
                   color="r", yerr=std[indices], align="center")
        plt.xticks(range(X.shape[1]), importances[indices])
        plt.xlim([-1, X.shape[1]])
        plt.show()

        # Print the feature ranking
        print("Feature ranking (first 20):")
        for f in range(20):
            print("%d. feature %s (%f)" % (f + 1, Xpd.columns[f], importances[indices[f]]))
        
    # Returns the average accurancy for the testing set and the training set
    return (sumAccurancy_tr / 5.0, sumAccurancy_te / 5.0)

Making X (the features to use in the classifier) and Y (the value to predict). Executing the random forest algorithm on them

In [None]:
def makeMatrixForClassifier(df):
    y = np.asarray(df["meanSkinTone"].values, dtype="|S6")
    Xpd = df.drop("meanSkinTone", axis = 1)
    X = Xpd.as_matrix()
    return (X, y, Xpd)

In [112]:
numb_trees = 10
nb_features = "auto"

In [None]:
(X, y, Xpd) = makeMatrixForClassifier(usedf)
executingRandomForest(X, y, Xpd, True, numb_trees, nb_features)

We can observe that we have unrealistic results. Ploting the features importance show....

In [None]:
betterResultMaybedf = usedf.drop(['birthday', 'height', 'weight']  , axis=1)
(X, y, Xpd) = makeMatrixForClassifier(betterResultMaybedf)
executingRandomForest(X, y, Xpd, True, numb_trees, nb_features)

We see that it leads still to a problem. It is in fact the case that we use the same player in both dataset. Therefore the random forest will train by identifying the player, and as he will always have the same skin color, it will help to predict that. We want to avoid that, therefore we will need to group the players together.

## Aggregate the referee info by socker player

In order to aggregate the referee info, we preprocessed the data in the following way:

<ul>
<li> One occurrence data - the data that doesn't change (constants): <b>const_columns</b></li>
<li> Accumulated data - e.g. victories, yellowCards, etc. :
<b>acc_columns</b></li>
<li> Majority voting - most frequent data:
<b>majority_vote</b></li>
<li> Removed data - insignificant columns:
<b>remove_columns</b></li>

In [101]:
acc_columns = ['games', 'victories', 'ties', 'defeats', 'goals', 'yellowCards', 'yellowReds', 'redCards']
const_columns = ['playerShort', 'player', 'birthday', 'height', 'weight', 'meanSkinTone']
majority_vote = ['club', 'leagueCountry', 'position']

In [102]:
# we remove rater1 and rater2 because we already have calculated 'meanSkinTone' as the mean between those two raters
remove_columns = ['photoID', 'refNum', 'refCountry', 'Alpha_3', 'rater1', 'rater2']
referee_info_df = cleandf.drop(remove_columns, axis = 1)

#### After cleaning up, we group the data by player short name

In [103]:
by_group_player = list(referee_info_df.groupby('playerShort'))

In [104]:
# auxiliary function used in order to accumulate values for columns in 'acc_columns'

sum_func = lambda x, y: x+y
def accumulate(series):
    return reduce(sum_func, series)

## Direct aggregation of data

In [105]:
# Having various means, we want to combine - so we average them via calculating weighted mean 
# (the weight is the sample size)

def get_weighted_mean(data):
    
    # weighted mean calculation for meanIAT
    acc_niat = accumulate(data['nIAT'].tolist())
    acc_prod_iat = accumulate((data['meanIAT']*data['nIAT']).tolist())
    
    # weighted mean calculation for meanExp
    acc_nexp = accumulate(data['nExp'].tolist())
    acc_prod_exp = accumulate((data['meanExp']*data['nExp']).tolist())
    
    # squareroot of weighted mean of the square for seIAT
    acc_se_iat =accumulate((data['nIAT']).tolist())
    acc_prod_se_iat = accumulate((data['seIAT']*data['seIAT']*data['nIAT']).tolist())
    
    # squareroot of weighted mean of the square for seExp
    acc_se_exp =accumulate((data['nExp']).tolist())
    acc_prod_se_exp = accumulate((data['seExp']*data['seExp']*data['nExp']).tolist())
    
    return {'weighted_mean_iat' : acc_prod_iat/acc_niat,
              'weighted_mean_exp' : acc_prod_exp/acc_nexp,
              'sqrt_weighted_mean_iat' : math.sqrt(acc_prod_se_iat/acc_se_iat),
              'sqrt_weighted_mean_exp' : math.sqrt(acc_prod_se_iat/acc_se_iat)}

In [106]:
# we aggregate data in the format of 'list[dict()]' in order to create DataFrame
unique_player_data = [] 

# iterating over the grouped by plater data we assemble all processed values into the dictionary - a row within
# an out DataFrame
for player_name, data in by_group_player:
    
    # constants - we keep just first value from every column
    one_occurrence = { column : data[column].tolist()[0] for column in const_columns }
    
    #accumulated values
    accumulated = {column : accumulate(data[column].tolist()) for column in acc_columns}
    
    # majority voting = most_common() gives descending ordered by count, list of pairs;
    # most_common()[0][0] - first 0 stands for max voted value ('name': count)
    #                       second 0 gives it's name - first element from tuple 
    majority_vote = { column : Counter(data[column].tolist()).most_common()[0][0] for column in majority_vote}

    # weighted mean calculation
    wm = get_weighted_mean(data)
    
    # assemble just calculated data into one dictionary
    unique_player_data.append(
        dict(list(one_occurrence.items()) +
             list(accumulated.items()) +
             list(majority_vote.items()) +
             list(wm.items())))

In [107]:
# create DataFrame from aggregated data
aggregated_df = pd.DataFrame(unique_player_data)

In [108]:
aggregated_df = pd.get_dummies(aggregated_df, prefix=None, prefix_sep='_', dummy_na=False, columns=["position"], sparse=False, drop_first=False)
aggregated_df = pd.get_dummies(aggregated_df, prefix=None, prefix_sep='_', dummy_na=False, columns=["club"], sparse=False, drop_first=False)
aggregated_df = pd.get_dummies(aggregated_df, prefix=None, prefix_sep='_', dummy_na=False, columns=["leagueCountry"], sparse=False, drop_first=False)


In [109]:
aggregated_df.isnull().values.any()
randomForestDF = aggregated_df.drop(['player', 'playerShort'], axis = 1)
randomForestDF.columns

Index(['birthday', 'defeats', 'games', 'goals', 'height', 'meanSkinTone',
       'redCards', 'sqrt_weighted_mean_exp', 'sqrt_weighted_mean_iat', 'ties',
       ...
       'club_Werder Bremen', 'club_West Bromwich Albion',
       'club_West Ham United', 'club_Wigan Athletic',
       'club_Wolverhampton Wanderers', 'club_Évian Thonon Gaillard',
       'leagueCountry_England', 'leagueCountry_France',
       'leagueCountry_Germany', 'leagueCountry_Spain'],
      dtype='object', length=129)

In [110]:
# For each column, if there is any NaN value, we compute the mean and replace the NaN values with it.
for i in range(len(randomForestDF.columns)):
    if (randomForestDF[randomForestDF.columns[i]].isnull().values.any()):
        mean = np.mean(randomForestDF[randomForestDF.columns[i]])
        randomForestDF[randomForestDF.columns[i]].fillna(mean, inplace = True)
        
randomForestDF.isnull().values.any()

False

In [115]:
(X, y, Xpd) = makeMatrixForClassifier(randomForestDF)
executingRandomForest(X, y, Xpd, False, numb_trees, nb_features)

Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9730 / 0.7442


(0.97299438536379179, 0.74415605159126297)

### Number of features effect

In [123]:
nb_features = len(randomForestDF.columns)
nb_trees = 200

In [127]:
# Very long to run, BE CAREFUL
training = []
testing = []

for i in range(10, nb_trees, 10):
    tr, te = executingRandomForest(X, y, Xpd, False, i, "auto")
    training.append(tr)
    testing.append(te)

Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9727 / 0.7378
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9919 / 0.7549
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9986 / 0.7555
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9992 / 0.7524
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9998 / 0.7568
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9997 / 0.7593
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9998 / 0.7549
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  1.0000 / 0.7587
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  1.0000 / 0.7581
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9998 / 0.7574
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  1.0000 / 0.7574
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  1.0000 / 0.7580
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  1.0000 / 0.7568
Train/Test s

KeyboardInterrupt: 

In [126]:
del training[:]
del testing[:]

for i in range(1, nb_features, 10):
    tr, te = executingRandomForest(X, y, Xpd, False, 10, i)
    training.append(tr)
    testing.append(te)

Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9725 / 0.7486
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9722 / 0.7397
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9641 / 0.7423
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9692 / 0.7473
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9722 / 0.7341
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9725 / 0.7340
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9692 / 0.7328
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9747 / 0.7341
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9692 / 0.7340
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9738 / 0.7397
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9724 / 0.7410
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9659 / 0.7385
Train/Test sample sizes: 1266  /  317
=> Final Train/Test:  0.9656 / 0.7423


## Text to adapt....

As we can obsrved above, the number of trees created by the randm Froest Classifier is proportional to the overfitting of our model. It can be explained by the fact that more trees implies a biggest complexity of our model and so the model overfits.

By default, the numbers of featurs reached is the square root of the number of column (here it is sqrt(133) =~ 11). With a lowest number, the accurancy of our model on the testing set decreases, so a lowest number of feature reduces the overfitting.