<h1>Homework 04 - Applied ML</h1>

Importation of libraries

In [1]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn import metrics
import matplotlib.pyplot as plt
from functools import reduce
import math
from collections import Counter

Importation of the datas

In [2]:
filename = os.path.join('data','CrowdstormingDataJuly1st.csv') 
df = pd.read_csv(filename)
df.head(2)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
0,lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,31.08.1983,177.0,72.0,Attacking Midfielder,1,0,...,0.5,1,1,GRC,0.326391,712.0,0.000564,0.396,750.0,0.002696
1,john-utaka,John Utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,...,0.75,2,2,ZMB,0.203375,40.0,0.010875,-0.204082,49.0,0.061504


<h2>Data exploration</h2>

In [3]:
print('Number of dyads (rows in dataframe): ', len(df))
print('Total number of interactions between a referee and a player (nb of games): ', sum(df.games))
print('Mean number of games for a dyad: ', np.mean(df['games']))

Number of dyads (rows in dataframe):  146028
Total number of interactions between a referee and a player (nb of games):  426572
Mean number of games for a dyad:  2.921165803818446


## Data cleaning / setup

In [4]:
print("Number of rows in dataframe: ", len(df))

Number of rows in dataframe:  146028


<b>Removing raters</b>: We decided to remove row where the two rates were significantly different or if any of the rates were absent (Nan value).

In [5]:
cleandf = df.copy()

## Removing null values in raters
cleandf = cleandf[cleandf["rater1"].notnull() & cleandf["rater2"].notnull()]

## Removing all rows where the difference between the two raters is larger than 0.25
cleandf['difference'] = abs(cleandf.rater1 - cleandf.rater2)
cleandf = cleandf[cleandf['difference'] <= 0.25]
cleandf.drop('difference', axis =1, inplace=True)

print("Number of rows in the cleaned dataframe: ", len(cleandf))
print("Number of rows removed: ", (len(df)-len(cleandf)))

Number of rows in the cleaned dataframe:  124457
Number of rows removed:  21571


<b>Skin tone</b>: Then we decide to take the skin tone as the mean between the two raters. This is the value that will be predicted later.

In [6]:
cleandf["meanSkinTone"] = abs(cleandf["rater1"] + cleandf["rater2"] ) / 2
# cols = cleandf.columns.tolist()
# cols = cols[-1:] + cols[:-1]
# cleandf = cleandf[cols]
cleandf.head(2)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,meanSkinTone
0,lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,31.08.1983,177.0,72.0,Attacking Midfielder,1,0,...,1,1,GRC,0.326391,712.0,0.000564,0.396,750.0,0.002696,0.125
1,john-utaka,John Utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,...,2,2,ZMB,0.203375,40.0,0.010875,-0.204082,49.0,0.061504,0.0


<b>Birthday date</b>: As the classifier can not understand date, we decided to change birthday date in seconds. It seemed important for us to keep the birthday date, as it could help predict the color skin if there were more people from a certain demography that played during some years.

In [7]:
def time_to_seconds(t):
    seconds = (pd.to_datetime(t) - datetime.datetime(1970, 1,1)).total_seconds()
    return int(seconds)

cleandf.birthday = cleandf.birthday.apply(time_to_seconds)
cleandf.head(2)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,meanSkinTone
0,lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,431136000,177.0,72.0,Attacking Midfielder,1,0,...,1,1,GRC,0.326391,712.0,0.000564,0.396,750.0,0.002696,0.125
1,john-utaka,John Utaka,Montpellier HSC,France,397008000,179.0,82.0,Right Winger,1,0,...,2,2,ZMB,0.203375,40.0,0.010875,-0.204082,49.0,0.061504,0.0


<b>Dummy variables</b>: We noticed that a lot of the columns could not be used in the Random forest as they are non-numerical. As most of these features can be seen as categorical variables, we decided to make dummy variables with them.

In [8]:
print("The number of different positions is", cleandf["position"].unique().size)
print("The number of different clubs is",cleandf["club"].unique().size)
print("The number of different league countries is",cleandf["leagueCountry"].unique().size)
print("The number of different referee countries is",cleandf["Alpha_3"].unique().size)

The number of different positions is 13
The number of different clubs is 97
The number of different league countries is 4
The number of different referee countries is 160


Out of these datas, we decided to remove the "referee country" (Alpha_3) and to make dummy variables with the 3 other categories. We decided to remove the referee country because there were a lot of them and it seemed it would induce more error and overfitting to our classifier than it would help it.

In [9]:
dummydf = pd.get_dummies(cleandf, prefix=None, prefix_sep='_', dummy_na=False, columns=["position"], sparse=False, drop_first=False)
dummydf = pd.get_dummies(dummydf, prefix=None, prefix_sep='_', dummy_na=False, columns=["club"], sparse=False, drop_first=False)
dummydf = pd.get_dummies(dummydf, prefix=None, prefix_sep='_', dummy_na=False, columns=["leagueCountry"], sparse=False, drop_first=False)
dummydf.drop("Alpha_3", axis = 1, inplace = True)

In [10]:
print("Number of columns in the previous dataframe: ",cleandf.columns.size)
print("Number of columns in the new dataframe with dummy variables: ",dummydf.columns.size)

Number of columns in the previous dataframe:  29
Number of columns in the new dataframe with dummy variables:  138


<b>Replacing Nan</b>: We realize that the dataframe still have some NaN values. We decided to substitute every NaN with the mean of their column.

In [11]:
dummydf.isnull().values.any()

True

In [12]:
# For each column, if there is any NaN value, we compute the mean and replace the NaN values with it.
for i in range(len(dummydf.columns)):
    if (dummydf[dummydf.columns[i]].isnull().values.any()):
        mean = np.mean(dummydf[dummydf.columns[i]])
        dummydf[dummydf.columns[i]].fillna(mean, inplace = True)
        
dummydf.isnull().values.any()

False

<b>Removing useless columns</b>: For the classifier, there are some columns that it makes no sense to use. These columns are the player name (and short name), the photo and the initial ratings (of rater 1 and 2). Therefore we will drop them.

In [13]:
usedf = dummydf.drop(['playerShort', 'player', 'photoID', 'rater1', 'rater2']  , axis=1)

In [14]:
usedf.head(3)

Unnamed: 0,birthday,height,weight,games,victories,ties,defeats,goals,yellowCards,yellowReds,...,club_Werder Bremen,club_West Bromwich Albion,club_West Ham United,club_Wigan Athletic,club_Wolverhampton Wanderers,club_Évian Thonon Gaillard,leagueCountry_England,leagueCountry_France,leagueCountry_Germany,leagueCountry_Spain
0,431136000,177.0,72.0,1,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,397008000,179.0,82.0,1,0,0,1,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,303177600,182.0,71.0,1,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


## Classifier

Here we will show how we classify the data using random forest.

In [15]:
def executingRandomForest(X, y, Xpd):
    # Creating kfolds
    once = False
    kf = KFold(n_splits=5, shuffle=True, random_state=1)

    #Iterating for kfold validation
    for train_index, test_index in kf.split(X):
        print("TRAIN:", train_index.size, "TEST:", test_index.size)

        ## Making indices
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        #Make classifier
        rf = RandomForestClassifier(n_jobs = 4)

        #Train the classifier
        clf = rf.fit(X_train, y_train)

        # Make prediction for testing data with the classifier
        y_pred = clf.predict(X_test)

        #Computing error
        rf_train_score = metrics.accuracy_score(y_train, clf.predict(X_train))
        rf_test_score = metrics.accuracy_score(y_test, y_pred)
        print('   Training score: {:.5f}'.format(rf_train_score))
        print('   Testing score: {:.5f}'.format(rf_test_score))


        # Features importance score
        if(once != True):
            once = True
            importances = rf.feature_importances_
            std = np.std([tree.feature_importances_ for tree in rf.estimators_],
                         axis=0)
            indices = np.argsort(importances)[::-1]


    # Plot the feature importances of the forest
    plt.figure()
    plt.title("Feature importances")
    plt.bar(range(X.shape[1]), importances[indices],
               color="r", yerr=std[indices], align="center")
    plt.xticks(range(X.shape[1]), importances[indices])
    plt.xlim([-1, X.shape[1]])
    plt.show()

    # Print the feature ranking
    print("Feature ranking (first 20):")
    for f in range(20):
        print("%d. feature %s (%f)" % (f + 1, Xpd.columns[f], importances[indices[f]]))


Making X (the features to use in the classifier) and Y (the value to predict). Executing the random forest algorithm on them

In [16]:
def makeMatrixForClassifier(df):
    y = np.asarray(df["meanSkinTone"].values, dtype="|S6")
    Xpd = df.drop("meanSkinTone", axis = 1)
    X = Xpd.as_matrix()
    return (X, y, Xpd)

In [17]:
(X, y, Xpd) = makeMatrixForClassifier(usedf)
executingRandomForest(X, y, Xpd)

TRAIN: 99565 TEST: 24892
   Training score: 0.99985
   Testing score: 0.99381
TRAIN: 99565 TEST: 24892
   Training score: 0.99984
   Testing score: 0.99413
TRAIN: 99566 TEST: 24891
   Training score: 0.99982
   Testing score: 0.99349
TRAIN: 99566 TEST: 24891
   Training score: 0.99986
   Testing score: 0.99514
TRAIN: 99566 TEST: 24891
   Training score: 0.99983
   Testing score: 0.99578
Feature ranking (first 20):
1. feature birthday (0.163980)
2. feature height (0.128227)
3. feature weight (0.128050)
4. feature games (0.016392)
5. feature victories (0.015854)
6. feature ties (0.015657)
7. feature defeats (0.015196)
8. feature goals (0.014407)
9. feature yellowCards (0.014150)
10. feature yellowReds (0.012443)
11. feature redCards (0.011990)
12. feature refNum (0.011582)
13. feature refCountry (0.010838)
14. feature meanIAT (0.010677)
15. feature nIAT (0.010646)
16. feature seIAT (0.010210)
17. feature meanExp (0.009537)
18. feature nExp (0.009358)
19. feature seExp (0.009166)
20. feat

We can observe that we overfit greatly. Ploting the features importance show....

In [18]:
notOverfittingMaybedf = usedf.drop(['birthday', 'height', 'weight']  , axis=1)
(X, y, Xpd) = makeMatrixForClassifier(notOverfittingMaybedf)
executingRandomForest(X, y, Xpd)

TRAIN: 99565 TEST: 24892
   Training score: 0.98387
   Testing score: 0.86088
TRAIN: 99565 TEST: 24892
   Training score: 0.98419
   Testing score: 0.86000
TRAIN: 99566 TEST: 24891
   Training score: 0.98453
   Testing score: 0.86312
TRAIN: 99566 TEST: 24891
   Training score: 0.98410
   Testing score: 0.86614
TRAIN: 99566 TEST: 24891
   Training score: 0.98404
   Testing score: 0.86272
Feature ranking (first 20):
1. feature games (0.148519)
2. feature victories (0.047789)
3. feature ties (0.043618)
4. feature defeats (0.038631)
5. feature goals (0.032899)
6. feature yellowCards (0.028385)
7. feature yellowReds (0.026083)
8. feature redCards (0.025312)
9. feature refNum (0.024850)
10. feature refCountry (0.023614)
11. feature meanIAT (0.021483)
12. feature nIAT (0.019899)
13. feature seIAT (0.019138)
14. feature meanExp (0.018161)
15. feature nExp (0.017935)
16. feature seExp (0.017258)
17. feature position_Attacking Midfielder (0.016698)
18. feature position_Center Back (0.016555)
19.

In [19]:
notOverfittingMaybedf = usedf.drop(['birthday', 'height', 'weight', 'games', 'victories', 'ties', 'defeats', 'goals']  , axis=1)
(X, y, Xpd) = makeMatrixForClassifier(notOverfittingMaybedf)
executingRandomForest(X, y, Xpd)

TRAIN: 99565 TEST: 24892
   Training score: 0.95778
   Testing score: 0.84585
TRAIN: 99565 TEST: 24892
   Training score: 0.95819
   Testing score: 0.84019
TRAIN: 99566 TEST: 24891
   Training score: 0.95769
   Testing score: 0.84083
TRAIN: 99566 TEST: 24891
   Training score: 0.95662
   Testing score: 0.84452
TRAIN: 99566 TEST: 24891
   Training score: 0.95770
   Testing score: 0.84215
Feature ranking (first 20):
1. feature yellowCards (0.254330)
2. feature yellowReds (0.041678)
3. feature redCards (0.027983)
4. feature refNum (0.027090)
5. feature refCountry (0.025719)
6. feature meanIAT (0.023203)
7. feature nIAT (0.022423)
8. feature seIAT (0.021333)
9. feature meanExp (0.021189)
10. feature nExp (0.021063)
11. feature seExp (0.018389)
12. feature position_Attacking Midfielder (0.017984)
13. feature position_Center Back (0.017602)
14. feature position_Center Forward (0.016643)
15. feature position_Center Midfielder (0.016625)
16. feature position_Defensive Midfielder (0.016580)
17.

## Aggregate the referee info by socker player

In [20]:
acc_columns = ['games', 'victories', 'ties', 'defeats', 'goals', 'yellowCards', 'yellowReds', 'redCards']
const_columns = ['playerShort', 'player', 'birthday', 'height', 'weight']
majority_vote = ['club', 'leagueCountry', 'position']

In [21]:
remove_columns = ['photoID', 'refNum', 'refCountry', 'Alpha_3', 'rater1', 'rater2']
referee_info_df = cleandf.drop(remove_columns, axis = 1)

In [22]:
by_group_player = list(referee_info_df.groupby('playerShort'))

In [23]:
sum_func = lambda x, y: x+y
def accumulate(series):
    return reduce(sum_func, series)

In [24]:
unique_player_data = []
for player_name, data in by_group_player:
    one_occurrence = {column : data[column].tolist()[0] for column in const_columns}
    accumulated = {column : accumulate(data[column].tolist()) for column in acc_columns}
    #TODO: comment on: most_common()[0][0] - first 0 stands for max voted value ('name': count); 
    #                                        second 0 - gives a name from tuple 
    majority_vote = { column : Counter(data[column].tolist()).most_common()[0][0] for column in majority_vote}

    #TODO: add comment
    #################################################################################
    acc_niat = accumulate(data['nIAT'].tolist())
    acc_prod_iat = accumulate((data['meanIAT']*data['nIAT']).tolist())
    
    acc_nexp = accumulate(data['nExp'].tolist())
    acc_prod_exp = accumulate((data['meanExp']*data['nExp']).tolist())
    
    #TODO: add comment
    #################################################################################
    acc_se_iat =accumulate((data['nIAT']).tolist())
    acc_prod_se_iat = accumulate((data['seIAT']*data['seIAT']*data['nIAT']).tolist())
    
    acc_se_exp =accumulate((data['nExp']).tolist())
    acc_prod_se_exp = accumulate((data['seExp']*data['seExp']*data['nExp']).tolist())
    
    wm = {'weighted_mean_iat' : acc_prod_iat/acc_niat,
          'weighted_mean_exp' : acc_prod_exp/acc_nexp,
          'sqrt_weighted_mean_iat' : math.sqrt(acc_prod_se_iat/acc_se_iat),
          'sqrt_weighted_mean_exp' : math.sqrt(acc_prod_se_iat/acc_se_iat)}
    
    unique_player_data.append(
        dict(list(one_occurrence.items()) +
             list(accumulated.items()) +
             list(majority_vote.items()) +
             list(wm.items())))

In [25]:
aggregated_df = pd.DataFrame(unique_player_data)

In [26]:
aggregated_df.head()

Unnamed: 0,birthday,club,defeats,games,goals,height,leagueCountry,player,playerShort,position,redCards,sqrt_weighted_mean_exp,sqrt_weighted_mean_iat,ties,victories,weight,weighted_mean_exp,weighted_mean_iat,yellowCards,yellowReds
0,303177600,Fulham FC,228,654,9,182.0,England,Aaron Hughes,aaron-hughes,Center Back,0,0.000121,0.000121,179,247,71.0,0.367721,0.328409,19,0
1,513388800,Werder Bremen,122,336,62,183.0,Germany,Aaron Hunt,aaron-hunt,Attacking Midfielder,1,5.8e-05,5.8e-05,73,141,73.0,0.441615,0.329945,42,0
2,545529600,Tottenham Hotspur,115,412,31,165.0,England,Aaron Lennon,aaron-lennon,Right Midfielder,0,8.6e-05,8.6e-05,97,200,63.0,0.365628,0.32823,11,0
3,662169600,Arsenal FC,68,260,39,178.0,England,Aaron Ramsey,aaron-ramsey,Center Midfielder,1,0.000218,0.000218,42,150,76.0,0.412859,0.327775,31,0
4,637632000,Montpellier HSC,43,124,1,180.0,France,Abdelhamid El-Kaoutari,abdelhamid-el-kaoutari,Center Back,2,0.000478,0.000478,40,41,73.0,0.379497,0.338847,8,4
