<h1>Homework 04 - Applied ML</h1>

Importation of libraries

In [None]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import datetime
from sklearn.ensemble import RandomForestClassifier

Importation of the datas

In [None]:
filename = os.path.join('data','CrowdstormingDataJuly1st.csv') 
df = pd.read_csv(filename)
df.head(2)

<h2>Data exploration</h2>

In [None]:
print('Number of dyads (rows in dataframe): ', len(df))
print('Total number of interactions between a referee and a player (nb of games): ', sum(df.games))
print('Mean number of games for a dyad: ', np.mean(df['games']))

## Data cleaning / setup

In [None]:
print("Number of rows in dataframe: ", len(df))

<b>Removing raters</b>: We decided to remove row where the two rates were significantly different or if any of the rates were absent (Nan value).

In [None]:
cleandf = df.copy()

## Removing null values in raters
cleandf = cleandf[cleandf["rater1"].notnull() & cleandf["rater2"].notnull()]

## Removing all rows where the difference between the two raters is larger than 0.25
cleandf['difference'] = abs(cleandf.rater1 - cleandf.rater2)
cleandf = cleandf[cleandf['difference'] <= 0.25]
cleandf.drop('difference', axis =1, inplace=True)

print("Number of rows in the cleaned dataframe: ", len(cleandf))
print("Number of rows removed: ", (len(df)-len(cleandf)))

<b>Skin tone</b>: Then we decide to take the skin tone as the mean between the two raters. This is the value that will be predicted later.

In [None]:
cleandf["meanSkinTone"] = abs(cleandf["rater1"] - cleandf["rater2"] ) / 2
# cols = cleandf.columns.tolist()
# cols = cols[-1:] + cols[:-1]
# cleandf = cleandf[cols]
cleandf.head(2)

<b>Birthday date</b>: As the classifier can not understand date, we decided to change birthday date in seconds. It seemed important for us to keep the birthday date, as it could help predict the color skin if there were more people from a certain demography that played during some years.

In [None]:
def time_to_seconds(t):
    seconds = (pd.to_datetime(t) - datetime.datetime(1970, 1,1)).total_seconds()
    return int(seconds)

cleandf.birthday = cleandf.birthday.apply(time_to_seconds)
cleandf.head(2)

<b>Dummy variables</b>: We noticed that a lot of the columns could not be used in the Random forest as they are non-numerical. As most of these features can be seen as categorical variables, we decided to make dummy variables with them.

In [None]:
print("The number of different positions is", cleandf["position"].unique().size)
print("The number of different clubs is",cleandf["club"].unique().size)
print("The number of different league countries is",cleandf["leagueCountry"].unique().size)
print("The number of different referee countries is",cleandf["Alpha_3"].unique().size)

Out of these datas, we decided to remove the "referee country" (Alpha_3) and to make dummy variables with the 3 other categories. We decided to remove the referee country because there were a lot of them and it seemed it would induce more error and overfitting to our classifier than it would help it.

In [None]:
dummydf = pd.get_dummies(cleandf, prefix=None, prefix_sep='_', dummy_na=False, columns=["position"], sparse=False, drop_first=False)
dummydf = pd.get_dummies(dummydf, prefix=None, prefix_sep='_', dummy_na=False, columns=["club"], sparse=False, drop_first=False)
dummydf = pd.get_dummies(dummydf, prefix=None, prefix_sep='_', dummy_na=False, columns=["leagueCountry"], sparse=False, drop_first=False)
dummydf.drop("Alpha_3", axis = 1, inplace = True)

In [None]:
print("Number of columns in the previous dataframe: ",cleandf.columns.size)
print("Number of columns in the new dataframe with dummy variables: ",dummydf.columns.size)

<b>Removing useless columns</b>: For the classifier, there are some columns that it makes no sense to use. These columns are the player name (and short name), the photo and the initial ratings (of rater 1 and 2). Therefore we will drop them.

In [None]:
usedf = dummydf.drop(['playerShort', 'player', 'photoID', 'rater1', 'rater2']  , axis=1)

<b>Replacing Nan</b>: We realize that the dataframe still have some NaN values. We decided to substitute every NaN with the mean of their column.

In [None]:
usedf.isnull().values.any()

In [None]:
# For each column, if there is any NaN value, we compute the mean and replace the NaN values with it.
for i in range(len(usedf.columns)):
    if (usedf[usedf.columns[i]].isnull().values.any()):
        mean = np.mean(usedf[usedf.columns[i]])
        usedf[usedf.columns[i]].fillna(mean, inplace = True)
        
usedf.isnull().values.any()

In [None]:
usedf.head(3)

## Classifier

Here we will show how we classify the data using random forest.

In [None]:
def build_k_indices(y, k_fold, seed):
    """build k indices for k-fold."""
    num_row = len(y)
    interval = int(num_row / k_fold)
    np.random.seed(seed)
    indices = np.random.permutation(num_row)
    k_indices = [indices[k * interval: (k + 1) * interval] for k in range(k_fold)]
    return np.array(k_indices)

In [None]:
import time
from sklearn import linear_model, metrics
from  sklearn.metrics import log_loss

sum_feature_importances = np.zeros(X.shape[1])

def cross_validation(y, x, k_indices, k):
    """return the loss of ridge regression."""
    # get k'th subgroup in test, others in train
    print("kkkkkkkk", k)
    te_indice = k_indices[k]
    tr_indice = k_indices[~(np.arange(k_indices.shape[0]) == k)]
    tr_indice = tr_indice.reshape(-1)
    
    y_te = y[te_indice]
    y_tr = y[tr_indice]
    x_te = x[te_indice]
    x_tr = x[tr_indice]
    
    #Make classifier
    rf = RandomForestClassifier(n_jobs = 4)

    #Train the classifier
    clf = rf.fit(x_tr, y_tr)
    path = rf.feature_importances_
    global sum_feature_importances
    sum_feature_importances = np.add(sum_feature_importances, path)

    # Make prediction for testing data with the classifier
    y_pred = clf.predict(x_te)

    #Compute error (trying with 2 different methods)
    #rf_score1 = log_loss(y_te, y_pred)
    rf_score2 = metrics.accuracy_score(y_te, y_pred)
    
    print(y_pred)

    #print('score: ', rf_score1)
    print('score 2: ', rf_score2)
    

Making X (the features to use in the classifier) and Y (the value to predict).

In [None]:
Y = np.asarray(usedf["meanSkinTone"].values, dtype="|S6")
subDummydf = usedf.drop("meanSkinTone", axis = 1)
X = subDummydf.as_matrix()

Executing the K_fold validation

In [None]:
k_fold = 8
k_indices = build_k_indices(Y, k_fold, 1)

for k in range(k_fold):
    cross_validation(Y, X, k_indices, k)

In [None]:
sum_feature_importances / 4

In [None]:
sum_feature_importances.argsort()