#### Isolation forests , novelty detection algorithm test.
 
Within a conduct risk domain the isolation forest algorithm can be used to spot outlier behaviour. Is somebody's login activity unusual versus their peer group? Is somebody's meta data unusual in terms of the frequency or time of day that they are communicating with their colleagues?  
 
The test below is not intended as a serious exercise. Here the algorithm is trained on data belonging to a group of footballers. The isolation forest hyper parameters are optimised using a cross val grid search process. It is then shown data on Irish football players and asked to classify if each of the Irish players belong to the other group or whether they are so different they should be treated as outliers.  

 
#### Ireland senior team player analysis.
The algorithm is initially trained in an unsupervised manner and no subjective information was passed to it in terms of whether a player was regarded as "good" or "bad". 

In this case the "norm/inliers" group for the algorithm is comprised only of players from the following international football teams; Brazil, France, Germany, Netherlands, Portugal and Spain.  This group is labelled "best in class".  

After training the algorithm on the best in class group the new and previously unseen data on the Irish players is shown to the isolation forest algorithm.

The model looks to have punished Irish players with low passing accuracy and low shots on goal per game and doesn't seem to rate winning balls in the air as being that important. 

Results are provided in the table below.


#### Footnote: 
I normalised the continuous variables. I also excluded several variables upfront with a heavy zero count. I left parameter selection to the grid search process. No replacement was used during training.

The features/attributes used in analysis were as follows: Height(cm), Age, Weight(KG), Shots(per game), Pass Success rate, Aerial duels won per game. Goalkeepers are excluded from both groups. Data for European teams is taken only from the recent Nations Cup.


##### Reference  
Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM‘08. Eighth IEEE International Conference on.
https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf


#### Sept 2019

In [1]:
#Dependencies
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import IsolationForest
from sklearn.metrics import make_scorer
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import MinMaxScaler

In [2]:
#Data source
#https://www.whoscored.com/Teams/331/Show/Ireland-Ireland

In [3]:
#Read in data,clean,feature extraction and split into training and test data for algorithm.
def clean_data():
    global train,label_train,test,label_test,df_2,df_train,df_test
    df=pd.read_csv('..ireland_football_senior.csv', header=0)
    #Clean up Player column and extract age and player position 
    df_1=pd.DataFrame(df.Player.str.split(n=3,expand=True))
    df_1.drop([0],axis=1,inplace=True) 
    df_1.columns=(['Name','Age','Position'])
    df_1['Age']=df_1.Age.str.replace(',', '')
    #Concatenate data above into new dataframe 
    df_2= pd.concat([df_1,df], axis=1)
    df_2.drop(columns='Player',inplace=True)
    df_2.index=df_2['Name']
    df_2.replace('-', 0,inplace=True)
    #exclude goalkeepers from all data
    df_2 = df_2[df_2.Position != "GK"]
    #Use df_train and df_test for the table below in cell bnm
    df_train=df_2.loc[:,['Height','Weight','Age','Pass_Success%','Shots(pg)','AerialsWon']][25:].astype(float)
    df_test=df_2.loc[:,['Height','Weight','Age','Pass_Success%','Shots(pg)','AerialsWon']][0:25].astype(float)
    #Split data into traning and test. First split features to enable normalization of the four continuous variables
    #Training data
    #Min/Max scaler to be used on continuous variables in the feature space.
    ##Scaling/Normalising continuous features to between 0 and 1 
    scaler = MinMaxScaler()
    train_1=df_2.loc[:,['Shots(pg)','AerialsWon']][25:].astype(float) 
    train_2=df_2.loc[:,['Height','Weight','Age','Pass_Success%']][25:].astype(float)
    train_3= pd.DataFrame(scaler.fit_transform(train_2) , columns=('Height','Weight','Age','Pass_Success%'),index=train_2.index)
    train=pd.concat([train_1,train_3],axis=1)
    label_train=df_2.iloc[25:,-1].astype(float)      
    #Test data
    test_1=df_2.loc[:,['Shots(pg)','AerialsWon']][0:25].astype(float) 
    test_2=df_2.loc[:,['Height','Weight','Age','Pass_Success%']][0:25].astype(float)
    test_3= pd.DataFrame(scaler.fit_transform(test_2) , columns=('Height','Weight','Age','Pass_Success%'),index=test_2.index)
    test=pd.concat([test_1,test_3],axis=1)
    label_test=df_2.iloc[0:25:,-1].astype(float) 

#Function that produces descriptive statistics table.     
def compare_BIC_Irish():     
    train_averages = df_train.mean().astype(int)
    test_averages = df_test.mean().astype(int)
    train_stdev= df_train.std()
    test_stdev= df_test.std()
    df_4= pd.DataFrame([train_averages,train_stdev,test_averages,test_stdev])
    df_4.drop(columns='outlier_score',inplace=True)
    index_1=["BIC_average","BIC_std_deviation","Irish_average","Irish_std_deviation"] 
    df_4.index=index_1    
    print("            Comparison of BIC and Irish descriptive stats")  
    return df_4 

#Custom accuracy score function for Grid search.
def my_scorer(y_pred,y_true):
        #Algorithm will score  inlier(+1) and outlier(-1).
        #Convert algorithm scores to zero and one for inlier and outlier for accuracy metric
        y_pred=y_pred.astype(int)
        y_pred[y_pred>0]=0
        y_pred[y_pred<0]=1 
        acc = accuracy_score(y_true, y_pred, normalize=True)   
        return acc 

#Main function that ultimately produces an outlier score for senior Irish team players
def rank_players():
    clean_data()
    param_grid= {'n_estimators': (20,50,100,200),'contamination': (0.05,0.1,0.2),'max_features':(1.0,0.9,0.75),'max_samples':(25,50)}
    clf_is = IsolationForest(bootstrap=False, n_jobs=-1, random_state=None, behaviour='new',verbose=0)      
    my_func = make_scorer(my_scorer, greater_is_better=True)
    #Running a Grid search on the training data. Using accuracy as metric.
    gs_is= GridSearchCV(clf_is,iid=True, param_grid=param_grid, verbose=0,scoring=my_func,cv=5)
    gs_is.fit(train,label_train)
    print('               Algorithm parameters selected from Grid search')
    print(gs_is.best_params_) 
    print('')
    #Passing optimized parameters from grid search to the model below 
    #Not allowing model to use replacement for data splitting. 
    clf = IsolationForest(n_estimators=20, contamination=0.05, max_samples=25, 
                      max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, behaviour='new',verbose=0) 
    clf.fit(train)
    y_pred_ic=clf.predict(test)
    acc_is = accuracy_score(label_test,y_pred_ic,normalize=True) 
    df_test['outlier_score']= clf.decision_function(test) 
    print('')
    print('                    Irish Players ranked by outlier score')
    print('              Biggest outliers are at the bottom of the table')
    return df_test.sort_values(['outlier_score'],ascending=False)  

In [4]:
#Main function
rank_players()

               Algorithm parameters selected from Grid search
{'contamination': 0.05, 'max_features': 1.0, 'max_samples': 25, 'n_estimators': 20}


                    Irish Players ranked by outlier score
              Biggest outliers are at the bottom of the table


Unnamed: 0_level_0,Height,Weight,Age,Pass_Success%,Shots(pg),AerialsWon,outlier_score
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ODowda,181.0,75.0,24.0,80.0,0.7,0.7,0.088999
Robinson,178.0,75.0,24.0,72.0,0.8,0.3,0.083357
Arter,176.0,70.0,29.0,84.8,0.5,1.5,0.051178
Williams,183.0,80.0,32.0,87.5,1.0,0.0,0.050079
Coleman,177.0,67.0,30.0,78.6,0.0,1.0,0.047153
Maguire,174.0,72.0,25.0,100.0,1.0,1.0,0.045075
Brady,176.0,71.0,27.0,64.3,2.0,0.0,0.043735
Stevens,183.0,78.0,29.0,74.4,0.0,0.3,0.041123
Hendrick,185.0,79.0,27.0,78.2,0.5,0.3,0.040802
O'Brien,183.0,72.0,25.0,68.6,0.3,2.3,0.038944


In [5]:
#Comparison of mean and standard deviations between the training and test data sets. 
#Training data set: BIC=  Brazil, France, Germany, Netherlands, Portugal and Spain. 
#Tests data set: Irish= Senior Irish player statistics from the Nations cup.
compare_BIC_Irish()

            Comparison of BIC and Irish descriptive stats


Unnamed: 0,Height,Weight,Age,Pass_Success%,Shots(pg),AerialsWon
BIC_average,180.0,75.0,26.0,84.0,1.0,0.0
BIC_std_deviation,5.97271,6.60255,3.827312,12.435953,1.062694,1.079415
Irish_average,181.0,75.0,27.0,68.0,0.0,1.0
Irish_std_deviation,5.441507,6.492817,3.650571,19.642317,0.533448,1.821062
