# Random Forest

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score, f1_score, confusion_matrix, classification_report

In [69]:
df_football = pd.read_pickle('data/df_data_cleaned.csv')
df_football

Unnamed: 0,Name,Age,Overall,Potential,Club,Value,Wage,Special,Preferred Foot,International Reputation,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,L. Messi,31.0,94.0,94.0,FC Barcelona,1.105e+08,5.65e+07,2202.0,0.0,5.0,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,226500000.0
1,Cristiano Ronaldo,33.0,94.0,94.0,Juventus,7.7e+07,4.05e+07,2228.0,1.0,5.0,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,127100000.0
2,Neymar Jr,26.0,92.0,93.0,Paris Saint-Germain,1.185e+08,2.9e+07,2143.0,1.0,5.0,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,228100000.0
3,K. De Bruyne,27.0,91.0,92.0,Manchester City,1.02e+08,3.55e+07,2281.0,1.0,4.0,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,196400000.0
4,E. Hazard,27.0,91.0,91.0,Chelsea,9.3e+07,3.4e+07,2142.0,1.0,4.0,...,91.0,34.0,27.0,22.0,11.0,12.0,6.0,8.0,8.0,172100000.0
5,L. Modrić,32.0,91.0,91.0,Real Madrid,6.7e+07,4.2e+07,2280.0,1.0,4.0,...,84.0,60.0,76.0,73.0,13.0,9.0,7.0,14.0,9.0,137400000.0
6,L. Suárez,31.0,91.0,91.0,FC Barcelona,8e+07,4.55e+07,2346.0,1.0,5.0,...,85.0,62.0,45.0,38.0,27.0,25.0,31.0,33.0,37.0,164000000.0
7,Sergio Ramos,32.0,91.0,91.0,Real Madrid,5.1e+07,3.8e+07,2201.0,1.0,4.0,...,82.0,87.0,92.0,91.0,11.0,8.0,9.0,7.0,11.0,104600000.0
8,R. Lewandowski,29.0,90.0,90.0,FC Bayern München,7.7e+07,2.05e+07,2152.0,1.0,4.0,...,86.0,34.0,42.0,19.0,15.0,6.0,12.0,8.0,10.0,127100000.0
9,T. Kroos,28.0,90.0,90.0,Real Madrid,7.65e+07,3.55e+07,2190.0,1.0,4.0,...,85.0,72.0,79.0,69.0,10.0,11.0,13.0,7.0,10.0,156800000.0


In [76]:
train, test = train_test_split(df_football, train_size=0.75, test_size=0.25)
X_train = train.drop(columns=['Name', 'Club', 'Position', 'Preferred Foot', 'Value'])
y_train = train[['Value']]
X_test = test.drop(columns=['Name', 'Club', 'Position', 'Preferred Foot', 'Value'])
y_test = test[['Value']]

## Random Forest
![Random forest image](https://miro.medium.com/max/936/1*58f1CZ8M4il0OZYg2oRN4w.png)
Is almost the same as Decission Tree, but here instead of one tree we a have "N" number of trees. it's one of the best ensemble learning methods. The idea behind this is to combine multiple decision trees to determine the final output, with this our model have more knowledge about our dataset (100 decission trees model are better than 1). If you want to know more about Decission trees go to his section in this repository

## Random Forest Regressor
### Params
The majority of the params are the same as Decission Tree Regressor, here we're just going to see the new hyperparameters
* n_estimators: the numbers of DecissionTrees of our Random Forest
* Bootsrapt: True/False, if True, the model use train subsamples to build each tree, this help to reduce overfitting and give very good results. If false, the model train all trees with whole dataset. It's strongly recommendet choose the True option
* oob_score: True/False, is a way to validate Random Forest model. When we take the samples to train the model, the model leaves out some samples in order to validate the training set.
* Warm_start: it helps to increase the speed of training when you are fitting estimators repeatedly, for example it helps with GridsearCV, or if we want to add new trees to the Random Forest

In [79]:
rf_reg = RandomForestRegressor(random_state=2019)

rf_reg.fit(X_train, y_train)
predictions = rf_reg.predict(X_test)
print('MAE in train:', mean_absolute_error(rf_reg.predict(X_train), y_train))
print('MSE in train:', np.sqrt(mean_squared_error(rf_reg.predict(X_train), y_train)))
print('MAE in test:', mean_absolute_error(rf_reg.predict(X_test), y_test))
print('RMSE in test:', np.sqrt(mean_squared_error(rf_reg.predict(X_test), y_test)))

  This is separate from the ipykernel package so we can avoid doing imports until


MAE in train: 1687209.0078683188
MSE in train: 4781322.293865469
MAE in test: 3992533.9120998373
RMSE in test: 10632635.1290914


We can see that wihout param tunning our Random Forest Regressor is in the same error than our Decision Tree Regressor with param tunning. Now let going to use hyperparams

In [80]:
rf_reg2 = RandomForestRegressor(n_estimators=200, oob_score=True, max_depth=25, 
                                random_state=2019, n_jobs=-1) #set n_jobs in -2 if your computer is not powerfull

rf_reg2.fit(X_train, y_train)
predictions = rf_reg2.predict(X_test)
print('MAE in train:', mean_absolute_error(rf_reg2.predict(X_train), y_train))
print('MSE in train:', np.sqrt(mean_squared_error(rf_reg2.predict(X_train), y_train)))
print('MAE in test:', mean_absolute_error(rf_reg2.predict(X_test), y_test))
print('RMSE in test:', np.sqrt(mean_squared_error(rf_reg2.predict(X_test), y_test)))

  after removing the cwd from sys.path.


MAE in train: 1478463.8207002212
MSE in train: 3922112.655077254
MAE in test: 3709972.681636176
RMSE in test: 9789550.577624224


We have reduce our RMSE in 1M in comparision with our Decision Tree Regressor, Let's see the feature importance now

In [81]:
pd.DataFrame(rf_reg2.feature_importances_, index=X_train.columns).sort_values(by=0, ascending=False)[:8]

Unnamed: 0,0
Release Clause,0.685678
Overall,0.099129
Potential,0.047168
Wage,0.01309
Age,0.012588
StandingTackle,0.005407
SlidingTackle,0.005174
Finishing,0.00464


Wow as in Decision Tree, Release Clause is the most important feature, maybe if we drop this feature, we will obtain better result

## Random Forest Classifier

In [91]:
def age_group_creator(age):
    if age <24:
        return 0 #young
    elif age >= 24 and age < 30:
        return 1 #medium age
    else:
        return 2 #old
        
df_football['age_group'] = df_football.apply(lambda x: age_group_creator(x['Age']), axis=1)
df_football

Unnamed: 0,Name,Age,Overall,Potential,Club,Value,Wage,Special,Preferred Foot,International Reputation,...,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause,age_group
0,L. Messi,31.0,94.0,94.0,FC Barcelona,1.105e+08,5.65e+07,2202.0,0.0,5.0,...,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,226500000.0,2
1,Cristiano Ronaldo,33.0,94.0,94.0,Juventus,7.7e+07,4.05e+07,2228.0,1.0,5.0,...,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,127100000.0,2
2,Neymar Jr,26.0,92.0,93.0,Paris Saint-Germain,1.185e+08,2.9e+07,2143.0,1.0,5.0,...,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,228100000.0,1
3,K. De Bruyne,27.0,91.0,92.0,Manchester City,1.02e+08,3.55e+07,2281.0,1.0,4.0,...,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,196400000.0,1
4,E. Hazard,27.0,91.0,91.0,Chelsea,9.3e+07,3.4e+07,2142.0,1.0,4.0,...,34.0,27.0,22.0,11.0,12.0,6.0,8.0,8.0,172100000.0,1
5,L. Modrić,32.0,91.0,91.0,Real Madrid,6.7e+07,4.2e+07,2280.0,1.0,4.0,...,60.0,76.0,73.0,13.0,9.0,7.0,14.0,9.0,137400000.0,2
6,L. Suárez,31.0,91.0,91.0,FC Barcelona,8e+07,4.55e+07,2346.0,1.0,5.0,...,62.0,45.0,38.0,27.0,25.0,31.0,33.0,37.0,164000000.0,2
7,Sergio Ramos,32.0,91.0,91.0,Real Madrid,5.1e+07,3.8e+07,2201.0,1.0,4.0,...,87.0,92.0,91.0,11.0,8.0,9.0,7.0,11.0,104600000.0,2
8,R. Lewandowski,29.0,90.0,90.0,FC Bayern München,7.7e+07,2.05e+07,2152.0,1.0,4.0,...,34.0,42.0,19.0,15.0,6.0,12.0,8.0,10.0,127100000.0,1
9,T. Kroos,28.0,90.0,90.0,Real Madrid,7.65e+07,3.55e+07,2190.0,1.0,4.0,...,72.0,79.0,69.0,10.0,11.0,13.0,7.0,10.0,156800000.0,1


In [92]:
train, test = train_test_split(df_football, train_size=0.75, test_size=0.25)
X_train = train.drop(columns=['Name', 'Club', 'Position', 'Preferred Foot', 'Age', 'age_group'])
y_train = train[['age_group']]
X_test = test.drop(columns=['Name', 'Club', 'Position', 'Preferred Foot', 'Age', 'age_group'])
y_test = test[['age_group']]

Whithout param tunning

In [93]:
#instance
rf_class = RandomForestClassifier()

#train
rf_class.fit(X_train, y_train)

#train preds
preds_train = rf_class.predict(X_train)

#metrics classifications in train
print('CLASSIFICATION IN TRAIN')
print()
print('CONFUSION MATRIX:\n', confusion_matrix(preds_train, y_train))
print()
print('F1 SCORE:\n', f1_score(preds_train, y_train, average='micro'))
print()
print('CLASSIFICATION REPORT\n',classification_report(preds_train, y_train))

  """


CLASSIFICATION IN TRAIN

CONFUSION MATRIX:
 [[4431   15    3]
 [   8 4497   50]
 [   0    4 2049]]

F1 SCORE:
 0.9927647644026408

CLASSIFICATION REPORT
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      4449
           1       1.00      0.99      0.99      4555
           2       0.97      1.00      0.99      2053

    accuracy                           0.99     11057
   macro avg       0.99      0.99      0.99     11057
weighted avg       0.99      0.99      0.99     11057



In [94]:
#Classification in test
preds_trest = rf_class.predict(X_test)

#metrics classifications in train
print('CLASSIFICATION IN TEST')
print()
print('CONFUSION MATRIX:\n', confusion_matrix(preds_trest, y_test))
print()
print('F1 SCORE:\n', f1_score(preds_trest, y_test, average='micro'))
print()
print('CLASSIFICATION REPORT\n',classification_report(preds_trest, y_test))

CLASSIFICATION IN TEST

CONFUSION MATRIX:
 [[1248  213    8]
 [ 197 1187  395]
 [   4  118  316]]

F1 SCORE:
 0.74633749321758

CLASSIFICATION REPORT
               precision    recall  f1-score   support

           0       0.86      0.85      0.86      1469
           1       0.78      0.67      0.72      1779
           2       0.44      0.72      0.55       438

    accuracy                           0.75      3686
   macro avg       0.69      0.75      0.71      3686
weighted avg       0.77      0.75      0.75      3686



With param tunning

In [134]:
#instance
rf_class2 = RandomForestClassifier(criterion='entropy', n_estimators=400, max_features=25, class_weight='balanced_subsample',
                                   n_jobs=-1, oob_score=True)

#train
rf_class2.fit(X_train, y_train)

#train preds
preds_train = rf_class2.predict(X_train)

#metrics classifications in train
print('CLASSIFICATION IN TRAIN')
print()
print('CONFUSION MATRIX:\n', confusion_matrix(preds_train, y_train))
print()
print('F1 SCORE:\n', f1_score(preds_train, y_train, average='micro'))
print()
print('CLASSIFICATION REPORT\n',classification_report(preds_train, y_train))

  


CLASSIFICATION IN TRAIN

CONFUSION MATRIX:
 [[4439    0    0]
 [   0 4516    0]
 [   0    0 2102]]

F1 SCORE:
 1.0

CLASSIFICATION REPORT
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      4439
           1       1.00      1.00      1.00      4516
           2       1.00      1.00      1.00      2102

    accuracy                           1.00     11057
   macro avg       1.00      1.00      1.00     11057
weighted avg       1.00      1.00      1.00     11057



In [138]:
#Classification in test
preds_trest = rf_class2.predict(X_test)

#metrics classifications in train
print('CLASSIFICATION IN TEST')
print()
print('CONFUSION MATRIX:\n', confusion_matrix(preds_trest, y_test))
print()
print('F1 SCORE:\n', f1_score(preds_trest, y_test, average='micro'))
print()
print('CLASSIFICATION REPORT\n',classification_report(preds_trest, y_test))

CLASSIFICATION IN TEST

CONFUSION MATRIX:
 [[1318  103    0]
 [ 131 1368  248]
 [   0   47  471]]

F1 SCORE:
 0.8564839934888768

CLASSIFICATION REPORT
               precision    recall  f1-score   support

           0       0.91      0.93      0.92      1421
           1       0.90      0.78      0.84      1747
           2       0.66      0.91      0.76       518

    accuracy                           0.86      3686
   macro avg       0.82      0.87      0.84      3686
weighted avg       0.87      0.86      0.86      3686



In [140]:
pd.DataFrame(rf_class2.feature_importances_, index=X_train.columns).sort_values(by=0, ascending=False)[:8]

Unnamed: 0,0
Potential,0.197758
Overall,0.141178
Value,0.060461
Release Clause,0.046315
Reactions,0.040503
Composure,0.038533
SprintSpeed,0.024189
Special,0.021489


In [146]:
#See our 200 trees
rf_class2.estimators_

[DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                        max_features=25, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort=False,
                        random_state=1035469788, splitter='best'),
 DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                        max_features=25, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort=False,
                        random_state=1887871593, splitter='best'),
 DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                        max_features=25, max_leaf_nodes=None,
                    

Our Random Forest Classifier have the same accuracy as our Decision Tree Classifier, This shows us that there are no better models than others. It all depends on our dataset and the types of data we're dealing with.