# Random Forest

One of the models we chose to implement for the purpose of lung cancer classification was Random Forest, an ensemble learning method that constructs a multitude of decision trees at training time.  
This algorithm is suitable for classification prediction problems, handles a lot of features well and lets us easily check the relative importance assigned to each.

We start by importing relevant libraries and dropping useless columns from our CSV.

In [3]:
import pandas as pd
import numpy as np
from kfold_and_metrics import *

from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier

In [4]:
df = pd.read_csv("final.csv")
df = df.drop(columns=["id"])
df.head()

Unnamed: 0,patient_id,diagnostics_Image-original_Mean,diagnostics_Mask-original_VoxelNum,diagnostics_Mask-original_VolumeNum,diagnostics_Image-interpolated_Mean,diagnostics_Image-interpolated_Minimum,diagnostics_Image-interpolated_Maximum,diagnostics_Mask-interpolated_VoxelNum,diagnostics_Mask-interpolated_VolumeNum,diagnostics_Mask-interpolated_Maximum,...,diagnostics_Mask-interpolated_BoundingBox_2,diagnostics_Mask-interpolated_BoundingBox_3,diagnostics_Mask-interpolated_BoundingBox_4,diagnostics_Mask-interpolated_BoundingBox_5,diagnostics_Mask-interpolated_CenterOfMassIndex_0,diagnostics_Mask-interpolated_CenterOfMassIndex_1,diagnostics_Mask-interpolated_CenterOfMassIndex_2,diagnostics_Mask-interpolated_CenterOfMass_0,diagnostics_Mask-interpolated_CenterOfMass_1,diagnostics_Mask-interpolated_CenterOfMass_2
0,LIDC-IDRI-0001,-826.943929,5905,1,-417.494203,-990.291016,1038.270874,909,2,237.087921,...,0.0,13.0,11.0,10.0,17.041265,16.108666,4.184319,128.652843,34.787644,-229.881362
1,LIDC-IDRI-0001,-826.943929,4613,1,-405.581777,-982.456726,949.768005,699,1,221.953705,...,0.0,13.0,11.0,10.0,17.041265,16.108666,4.184319,128.652843,34.787644,-229.881362
2,LIDC-IDRI-0001,-826.943929,4955,1,-410.236759,-990.291016,1038.270874,772,1,237.087921,...,0.0,13.0,11.0,10.0,17.041265,16.108666,4.184319,128.652843,34.787644,-229.881362
3,LIDC-IDRI-0001,-826.943929,5498,1,-416.576321,-990.291016,1038.270874,841,2,237.087921,...,0.0,13.0,11.0,10.0,17.041265,16.108666,4.184319,128.652843,34.787644,-229.881362
4,LIDC-IDRI-0002,-826.943929,10351,1,-546.359139,-1007.657349,1020.174988,749,1,160.687653,...,0.0,13.0,11.0,10.0,17.041265,16.108666,4.184319,128.652843,34.787644,-229.881362


### Parameter Hypertuning

In order to prevent overfitting, we hypertuned our model on the values of the number of estimators, max depth and max samples per leaf.  
We also tested which of the evaluation criteria suited our data best.

In [3]:
best_auc = 0
best = {}

for crit in ["gini", "entropy", "log_loss"]:
    for n_est in range(25, 201, 25):
        for m_depth in range(5, 56, 10):
            for m_samples_leaf in range(5, 26, 5):
                params = {'n_estimators': n_est, 'max_depth': m_depth, 'min_samples_leaf': m_samples_leaf, 'criterion': crit}
                print("Current parameter combination:")
                for parameter, value in params.items():
                    print(f"\t{parameter}: {value}")
                print()

                rf_model = RandomForestClassifier(n_estimators=n_est, criterion=crit, max_depth=m_depth, min_samples_leaf=m_samples_leaf)
                score = k_fold_cv(model=rf_model, df=df, metric_funcs=[roc_auc_score], pca_components=50)
                avg_auc, std = weighted_avg_and_std(np.array(score["roc_auc_score"]))
                if avg_auc > best_auc:
                    best_auc = avg_auc
                    best = params

Current parameter combination:
	n_estimators: 25
	max_depth: 5
	min_samples_leaf: 5
	criterion: gini

Current parameter combination:
	n_estimators: 25
	max_depth: 5
	min_samples_leaf: 10
	criterion: gini

Current parameter combination:
	n_estimators: 25
	max_depth: 5
	min_samples_leaf: 15
	criterion: gini

Current parameter combination:
	n_estimators: 25
	max_depth: 5
	min_samples_leaf: 20
	criterion: gini

Current parameter combination:
	n_estimators: 25
	max_depth: 5
	min_samples_leaf: 25
	criterion: gini

Current parameter combination:
	n_estimators: 25
	max_depth: 15
	min_samples_leaf: 5
	criterion: gini

Current parameter combination:
	n_estimators: 25
	max_depth: 15
	min_samples_leaf: 10
	criterion: gini

Current parameter combination:
	n_estimators: 25
	max_depth: 15
	min_samples_leaf: 15
	criterion: gini

Current parameter combination:
	n_estimators: 25
	max_depth: 15
	min_samples_leaf: 20
	criterion: gini

Current parameter combination:
	n_estimators: 25
	max_depth: 15
	min_sa

After running, we got the following values:

In [4]:
print("Results of the grid search parameter hypertunning:")
for parameter, value in best.items():
    print(f"\t{parameter}: {value}")

Results of the grid search parameter hypertunning:
	n_estimators: 100
	max_depth: 25
	min_samples_leaf: 25
	criterion: gini


In [6]:
best = {'n_estimators': 100, 'max_depth': 25, 'min_samples_leaf': 25, 'criterion': "gini"}

best_rf = RandomForestClassifier(**best)
scores = k_fold_cv(model=best_rf, df=df, pca_components=50)

metrics_results = mean_std_results_k_fold_CV(scores)
metrics_results


Unnamed: 0,metric,mean,std
0,f1_score,0.31356,0.067326
1,accuracy_score,0.645838,0.042222
2,roc_auc_score,0.554891,0.035595
