## Find Best Random Forest Hyper-Parameters (with 5-fold cross-validation)

Random Forest Model has quiet a few hyper-parameters to tune. Exhaustive Grid Search might be too time-consuming. Here, we tune the model's hyper-parameters from more important ones to less important ones and while the untuned hyper-parameters are their default values. Related hyper-parameters will be tuned together. 

In [1]:
# import all models and utils 
import time 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import sklearn.metrics as metrics

import matplotlib.pyplot as plt 
from matplotlib.colors import ListedColormap
from sklearn.metrics import ConfusionMatrixDisplay 

from sklearn.model_selection import cross_val_score 

# Load Necessary Libs and Read Data 
import numpy as np 
import pandas as pd 

In [2]:
def get_all_data( train_csv, val_csv ):
    
    train_data = pd.read_csv( train_csv )
    val_data = pd.read_csv( val_csv ) 
    vars = train_data.columns 
    feat_num = len(vars) - 3 # first 2 columns and last 
    print(vars)
    print( train_data.iloc[0] )
    # need to omit the first two columns as they're not normal variables ... 
    train_x, train_y = train_data[ vars[2:-1] ], train_data[ [vars[-1]] ]
    val_x, val_y = val_data[ vars[2:-1] ], val_data[ [vars[-1]] ] 
    xs, ys = pd.concat( [train_x, val_x], axis=0 ), pd.concat( [train_y, val_y], axis=0 )
    return xs, ys, feat_num


X, y, feat_num = get_all_data( "train.csv", "val.csv" ) 
# X, y = X.values, y.values
y = np.reshape( y.values, (-1) )

Index(['Unnamed: 0', 'ID', 'Gender', 'Age(y)', 'Diameter(mm)', 'Shape',
       'Margin', 'Cortex.size(mm)', 'Cortical.morphologic.features',
       'Nodal.echogenicity', 'Calcifications', 'Hilum', 'History.of.cancer',
       'Number.of.suspicious.axillary.lymphnodes',
       'Multiple.regions.of.suspicious.lymphnodes', 'Pathology'],
      dtype='object')
Unnamed: 0                                        415
ID                                           us713876
Gender                                              0
Age(y)                                             52
Diameter(mm)                                       29
Shape                                               0
Margin                                              0
Cortex.size(mm)                                    16
Cortical.morphologic.features                       2
Nodal.echogenicity                                  1
Calcifications                                      0
Hilum                                            

### 1. Random Forest 

Most important hyper-parameters of Random Forest include - 

1. n_estimators [default:100]: defines how many Decision Trees to be used in Random Forest Model;
2. criterion [default: 'gini']: the function to measure the quality of a split. "gini" for the Gini impurity and "log_loss" and "entropy" both for the shannon information gain. "criteria" parameter is tree-specific.  
3. max_features [default: 'sqrt']: limits the maximum number of features that can be selected in each tree;
4. max_depth [default:None]: defines hte max depth of each decision tree;
5. max_leaf_nodes [default:None]: defines the max number of leaf nodes;
6. max_sample [defulat_None]: max sample used for each decision tree;
7. min_sample_split [default: 2]: the minimum number of samples required at a leaf node;
...

In experiments, adjusting 5-7 didn't influence/improve model performance. Thus, we provide the procedure of model selection for parameters 1-4.


#### A. n_estimators and criterion

the number of trees can be built and the criterion to measure the quality of tree node splitting.

In [None]:
n_times = 4      # larger value, longer runtime    
# PS - repeat the cross-validation for multiple times to make sure we obtain the best model working for data distribution 
criterions = [ 'gini', 'entropy', 'log_loss' ]
n_estimators = [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]
best_pair = [ 0, None, None ]  
res = []

time_1 = time.time()
for criterion in criterions:
    print( criterion )
    tmp = []
    for ne in n_estimators:
        score = 0
        for i in range(n_times):
            rf = RandomForestClassifier( criterion=criterion, n_estimators=ne ) 
            score += np.mean( cross_val_score( rf, X, y, cv=5 ) )
        score /= n_times
        # print(score)
        if score > best_pair[0]:
            best_pair = [ score, criterion, ne ]
        tmp.append( score ) 
    res.append(tmp)
    print(tmp) 
time_2 = time.time() 
print( f"It takes {time_2 - time_1}s to complete cross validations." )

print( best_pair )

gini


In [None]:
def multi_line_plot( xs, ys, ns, cs, y_range, names=[] ): 
    
    import matplotlib.pyplot as plt 
    
    for i in range(len(ns)):
        plt.plot( [i for i in range(len(xs))], ys[i], color=cs[i] ) 
    plt.ylim( y_range )
    plt.xlabel( names[0] )
    plt.ylabel( "Accuracy" )
    plt.title( names[1] )
    plt.legend([ name for name in ns ], loc="lower right", handlelength=1.5, fontsize=12)
    plt.show() 

ns, cs = [ 'gini', 'entropy', 'log_loss' ], [ "r","g","b" ]
multi_line_plot( n_estimators, res, ns, cs, [0.70,0.85], names=[ "n_estimaor", "Random Forest" ] ) 

# conclusion - 3 criterion performs similarly, choose gini/logloss which reach higher peak 
# for model computation efficiency, choose n_estimator around range [100,200] 
# conduct the same procedure again ... 

In [None]:
# "cv" = "k"
n_times = 4      # larger value, longer runtime    
# PS - repeat the cross-validation for multiple times to make sure we obtain the best model working for data distribution 
criterions = [ 'gini', 'log_loss' ]
n_estimators = [ i for i in range( 80, 300, 20 ) ]
best_pair = [ 0, None, None ]  
res = []

time_1 = time.time()
for criterion in criterions:
    print( criterion )
    tmp = []
    for ne in n_estimators:
        score = 0
        for i in range(n_times):
            rf = RandomForestClassifier( criterion=criterion, n_estimators=ne ) 
            score += np.mean( cross_val_score( rf, X, y, cv=5 ) )
        score /= n_times
        # print(score)
        if score > best_pair[0]:
            best_pair = [ score, criterion, ne ]
        tmp.append( score ) 
    res.append(tmp)
    print(tmp) 
time_2 = time.time() 
print( f"It takes {time_2 - time_1}s to complete cross validations." )

print( best_pair )

In [None]:
def multi_line_plot( xs, ys, ns, cs, y_range, names=[] ): 
    
    import matplotlib.pyplot as plt 
    
    for i in range(len(ns)):
        plt.plot( xs, ys[i], color=cs[i] ) 
    plt.ylim( (xs[0], xs[-1]) )
    plt.ylim( (0.78, 0.81) )
    plt.xlabel( names[0] )
    plt.ylabel( "Accuracy" )
    plt.title( names[1] )
    plt.legend([ name for name in ns ], loc="lower right", handlelength=1.5, fontsize=12)
    plt.show() 

ns, cs = [ 'gini', 'log_loss' ], [ "r","g" ]
multi_line_plot( n_estimators, res, ns, cs, [0.70,0.85], names=[ "n_estimaor", "Random Forest" ] ) 

# gini is more stable, choose n_estimator=280 (or around that value)

### B. max_features and max_depth 

max_features and max_depth constrain how decision trees grow during training. 

In [None]:
n_times = 4      # larger value, longer runtime    
# PS - repeat the cross-validation for multiple times to make sure we obtain the best model working for data distribution 
max_feats = [ None, 'sqrt', 'log2' ]
max_depths = [ i*2 for i in range(1,11) ] 
max_depths.append( None )
best_pair = [ 0, None, None ]  
res = []

time_1 = time.time()
for max_feat in max_feats:
    print( max_feat )
    tmp = []
    for max_depth in max_depths: 
        score = 0
        for i in range(n_times):
            rf = RandomForestClassifier( max_features=max_feat, max_depth=max_depth, criterion='gini', n_estimators=280 ) 
            score += np.mean( cross_val_score( rf, X, y, cv=5 ) )
        score /= n_times
        # print(score)
        if score > best_pair[0]:
            best_pair = [ score, max_feat, max_depth ]
        tmp.append( score ) 
    res.append(tmp)
    print(tmp) 
time_2 = time.time() 
print( f"It takes {time_2 - time_1}s to complete cross validations." )

print( best_pair )

In [None]:
def multi_line_plot( xs, ys, ns, cs, y_range, names=[] ): 
    
    import matplotlib.pyplot as plt 
    
    for i in range(len(ns)):
        plt.plot( xs, ys[i], color=cs[i] ) 
    plt.ylim( (xs[0], xs[-1]) )
    plt.ylim( y_range )
    plt.xlabel( names[0] )
    plt.ylabel( "Accuracy" )
    plt.title( names[1] )
    plt.legend([ name for name in ns ], loc="lower right", handlelength=1.5, fontsize=12)
    plt.show() 

ns, cs = [ 'all feats', 'sqrt feats', 'log2 feats' ], [ "r","g", 'b' ]
multi_line_plot( max_depths, res, ns, cs, [0.75,0.85], names=[ "max_depth", "1st Random Forest (max_depth + max_features)" ] ) 

# sqrt and log2 max_features are better 
# best max_depth should be around 6

In [None]:
# "cv" = "k"
n_times = 4      # larger value, longer runtime    
# PS - repeat the cross-validation for multiple times to make sure we obtain the best model working for data distribution 
max_feats = [ 'sqrt', 'log2' ]
max_depths = [ 3,4,5,6,7,8, 20,25,30 ] 
max_depths.append( None )
best_pair = [ 0, None, None ]  
res = []

time_1 = time.time()
for max_feat in max_feats:
    print( max_feat )
    tmp = []
    for max_depth in max_depths: 
        score = 0
        for i in range(n_times):
            rf = RandomForestClassifier( max_features=max_feat, max_depth=max_depth, criterion='gini', n_estimators=280 ) 
            score += np.mean( cross_val_score( rf, X, y, cv=5 ) )
        score /= n_times
        # print(score)
        if score > best_pair[0]:
            best_pair = [ score, max_feat, max_depth ]
        tmp.append( score ) 
    res.append(tmp)
    print(tmp) 
time_2 = time.time() 
print( f"It takes {time_2 - time_1}s to complete cross validations." )

print( best_pair )

In [None]:
def multi_line_plot( xs, ys, ns, cs, y_range, names=[] ): 
    
    import matplotlib.pyplot as plt 
    
    for i in range(len(ns)):
        plt.plot( xs, ys[i], color=cs[i] ) 
    plt.ylim( (xs[0], xs[-1]) )
    plt.ylim( (0.78, 0.82) )
    plt.xlabel( names[0] )
    plt.ylabel( "Accuracy" )
    plt.title( names[1] )
    plt.legend([ name for name in ns ], loc="lower right", handlelength=1.5, fontsize=12)
    plt.show() 

ns, cs = [ 'sqrt feats', 'log2 feats' ], [ "g", 'b' ]
multi_line_plot( max_depths, res, ns, cs, [0.70,0.85], names=[ "max_depth", "2nd Random Forest (max_depth + max_features)" ] ) 

# sqrt and log2 max_features performs similarly  (may choose log2 which bring less computation costs) 
# best max_depth should be around 6