  Write simple (straightforward) definitions for the following parameters for
RandomForestClassifier
(https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClass
ifier.html) and indicate how they correlate with the precision and recall for the basic
diabetes model we built in class. You will need to rerun the model multiple times to do
so.

| Parameter | Definition | Correlation with Precision | Correlation with Recall |
| :-:  | :- | :-  | :- |
| estimators | The number of decision trees in the forest | The highest is at 500 and it decreases. Positive correlation but mainly remained in the same range | The highest is at 200. Positive correlation but mainly remained in the same range
| max_depth | The maximum depth of each tree | The highest is at 1. Negative correlation | The highest is at 10. Postive Correlation
| min_samples_split | The minimum number of samples required to split an internal node | The highest at 75. Positive correlation but mainly remained in the same range | The highest at 4. Negative correlation
| min_samples_leaf | The minimum number of samples required to be at a leaf node | Perfect score at 125. Strong Postive Correlation.| The highest score is at 1. Strong Negative Correlation
| min_weight_fraction_leaf | The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. | Perfect score at 0.32. Strong Positive Correlation| The highest score is at 0. Strong Negative Correlation
| max_leaf_nodes | Grows the tree in best-first fashion until max_leaf_nodes reached | The highest score is at 50. Positive correlation but mainly remained in the same range| The highest score is at 100.  Positive correlation but mainly remained in the same range.
| min_impurity_decrease | A node will be split if this split induces a decrease of the impurity greater than or equal to this value. | The highest score is at 0.0005. Positive correlation but mainly remained in the same range| Same as precision.

In [37]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.metrics import precision_score, recall_score 
from sklearn.metrics import classification_report

In [2]:
diabetes = pd.read_csv('diabetes.csv')
diabetes.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [3]:
X = diabetes.drop('Outcome', axis=1)
y = diabetes['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [45]:
def run_model(rf,parameter):
    rf = rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    if type(parameter)==bool:
        print(parameter, "|", 
          'Precision:', round(precision_score(y_test,y_pred), 2), "| "
          'Recall:', round(recall_score(y_test,y_pred), 2))
        print(classification_report(y_test, y_pred))
    else:
        print(parameter, i, "|", 
          'Precision:', round(precision_score(y_test,y_pred), 2), "| "
          'Recall:', round(recall_score(y_test,y_pred), 2))
        
        

In [5]:
estimators = [10, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000,1500,2000]
for i in estimators:
    rf = RandomForestClassifier(n_estimators=i, random_state=42)
    run_model(rf,"n_estimators")

n_estimators 10 | Precision: 0.68 | Recall: 0.53
n_estimators 50 | Precision: 0.66 | Recall: 0.56
n_estimators 100 | Precision: 0.68 | Recall: 0.54
n_estimators 200 | Precision: 0.7 | Recall: 0.58
n_estimators 300 | Precision: 0.68 | Recall: 0.56
n_estimators 400 | Precision: 0.68 | Recall: 0.54
n_estimators 500 | Precision: 0.71 | Recall: 0.58
n_estimators 600 | Precision: 0.7 | Recall: 0.57
n_estimators 700 | Precision: 0.7 | Recall: 0.57
n_estimators 800 | Precision: 0.7 | Recall: 0.57
n_estimators 900 | Precision: 0.7 | Recall: 0.58
n_estimators 1000 | Precision: 0.7 | Recall: 0.58
n_estimators 1500 | Precision: 0.7 | Recall: 0.58
n_estimators 2000 | Precision: 0.69 | Recall: 0.58


In [6]:
depth = [1, 5, 10, 15, 20, 25, 50, 75, 100]
for i in depth:
    rf = RandomForestClassifier(max_depth=i, random_state=42)
    run_model(rf,"max_depth")

max_depth 1 | Precision: 0.88 | Recall: 0.19
max_depth 5 | Precision: 0.69 | Recall: 0.49
max_depth 10 | Precision: 0.69 | Recall: 0.58
max_depth 15 | Precision: 0.69 | Recall: 0.57
max_depth 20 | Precision: 0.68 | Recall: 0.54
max_depth 25 | Precision: 0.68 | Recall: 0.54
max_depth 50 | Precision: 0.68 | Recall: 0.54
max_depth 75 | Precision: 0.68 | Recall: 0.54
max_depth 100 | Precision: 0.68 | Recall: 0.54


In [7]:
min_samples = [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 25,50,75,100]
for i in min_samples:
    rf = RandomForestClassifier(min_samples_split=i, random_state=42)
    run_model(rf,"min_samples_split")

min_samples_split 2 | Precision: 0.68 | Recall: 0.54
min_samples_split 3 | Precision: 0.7 | Recall: 0.56
min_samples_split 4 | Precision: 0.68 | Recall: 0.58
min_samples_split 5 | Precision: 0.68 | Recall: 0.57
min_samples_split 6 | Precision: 0.7 | Recall: 0.57
min_samples_split 7 | Precision: 0.69 | Recall: 0.57
min_samples_split 8 | Precision: 0.68 | Recall: 0.53
min_samples_split 9 | Precision: 0.65 | Recall: 0.48
min_samples_split 10 | Precision: 0.68 | Recall: 0.54
min_samples_split 15 | Precision: 0.67 | Recall: 0.56
min_samples_split 25 | Precision: 0.68 | Recall: 0.56
min_samples_split 50 | Precision: 0.66 | Recall: 0.51
min_samples_split 75 | Precision: 0.72 | Recall: 0.52
min_samples_split 100 | Precision: 0.66 | Recall: 0.46


In [8]:
min_leaf = [1, 5, 10, 20, 30, 40, 50, 75, 100, 110, 125]
for i in min_leaf:
    rf = RandomForestClassifier(min_samples_leaf=i, random_state=42)
    run_model(rf,"min_samples_leaf")

min_samples_leaf 1 | Precision: 0.68 | Recall: 0.54
min_samples_leaf 5 | Precision: 0.7 | Recall: 0.54
min_samples_leaf 10 | Precision: 0.68 | Recall: 0.53
min_samples_leaf 20 | Precision: 0.69 | Recall: 0.49
min_samples_leaf 30 | Precision: 0.7 | Recall: 0.48
min_samples_leaf 40 | Precision: 0.7 | Recall: 0.46
min_samples_leaf 50 | Precision: 0.73 | Recall: 0.43
min_samples_leaf 75 | Precision: 0.73 | Recall: 0.43
min_samples_leaf 100 | Precision: 0.76 | Recall: 0.35
min_samples_leaf 110 | Precision: 0.8 | Recall: 0.2
min_samples_leaf 125 | Precision: 1.0 | Recall: 0.07


In [12]:
min_weight = [0, 0.1, 0.2, 0.25, 0.30,0.35]
for i in min_weight:
    rf = RandomForestClassifier(min_weight_fraction_leaf=i, random_state=42)
    run_model(rf,"min_weight_fraction_leaf")

min_weight_fraction_leaf 0 | Precision: 0.68 | Recall: 0.54
min_weight_fraction_leaf 0.1 | Precision: 0.7 | Recall: 0.47
min_weight_fraction_leaf 0.2 | Precision: 0.7 | Recall: 0.47
min_weight_fraction_leaf 0.25 | Precision: 0.72 | Recall: 0.38
min_weight_fraction_leaf 0.3 | Precision: 0.76 | Recall: 0.35
min_weight_fraction_leaf 0.35 | Precision: 1.0 | Recall: 0.09


In [13]:
max_leaf = [None, 5, 10, 50, 100, 200, 300, 400, 500, 1000]
for i in max_leaf:
    rf = RandomForestClassifier(max_leaf_nodes=i, random_state=42)
    run_model(rf,"max_leaf_nodes")

max_leaf_nodes None | Precision: 0.68 | Recall: 0.54
max_leaf_nodes 5 | Precision: 0.72 | Recall: 0.44
max_leaf_nodes 10 | Precision: 0.67 | Recall: 0.49
max_leaf_nodes 50 | Precision: 0.73 | Recall: 0.57
max_leaf_nodes 100 | Precision: 0.7 | Recall: 0.58
max_leaf_nodes 200 | Precision: 0.7 | Recall: 0.57
max_leaf_nodes 300 | Precision: 0.7 | Recall: 0.57
max_leaf_nodes 400 | Precision: 0.7 | Recall: 0.57
max_leaf_nodes 500 | Precision: 0.7 | Recall: 0.57
max_leaf_nodes 1000 | Precision: 0.7 | Recall: 0.57


In [11]:
min_decrease = [.01, .005, .001, .0005, .0001, .00005, .00001]
for i in min_decrease:
    rf = RandomForestClassifier(min_impurity_decrease=i, random_state=42)
    run_model(rf,"min_impurity_decrease")

min_impurity_decrease 0.01 | Precision: 0.7 | Recall: 0.46
min_impurity_decrease 0.005 | Precision: 0.7 | Recall: 0.52
min_impurity_decrease 0.001 | Precision: 0.67 | Recall: 0.56
min_impurity_decrease 0.0005 | Precision: 0.71 | Recall: 0.57
min_impurity_decrease 0.0001 | Precision: 0.65 | Recall: 0.52
min_impurity_decrease 5e-05 | Precision: 0.68 | Recall: 0.54
min_impurity_decrease 1e-05 | Precision: 0.68 | Recall: 0.54


2. How does setting bootstrap=False influence the model performance? Note: the default isbootstrap=True. Explain why your results might be so.

This example model performs better when bootstrap=False in terms of recall compared to when it is set to True. When it is False, the whole dataset is used to build each descision tree. The sample are drawn without replacement.

When it is True, data points will be used more than once ( with replacement). This can lead to overfiting.

In [50]:
rf = RandomForestClassifier(random_state=42)
run_model(rf,True)

True | Precision: 0.68 | Recall: 0.54
              precision    recall  f1-score   support

           0       0.78      0.86      0.82       150
           1       0.68      0.54      0.60        81

    accuracy                           0.75       231
   macro avg       0.73      0.70      0.71       231
weighted avg       0.74      0.75      0.74       231



In [51]:
rf = RandomForestClassifier(bootstrap=False, random_state=42)
run_model(rf,False)

False | Precision: 0.68 | Recall: 0.58
              precision    recall  f1-score   support

           0       0.79      0.85      0.82       150
           1       0.68      0.58      0.63        81

    accuracy                           0.76       231
   macro avg       0.74      0.72      0.72       231
weighted avg       0.75      0.76      0.75       231

