# In class exercise
### 01/19/2022

**1. Write simple (straightforward) definitions for the following parameters for RandomForestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and indicate how they correlate with the precision and recall for the basic diabetes model we built in class. You will need to rerun the model multiple times to do
so.**

The table below is a super basic summary. More details follow in the code.

Parameter|Definition |Correlation with Precision |Correlation with Recall
:-----|:-----|:-----|:----- 
estimators|Number of decision trees|Back and forth by 1|Back and forth by 1
max_depth|Maximum levels below the root node of your decision trees |Negative|Positive
min_samples_split|Minimum samples to take before a node can split|Slight positive|slight positive
min_samples_leaf|Guarantees minimum number of samples in a leaf node|Positive|Negative
min_weight_fraction_leaf|fraction of input samples required to be at a leaf node|Positive|Negative
max_leaf_nodes|The maximum number of leaf nodes in your trees|Negative|Positive
min_impurity_decrease|The node will be split if doing so will decrease the impurity|Positive|Negative

Set up the training/testing data from class

In [5]:
# Import modules
import pandas as pd
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
from sklearn.ensemble import RandomForestClassifier

# Load in data frame
diabetes_df = pd.read_csv('diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

Import the RandomForestClassifier from class. Then, fit the model to a variety of parameters

**n_estimators: Number of decision trees**

I tried estimators of 50, 100, 200, 300, 500, and 700. Surprisingly, the precision/recall barely changed at all. It wavered only 1 or 2 points for both. The best combination was at 500, which is displayed below.

In [70]:
# Fit the classifier
est_rf = est_rf.fit(X_train, y_train)

predictions = est_rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.79      0.87      0.83       150
           1       0.71      0.58      0.64        81

    accuracy                           0.77       231
   macro avg       0.75      0.73      0.74       231
weighted avg       0.77      0.77      0.76       231



**Max_Depth: Maximum levels below the root node of your decision trees**

I tested 1-6, 10, 15, and 20. The lower the number, the better the precision. The higher the number, the better the recall (although, this changed once the depth got larger than 10). 10 was the overall best result.

In [28]:
md_rf = RandomForestClassifier(max_depth=10, random_state=42)

# Fit the classifier
md_rf = md_rf.fit(X_train, y_train)

predictions = md_rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.79      0.86      0.82       150
           1       0.69      0.58      0.63        81

    accuracy                           0.76       231
   macro avg       0.74      0.72      0.73       231
weighted avg       0.76      0.76      0.76       231



**Min_samples_split: Minimum samples to take before a node can split**

Decision and recall increase/decrease by 1 or 2 points each. As I increased the min_samples, precision increased modestly. Recall increased by 1 with each increment until 4, which is where it peaked. The best overall value was 4 (below).

In [54]:
mss_rf = RandomForestClassifier(min_samples_split=4, random_state=42)

# Fit the classifier
mss_rf = mss_rf.fit(X_train, y_train)

predictions = mss_rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.79      0.85      0.82       150
           1       0.68      0.58      0.63        81

    accuracy                           0.76       231
   macro avg       0.74      0.72      0.72       231
weighted avg       0.75      0.76      0.75       231



**Min_samples_leaf: Guarantees minimum number of samples in a leaf node**

Again, the values waver only slightly for all combinations. I tested 1-4, and then 10, 20, 50, 100, and 125. 125 gave me a precision of 100 and a recall of 7! In general, as I increased this parameter, the precision went up and recall went down (roughly). The best overall value I found was for min_samples_leaf=2, which is below.

In [66]:
msl_rf = RandomForestClassifier(min_samples_leaf=2, random_state=42)

# Fit the classifier
msl_rf = msl_rf.fit(X_train, y_train)

predictions = msl_rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.78      0.87      0.83       150
           1       0.70      0.56      0.62        81

    accuracy                           0.76       231
   macro avg       0.74      0.71      0.72       231
weighted avg       0.76      0.76      0.75       231



**Min_weight_fraction_leaf: fraction of input samples required to be at a leaf node**

In general, precision increased and recall decreased as I increased this parameter. The default of 0 had the best recall (displayed below). Between 0 and 0.1, recall decreased from 54 to 47 and then had more modest decreases afterwards. However, precision increased by about 2 points per parameter change. The best overall was the default value of 0 (displayed below).  

In [109]:
mwf_rf = RandomForestClassifier(min_weight_fraction_leaf=0, random_state=42)

# Fit the classifier
mwf_rf = mwf_rf.fit(X_train, y_train)

predictions = mwf_rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.78      0.86      0.82       150
           1       0.68      0.54      0.60        81

    accuracy                           0.75       231
   macro avg       0.73      0.70      0.71       231
weighted avg       0.74      0.75      0.74       231



**Max_leaf_nodes: The maximum number of leaf nodes in your trees**

Overall, the predicsion would decrease by 1 or 2 points as I increased the parameter. However, recall would increase by a few points. Despite this relationship, the best value was actually when I set the parameter to "None" (below). 

In [89]:
mln_rf = RandomForestClassifier(max_leaf_nodes=None, random_state=42)

# Fit the classifier
mln_rf = mln_rf.fit(X_train, y_train)

predictions = mln_rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.78      0.86      0.82       150
           1       0.68      0.54      0.60        81

    accuracy                           0.75       231
   macro avg       0.73      0.70      0.71       231
weighted avg       0.74      0.75      0.74       231



**min_impurity_decrease: The node will be split if doing so will decrease the impurity**

This parameter worked best when I tested it in increments of 0.001. As I increased the parameter, the precision increased by 1-2 points, while the recall lowered by 1-2 points. The largest extreme was a value of 0.04, which resulted in a precision of 80 and a recall of 25. The best overall was a weight of 0.003, which is below.

In [99]:
mid_rf = RandomForestClassifier(min_impurity_decrease=0.003, random_state=42)

# Fit the classifier
mid_rf = mid_rf.fit(X_train, y_train)

predictions = mid_rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.79      0.87      0.83       150
           1       0.71      0.57      0.63        81

    accuracy                           0.77       231
   macro avg       0.75      0.72      0.73       231
weighted avg       0.76      0.77      0.76       231



**2. How does setting bootstrap=False influence the model performance? Note: the default is bootstrap=True. Explain why your results might be so.**

Using the in-class example as a reference...you can see that when boot strap is set to true, the model is ever so slightly worse. When set to "false," the precision increases by 1 and the recall increases by 1 (however, the overall  accuracy stays the same). This is likely because instead of pooling different samples in our data set, the model is using the entire data set for each tree. This would mean each tree has more information to work with. Practically speaking though, setting bootstrap to "false," is less computationally efficient than if we were to keep it as true.

In [69]:
#estimator = number of decision trees
rf = RandomForestClassifier(n_estimators=200, bootstrap=True, random_state=42)

# Fit the classifier
rf = rf.fit(X_train, y_train)
# Get the accuracy score
rf.score(X_test, y_test)

predictions = rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.79      0.87      0.83       150
           1       0.70      0.58      0.64        81

    accuracy                           0.77       231
   macro avg       0.75      0.72      0.73       231
weighted avg       0.76      0.77      0.76       231



In [68]:
#estimator = number of decision trees
rf = RandomForestClassifier(n_estimators=200, bootstrap=False, random_state=42)

# Fit the classifier
rf = rf.fit(X_train, y_train)
# Get the accuracy score
rf.score(X_test, y_test)

predictions = rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.80      0.87      0.83       150
           1       0.71      0.59      0.64        81

    accuracy                           0.77       231
   macro avg       0.75      0.73      0.74       231
weighted avg       0.77      0.77      0.77       231

