1. Write simple (straightforward) definitions for the following parameters for RandomForestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and indicate how they correlate with the precision and recall for the basic diabetes model we built in class. You will need to rerun the model multiple times to do so.

estimators,
max_depth,
min_samples_split,
min_samples_leaf,
min_weight_fraction_leaf,
max_leaf_nodes,
min_impurity_decrease,
min_impurity_split

Estimators are the individual models - in this case, the individual decision trees.
Reducing the number of estimators tends to reduce the precision and recall of a model.

The max_depth puts a cap on the number of levels or "depth" each tree is permitted to have.
The shallower the depth, the less accurate your precision and recall.  (Introducing a max_depth of 3 drops our model's macro avg recall to 66)

min_samples_split requires an internal node to have a certain number of samples in order to split (aka have children branching off of it); its default value is 2.
Increasing this number (from the default of 2) seems to follow a parabolic accuracy trend: it increases recall and precision to a point (for this model, around min_samples_split = 11), after which point increased min_samples_split causes diminished recall and precision accuracy.

min_samples_leaf is kind of the counterpoint of min_samples_split: it requires a certain number of samples to meet the given criteria in order to form a child leaf off of the parent node (default = 1)
Kind of similar to min_samples_split, upping this number seems to increase recall and precision accuracy to a point, after which recall and precision accuracy starts to fall off.

min_weight_fraction_leaf: similar to min_samples_leaf, but deals with the weight needed for a leaf node to exist as a fraction of the sum total weights. By default, all samples weighted equally.  You can't go higher than 0.5 in this parameter, which makes sense (if you have two branches and they are weighted equally, each will "weigh" 0.5).
The higher this number, the more generalized your model and its outputs will be.  Overall, a smaller min_weight_fraction_leaf seems to yield better precision and accuracy, with a few random exceptions - ex. with min_weight_fraction_leaf set to 0.3, precision reaches a macro avg accuracy of 0.78. When you make this number really small (ex. 0.0001, 0.00001), the accuracy scores seem to level off; for this model, macro avg accuracies of about 0.75 (precision) and 0.73 (recall).

max_leaf_nodes: caps the number of leaves or terminal nodes on the tree. Prioritizes leaves that do more to reduce impurity.  Reducing this number (I started at 10000) doesn't have much of an effect on accuracy until you reach a certain threshold (for this model, around 100), when recall and precision accuracy begins to diminish.

min_impurity_decrease: splits a node if this split induces a decrease of the impurity greater than or equal to this value.  I see it as a form of making the Forest function more efficiently: if splitting a node won't result in additional purity, does it make sense to expend the computing power?  The more refined (smaller) this number, the better accuracy in both precision and recall.

min_impurity_split: another method for capping the number of nodes/leaves on a tree.  A node will split if its impurity is above the threshold, otherwise it is a leaf.  However, the documentation says that this parameter has been deprecated since version 0.19, and that min_impurity_decrease should be used instead.

In [117]:
import pandas as pd
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
import pydotplus
from IPython.display import Image

diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [118]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [169]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, min_impurity_decrease=0.0001, random_state=42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7727272727272727

In [170]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.80      0.87      0.83       100
           1       0.71      0.59      0.65        54

    accuracy                           0.77       154
   macro avg       0.75      0.73      0.74       154
weighted avg       0.77      0.77      0.77       154



2. How does setting bootstrap=False influence the model performance? Note: the default is bootstrap=True. Explain why your results might be so.

According to the documentation, this parameter indicates whether bootstrap samples are used in building the forest's trees.  When set equal to False, the whole dataset is used to create each tree.  As you might expect, using the whole dataset every time a tree is created takes some additional computing power and time, though for this specific model the different isnt drastic (459 ms wall time for default of "True" vs. 440 ms wall time for "False").

In [171]:
%%time
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

CPU times: user 424 ms, sys: 16.2 ms, total: 440 ms
Wall time: 459 ms


0.7662337662337663

In [173]:
%%time
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.80      0.86      0.83       100
           1       0.70      0.59      0.64        54

    accuracy                           0.77       154
   macro avg       0.75      0.73      0.73       154
weighted avg       0.76      0.77      0.76       154

CPU times: user 51.2 ms, sys: 3.59 ms, total: 54.8 ms
Wall time: 59.4 ms


In [174]:
%%time
rf = RandomForestClassifier(n_estimators=200, bootstrap='False', random_state=42)
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

CPU times: user 426 ms, sys: 8.24 ms, total: 434 ms
Wall time: 440 ms


0.7662337662337663

In [175]:
%%time
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.80      0.86      0.83       100
           1       0.70      0.59      0.64        54

    accuracy                           0.77       154
   macro avg       0.75      0.73      0.73       154
weighted avg       0.76      0.77      0.76       154

CPU times: user 40 ms, sys: 3.24 ms, total: 43.2 ms
Wall time: 50.9 ms
