# Lab | Random Forests

For this lab, you will be using the CSV files provided in the `files_for_lab` folder.

### Instructions

- Apply the Random Forest algorithm to predict the `TARGET_B`. Please note that this column suffers from class imbalance. Fix the class imbalance using upsampling.
- Discuss the model predictions and it's impact in the bussiness scenario. Is the cost of a false positive equals to the cost of the false negative? How much the money the company will not earn because of missclassifications made by the model?
- Sklearn classification models are trained to maximize the accuracy. However, another error metric will be more relevant here. Which one?  Please checkout
[make_scorer](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer) alongside with [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) in order to train the model to maximize the error metric of interest in this case.


In [1]:
import pandas as pd
import numpy as np

categorical = pd.read_csv('files_for_lab/categorical.csv')
numerical = pd.read_csv('files_for_lab/numerical.csv')
target = pd.read_csv('files_for_lab/target.csv')

In [2]:
display(categorical.head())
display(numerical.head())
display(target.head())

Unnamed: 0,STATE,CLUSTER,HOMEOWNR,GENDER,DATASRCE,RFA_2R,RFA_2A,GEOCODE2,DOMAIN_A,DOMAIN_B,...,DOB_YR,DOB_MM,MINRDATE_YR,MINRDATE_MM,MAXRDATE_YR,MAXRDATE_MM,LASTDATE_YR,LASTDATE_MM,FIRSTDATE_YR,FIRSTDATE_MM
0,IL,36,H,F,3,L,E,C,T,2,...,37,12,92,8,94,2,95,12,89,11
1,CA,14,H,M,3,L,G,A,S,1,...,52,2,93,10,95,12,95,12,93,10
2,NC,43,U,M,3,L,E,C,R,2,...,0,2,91,11,92,7,95,12,90,1
3,CA,44,U,F,3,L,E,C,R,2,...,28,1,87,11,94,11,95,12,87,2
4,FL,16,H,F,3,L,F,A,S,2,...,20,1,93,10,96,1,96,1,79,3


Unnamed: 0,TCODE,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,...,CARDGIFT,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2
0,0,60.0,5,9,0,0,39,34,18,10,...,14,5.0,12.0,10.0,4,7.741935,95515,0,4,39
1,1,46.0,6,9,16,0,15,55,11,6,...,1,10.0,25.0,25.0,18,15.666667,148535,0,2,1
2,1,61.611649,3,1,2,0,20,29,33,6,...,14,2.0,16.0,5.0,12,7.481481,15078,1,4,60
3,0,70.0,1,4,2,0,23,14,31,3,...,7,2.0,11.0,10.0,9,6.8125,172556,1,4,41
4,0,78.0,3,2,60,1,28,9,53,26,...,8,3.0,15.0,15.0,14,6.864865,7112,1,2,26


Unnamed: 0,TARGET_B,TARGET_D
0,0,0.0
1,0,0.0
2,0,0.0
3,0,0.0
4,0,0.0


In [3]:
display(categorical.shape)
display(numerical.shape)
display(target.shape)

(95412, 22)

(95412, 315)

(95412, 2)

In [5]:
target['TARGET_B'].value_counts()

TARGET_B
0    90569
1     4843
Name: count, dtype: int64

#### Dealing with Class Imbalance

Choosing Between RandomOverSampler and RandomUnderSampler:

RandomOverSampler: 
- 


RandomUnderSampler:

- the data set is very large. in this case 95.412 rows. we would like to reduce the df anyways to re


In practice, the choice between RandomOverSampler and RandomUnderSampler depends on the specific characteristics of your dataset and the requirements of your machine learning task. It's often a good idea to experiment with both techniques and evaluate their impact on your model's performance using appropriate evaluation metrics. Additionally, you may also explore more advanced techniques like SMOTE (Synthetic Minority Over-sampling Technique) for oversampling or combinations of oversampling and undersampling to strike a balance between the two classes.

In [7]:
# concat all dataframes
df = pd.concat([categorical, numerical, target], axis = 1)

Under-sampling only the Train Set:

- simulation where the model is trained on a more balanced dataset but needs to generalize to imbalanced (test) data in the real world
- pro: more realistic scenario where the model encounters imbalanced data during deployment
- cons: the test set may still have imbalances and could impact the evaluation metrics

Under-sampling on both the Train and Test Set:
- both sets are under-sampled and it's ensured that the test set is representative of the original (train) class distribution
- pro: ensures that the test set reflects the original class distribution and comparison of model performance is easier
- con: the model might not fully capture the variability present in the original majority class during training (too aggressive under-sampling)

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

y = df['TARGET_B']
X = df.drop(columns='TARGET_B')

X_cat  = X.select_dtypes(exclude=np.number)

# We create a list in which every element is another list with the unique values of each categorical column,
# because the OneHotEncoder will need to know all the acceptable values.
unique_categ_values = [ list(X[col].unique()) for col in X_cat.columns ]
print(unique_categ_values)

# create the train and test data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

X_train_num = X_train.select_dtypes(np.number)
X_test_num  = X_test.select_dtypes(np.number)

X_train_cat = X_train.select_dtypes(exclude=np.number)
X_test_cat  = X_test.select_dtypes(exclude=np.number)

# we OneHotEncode the categoricals so we can use the same dataset to perform a regression later
# even though it is not needed for the RandomForest model
# categories like the unique values of the categorical columns
encoder = OneHotEncoder(drop='first', categories=unique_categ_values).fit(X_train_cat)

X_train_cat_encoded_np = encoder.transform(X_train_cat).toarray()
X_test_cat_encoded_np  = encoder.transform(X_test_cat).toarray()

X_train_cat_encoded_df = pd.DataFrame(X_train_cat_encoded_np, columns=encoder.get_feature_names_out(), index=X_train.index)
X_test_cat_encoded_df  = pd.DataFrame(X_test_cat_encoded_np,  columns=encoder.get_feature_names_out(), index=X_test.index)

X_train = pd.concat([X_train_num, X_train_cat_encoded_df], axis = 1)
X_test  = pd.concat([X_test_num, X_test_cat_encoded_df], axis = 1)

[['IL', 'CA', 'NC', 'FL', 'other', 'IN', 'MI', 'MO', 'TX', 'WA', 'WI', 'GA'], ['H', 'U'], ['F', 'M', 'other'], ['L'], ['E', 'G', 'F', 'D'], ['C', 'A', 'D', 'B'], ['T', 'S', 'R', 'U', 'C']]


In [18]:
from imblearn.under_sampling import RandomUnderSampler

# resample the X_train and y_train data
rus = RandomUnderSampler(random_state=42)
# for now, rus is only applied on the train data
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

X_train_resampled.shape, y_train_resampled.shape

((7686, 355), (7686,))

In [19]:
y_train_resampled.value_counts()

TARGET_B
0    3843
1    3843
Name: count, dtype: int64

The data is now balanced and the RandomForestClassifier can be applied.
It's investigated with hyperparameters are the best to use


Let's break down the hyperparameters you provided for the RandomForestClassifier and explain them in simpler terms:

n_estimators:
Definition: Number of trees in the forest.
Explanation: This hyperparameter controls the number of decision trees that will be built in the Random Forest. A higher number of trees can lead to a more robust and stable model, but it may also increase computation time.


max_depth:
- Definition: Maximum depth of the trees.
- Explanation: This hyperparameter determines the maximum depth or levels that an individual decision tree in the forest can grow. A tree with more levels can capture more complex patterns in the data, but it may also be prone to overfitting. Setting a maximum depth helps control the size of the trees.

min_samples_split:
- Definition: Minimum number of samples required to split an internal node.
- Explanation: This hyperparameter sets the minimum number of samples needed for a node in the tree to be split into child nodes. If a node has fewer samples than this threshold, it won't be split, which helps prevent the creation of small and potentially overfit trees.


min_samples_leaf:
- Definition: Minimum number of samples required to be at a leaf node.
- Explanation: This hyperparameter sets the minimum number of samples required for a leaf node. A leaf node is a terminal node in a decision tree. Setting this parameter helps control the size of the leaves and can prevent the creation of leaves with very few instances.

max_samples:
- Definition: Maximum number or proportion of samples used for tree induction.
- Explanation: This hyperparameter sets the maximum number or proportion of samples that are randomly drawn from the dataset to build each individual tree. It introduces randomness in the training process and can help prevent overfitting. If set to a value less than 1.0, it specifies the proportion of samples to be used.

random_state:
- Definition: Seed for the random number generator.
- Explanation: This hyperparameter sets the seed for the random number generator. Setting a seed ensures reproducibility, meaning that if you run the same code with the same seed, you'll get the same results. It's useful for obtaining consistent results when you want to compare or reproduce experiments.

In [20]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, #default=100, number of trees inthe forest
                             max_depth=5, # amount of tree levels; If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
                             min_samples_split=20, #
                             min_samples_leaf =20,
                             max_samples=0.8,
                             random_state=42)

clf.fit(X_train_resampled, y_train_resampled)

print("The Accuracy for the Random Forest in the TRAIN set is {:.2f}".format(clf.score(X_train_resampled, y_train_resampled)))
print("The Accuracy for the Random Forest in the TEST  set is {:.2f}".format(clf.score(X_test, y_test)))

The Accuracy for the Random Forest in the TRAIN set is 1.00
The Accuracy for the Random Forest in the TEST  set is 1.00


In [22]:
from sklearn.model_selection import cross_val_score

cross_val_scores = cross_val_score(clf, X_train_resampled, y_train_resampled, cv=10)
print("The mean Accuracy of the folds was {:.2f}".format(np.mean(cross_val_scores)))

The mean Accuracy of the folds was 0.99


In [27]:
cross_val_scores

array([0.98959688, 0.99479844, 0.99219766, 0.98049415, 0.98829649,
       0.98829649, 0.99609375, 0.98567708, 0.9921875 , 0.9921875 ])

In [None]:
# Create a scorer for recall (sensitivity)
scorer = make_scorer(recall_score)

In [34]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, recall_score

# Create a RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=42)

# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5],
    'min_samples_split': [10, 20],
    'min_samples_leaf': [5, 10],
    'max_samples': [0.8]
}

# Define precision as the scoring metric
recallscore = make_scorer(recall_score)  # pos_label is the label of the positive class

# Create a GridSearchCV object with precision as the scoring metric
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, scoring=recallscore, cv=5)

# Fit the model using GridSearchCV
grid_search.fit(X_train_resampled, y_train_resampled)

# Get the best parameters and the best precision score
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Print the results
print("Best Parameters:", best_params)
print("Best Model:", best_model)

Best Parameters: {'max_depth': 5, 'max_samples': 0.8, 'min_samples_leaf': 5, 'min_samples_split': 10, 'n_estimators': 100}
Best Model: RandomForestClassifier(max_depth=5, max_samples=0.8, min_samples_leaf=5,
                       min_samples_split=10, random_state=42)


In [35]:
best_model

In [36]:
rfc = grid_search.best_estimator_
cross_val_scores = cross_val_score(rfc, X_train_resampled, y_train_resampled, cv=10)

In [37]:
cross_val_scores

array([0.9869961 , 0.99609883, 0.98829649, 0.99479844, 0.9869961 ,
       0.99349805, 0.99479167, 0.98958333, 0.99088542, 0.98828125])

In [38]:
print("The Recall Score for the model in the TRAIN set is {:.3f}".format(np.mean(cross_val_scores)))

The Recall Score for the model in the TRAIN set is 0.991


In [39]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, precision_score

# Create a RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=42)

# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5],
    'min_samples_split': [10, 20],
    'min_samples_leaf': [5, 10],
    'max_samples': [0.8]
}

# Define precision as the scoring metric
precscore = make_scorer(precision_score)  # pos_label is the label of the positive class

# Create a GridSearchCV object with precision as the scoring metric
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, scoring=precscore, cv=5)

# Fit the model using GridSearchCV
grid_search.fit(X_train_resampled, y_train_resampled)

# Get the best parameters and the best precision score
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Print the results
print("Best Parameters:", best_params)
print("Best Model:", best_model)

Best Parameters: {'max_depth': 5, 'max_samples': 0.8, 'min_samples_leaf': 5, 'min_samples_split': 20, 'n_estimators': 150}
Best Model: RandomForestClassifier(max_depth=5, max_samples=0.8, min_samples_leaf=5,
                       min_samples_split=20, n_estimators=150, random_state=42)


In [40]:
rfc2 = grid_search.best_estimator_
cross_val_scores = cross_val_score(rfc2, X_train_resampled, y_train_resampled, cv=10)

In [42]:
print("The Precision Score for the model in the TRAIN set is {:.3f}".format(np.mean(cross_val_scores)))

The Precision Score for the model in the TRAIN set is 0.994


In [48]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Assuming X_test and y_test are your test set features and labels
y_pred_test = best_model.predict(X_test)
test_recall = recall_score(y_test, y_pred_test)
test_precision = precision_score(y_test, y_pred_test)
test_f1 = f1_score(y_test, y_pred_test)
test_accuracy = accuracy_score(y_test, y_pred_test)
conf_matrix = confusion_matrix(y_test, y_pred_test)

print("Test Recall:", test_recall)
print("Test Precision:", test_precision)
print("Test F1-score:", test_f1)
print("Test Accuracy:", test_accuracy)
print("Confusion Matrix:\n", conf_matrix)

Test Recall: 0.97
Test Precision: 0.9969167523124358
Test F1-score: 0.9832742017232641
Test Accuracy: 0.9982707121521773
Confusion Matrix:
 [[18080     3]
 [   30   970]]


3 people are predicted to donate, but they did not.
However, 30 people are predicted not to donate, but they did.

If the predictions were vice-versa the impact on the company would have been much worse.
The company would miss 30 donations it had anticipated to get. This could result in not being able to pay bills etc.
Luckily, the model only predict 3 false positives which are also compensated by the 30 false negatives.