In [1]:
import pandas as pd

df = pd.read_csv("/content/random_forest_poc_dataset_1000.csv")

print(df.head())
# dataset contains customer records with the number of interactions they with company and then their decision to continue their subscription or not (churn)
# churn: yes/no value, yes means customer canceled subscription while no means they stayed subscribed after support ticket filed

   age  income  tenure  interactions  churn
0   56  135186      10             3      0
1   69   64674       6             2      0
2   46   65854       1             2      1
3   32   76271       6             4      0
4   60  103688      10             7      0


In [2]:
from sklearn.model_selection import train_test_split

# split into features (X) and target (y)
X = df[["age", "income", "tenure", "interactions"]]
y = df["churn"]

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# üå≤ How Random Forests Work (Plain-Language Explanation)

Random Forests are machine learning models that make predictions by combining the results of **many decision trees**. Instead of relying on a single tree‚Äîwhich can easily overfit‚Äîthe forest averages or votes across many trees to produce a stronger, more stable prediction.

---

## 1Ô∏è‚É£ Create Many Random Decision Trees
A Random Forest builds **lots of decision trees** (often 100+).  
Each tree is trained on:
- a **random sample** of the dataset  
- a **random subset** of the features  

This randomness makes the trees different from each other and helps the model generalize better.

---

## 2Ô∏è‚É£ Bootstrapping (Training on Random Samples)
Each tree is trained on a bootstrap sample ‚Äî a random sample of the data **with replacement**.

- Some rows appear more than once  
- Some rows aren‚Äôt used at all  

This creates natural variation between trees.

---

## 3Ô∏è‚É£ Random Subset of Features per Split
During tree construction, each split considers only a **random subset of features**, not all of them.

This prevents the trees from becoming identical and increases overall model robustness.

---

## 4Ô∏è‚É£ Grow Each Tree Fully
Each tree is usually allowed to grow deep and may overfit its small random sample.

This is fine because the **forest averages out** the overfitting across many trees, creating a stable model.

---

## 5Ô∏è‚É£ Combine All Trees (Voting or Averaging)

- **Classification:** each tree votes ‚Üí the forest picks the majority vote  
- **Regression:** each tree outputs a value ‚Üí the forest averages them  

This ensemble approach smooths out noise and improves accuracy.

---

## 6Ô∏è‚É£ Out-of-Bag Validation (Optional)
Because trees train on bootstrapped samples, about one-third of the data is not used for each tree.

These unused rows (called **out-of-bag samples**) can be used to estimate accuracy without needing a separate validation set.

---

# ‚≠ê Why Random Forests Are Effective
- ‚úîÔ∏è Reduce overfitting compared to single trees  
- ‚úîÔ∏è Work well with noisy or messy data  
- ‚úîÔ∏è Handle numeric and categorical features  
- ‚úîÔ∏è Easy to tune and difficult to break  
- ‚úîÔ∏è Provide feature importance scores  
- ‚úîÔ∏è Strong baseline performance with minimal configuration  

In [21]:
# set up random forest model
from sklearn.ensemble import RandomForestClassifier

# setup model with randomly picked parameters to start, will use GridSearch to optimize later
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=20,
    min_samples_split=5,
    min_samples_leaf=1,
    random_state=1,
    max_features='sqrt',
    n_jobs=-1
)
# hyper parameters rundown:
# n_estimators: number of trees in the forest, more trees lead to more stability and higher accuracy
# max_depth: how deep each tree can grow, smaller values lead to reduced overfitting, none can overfit
# min_samples_split: min number of samples needed before a node can be split into 2 branches, small values lead to more complexity & overfitting while larger values are more generalized
# min_samples_leaf: min number of samples required by each leaf node, lower values lead to more detailed modeling but higher overfitting risk while larger values (2+) lead to smooth predictions but less variance
# max_features: how many features each tree is allowed to consider at each split
  # auto: same as sqrt for classification
  # sqrt: takes sqrt of num_features
  # log2: takes log2(num_features)
  # impact of max_features: controls randomness and diversity among trees!
    # more features -> stronger trees but more correlation, less ensemble benefit
    # less features -> weaker trees individually but more ensemble benefit

rf.fit(X_train, y_train)

In [22]:
# make predictions

y_pred = rf.predict(X_test)

In [23]:
# now let's calculate some metrics to see how the model did
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# accuracy: tells what % of correct predictions
acc = accuracy_score(y_pred, y_test) * 100
print("Accuracy: ", acc, "%")

# confusion matrix: correctly labelled datapoints are top left and bottom right (true negative, true positive)
cm = confusion_matrix(y_pred, y_test)
print("\nConfusion matrix:\n", cm)

# classification report: shows precision, recall, f1-score, and support metrics
# precision: when my model predicts label 1 for ex, how often is it right? TP / (TP + FP)
# recall: of all the actual label 1s, how many did my model correctly label as 1? TP / (TP + FN)
# f1-score: how balanced is my model of correctly labelling TP and avoiding false alarms? (2 * ((precision * recall) / (precision + recall)))
# support: number of true predictions for each class
cr = classification_report(y_pred, y_test)
print("\nClassification report:\n", cr)

# ROC-AUC score
# ROC-AUC: measures how well the model separates the two classes across all thresholds, not just the default 0.5 threshold used in class predictions

# get positive probabilities for the positive class ("yes")
y_proba = rf.predict_proba(X_test)[:,1] # [:,1] predicts for only churn="yes"

roc_auc = roc_auc_score(y_test, y_proba)
print("ROC-AUC score: ", roc_auc)
# ROC-AUC ranges from 0.5 (completely random class separation) to 1 (perfect class separation)

Accuracy:  79.5 %

Confusion matrix:
 [[150  23]
 [ 18   9]]

Classification report:
               precision    recall  f1-score   support

           0       0.89      0.87      0.88       173
           1       0.28      0.33      0.31        27

    accuracy                           0.80       200
   macro avg       0.59      0.60      0.59       200
weighted avg       0.81      0.80      0.80       200

ROC-AUC score:  0.8063616071428572


In [14]:
# now, let's begin hyperparameter tuning to see how we can improve the model by optimizing our parameters
# lets use grid search

from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier(random_state=1, n_jobs=-1)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}
# hyper parameters rundown:
# n_estimators: number of trees in the forest, more trees lead to more stability and higher accuracy
# max_depth: how deep each tree can grow, smaller values lead to reduced overfitting, none can overfit
# min_samples_split: min number of samples needed before a node can be split into 2 branches, small values lead to more complexity & overfitting while larger values are more generalized
# min_samples_leaf: min number of samples required by each leaf node, lower values lead to more detailed modeling but higher overfitting risk while larger values (2+) lead to smooth predictions but less variance
# max_features: how many features each tree is allowed to consider at each split
  # auto: same as sqrt for classification
  # sqrt: takes sqrt of num_features
  # log2: takes log2(num_features)
  # impact of max_features: controls randomness and diversity among trees!
    # more features -> stronger trees but more correlation, less ensemble benefit
    # less features -> weaker trees individually but more ensemble benefit

# setup grid search
grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5, # number of cross validation folds, more folds -> more stable but slower, 5 is standard
    scoring='f1', # using f1-score to determine which model is best, can also use acc, roc auc, recall, etc
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train, y_train)

print("Best parameters: ", grid.best_params_)
print("Best score: ", grid.best_score_)

# get the best model
best_rf = grid.best_estimator_

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best parameters:  {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}
Best score:  0.4802456753130893


In [15]:
# make predictions with best random forest model
y_pred = best_rf.predict(X_test)

In [17]:
# calculate metrics to see how the model did

acc = accuracy_score(y_pred, y_test) * 100
print("Accuracy: ", acc, "%")

cm = confusion_matrix(y_pred, y_test)
print("\nConfusion matrix:\n", cm)

cr = classification_report(y_pred, y_test)
print("\nClassification report:\n", cr)

y_proba = best_rf.predict_proba(X_test)[:,1]
roc_auc = roc_auc_score(y_test, y_proba)
print("ROC-AUC score: ", roc_auc)

Accuracy:  79.0 %

Confusion matrix:
 [[150  24]
 [ 18   8]]

Classification report:
               precision    recall  f1-score   support

           0       0.89      0.86      0.88       174
           1       0.25      0.31      0.28        26

    accuracy                           0.79       200
   macro avg       0.57      0.58      0.58       200
weighted avg       0.81      0.79      0.80       200

ROC-AUC score:  0.8033854166666666


# Possible Reasoning for results of original model vs optimized model
Ah, yes ‚Äî that‚Äôs actually a surprisingly common situation! Grid search finds the **best parameters** on your validation set (or via cross-validation), but that doesn‚Äôt always guarantee better performance on the test set. A few things could be at play here:

- **Random variation / small differences** ‚Äì Random forests have some inherent randomness, and if your test set is small, the differences might just be noise.

- **Overfitting to CV folds** ‚Äì Grid search optimizes parameters on cross-validation, which can sometimes overfit the CV folds slightly, leading to slightly worse performance on the test set.

- **Metric mismatch** ‚Äì If your grid search optimized for one metric (like accuracy or F1) but you‚Äôre evaluating on a different metric, ‚Äúoptimal‚Äù parameters may not appear optimal for your chosen metric.

- **Hyperparameter sensitivity** ‚Äì Random forests are often robust to parameter choices. Sometimes, the defaults are already very good. Tweaking parameters may not give a big boost, and occasionally a ‚Äúnon-optimal‚Äù combo just happens to work better on that specific split.

---

### Next Steps / Tips

- **Check multiple splits**: Evaluate your models on different train/test splits or with more cross-validation folds to see if the trend holds.  
- **Compare metrics consistently**: Make sure the metric you optimized in grid search matches what you care about for evaluation.  
- **Look at feature importance**: Sometimes a slightly worse overall score still gives more stable or interpretable feature insights.  
- **Consider randomness**: For random forests, try setting a fixed `random_state` to reduce variation and get more reproducible results.
