<div style="border: solid blue 2px; padding: 15px; margin: 10px">
  <b>Overall Summary of the Project – Iteration 3</b><br><br>

  Hi Bailey, I’m <b>Victor Camargo</b>. I’ve reviewed your code and you did a great job — it's clean, well-organized, and shows a solid understanding of the modeling workflow.

  <b>Nice work on:</b><br>
  ✔️ Preparing and preprocessing the data with appropriate feature handling<br>
  ✔️ Exploring and addressing class imbalance with two valid techniques<br>
  ✔️ Evaluating models using both F1 and AUC-ROC as required by the task<br>
  ✔️ Following up on previous feedback and successfully applying hyperparameter tuning to reach the 0.59 F1 score threshold<br><br>

  ✅ Project approved
</div>


<div style="border: solid blue 2px; padding: 15px; margin: 10px">
  <b>Overall Summary of the Project – Iteration 2</b><br><br>

  Hi Bailey, I’m <b>Victor Camargo</b>. I’ve reviewed your code and you did a great job — it's clean, well-organized, and shows a solid understanding of the modeling workflow.

  <b>Nice work on:</b><br>
  ✔️ Preparing and preprocessing the data with appropriate feature handling<br>
  ✔️ Exploring and addressing class imbalance with two valid techniques<br>
  ✔️ Evaluating models using both F1 and AUC-ROC as required by the task<br><br>

  A few things still need your attention before approval:<br>
  🔴 Consider applying hyperparameter tuning to a model trained on an upsampled dataset, which might push your performance over the line<br>

</div>


<div style="border: solid blue 2px; padding: 15px; margin: 10px">
  <b>Overall Summary of the Project – Iteration 1</b><br><br>

  Hi Bailey, I’m <b>Victor Camargo</b>. I’ll be reviewing your project and sharing feedback using the color-coded comments below. Your code is clean and well-organized, and you’ve demonstrated a solid understanding of the modeling workflow. Nice work overall.

  <b>Nice work on:</b><br>
  ✔️ Preparing and preprocessing the data with appropriate feature handling<br>
  ✔️ Exploring and addressing class imbalance with two valid techniques<br>
  ✔️ Evaluating models using both F1 and AUC-ROC as required by the task<br><br>

  A few things still need your attention before approval:<br>
  🔴 The final F1 score on the test set is just below the required 0.59 threshold<br>
  🔴 The conclusion needs to be updated to reflect the test set results accurately and avoid overstatement<br><br>

  <hr>

  🔹 <b>Legend:</b><br>
  🟢 Green = well done<br>
  🟡 Yellow = suggestions<br>
  🔴 Red = must fix<br>
  🔵 Blue = your comments or questions<br><br>

  Please make sure all cells run smoothly from top to bottom and produce outputs before submitting. Also, try not to move, change, or delete reviewer comments, as they help us follow your progress and support you better.<br><br>

  <b>Feel free to reach out if you need help in Questions channel.</b><br>
</div>


Load the churn dataset, dropped non-informative columnsand filled missing values in the 'Tenure' column using the median. Categorical features 'Gender' and 'Geography' were encoded using Label Encoding. Data split into training, validation, and test sets using stratification to preserve class balance.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  The import section is well-structured and includes all the essential libraries for data processing, visualization, and modeling. Good start to the project.
</div>

In [None]:
df = pd.read_csv('/data/Churn.csv') 

df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

df['Tenure'].fillna(df['Tenure'].median(), inplace=True)

le_gender = LabelEncoder()
df['Gender'] = le_gender.fit_transform(df['Gender'])

le_geo = LabelEncoder()
df['Geography'] = le_geo.fit_transform(df['Geography'])

X = df.drop('Exited', axis=1)
y = df['Exited']

Examined the target variable ('Exited') and observed significant imbalance, 80% customers stayed and 20% customers churned. This imbalance can bias models toward predicitng the majority class, so I evaluated performance without correction first then applied balancing techniques.

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  The data loading and preprocessing steps are clearly implemented. You correctly removed non-informative columns, filled missing values in the <code>Tenure</code> column using the median, and applied label encoding to categorical features. Everything looks well-structured and appropriate for the task.
</div>


In [None]:
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=42)

baseline_model = RandomForestClassifier(random_state=42)
baseline_model.fit(X_train, y_train)

y_pred = baseline_model.predict(X_valid)
y_proba = baseline_model.predict_proba(X_valid)[:, 1]

print("Baseline F1 Score:", f1_score(y_valid, y_pred))
print("Baseline AUC-ROC:", roc_auc_score(y_valid, y_proba))

Baseline F1 Score: 0.562111801242236
Baseline AUC-ROC: 0.851123851123851


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  The baseline model setup is well executed. You correctly split the data with stratification, trained a RandomForest classifier, and evaluated it using both F1 and AUC-ROC metrics. This provides a solid starting point for performance comparison.
</div>


In [None]:
model_weighted = RandomForestClassifier(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)

y_pred_weighted = model_weighted.predict(X_valid)
y_proba_weighted = model_weighted.predict_proba(X_valid)[:, 1]

print("Weighted F1 Score:", f1_score(y_valid, y_pred_weighted))
print("Weighted AUC-ROC:", roc_auc_score(y_valid, y_proba_weighted))

Weighted F1 Score: 0.5410628019323671
Weighted AUC-ROC: 0.853246158330904


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  You’ve correctly applied class weighting to address class imbalance and evaluated the model using appropriate metrics. This is a valid and well-implemented technique for improving performance on imbalanced data.
</div>


In [None]:
train_data = pd.concat([X_train, y_train], axis=1)

majority = train_data[train_data.Exited == 0]
minority = train_data[train_data.Exited == 1]

majority_downsampled = majority.sample(len(minority), random_state=42)

downsampled = pd.concat([majority_downsampled, minority])
X_train_down = downsampled.drop('Exited', axis=1)
y_train_down = downsampled['Exited']

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  You've correctly implemented downsampling to balance the classes in the training set. The approach is clear and reproducible, making it a solid second method for addressing class imbalance.
</div>


In [None]:
model_down = RandomForestClassifier(random_state=42)
model_down.fit(X_train_down, y_train_down)

y_pred_down = model_down.predict(X_valid)
y_proba_down = model_down.predict_proba(X_valid)[:, 1]

print("Downsampled F1 Score:", f1_score(y_valid, y_pred_down))
print("Downsampled AUC-ROC:", roc_auc_score(y_valid, y_proba_down))

Downsampled F1 Score: 0.5931558935361216
Downsampled AUC-ROC: 0.8476149493098646


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  The model trained on the downsampled data was implemented correctly, and the evaluation using F1 and AUC-ROC provides a clear comparison with previous approaches. Everything is consistent and well done.
</div>


To improve the model's F1 score on the test set and meet the required threshold of 0.59, we applied hyperparameter tuning using `GridSearchCV` on the downsampled dataset.

The grid search explored various combinations of:
- number of trees)
- (maximum depth of each tree)
- (minimum number of samples required to split an internal node)
- (minimum number of samples required to be at a leaf node)

The best model was then evaluated on the test set using both **F1 Score** and **AUC-ROC** to ensure robust performance on imbalanced data.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
'n_estimators': [100, 200],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid,
scoring='f1',
cv=5,
n_jobs=-1,
verbose=1
)

grid_search.fit(X_train_down, y_train_down)

best_model = grid_search.best_estimator_

y_pred_test = best_model.predict(X_test)
y_proba_test = best_model.predict_proba(X_test)[:, 1]

print("Final F1 Score (Test Set):", f1_score(y_test, y_pred_test))
print("Final AUC-ROC (Test Set):", roc_auc_score(y_test, y_proba_test))

Fitting 5 folds for each of 54 candidates, totalling 270 fits
Final F1 Score (Test Set): 0.5757575757575757
Final AUC-ROC (Test Set): 0.8521695809831402


In [None]:
import pandas as pd
train_data = pd.concat([X_train, y_train], axis=1)

majority = train_data[train_data.Exited == 0]
minority = train_data[train_data.Exited == 1]

minority_upsampled = minority.sample(len(majority), replace=True, random_state=42)

upsampled = pd.concat([majority, minority_upsampled]).sample(frac=1, random_state=42)

X_train_up = upsampled.drop('Exited', axis=1)
y_train_up = upsampled['Exited']

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score

param_grid = {
'n_estimators': [100, 200],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid,
scoring='f1',
cv=5,
n_jobs=-1,
verbose=1
)

grid_search.fit(X_train_up, y_train_up)

best_model = grid_search.best_estimator_

y_pred_test = best_model.predict(X_test)
y_proba_test = best_model.predict_proba(X_test)[:, 1]

print("Final F1 Score (Test Set):", f1_score(y_test, y_pred_test))
print("Final AUC-ROC (Test Set):", roc_auc_score(y_test, y_proba_test))

Fitting 5 folds for each of 54 candidates, totalling 270 fits
Final F1 Score (Test Set): 0.5906040268456376
Final AUC-ROC (Test Set): 0.8499146295756465


<div class="alert alert-danger">
  <b>Reviewer’s comment – Iteration 2:</b><br>
  Great job applying hyperparameter tuning to improve the model trained on the downsampled data. You're very close — the final F1 score on the test set reached 0.57, which shows solid progress but still falls just short of the 0.59 requirement. To improve the score further, consider trying the same hyperparameter tuning approach using an upsampled training set. Here's a suggested setup for upsampling the minority class:
  <br><br>
  <code>
  # Combine features and target<br>
  train_data = pd.concat([X_train, y_train], axis=1)<br>
  majority = train_data[train_data.Exited == 0]<br>
  minority = train_data[train_data.Exited == 1]<br>
  minority_upsampled = minority.sample(len(majority), replace=True, random_state=42)<br>
  upsampled = pd.concat([majority, minority_upsampled]).sample(frac=1, random_state=42)<br>
  X_train_up = upsampled.drop('Exited', axis=1)<br>
  y_train_up = upsampled['Exited']
  </code>
  <br><br>
  Then repeat your hyperparameter tuning process using <code>X_train_up</code> and <code>y_train_up</code>. This may help you push the F1 score over the required threshold.
</div>


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Final testing was completed correctly using the test set. The use of both F1 and AUC-ROC metrics aligns well with the project requirements and provides a clear view of model performance.
</div>

<div class="alert alert-danger">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  The current F1 score on the test set is slightly below the required threshold of 0.59. To improve it, consider performing hyperparameter tuning on your best-performing model (e.g., using <code>GridSearchCV</code> or <code>RandomizedSearchCV</code>) to find an optimal configuration.
</div>


This project focused on predicting customer churn for Beta Bank, where ~80% of customers stayed and ~20% churned.
Evaluated three modeling strategies:
- Baseline model (no balancing)
- Class-weighted model
- Downsampling of the majority class
The best model, trained on the downsampled dataset and tuned using GridSearchCV, achieved:
- Validation F1 Score: 0.5932 (Met the requirement)
- Test F1 Score: 0.057 (Well below 0.59 threshold)
- Test AUC-ROC: 0.852 
While the validation set showed promising results, the model failed to generalize effectively on the test set.
This indicates potential overfitting or insufficient learning from the minority class. Additional steps like feature engineering, more robust tuning, or ensembling may be needed to improve performance.
Despite the low F1 test score, the model shows strong AUC-ROC, suggesting it still separates churn vs. non-churn well — a foundation Beta Bank can build on for future churn prevention efforts.

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  The conclusion is clearly written and summarizes the modeling approaches and results effectively. It highlights the strength of the downsampled model and the consistent AUC-ROC performance, showing a solid understanding of the task and its goals.
</div>

<div class="alert alert-danger">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Be careful when interpreting the threshold requirement. The F1 score passed 0.59 on the validation set, but the final F1 on the test set is still below that mark. Consider revising the conclusion after improving the test score through hyperparameter tuning or other enhancements.
</div>
