# Bank Customer Churn Prediction
## Notebook 6 — Final Model Training and Saving

**Goal:** Train the selected Random Forest model on the complete processed dataset and save it.

### Why retrain on 100% of the data?

In N5 we split the data 80/20 to get an honest performance estimate. Now that the model is selected, the standard practice is to retrain on all available data before deployment — more training data means better generalisation. Performance was already validated on the N5 test set.

> Analogy: Like a student who sets aside practice questions to test themselves, but studies every question before the real exam.

In [None]:
import pandas as pd
import numpy as np
import pickle
import os
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv('data_processed.csv')
print(f'Dataset loaded: {data.shape}')

## 1. Prepare Features and Target

In [None]:
X = data.drop('Exited', axis=1)
y = data['Exited']

print(f'X: {X.shape}')
print(f'y: {y.shape}')
print('Features:', X.columns.tolist())

## 2. Should We Apply SMOTE When Training the Final Model?

Short answer: **No.** Here is the reasoning:

The purpose of SMOTE in N5 was to prevent the model from ignoring the minority class during *training*. But for the **final production model**, we want it calibrated to the **real-world class distribution** (≈20% churn, ≈80% stay).

If we SMOTE the final training data:
- The model learns a 50/50 world that doesn't exist.
- `predict_proba()` outputs will be poorly calibrated — it will overestimate churn probability for every customer.
- Downstream decisions (retention budgets, alert thresholds) will be based on inflated probabilities.

The better practice for the final model is to train on the **real, imbalanced data** and use the `class_weight='balanced'` parameter OR tune the decision threshold — not SMOTE.

> **What we do here:** Train on all 10,000 real customers without resampling.
> The model generalises better because it reflects the true proportion of churners in the real world.
> Recall on the minority class stays reasonable because Random Forest is already relatively robust to mild imbalance.

## 2. Train on the Full Dataset

Key Random Forest parameters:
- `n_estimators=100` — 100 decision trees in the ensemble
- `random_state=42` — reproducibility
- `max_features='sqrt'` — each tree sees a random subset of features, reducing correlation between trees

In [None]:
final_model = RandomForestClassifier(n_estimators=100, random_state=42)
final_model.fit(X, y)

print('Model trained on full dataset.')
print(f'  Trees: {final_model.n_estimators}')
print(f'  Features: {final_model.n_features_in_}')
print(f'  Classes: {final_model.n_classes_}')

## 3. Save with Pickle

Pickle serialises any Python object to disk. This allows the model to be reloaded in any script — including the inference module in N7.

```
Save  — open('file.pkl', 'wb')  then  pickle.dump(object, file)
Load  — open('file.pkl', 'rb')  then  pickle.load(file)
```

**Security note:** Only unpickle files from trusted sources.

In [None]:
with open('model_file.pkl', 'wb') as f:
    pickle.dump(final_model, f)

for path in ['model_file.pkl', 'Scaler_file.pkl']:
    if os.path.exists(path):
        kb = os.path.getsize(path) / 1024
        print(f'  {path:<25s}  {kb:>8.1f} KB  saved')
    else:
        print(f'  WARNING: {path} not found — run N4 first')

## 4. Verify the Saved Model

In [None]:
with open('model_file.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

sample = X.iloc[[0]]
pred   = loaded_model.predict(sample)
prob   = loaded_model.predict_proba(sample)

label = 'Churned' if pred[0] == 1 else 'Stayed'
print(f'Sample prediction : {pred[0]}  ({label})')
print(f'Stay probability  : {prob[0][0]:.3f}')
print(f'Churn probability : {prob[0][1]:.3f}')
print('Model verified successfully.')

---
## Summary

| Artifact | File | Notes |
|---|---|---|
| Production model | `model_file.pkl` | Random Forest on 100% of data |
| Scaler | `Scaler_file.pkl` | Fitted in N4 — same parameters as training |

Both artifacts are required by the inference module in N7.

Continue to **N7_BankChurn_Module** to see the full inference pipeline in action.