# Model Building

## Setup & Load Data

In [7]:
import pandas as pd

X_train = pd.read_csv('../data/processed/X_train_processed.csv')
X_test = pd.read_csv('../data/processed/X_test_processed.csv')
y_train = pd.read_csv('../data/processed/y_train.csv')
y_test = pd.read_csv('../data/processed/y_test.csv')

## Converting 2D to 1D

In [None]:
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

## Models Definition

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

models = {
    "Logistic Regression": LogisticRegression(class_weight='balanced', max_iter=1000),
    "Random Forest": RandomForestClassifier(class_weight='balanced', random_state=42),
    "SVC": SVC(class_weight='balanced',  random_state=42)
}

Positive class weight (for XGBoost/CatBoost): churn    3.907975
dtype: float64


LogisticRegression:
- Fast
- Interpretable
- Strong baseline

RandomForestClassifier
- Handles nonlinear patterns
- Works well with imbalance + class weights
- Provides feature importance

## Train & Evaluate Each Model

In [16]:
from sklearn.metrics import classification_report, roc_auc_score

for name, model in models.items():
    print(f"\n===== {name} =====")
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    
    # Probabilities for ROC-AUC
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
    else:  # for models like SVC without predict_proba unless probability=True
        y_prob = model.decision_function(X_test)
    
    # Classification report
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    
    # ROC-AUC score
    roc_auc = roc_auc_score(y_test, y_prob)
    print(f"ROC-AUC Score: {roc_auc:.4f}")



===== Logistic Regression =====
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.72      0.80      1593
           1       0.39      0.70      0.50       407

    accuracy                           0.71      2000
   macro avg       0.65      0.71      0.65      2000
weighted avg       0.80      0.71      0.74      2000

ROC-AUC Score: 0.7771

===== Random Forest =====
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.97      0.92      1593
           1       0.77      0.44      0.56       407

    accuracy                           0.86      2000
   macro avg       0.82      0.70      0.74      2000
weighted avg       0.85      0.86      0.84      2000

ROC-AUC Score: 0.8511

===== SVC =====
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.80      0.86      1593
           1       0.49      0.74      0.59    

## Model Choosing

Choosing SVC as the final model, because our churn datasets is **imbalanced** (majority non-churners).

In this scenario:

* Accuracy becomes misleading
* The focus shifts to the positive class (**churners**)
* Missing churners introduces **high business risk**

**SVC provided the best recall and F1-score for churners**, meaning it identifies the highest number of true churners.

This aligns with the project objective: **prioritize detecting churners even if it introduces slightly more false positives.**

Thus, **SVC was selected as the recommended model**.

In [17]:
final_model = SVC(class_weight='balanced',  random_state=42)

final_model.fit(X_train, y_train)

0,1,2
,C,1.0
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,'balanced'


## Test on New Data Point

In [67]:
new_customer = {
    "credit_score": 300,
    "gender": 1,     # Male
    "age": 42,
    "tenure": 5,
    "balance": 75000,
    "products_number": 2,
    "credit_card": 1,
    "active_member" : 0,
    "estimated_salary": 45000,
    "country_Germany": 1,
    "country_Spain": 0
}

In [68]:
new_df = pd.DataFrame([new_customer])

num_features = ['credit_score', 'age', 'tenure', 'balance',
                'products_number', 'estimated_salary']

In [69]:
import pickle

with open('../models/scaler.pkl', 'rb') as f:
    loaded_scaler = pickle.load(f)



In [70]:
new_df[num_features] = loaded_scaler.transform(new_df[num_features])

In [73]:
prediction = final_model.predict(new_df)[0]

print("Prediction (0=no churn, 1=churn):", prediction)

Prediction (0=no churn, 1=churn): 1
