### 📊 Customer Churn – Model Training & Evaluation

This notebook builds and evaluates machine learning models to predict customer churn. It includes:

- Importing the preprocessed dataset  
- Splitting into training and test sets  
- Training baseline models (Logistic Regression, Random Forest, XGBoost)  
- Evaluating performance using metrics  
- Interpreting feature importance  


In [1]:
# 📥 Load cleaned dataset and split for training

import pandas as pd
from sklearn.model_selection import train_test_split

# Load preprocessed dataset
df = pd.read_csv("../data/customer_churn_cleaned.csv")

# Separate features and target
X = df.drop("Churn_Yes", axis=1)
y = df["Churn_Yes"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Confirm shape
print(f"Train set: {X_train.shape}, Test set: {X_test.shape}")


Train set: (86, 30), Test set: (22, 30)


## 🤖 Train Logistic Regression Model

In this section, we train a baseline Logistic Regression model on the processed customer churn dataset.

We evaluate the model's performance using:
- Accuracy
- Confusion matrix
- Classification report (Precision, Recall, F1-score)

This provides a benchmark to compare with more advanced models like XGBoost or Random Forest.


In [2]:
# 🤖 Train Logistic Regression model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize and train model
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

# Predict on test set
y_pred_lr = lr_model.predict(X_test)

# 📊 Evaluate performance
print("🔎 Logistic Regression Evaluation")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("\nClassification Report:\n", classification_report(y_test, y_pred_lr))


🔎 Logistic Regression Evaluation
Accuracy: 0.4090909090909091

Confusion Matrix:
 [[3 8]
 [5 6]]

Classification Report:
               precision    recall  f1-score   support

       False       0.38      0.27      0.32        11
        True       0.43      0.55      0.48        11

    accuracy                           0.41        22
   macro avg       0.40      0.41      0.40        22
weighted avg       0.40      0.41      0.40        22



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## ⚡ Train XGBoost Model

In this step, we train an XGBoost classifier to predict customer churn. XGBoost is a powerful gradient boosting algorithm known for its performance and speed.

We’ll compare its results with our Logistic Regression baseline using:
- Accuracy
- Confusion matrix
- Classification report


In [3]:
# ⚡ Train and Evaluate XGBoost Model
from xgboost import XGBClassifier

# Initialize and train the model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)

# Predict on test set
y_pred_xgb = xgb_model.predict(X_test)

# 🧠 Evaluate performance
print("🔍 XGBoost Evaluation")
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))


🔍 XGBoost Evaluation
Accuracy: 0.6363636363636364

Confusion Matrix:
 [[6 5]
 [3 8]]

Classification Report:
               precision    recall  f1-score   support

       False       0.67      0.55      0.60        11
        True       0.62      0.73      0.67        11

    accuracy                           0.64        22
   macro avg       0.64      0.64      0.63        22
weighted avg       0.64      0.64      0.63        22



Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


## ✅ Final Model Comparison Summary

We trained and evaluated two classification models on the preprocessed customer churn dataset:

| Model               | Accuracy | Precision (avg) | Recall (avg) | F1-score (avg) |
|--------------------|----------|------------------|--------------|----------------|
| Logistic Regression| 0.41     | 0.40             | 0.41         | 0.40           |
| XGBoost            | 0.64     | 0.64             | 0.63         | 0.63           |

🔍 **Conclusion**:  
XGBoost outperformed Logistic Regression in all evaluation metrics. We recommend using XGBoost for production deployment or further hyperparameter tuning.

In [4]:
# 📤 Export Feature Importances from XGBoost
import pandas as pd

# Create DataFrame with feature names and their importances
importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': xgb_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Export to CSV inside the data folder
importance_df.to_csv('../data/feature_importance.csv', index=False)

# Display the top 10 features for verification
importance_df.head(10)


Unnamed: 0,Feature,Importance
17,DeviceProtection_Yes,0.098697
22,StreamingMovies_No internet service,0.081416
20,StreamingTV_No internet service,0.07774
16,DeviceProtection_No internet service,0.077368
15,OnlineBackup_Yes,0.070418
11,InternetService_No,0.05923
10,InternetService_Fiber optic,0.054407
18,TechSupport_No internet service,0.052591
8,MultipleLines_No phone service,0.046674
23,StreamingMovies_Yes,0.039845
