**Programmer: python_scripts (Abhijith Warrier)**

**PYTHON SCRIPT TO *PREDICT CUSTOMER CHURN USING RANDOM FOREST WITH IMBALANCED DATA HANDLING (SMOTE)*. üß†üìäü§ñ**

This script demonstrates a **real-world churn prediction pipeline**, where customer churn is treated as an **imbalanced classification problem**. We use **SMOTE** to balance the dataset and a **RandomForestClassifier** to model non-linear relationships in customer behavior.

---

## **üì¶ Install Required Packages**

**Install ML, preprocessing, and imbalance-handling libraries.**

In [None]:
pip install pandas numpy scikit-learn imbalanced-learn matplotlib

---

## **üß© Load and Inspect the Dataset**

**We simulate a churn-like dataset structure for demonstration.**

In [None]:
import pandas as pd
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=5000,
    n_features=10,
    n_informative=6,
    n_redundant=2,
    weights=[0.75, 0.25],   # churn imbalance
    random_state=42
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
df["churn"] = y

This mimics real churn datasets where **non-churn customers dominate**.

## **‚úÇÔ∏è Train/Test Split**

**Split the data while preserving class distribution.**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("churn", axis=1),
    df["churn"],
    test_size=0.3,
    stratify=df["churn"],
    random_state=42
)

---

## **‚öñÔ∏è Handle Class Imbalance with SMOTE**

**SMOTE generates synthetic minority samples to balance the dataset.**

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

This prevents the model from being biased toward the majority class.

---

## **üå≤ Train a Random Forest Classifier**

**Random Forest handles non-linear feature interactions effectively.**

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=200,
    max_depth=8,
    random_state=42
)

model.fit(X_resampled, y_resampled)

---

## **üìä Evaluate the Model**

**Evaluate performance using metrics suitable for imbalanced data.**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Accuracy alone is not enough ‚Äî focus on **recall and F1-score for churned users**.

---

## **üîç Feature Importance Analysis**

**Understand which features influence churn predictions.**

In [None]:
import pandas as pd

importances = pd.Series(
    model.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

print(importances.head())

---

## **üß™ Why This Matters in the Real World**

- Churn datasets are almost always imbalanced
- Predicting churn enables proactive retention strategies
- SMOTE improves minority class learning
- Random Forest captures complex patterns in customer behavior

---

## **Key Takeaways**

1. Customer churn prediction is an imbalanced classification problem.
2. SMOTE helps models learn minority class patterns effectively.
3. Random Forest performs well on tabular, non-linear data.
4. Evaluation should focus on recall and F1-score, not just accuracy.
5. Feature importance adds interpretability to churn models.

---