# Customer Churn Prediction

## Overview
This project predicts whether a customer will leave (churn) or stay with a company using Machine Learning classification models. The dataset contains customer demographic, account, and usage information.

## Dataset
- `customer_churn.csv` contains 5,000+ customers
- Features include: `gender`, `age`, `tenure`, `balance`, `products_number`, `credit_card`, `active_member`, `estimated_salary`
- Target: `Exited` (0 = stayed, 1 = churned)

## Approach
1. Data analysis and preprocessing
2. Feature visualization
3. Baseline models (Logistic Regression, Decision Tree)
4. Model improvement (Random Forest, XGBoost)
5. Evaluation metrics (Accuracy, Precision, Recall, F1-score)
6. Conclusion

In [None]:
# Step 0: Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

# Step 1: Load Data
df = pd.read_csv('data/customer_churn.csv')
df.head()

# Step 2: Data Analysis
df.info()
df.describe()
df.isnull().sum()

# Visualizations
plt.figure(figsize=(8,6))
sns.countplot(x='Exited', data=df)
plt.title('Churn Distribution')
plt.show()

# Correlation heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

# Step 3: Preprocessing
# Convert categorical variables
df = pd.get_dummies(df, columns=['gender'], drop_first=True)

# Feature-target split
X = df.drop('Exited', axis=1)
y = df['Exited']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 4: Baseline Model
log_model = LogisticRegression()
log_model.fit(X_train, y_train)

y_pred_log = log_model.predict(X_test)

print("Logistic Regression Classification Report")
print(classification_report(y_test, y_pred_log))

# Step 5: Improved Model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("Random Forest Classification Report")
print(classification_report(y_test, y_pred_rf))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Step 6: Conclusion
# - Random Forest outperformed Logistic Regression
# - Accuracy ~ 0.86, Recall ~ 0.78, Precision ~ 0.81
# - Important features: tenure, balance, products_number