# Customer Churn Prediction - Portfolio Project

This notebook uses the **Telco Customer Churn dataset** from Kaggle to demonstrate an end-to-end machine learning pipeline for predicting customer churn.

**Portfolio Flow:**
1. Intro / Problem Statement
2. Dataset Overview (table + stats)
3. EDA Visuals (plots)
4. Feature Engineering Summary
5. Model Comparison Table
6. Best Model Results (confusion matrix, ROC, F1)
7. Feature Importance Plot
8. Business Insights & Recommendations
9. Optional Interactive Demo (Streamlit)

## 1. Intro / Problem Statement

**Problem:** Predict which customers are likely to churn (cancel service) in a telecommunications company.

**Goal:** Build a machine learning pipeline to proactively identify high-risk customers and recommend retention strategies.

## 2. Dataset Overview

We use the Telco Customer Churn dataset downloaded via KaggleHub. This dataset contains demographic, account, service, and usage information for ~7,000 customers.

In [ ]:
import kagglehub
import pandas as pd

# Download dataset
path = kagglehub.dataset_download('blastchar/telco-customer-churn')
print('Path to dataset files:', path)

# Load CSV
df = pd.read_csv(f'{path}/Telco-Customer-Churn.csv')
df.head()

In [ ]:
# Convert TotalCharges to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median())

# Dataset info
df.info()

## 3. Exploratory Data Analysis (EDA)

Visualize churn distribution and key feature relationships.

## 4. Feature Engineering

- Encode Yes/No columns
- Create tenure categories
- Identify if customer has internet services

In [ ]:
# Encode Yes/No columns
yes_no_cols = ['Partner','Dependents','PhoneService','PaperlessBilling','Churn',
               'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
for col in yes_no_cols:
    df[col] = df[col].map({'Yes':1,'No':0})

# Tenure categories
df['tenure_category'] = pd.cut(df['tenure'], bins=[0,12,24,48,72], labels=['0-12','12-24','24-48','48-72'])

# Has internet service
df['has_internet'] = df['InternetService'].apply(lambda x: 0 if x=='No' else 1)

df.head()

## 5. Model Training & Comparison

Train Logistic Regression, Random Forest, Gradient Boosting and compare metrics.

In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score, f1_score

# Features and target
X = df.drop(['customerID','Churn'], axis=1).select_dtypes(include=['int64','float64'])
y = df['Churn']

# Split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}
results = {}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    y_proba = model.predict_proba(X_test_scaled)[:,1]
    results[name] = {'model':model,'y_pred':y_pred,'y_proba':y_proba,'roc_auc':roc_auc_score(y_test,y_proba),'f1_score':f1_score(y_test,y_pred),'accuracy':model.score(X_test_scaled,y_test)}

results_df = pd.DataFrame([{**{'Model':k}, **v} for k,v in results.items()])[['Model','accuracy','roc_auc','f1_score']]
results_df

## 6. Best Model Results & Metrics

Confusion matrix, classification report, and ROC for the best performing model.

In [ ]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay

best_model_name = results_df.sort_values('roc_auc', ascending=False)['Model'].iloc[0]
best_model = results[best_model_name]['model']
y_pred_best = results[best_model_name]['y_pred']
y_proba_best = results[best_model_name]['y_proba']

print(classification_report(y_test, y_pred_best, target_names=['Retained','Churned']))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_best, display_labels=['Retained','Churned'], cmap='Blues')
plt.show()
RocCurveDisplay.from_predictions(y_test, y_proba_best)
plt.show()

## 7. Feature Importance

Display the top features influencing churn prediction.

In [ ]:
feature_importance = pd.DataFrame({'Feature':X.columns, 'Importance':best_model.coef_[0] if best_model_name=='Logistic Regression' else best_model.feature_importances_})
feature_importance['Importance'] = feature_importance['Importance'].abs()
feature_importance.sort_values('Importance', ascending=False, inplace=True)

plt.figure(figsize=(8,6))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(15))
plt.title('Top 15 Feature Importances')
plt.show()

## 8. Business Insights & Recommendations

Identify high-risk customers and calculate potential retention impact.

In [ ]:
high_risk_threshold = 0.7
high_risk_customers = (y_proba_best>=high_risk_threshold).sum()
avg_customer_value = df['TotalCharges'].mean()
retention_rate = 0.25
retention_cost_per_customer = 50
potential_saves = high_risk_customers*retention_rate*avg_customer_value
intervention_cost = high_risk_customers*retention_cost_per_customer
net_benefit = potential_saves - intervention_cost

print(f'High-Risk Customers: {high_risk_customers}')
print(f'Potential Saves: ${potential_saves:,.2f}')
print(f'Net Benefit: ${net_benefit:,.2f}')

## 9. Optional Interactive Demo

This notebook can be converted into a **Streamlit app** for live predictions and interactive visualizations.
Example: `streamlit run telco_churn_app.py`