# Predicting Customer Churn Using Machine Learning

This project explores the use of machine learning models—Logistic Regression and Random Forest—to predict customer churn based on demographic and usage patterns. By developing and evaluating predictive models, we aim to understand key factors contributing to customer churn and compare the effectiveness of different algorithms.

This analysis was completed as part of a **DataCamp project**, adapted and extended for personal skill development in **classification modeling, feature engineering, and model evaluation techniques** relevant to health-related predictive modeling.

**Key Skills Applied:**
- Data preprocessing and feature scaling
- One-hot encoding of categorical variables
- Classification modeling with Logistic Regression and Random Forest
- Model evaluation using classification reports, confusion matrices, and accuracy metrics

**Objective:** To develop a machine learning pipeline for predicting binary outcomes (churn vs. no churn), with workflows analogous to real-world health event prediction.

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

## Data Loading and Merging
We load two datasets containing customer demographic and usage information, and merge them to form a comprehensive dataset.

In [None]:
# Load datasets
telecom_demographics = pd.read_csv('telecom_demographics.csv')  
telecom_usage = pd.read_csv('telecom_usage.csv')  

# Merge on 'customer_id'
churn_df = pd.merge(telecom_demographics, telecom_usage, on='customer_id')

# Churn proportion
churn_proportion = churn_df['churn'].mean()  
print(f"Proportion of customers who have churned: {churn_proportion:.4f}")

## Data Preprocessing
Categorical variables are one-hot encoded, and numerical features are scaled for model readiness.

In [None]:
# Identify categorical columns
categorical_vars = churn_df.select_dtypes(include=['object']).columns.tolist()

# One-hot encode
encoder = OneHotEncoder(drop='first', sparse=False)
encoded_vars = encoder.fit_transform(churn_df[categorical_vars])
encoded_df = pd.DataFrame(encoded_vars, columns=encoder.get_feature_names_out(categorical_vars))

# Merge encoded and drop originals
churn_df = churn_df.drop(categorical_vars, axis=1)
churn_df = pd.concat([churn_df, encoded_df], axis=1)

# Scale features
scaler = StandardScaler()
target = churn_df['churn']
features = churn_df.drop(['customer_id', 'churn'], axis=1)
features_scaled = scaler.fit_transform(features)

## Train-Test Split
We split the data into training and testing sets to evaluate model performance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42)

## Model Training
Two classification models are trained: Logistic Regression and Random Forest.

In [None]:
# Logistic Regression
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)

# Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

## Model Evaluation
We assess model performance using classification reports, confusion matrices, and accuracy scores.

In [None]:
# Logistic Regression Evaluation
print("Logistic Regression Report:")
print(classification_report(y_test, logreg_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, logreg_pred))

# Random Forest Evaluation
print("\nRandom Forest Report:")
print(classification_report(y_test, rf_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, rf_pred))

# Accuracy comparison
logreg_accuracy = (logreg_pred == y_test).mean()
rf_accuracy = (rf_pred == y_test).mean()

print(f"\nLogistic Regression Accuracy: {logreg_accuracy:.4f}")
print(f"Random Forest Accuracy: {rf_accuracy:.4f}")

higher_accuracy = "Logistic Regression" if logreg_accuracy > rf_accuracy else "Random Forest"
print("Model with higher accuracy:", higher_accuracy)


## Conclusion
Both models were able to predict customer churn with reasonable accuracy. The better-performing model was identified based on accuracy and classification metrics.

**Next Steps:**  
- Tune hyperparameters for improved performance.  
- Evaluate model fairness across subgroups.  
- Explore explainability tools for actionable insights.