# <center><h1>*Customer Churn Prediction Report*</h1></center>
## 1. Executive Summary
### Objective:
- This analysis aims to predict customer churn using various classification models, including Logistic Regression, Random Forest, and Gradient Boosting. The goal is to help the business proactively identify at-risk customers and implement targeted retention strategies.

### Key Findings:
- Customers with month-to-month contracts and higher monthly charges are more likely to churn.
The Random Forest Classifier provided the best balance between interpretability and accuracy.
Feature importance analysis shows that tenure, contract type, and payment method significantly impact churn.

### Next Steps:
- Optimize the best-performing model through hyperparameter tuning.
Collect additional data on customer service interactions for improved predictive power.
Implement a retention strategy targeting customers at high risk of churn.


This report aims to analyze customer churn using multiple classification models and provide insights for business decisions.

In [2]:
# Import Libraries and Load the Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Load the Telco Customer Churn dataset (adjust the path as needed)
data_path = "/kaggle/input/wa-fnusec-telcocustomerchurn/WA_Fn-UseC_-Telco-Customer-Churn.csv"
df = pd.read_csv(data_path)

# Quick glance at the first few rows
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## 4. Data Exploration and Preprocessing
### Exploratory Data Analysis (EDA):
- The dataset contains 7,043 records with 21 features.
- Missing values were found in TotalCharges (which was converted to numeric and imputed).
- The target variable (Churn) is imbalanced (27% churn, 73% non-churn), which may affect model performance.
- Categorical features were encoded using one-hot encoding.


We will analyze **missing values, categorical distributions, numerical summaries, and correlations** to understand data patterns.

In [1]:
#  Exploratory Data Analysis (EDA)

print("DataFrame Info:")
df.info()

print("\nSummary Statistics:")
print(df.describe())

print("\nChurn Value Counts:")
print(df['Churn'].value_counts())

# Optional: visualize churn distribution
sns.countplot(x='Churn', data=df)
plt.title("Churn Distribution")
plt.show()


DataFrame Info:


NameError: name 'df' is not defined

### Data Cleaning & Feature Engineering:
- Dropped customerID (not useful for prediction).
- Converted Churn to 0/1 for machine learning models.
- Created dummy variables for categorical features.

In [5]:
# Import Libraries and Load the Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Load the Telco Customer Churn dataset (adjust the path as needed)
data_path = "/kaggle/input/wa-fnusec-telcocustomerchurn/WA_Fn-UseC_-Telco-Customer-Churn.csv"
df = pd.read_csv(data_path)

# Quick glance at the first few rows
df.head()



Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [6]:
#  Define Features and Target, then Split the Data

# Features (X) and Target (y)
X = df.drop('Churn', axis=1)
y = df['Churn']

# Split into training and testing (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")


Training set shape: (4930, 20)
Test set shape: (2113, 20)


## 5. Model Training & Evaluation
### Models Used:
- Logistic Regression - Baseline model.
- Random Forest Classifier - Captures non-linear relationships.
- Gradient Boosting Classifier - Higher predictive accuracy.

We will compare **Logistic Regression, Random Forest, and Gradient Boosting models** using accuracy, recall, precision, and ROC-AUC metrics.

In [7]:
# Import Libraries and Load the Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Load the Telco Customer Churn dataset (adjust the path as needed)
data_path = "/kaggle/input/wa-fnusec-telcocustomerchurn/WA_Fn-UseC_-Telco-Customer-Churn.csv"
df = pd.read_csv(data_path)

# Quick glance at the first few rows
df.head()



Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [10]:
# Re-import the necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Redefine the models
logreg = LogisticRegression(max_iter=1000)
rf = RandomForestClassifier(random_state=42)
gb = GradientBoostingClassifier(random_state=42)

# Fit the models again
logreg.fit(X_train, y_train)
rf.fit(X_train, y_train)
gb.fit(X_train, y_train)

print("Models have been trained successfully!")


ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [9]:
import pandas as pd
import numpy as np

# Assuming X_train is a pandas DataFrame
# Replace string values with NaN
X_train = X_train.apply(pd.to_numeric, errors='coerce')

# Optionally, you can fill NaN values with a specific value, such as the mean or median of the column
X_train.fillna(X_train.mean(), inplace=True)

# Alternatively, you can drop rows with NaN values
# X_train.dropna(inplace=True)

# Now you can fit your models again
logreg.fit(X_train, y_train)
rf.fit(X_train, y_train)
gb.fit(X_train, y_train)

print("Models have been trained successfully!")


NameError: name 'logreg' is not defined

In [None]:
# Import Libraries and Load the Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Load the Telco Customer Churn dataset (adjust the path as needed)
data_path = "/kaggle/input/wa-fnusec-telcocustomerchurn/WA_Fn-UseC_-Telco-Customer-Churn.csv"
df = pd.read_csv(data_path)

# Quick glance at the first few rows
df.head()



In [None]:
#  Evaluate the Models

models = {
    'Logistic Regression': logreg,
    'Random Forest': rf,
    'Gradient Boosting': gb
}

for name, model in models.items():
    print(f"\n--- {name} ---")
    y_pred = model.predict(X_test)
    
    # Classification report
    print(classification_report(y_test, y_pred))
    
    # ROC-AUC score using predicted probabilities
    y_proba = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_proba)
    print(f"ROC-AUC Score: {auc:.4f}")
