# 📊 Churn Prediction - ChurnX Telecom

**Fictional Company:** ChurnX Telecom  
**Project Type:** Supervised Classification  
**Author:** S Jones  
**Tools Used:** Python, Pandas, Scikit-learn, Matplotlib, Seaborn  

---

## 🏢 About the Company:
ChurnX Telecom is a fictional telecom provider offering phone and internet services. The company noticed a steady decline in its active user base and wants to investigate the factors that lead to customer churn (i.e., customers leaving the service).

---

## 🎯 Problem Statement:
The goal is to build a predictive model using historical customer data that can accurately identify customers who are likely to **churn**. This will help ChurnX proactively take retention actions and reduce customer loss.

---

## 📁 Dataset Overview:
- Dataset Source: [Kaggle - Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn)
- Format: CSV
- Rows: 7043 customers
- Target Column: `Churn` (Yes/No)

---

## 📌 Objectives:
1. Explore and clean the dataset  
2. Analyze churn distribution and feature relationships  
3. Encode categorical data  
4. Train a machine learning model to predict churn  
5. Evaluate the model's performance  
6. Summarize findings and recommendations

---


## 📥 Step 1: Load and Preview the Dataset

We begin by importing necessary libraries and loading the Telco Customer Churn dataset.  
This gives us an initial understanding of the structure and types of features available.


In [1]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv('/content/WA_Fn-UseC_-Telco-Customer-Churn.csv')

# Preview
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## 🧾 Step 2: Dataset Overview

Let's inspect column types, null values, and statistical summaries to understand what cleaning may be required.


In [2]:
df.info()
df.describe()
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


## 🧼 Step 3: Data Cleaning – Fixing Data Types

The `TotalCharges` column should be numeric, but it's stored as an object.  
We’ll convert it to float and handle any non-numeric (blank) values by replacing them with the median.


In [3]:
# Convert 'TotalCharges' to numeric (some values are blank strings)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Check how many became NaN
print("Missing TotalCharges after conversion:", df['TotalCharges'].isnull().sum())

# Replace NaNs with median
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

# Confirm it's now float
df.dtypes


Missing TotalCharges after conversion: 11


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)


Unnamed: 0,0
customerID,object
gender,object
SeniorCitizen,int64
Partner,object
Dependents,object
tenure,int64
PhoneService,object
MultipleLines,object
InternetService,object
OnlineSecurity,object


In [4]:
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median())


## 🔡 Step 4: Encoding Categorical Variables

To prepare the data for modeling, we need to convert all categorical (object) columns into numerical form.  
We'll use Label Encoding for simplicity, and skip `customerID` since it's not a feature.


In [5]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# Encode all object-type columns except 'customerID'
for col in df.select_dtypes(include='object').columns:
    if col != 'customerID':
        df[col] = le.fit_transform(df[col])

# Confirm encoding
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,1,2,29.85,29.85,0
1,5575-GNVDE,1,0,0,0,34,1,0,0,2,...,2,0,0,0,1,0,3,56.95,1889.5,0
2,3668-QPYBK,1,0,0,0,2,1,0,0,2,...,0,0,0,0,0,1,3,53.85,108.15,1
3,7795-CFOCW,1,0,0,0,45,0,1,0,2,...,2,2,0,0,1,0,0,42.3,1840.75,0
4,9237-HQITU,0,0,0,0,2,1,0,1,0,...,0,0,0,0,0,1,2,70.7,151.65,1


## 🔄 Step 5: Define Features and Target Variable

We’ll drop `customerID` as it’s not useful for prediction, and define our target variable `Churn` (0 = No, 1 = Yes).


In [6]:
# Define target and features
X = df.drop(['customerID', 'Churn'], axis=1)
y = df['Churn']

# Confirm shapes
print("Features shape:", X.shape)
print("Target shape:", y.shape)


Features shape: (7043, 19)
Target shape: (7043,)


## ✂️ Step 6: Train-Test Split and Model Training

We’ll split the data into training and test sets, then train a **Logistic Regression** model to predict churn.  
This will help us evaluate how well the model generalizes to unseen data.


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 📊 Step 7: Model Evaluation

Now we’ll evaluate how well the model performs on the test set using classification metrics.  
These include accuracy, precision, recall, and F1-score — especially important in churn prediction.


In [8]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nAccuracy Score:", accuracy_score(y_test, y_pred))


Confusion Matrix:
[[935 101]
 [158 215]]

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.90      0.88      1036
           1       0.68      0.58      0.62       373

    accuracy                           0.82      1409
   macro avg       0.77      0.74      0.75      1409
weighted avg       0.81      0.82      0.81      1409


Accuracy Score: 0.8161816891412349


## 🏁 Step 8: Conclusion & Insights

Our logistic regression model achieved an **accuracy of ~81.6%** on the test set, which is a solid baseline for churn prediction.

### 🔍 Key Observations:
- The model performs well in predicting customers who **will not churn** (class `0`) with high precision and recall.
- However, it struggles a bit more with customers who **do churn** (class `1`), with a recall of ~58%. This means some churners are being missed.
- This is a common issue in churn prediction due to **class imbalance** — more customers stay than leave.

### 🛠️ Next Steps:
- Try **Random Forest or XGBoost** models to improve performance.
- Use **feature scaling** and **hyperparameter tuning** for logistic regression.
- Address **class imbalance** with techniques like SMOTE or class weights.
- Explore **SHAP or feature importance** to understand which features impact churn the most.

---

This project demonstrates how data science can help telecom companies identify at-risk customers and take proactive actions to improve customer retention.
