## INTRODUCTION

Customer churn, the phenomenon where customers discontinue their relationship with a business, poses a significant challenge across industries such as telecommunications, e-commerce, and subscription-based services. Retaining existing customers is often more cost-effective than acquiring new ones, making the ability to predict and mitigate churn a critical business priority. The "Customer Churn Prediction Using Machine Learning" project aims to leverage advanced machine learning techniques to identify customers at risk of churning, enabling businesses to implement proactive retention strategies.

This project involves analyzing historical customer data, including demographics, usage patterns, and behavioral metrics, to uncover patterns indicative of churn. By employing machine learning algorithms such as logistic regression, decision trees, random forests, or neural networks, the project seeks to build a predictive model that accurately classifies customers as likely to churn or stay. The model will be trained and evaluated using real-world datasets, ensuring its robustness and reliability.

The primary objectives of this project are to:

Understand key factors driving customer churn.

Develop an accurate and interpretable machine learning model for churn prediction.

Provide actionable insights to businesses for improving customer retention.

The aim of this project, to demonstrate the power of machine learning in transforming raw customer data into strategic insights, ultimately helping businesses enhance customer satisfaction and reduce churn rates.

## BUSINESS UNDERSTING

Business Understanding for Customer Churn Prediction Using Machine Learning
Context
Customer churn, where customers stop using a company’s services, hurts revenue and growth in industries like telecom, e-commerce, and subscriptions. Retaining customers is cheaper than acquiring new ones, making churn prediction vital for business success.

Problem
Businesses struggle to identify customers at risk of leaving before they churn. Without predictive insights, retention efforts are reactive, costly, and less effective. This project aims to predict churn using machine learning to enable proactive retention.

* Objectives  
Predict customers likely to churn using historical data.  
Identify key churn drivers (e.g., usage patterns, demographics).  
Support targeted retention strategies to reduce churn.  
Boost customer lifetime value and lower acquisition costs.

* Stakeholders  
Executives: Focus on revenue and retention.  
Marketing: Designs retention campaigns.  
Customer Support: Engages high-risk customers.  
Data Scientists: Build and maintain the model.

* Success Criteria  
High model accuracy, especially minimizing missed churners.  
Measurable churn reduction via targeted actions.  
Clear, actionable insights for decision-making.  
Scalable model for business integration.

* Scope and Constraints
The project uses customer data (e.g., demographics, transactions, usage) to predict churn. Constraints include potential data quality issues, the need for interpretable models, and computational limits for real-time use.

* Expected Outcomes  
Lower churn rates through early identification.  
Better resource allocation for retention.  
Improved customer satisfaction with personalized strategies.  
Higher revenue via sustained loyalty.

This project will help the business retain customers and stay competitive by turning data into actionable retention strategies.

## Importing the necessary Libraries

In [20]:
# Import necessary libraries

# data loading
import glob
import os
import warnings
import sys

# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

# Correlation
import phik

# Visualize missing values
import missingno as msno

# Hypothesis testing
import scipy.stats as stats

#Impute
from sklearn.impute import SimpleImputer

# Machine Learning
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Class Imbalance
from imblearn.over_sampling import SMOTE

import warnings

# ignore warnings
warnings.filterwarnings('ignore')

## DATA LOADING



In [30]:
# Set your directory
folder_path = r"D:\DS PROJECTS\CUSTOMER-CHURN-PREDICTION-USING-ML\Datasets"

# Get all CSV file paths
csv_files = glob.glob(os.path.join(folder_path, "*.csv"))

# Create a dictionary of DataFrames with file names as keys (without .csv)
dataframes = {
    os.path.splitext(os.path.basename(f))[0]: pd.read_csv(f)
    for f in csv_files
}

# Access like: dataframes["df_first_3000"]


In [31]:
#Lets inspect our dataframes

def inspect_dataframe(df, name="DataFrame"):
    """
    Takes a DataFrame as input, and displays the head, info, and sum of all null values for each column in that DataFrame.
    """
    print(f"\n{name} Head:")
    print(df.head())
    print(f"\n{name} Info:")
    df.info()
    print(f"\n{name} Sum of Null Values for Each Column:")
    print(df.isna().sum())

def inspect_reviews_and_info(df_first_3000, df_last_2000, df_second_2000, df_LP2_Telco_churn_last_2000, df_Telco_, df_Telco_churn_second_2000):
    """
    Takes the reviews and info DataFrames as input, and displays the head, info, and sum of all null values for each column in both DataFrames.
    """
    inspect_dataframe(df_first_3000, "first_3000 DataFrame")
    inspect_dataframe(df_last_2000, "last_2000 DataFrame")
    inspect_dataframe(df_second_2000, "second_2000 DataFrame")
    inspect_dataframe(df_LP2_Telco_churn_last_2000, "LP2_Telco_churn_last_2000 DataFrame")
    inspect_dataframe(df_Telco_, "Telco_1 DataFrame")
    inspect_dataframe(df_Telco_churn_second_2000, "Telco_churn_second_2000 DataFrame")
# Inspect the datasets
for name, df in dataframes.items():
    inspect_dataframe(df, name)


df_first_3000 Head:
   customerID  gender  SeniorCitizen  Partner  Dependents  tenure  \
0  7590-VHVEG  Female          False     True       False       1   
1  5575-GNVDE    Male          False    False       False      34   
2  3668-QPYBK    Male          False    False       False       2   
3  7795-CFOCW    Male          False    False       False      45   
4  9237-HQITU  Female          False    False       False       2   

   PhoneService MultipleLines InternetService OnlineSecurity  ...  \
0         False           NaN             DSL          False  ...   
1          True         False             DSL           True  ...   
2          True         False             DSL           True  ...   
3         False           NaN             DSL           True  ...   
4          True         False     Fiber optic          False  ...   

  DeviceProtection TechSupport StreamingTV StreamingMovies        Contract  \
0            False       False       False           False  Month-to-mo

## Summary of Telco Churn Datasets

### df_first_3000.csv & Telco_1.csv (Identical)
- **Rows**: 3,000
- **Columns**: 21 (customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn)
- **Data Types**: bool (5), float64 (2), int64 (1), object (13)
- **Null Values**:
  - MultipleLines: 269
  - OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies: 651 each
  - TotalCharges: 5
  - Churn: 1
- **Memory Usage**: ~389.8 KB
- **Key Features**:
  - SeniorCitizen, Partner, Dependents, PhoneService, PaperlessBilling are boolean.
  - MonthlyCharges and TotalCharges are float64.
  - Tenure is int64.

### df_second_2000.csv & Telco-churn-second-2000.csv (Identical)
- **Rows**: 2,000
- **Columns**: 20 (same as above, but no Churn column in head output)
- **Data Types**: float64 (1), int64 (2), object (17)
- **Null Values**: None
- **Memory Usage**: ~312.6 KB
- **Key Features**:
  - SeniorCitizen, tenure are int64.
  - MonthlyCharges is float64.
  - TotalCharges is object (likely string, needs conversion).
  - Partner, Dependents, PhoneService, PaperlessBilling are object (not bool).

### df_last_2000.csv & LP2_Telco-churn-last-2000.csv (Identical)
- **Rows**: 2,043
- **Columns**: 21 (same as df_first_3000.csv)
- **Data Types**: float64 (1), int64 (2), object (18)
- **Null Values**: None
- **Memory Usage**: ~335.3 KB
- **Key Features**:
  - SeniorCitizen, tenure are int64.
  - MonthlyCharges is float64.
  - TotalCharges is object (likely string, needs conversion).
  - Partner, Dependents, PhoneService, PaperlessBilling are object (not bool).
  - Churn is object (likely "Yes"/"No").

### Observations
- **Duplicates**: df_first_3000.csv = Telco_1.csv; df_second_2000.csv = Telco-churn-second-2000.csv; df_last_2000.csv = LP2_Telco-churn-last-2000.csv.
- **Inconsistencies**:
  - df_first_3000.csv has null values; others have none.
  - Data types vary (e.g., bool vs. object for Partner, TotalCharges as float64 vs. object).
  - df_second_2000.csv head omits Churn, but info suggests it exists.
- **Preprocessing Needs**:
  - Handle nulls in df_first_3000.csv/Telco_1.csv.
  - Convert TotalCharges to float64 in df_second_2000.csv and df_last_2000.csv.
  - Standardize boolean columns (Partner, Dependents, etc.) across datasets.

In [32]:
# Show all columns in each dataset
for name, df in dataframes.items():
    print(f"\nColumns in '{name}':")
    print(df.columns.tolist())



Columns in 'df_first_3000':
['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']

Columns in 'df_last_2000':
['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']

Columns in 'df_second_2000':
['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', '

In [33]:
import numpy as np
import pandas as pd
import numpy as np

# Load your datasets
df_second_2000 = pd.read_csv(r"D:\DS PROJECTS\CUSTOMER-CHURN-PREDICTION-USING-ML\Datasets\df_second_2000.csv")
Telco_churn_second_2000 = pd.read_csv(r"D:\DS PROJECTS\CUSTOMER-CHURN-PREDICTION-USING-ML\Datasets\Telco_churn_second_2000.csv")


In [34]:
# Add 'Churn' column with placeholder value

df_second_2000['Churn'] = np.nan
Telco_churn_second_2000['Churn'] = np.nan

Since all our datasets have the same number of collumns, column names and key features that are similar, It is imperative to merge all our datasets into one key dataframe to better understand the data before cleaning the data.


In [36]:
# Merge all dataframes in the dictionary into one
df_merged = pd.concat(list(dataframes.values()), ignore_index=True)


In [37]:
# Check the number of rows and columns
print(f"Combined dataset shape: {df_merged.shape}")

# Preview the top few rows
df_merged.head()


Combined dataset shape: (14086, 21)


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,True,False,1,False,,DSL,False,...,False,False,False,False,Month-to-month,True,Electronic check,29.85,29.85,False
1,5575-GNVDE,Male,0,False,False,34,True,False,DSL,True,...,True,False,False,False,One year,False,Mailed check,56.950001,1889.5,False
2,3668-QPYBK,Male,0,False,False,2,True,False,DSL,True,...,False,False,False,False,Month-to-month,True,Mailed check,53.849998,108.150002,True
3,7795-CFOCW,Male,0,False,False,45,False,,DSL,True,...,True,True,False,False,One year,False,Bank transfer (automatic),42.299999,1840.75,False
4,9237-HQITU,Female,0,False,False,2,True,False,Fiber optic,False,...,False,False,False,False,Month-to-month,True,Electronic check,70.699997,151.649994,True


In [38]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14086 entries, 0 to 14085
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        14086 non-null  object 
 1   gender            14086 non-null  object 
 2   SeniorCitizen     14086 non-null  int64  
 3   Partner           14086 non-null  object 
 4   Dependents        14086 non-null  object 
 5   tenure            14086 non-null  int64  
 6   PhoneService      14086 non-null  object 
 7   MultipleLines     13548 non-null  object 
 8   InternetService   14086 non-null  object 
 9   OnlineSecurity    12784 non-null  object 
 10  OnlineBackup      12784 non-null  object 
 11  DeviceProtection  12784 non-null  object 
 12  TechSupport       12784 non-null  object 
 13  StreamingTV       12784 non-null  object 
 14  StreamingMovies   12784 non-null  object 
 15  Contract          14086 non-null  object 
 16  PaperlessBilling  14086 non-null  object

# DATA UNDERSTANDING

The dataset contains customer information from the Vodafone network service, including features like MonthlyCharges, Tenure, SeniorCitizen status, and various service subscriptions (e.g., OnlineSecurity, OnlineBackup). The target variable 'Churn' indicates whether a customer has churned ('Yes') or not ('No'). Before building the models, we need to preprocess the data, handle missing values, and address class imbalance.

| Feature           | Description                                                          |
|-----------------|----------------------------------------------------------------------|
| Gender          | Whether the customer is a male or a female                          |
| SeniorCitizen   | Whether a customer is a senior citizen or not                       |
| Partner         | Whether the customer has a partner or not (Yes, No)                 |
| Dependents      | Whether the customer has dependents or not (Yes, No)                |
| Tenure          | Number of months the customer has stayed with the company           |
| Phone Service   | Whether the customer has a phone service or not (Yes, No)           |
| MultipleLines   | Whether the customer has multiple lines or not                      |
| InternetService | Customer's internet service provider (DSL, Fiber Optic, No)        |
| OnlineSecurity  | Whether the customer has online security or not (Yes, No, No Internet) |
| OnlineBackup    | Whether the customer has online backup or not (Yes, No, No Internet) |
| DeviceProtection| Whether the customer has device protection or not (Yes, No, No internet service) |
| TechSupport     | Whether the customer has tech support or not (Yes, No, No internet) |
| StreamingTV     | Whether the customer has streaming TV or not (Yes, No, No internet service) |
| StreamingMovies | Whether the customer has streaming movies or not (Yes, No, No Internet service) |
| Contract        | The contract term of the customer (Month-to-Month, One year, Two year) |
| PaperlessBilling| Whether the customer has paperless billing or not (Yes, No)        |
| Payment Method  | The customer's payment method (Electronic check, mailed check, Bank transfer(automatic), Credit card(automatic)) |
| MonthlyCharges  | The amount charged to the customer monthly                         |
| TotalCharges    | The total amount charged to the customer                            |
| Churn           | Whether the customer churned or not (Yes or No)                     |


In [39]:
# Summary table for merged data

summary = pd.DataFrame({
    'Column': df_merged.columns,
    'Data Type': df_merged.dtypes.values,
    'Missing Values': df_merged.isnull().sum().values,
    'Unique Values': df_merged.nunique().values,
    'Example Value': [df_merged[col].dropna().iloc[0] if df_merged[col].notnull().any() else None for col in df_merged.columns]
})

summary.reset_index(drop=True, inplace=True)
summary

Unnamed: 0,Column,Data Type,Missing Values,Unique Values,Example Value
0,customerID,object,0,7043,7590-VHVEG
1,gender,object,0,2,Female
2,SeniorCitizen,int64,0,2,0
3,Partner,object,0,4,True
4,Dependents,object,0,4,False
5,tenure,int64,0,73,1
6,PhoneService,object,0,4,False
7,MultipleLines,object,538,5,False
8,InternetService,object,0,3,DSL
9,OnlineSecurity,object,1302,5,False


The summary table provides an overview of the merged Telco churn dataset:

- **Columns & Data Types**: The dataset contains 21 columns, including customer demographics, service usage, and churn status. Most columns are of type 'object' (categorical), with a few numerical columns like 'SeniorCitizen', 'tenure', and 'MonthlyCharges'.

- **Missing Values**: Some columns have missing values, notably 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'TotalCharges', and 'Churn'. For example, 'Churn' has 4002 missing values, indicating that not all records are labeled.

- **Unique Values**: The number of unique values varies by column. For instance, 'customerID' is unique for each customer, while categorical columns like 'gender' and 'Contract' have only a few unique values.

- **Example Values**: The table also shows a sample value from each column, helping to understand the kind of data stored (e.g., '7590-VHVEG' for customerID, 'Female' for gender, 'Month-to-month' for Contract).

Overall, the dataset is a mix of categorical and numerical features, with some missing data that will need to be addressed during preprocessing. The target variable 'Churn' indicates whether a customer has left the service, which is crucial for building predictive models.