## INTRODUCTION

Customer churn, the phenomenon where customers discontinue their relationship with a business, poses a significant challenge across industries such as telecommunications, e-commerce, and subscription-based services. Retaining existing customers is often more cost-effective than acquiring new ones, making the ability to predict and mitigate churn a critical business priority. The "Customer Churn Prediction Using Machine Learning" project aims to leverage advanced machine learning techniques to identify customers at risk of churning, enabling businesses to implement proactive retention strategies.

This project involves analyzing historical customer data, including demographics, usage patterns, and behavioral metrics, to uncover patterns indicative of churn. By employing machine learning algorithms such as logistic regression, decision trees, random forests, or neural networks, the project seeks to build a predictive model that accurately classifies customers as likely to churn or stay. The model will be trained and evaluated using real-world datasets, ensuring its robustness and reliability.

The primary objectives of this project are to:

Understand key factors driving customer churn.

Develop an accurate and interpretable machine learning model for churn prediction.

Provide actionable insights to businesses for improving customer retention.

The aim of this project, to demonstrate the power of machine learning in transforming raw customer data into strategic insights, ultimately helping businesses enhance customer satisfaction and reduce churn rates.

## BUSINESS UNDERSTING

Business Understanding for Customer Churn Prediction Using Machine Learning
Context
Customer churn, where customers stop using a company’s services, hurts revenue and growth in industries like telecom, e-commerce, and subscriptions. Retaining customers is cheaper than acquiring new ones, making churn prediction vital for business success.

Problem
Businesses struggle to identify customers at risk of leaving before they churn. Without predictive insights, retention efforts are reactive, costly, and less effective. This project aims to predict churn using machine learning to enable proactive retention.

* Objectives  
Predict customers likely to churn using historical data.  
Identify key churn drivers (e.g., usage patterns, demographics).  
Support targeted retention strategies to reduce churn.  
Boost customer lifetime value and lower acquisition costs.

* Stakeholders  
Executives: Focus on revenue and retention.  
Marketing: Designs retention campaigns.  
Customer Support: Engages high-risk customers.  
Data Scientists: Build and maintain the model.

* Success Criteria  
High model accuracy, especially minimizing missed churners.  
Measurable churn reduction via targeted actions.  
Clear, actionable insights for decision-making.  
Scalable model for business integration.

* Scope and Constraints
The project uses customer data (e.g., demographics, transactions, usage) to predict churn. Constraints include potential data quality issues, the need for interpretable models, and computational limits for real-time use.

* Expected Outcomes  
Lower churn rates through early identification.  
Better resource allocation for retention.  
Improved customer satisfaction with personalized strategies.  
Higher revenue via sustained loyalty.

This project will help the business retain customers and stay competitive by turning data into actionable retention strategies.

## Importing the necessary Libraries

In [21]:
# Import necessary libraries

# data loading
import glob
import os
import warnings
import sys

# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

# Correlation
import phik

# Visualize missing values
import missingno as msno

# Hypothesis testing
import scipy.stats as stats

#Impute
from sklearn.impute import SimpleImputer

# Machine Learning
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Class Imbalance
from imblearn.over_sampling import SMOTE

import warnings

# ignore warnings
warnings.filterwarnings('ignore')

## DATA LOADING



In [None]:


# Set your directory
folder_path = r"D:\DS PROJECTS\CUSTOMER-CHURN-PREDICTION-USING-ML\Datasets"

# Get all CSV file paths
csv_files = glob.glob(os.path.join(folder_path, "*.csv"))

# Create a dictionary of DataFrames with file names as keys (without .csv)
dataframes = {
    os.path.splitext(os.path.basename(f))[0]: pd.read_csv(f)
    for f in csv_files
}

# Access like: dataframes["df_first_3000"]


In [23]:
#Lets inspect our dataframes

def inspect_dataframe(df, name="DataFrame"):
    """
    Takes a DataFrame as input, and displays the head, info, and sum of all null values for each column in that DataFrame.
    """
    print(f"\n{name} Head:")
    print(df.head())
    print(f"\n{name} Info:")
    df.info()
    print(f"\n{name} Sum of Null Values for Each Column:")
    print(df.isna().sum())

def inspect_reviews_and_info(df_first_3000, df_last_2000, df_second_2000, df_LP2_Telco_churn_last_2000, df_Telco_, df_Telco_churn_second_2000):
    """
    Takes the reviews and info DataFrames as input, and displays the head, info, and sum of all null values for each column in both DataFrames.
    """
    inspect_dataframe(df_first_3000, "first_3000 DataFrame")
    inspect_dataframe(df_last_2000, "last_2000 DataFrame")
    inspect_dataframe(df_second_2000, "second_2000 DataFrame")
    inspect_dataframe(df_LP2_Telco_churn_last_2000, "LP2_Telco_churn_last_2000 DataFrame")
    inspect_dataframe(df_Telco_, "Telco_1 DataFrame")
    inspect_dataframe(df_Telco_churn_second_2000, "Telco_churn_second_2000 DataFrame")
# Inspect the datasets
for name, df in dataframes.items():
    inspect_dataframe(df, name)


df_first_3000 Head:
   customerID  gender  SeniorCitizen  Partner  Dependents  tenure  \
0  7590-VHVEG  Female          False     True       False       1   
1  5575-GNVDE    Male          False    False       False      34   
2  3668-QPYBK    Male          False    False       False       2   
3  7795-CFOCW    Male          False    False       False      45   
4  9237-HQITU  Female          False    False       False       2   

   PhoneService MultipleLines InternetService OnlineSecurity  ...  \
0         False           NaN             DSL          False  ...   
1          True         False             DSL           True  ...   
2          True         False             DSL           True  ...   
3         False           NaN             DSL           True  ...   
4          True         False     Fiber optic          False  ...   

  DeviceProtection TechSupport StreamingTV StreamingMovies        Contract  \
0            False       False       False           False  Month-to-mo

## Summary of Telco Churn Datasets

### df_first_3000.csv & Telco_1.csv (Identical)
- **Rows**: 3,000
- **Columns**: 21 (customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn)
- **Data Types**: bool (5), float64 (2), int64 (1), object (13)
- **Null Values**:
  - MultipleLines: 269
  - OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies: 651 each
  - TotalCharges: 5
  - Churn: 1
- **Memory Usage**: ~389.8 KB
- **Key Features**:
  - SeniorCitizen, Partner, Dependents, PhoneService, PaperlessBilling are boolean.
  - MonthlyCharges and TotalCharges are float64.
  - Tenure is int64.

### df_second_2000.csv & Telco-churn-second-2000.csv (Identical)
- **Rows**: 2,000
- **Columns**: 20 (same as above, but no Churn column in head output)
- **Data Types**: float64 (1), int64 (2), object (17)
- **Null Values**: None
- **Memory Usage**: ~312.6 KB
- **Key Features**:
  - SeniorCitizen, tenure are int64.
  - MonthlyCharges is float64.
  - TotalCharges is object (likely string, needs conversion).
  - Partner, Dependents, PhoneService, PaperlessBilling are object (not bool).

### df_last_2000.csv & LP2_Telco-churn-last-2000.csv (Identical)
- **Rows**: 2,043
- **Columns**: 21 (same as df_first_3000.csv)
- **Data Types**: float64 (1), int64 (2), object (18)
- **Null Values**: None
- **Memory Usage**: ~335.3 KB
- **Key Features**:
  - SeniorCitizen, tenure are int64.
  - MonthlyCharges is float64.
  - TotalCharges is object (likely string, needs conversion).
  - Partner, Dependents, PhoneService, PaperlessBilling are object (not bool).
  - Churn is object (likely "Yes"/"No").

### Observations
- **Duplicates**: df_first_3000.csv = Telco_1.csv; df_second_2000.csv = Telco-churn-second-2000.csv; df_last_2000.csv = LP2_Telco-churn-last-2000.csv.
- **Inconsistencies**:
  - df_first_3000.csv has null values; others have none.
  - Data types vary (e.g., bool vs. object for Partner, TotalCharges as float64 vs. object).
  - df_second_2000.csv head omits Churn, but info suggests it exists.
- **Preprocessing Needs**:
  - Handle nulls in df_first_3000.csv/Telco_1.csv.
  - Convert TotalCharges to float64 in df_second_2000.csv and df_last_2000.csv.
  - Standardize boolean columns (Partner, Dependents, etc.) across datasets.

Since all our datasets have the same number of collumns, column names and key features that are similar, It is imperative to merge all our datasets into one key dataframe to better understand the data before cleaning the data.


In [24]:
# Show all columns in each dataset
for name, df in dataframes.items():
    print(f"\nColumns in '{name}':")
    print(df.columns.tolist())



Columns in 'df_first_3000':
['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']

Columns in 'df_last_2000':
['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']

Columns in 'df_second_2000':
['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', '