## Introduction

In this notebook, I explored the Telco Customer Churn dataset, broke down key trends and factors causing churn, and finally analyzed XGBRClassifier and DecisionTreeClassifier to predict churn and compare their performance .

## 1️⃣ **Importing Libraries and Dataset**

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBRFClassifier
from sklearn.metrics import accuracy_score,classification_report

**1.1 import dataset**

In [None]:
df=pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()

## 2️⃣ **Data Understanding & Basic Checks**

2.****1 checking shape****

In [None]:
df.shape

2.****2 checking NAN values or no value data****

In [None]:
df.replace(' ', np.nan, inplace=True)
df.isna().sum()

2.****3 fixing datatypes****

In [None]:
df['MonthlyCharges'] = df['MonthlyCharges'].astype('float')
df['TotalCharges'] = df['TotalCharges'].astype('float')

## 3️⃣ **Data Exploration**  

3.****1 coding for plot****

In [None]:
def hist(i):
    sns.histplot(df[i], kde=True, color='skyblue')
    plt.xlabel(i)
    plt.ylabel('Frequency')
    colm=df[i].mean()
    colmd=df[i].median()
    plt.axvline(colm, color="red", linestyle="--", label="Mean")
    plt.axvline(colmd, color="green", linestyle="-", label="Median")
    plt.legend()
    plt.show()

def box(i):
    sns.boxplot(df[i])
    plt.show()

def count(i):
    sns.countplot(x=df[i])
    plt.xlabel(i)
    plt.ylabel('count')
    plt.show()

3.****2 checking histogram plot for better understanding of data****

In [None]:
for i in ['tenure','MonthlyCharges','TotalCharges']:
    hist(i)

3.****3 checking boxplot for Q1,Q2,Q3****

In [None]:
for i in ['tenure','MonthlyCharges','TotalCharges']:
    box(i)

3.****4  checking data counts****

In [None]:
for i in ['gender',	'SeniorCitizen',	'Partner',	'Dependents',	'PhoneService',	'MultipleLines',
          'InternetService'	,'OnlineSecurity',	'OnlineBackup',	'DeviceProtection',	'TechSupport',
          'StreamingTV',	'StreamingMovies',	'Contract',	'PaperlessBilling',	'PaymentMethod']:
    count(i)

3.****5 converting data to float type****

In [None]:
cols_to_convert = ['Partner', 'Dependents', 'PhoneService', 'OnlineSecurity', 'OnlineBackup', 
                   'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling','Churn']

df[cols_to_convert] = (df[cols_to_convert] == 'Yes').astype(int)
df['gender'] = (df['gender'] == 'Male').astype(int)


3.****6 correlation between different features as TotalCharges has NAN values****

In [None]:
sns.heatmap(df[["tenure", "MonthlyCharges", "TotalCharges",'Churn']].corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

## 4️⃣ **Data Preprocessing** 

4.****1 checking shape****

In [None]:
df.shape

4.**2 viewing all columns for better analysis**

In [None]:
pd.set_option('display.max_columns',None)
df.head()

4.****3 checking unique values****

In [None]:
for i in df:
    n = df[i].nunique() 
    print(f" '{i}' unique values - {n} ")


4.****4 using OneHotEncoder to make data usable****

In [None]:
colen=['MultipleLines','InternetService','Contract','PaymentMethod']
def OHE(i):
    encoder = OneHotEncoder(sparse_output=False)
    encoded_array = encoder.fit_transform(df[[i]])
    encoded_df = pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out([i]))
    return encoded_df

In [None]:
for i in colen:
    encoded_df=OHE(i)
    df_concat = pd.concat([df, encoded_df], axis=1)
    df=df_concat.drop(columns=[i])

4.****5 droping NAN rows . As we saw that TotalCharges has high correlation with other features . So we can't drop that column****

In [None]:
df.dropna(inplace=True)

4.****6 checking shape****

In [None]:
df.shape

4.****6 Droping customer ID as it's not related to Churn****

In [None]:
df=df.drop(columns=['customerID'])
df.head()

4.****7 Seperating dependent and independent features****

In [None]:
X=df.drop(columns=['Churn'])
y=df[['Churn']]


## 5️⃣ **Model Training** 

5.****1 Splitting data****

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

5.****2 Checking dependent data****

In [None]:
print(sum(y_train['Churn']==1))
print(sum(y_train['Churn']==0))

In [None]:
sns.countplot(x=df['Churn'])
plt.show()

5.****3 Oversampling data to equalize the dependent feature****

In [None]:
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

## 6️⃣ **Model Analysis** 

6.****1 DecisionTreeClassifier****

In [None]:
model1=DecisionTreeClassifier()
model1.fit(X_train_resampled,y_train_resampled)
y_pred=model1.predict(X_test)
print(classification_report(y_pred,y_test))
scores = cross_val_score(model1, X, y, cv=10).mean() 
print(f'cross valulation score : {scores}')
print(f'accuracy : {accuracy_score(y_pred,y_test)}')

6.****2 XGBRFClassifier****

In [None]:
model2=XGBRFClassifier()
model2.fit(X_train_resampled,y_train_resampled)
y_pred=model2.predict(X_test)
print(classification_report(y_pred,y_test))
scores = cross_val_score(model2, X, y, cv=10).mean() 
print(f'cross valulation score : {scores}')
print(f'accuracy : {accuracy_score(y_pred,y_test)}')

## 7️⃣ **Conclusion**


In this project, we conducted an in-depth Exploratory Data Analysis (EDA) on the Telco Customer Churn dataset and built a predictive model to identify customers likely to churn.
We saw ,

**Senior citizens** are more likely to churn.

**Gender** doesn’t seem to matter much.

People with **Tensure** less than a year with the company leave the most. Long-time customers usually stay loyal.

**Month-to-month** plans have the highest churn.

**One or two-year contracts** are much more stable.

**Fiber optic** customers leave more compared to **DSL** or **no-internet** users.

Customers paying **higher monthly charges** churn more. But
**high total charges** (been around longer) usually stay.



We saw that our model is doing quite good with 73.13%(DecisionTreeClassifier) and 78.03%(XGBRFClassifier) accuracy.
and we can also see the classification report and cross validation score .