## Telco Customer Churn: Data Analysis and Prediction

In [66]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

In [67]:
df=pd.read_csv(r"C:\Users\pc\Downloads\archive\WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.info()
print(df.describe())
print(df.shape)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


* I observe that while no rows are missing, total charges, which is a numeric data is listed as object, which means we might need to deal with that, most of our data is listed as object including our target which means we might need to do some encoding

In [68]:
df=df.dropna()

df['TotalCharges']=pd.to_numeric(df['TotalCharges'],errors='coerce')
df=df.dropna()
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7032 non-null   object 
 1   gender            7032 non-null   object 
 2   SeniorCitizen     7032 non-null   int64  
 3   Partner           7032 non-null   object 
 4   Dependents        7032 non-null   object 
 5   tenure            7032 non-null   int64  
 6   PhoneService      7032 non-null   object 
 7   MultipleLines     7032 non-null   object 
 8   InternetService   7032 non-null   object 
 9   OnlineSecurity    7032 non-null   object 
 10  OnlineBackup      7032 non-null   object 
 11  DeviceProtection  7032 non-null   object 
 12  TechSupport       7032 non-null   object 
 13  StreamingTV       7032 non-null   object 
 14  StreamingMovies   7032 non-null   object 
 15  Contract          7032 non-null   object 
 16  PaperlessBilling  7032 non-null   object 
 17  

* We removed the rows with missing total charge column since its a small amount of the data and removing it is inconsequential

In [69]:
columns_to_encode = [
    'gender',
    'Partner',
    'Dependents',
    'PhoneService',
    'MultipleLines',
    'InternetService',
    'OnlineSecurity',
    'OnlineBackup',
    'DeviceProtection',
    'TechSupport',
    'StreamingTV',
    'StreamingMovies',
    'Contract',
    'PaperlessBilling',
    'PaymentMethod'
]
df_one_hot=pd.get_dummies(df,columns = columns_to_encode,drop_first=True)
df_one_hot.drop(columns=['customerID'],inplace=True)
df_one_hot['Churn'] = df_one_hot['Churn'].map({'Yes': 1, 'No': 0})
df_one_hot.info()


<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 31 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   SeniorCitizen                          7032 non-null   int64  
 1   tenure                                 7032 non-null   int64  
 2   MonthlyCharges                         7032 non-null   float64
 3   TotalCharges                           7032 non-null   float64
 4   Churn                                  7032 non-null   int64  
 5   gender_Male                            7032 non-null   bool   
 6   Partner_Yes                            7032 non-null   bool   
 7   Dependents_Yes                         7032 non-null   bool   
 8   PhoneService_Yes                       7032 non-null   bool   
 9   MultipleLines_No phone service         7032 non-null   bool   
 10  MultipleLines_Yes                      7032 non-null   bool   
 11  InternetS

In [70]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    roc_auc_score,
    confusion_matrix
)
from sklearn.model_selection import train_test_split

In [71]:
X_one_hot = df_one_hot.drop('Churn', axis=1)
y_one_hot = df_one_hot['Churn']
X_train_one_hot,X_test_one_hot,y_train_one_hot,y_test_one_hot=train_test_split(X_one_hot,y_one_hot,test_size=0.2,random_state=42,stratify=y_one_hot)

In [72]:
rf=RandomForestClassifier()
rf.fit(X_train_one_hot,y_train_one_hot)
y_pred=rf.predict(X_test_one_hot)
accuracy=accuracy_score(y_test_one_hot,y_pred)

cm=confusion_matrix(y_test_one_hot,y_pred)
print(f"Our accuracy is {accuracy:.2f}")

Our accuracy is 0.78
