# Machine Learning Classification

## Telecom Customer Churn Prediction Project

*We use logistic regression to predict churn*

**Dataset** : <https://www.kaggle.com/datasets/blastchar/telco-customer-churn>



### Data Preparation

- Download the data, read it with pandas
- Make column names and values look uniform
- Look for churn column and convert to 0 and 1s

In [1]:
#importing libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [2]:
#read the data
df = pd.read_csv("./WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [4]:
df.columns = df.columns.str.lower().str.replace(' ','_')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerid        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   seniorcitizen     7043 non-null   int64  
 3   partner           7043 non-null   object 
 4   dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   phoneservice      7043 non-null   object 
 7   multiplelines     7043 non-null   object 
 8   internetservice   7043 non-null   object 
 9   onlinesecurity    7043 non-null   object 
 10  onlinebackup      7043 non-null   object 
 11  deviceprotection  7043 non-null   object 
 12  techsupport       7043 non-null   object 
 13  streamingtv       7043 non-null   object 
 14  streamingmovies   7043 non-null   object 
 15  contract          7043 non-null   object 
 16  paperlessbilling  7043 non-null   object 


In [6]:
#convert the totalcharges column to numeric
df['totalcharges'] = pd.to_numeric(df['totalcharges'],errors = 'coerce')

df['totalcharges'] = df['totalcharges'].fillna(0)

In [7]:
#clean string columns
string_columns = list(df.dtypes[df.dtypes==object].index)

for s in string_columns:
    df[s]=df[s].str.lower().str.replace(' ','_')

In [8]:
#clean churn column
df['churn']=(df['churn']=='yes').astype(int)

### Data Validation

Split the data into training, validation and testing as 60%, 20% and 20% respectively.

Use sklearn.model_selection
```python
from sklearn.model_selection import train_test_split
```

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
df_full_train, df_test = train_test_split(df,test_size=0.2,random_state=1)

In [11]:
len(df_full_train),len(df_test)

(5634, 1409)

In [12]:
df_train,df_val = train_test_split(df_full_train,test_size=0.25,random_state=1)

In [13]:
len(df_train),len(df_val),len(df_test)

(4225, 1409, 1409)

In [14]:
df_train.reset_index(drop=True)
df_val.reset_index(drop=True)
df_test.reset_index(drop=True)

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,8879-zkjof,female,0,no,no,41,yes,no,dsl,yes,...,yes,yes,yes,yes,one_year,yes,bank_transfer_(automatic),79.85,3320.75,0
1,0201-mibol,female,1,no,no,66,yes,yes,fiber_optic,yes,...,no,no,yes,yes,two_year,yes,bank_transfer_(automatic),102.40,6471.85,0
2,1600-dilpe,female,0,no,no,12,yes,no,dsl,no,...,no,no,no,no,month-to-month,yes,bank_transfer_(automatic),45.00,524.35,0
3,8601-qacrs,female,0,no,no,5,yes,yes,dsl,no,...,no,no,no,no,month-to-month,yes,mailed_check,50.60,249.95,1
4,7919-zodzz,female,0,yes,yes,10,yes,no,dsl,no,...,yes,no,no,yes,one_year,yes,mailed_check,65.90,660.05,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1404,5130-iekqt,male,1,no,no,25,yes,yes,fiber_optic,no,...,yes,no,yes,yes,month-to-month,no,mailed_check,105.95,2655.25,1
1405,4452-rohmo,female,0,no,no,15,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.60,331.60,0
1406,6164-haqtx,male,0,no,no,71,no,no_phone_service,dsl,yes,...,yes,yes,yes,no,two_year,no,bank_transfer_(automatic),53.95,3888.65,0
1407,3982-dqlus,male,1,yes,yes,65,yes,yes,fiber_optic,yes,...,no,no,no,no,month-to-month,yes,electronic_check,85.75,5688.45,0


In [15]:
# prepare target variables
y_train = df_train['churn']
y_val = df_val['churn']
y_test = df_test['churn']

In [16]:
del df_train['churn']
del df_val['churn']
del df_test['churn']

### Exploratory Data Analysis

- Check for missing values
- Look at the target variable(churn) distribution

In [17]:
df_full_train = df_full_train.reset_index(drop=True)

In [18]:
df_full_train

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,5442-pptjy,male,0,yes,yes,12,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.70,258.35,0
1,6261-rcvns,female,0,no,no,42,yes,no,dsl,yes,...,yes,yes,no,yes,one_year,no,credit_card_(automatic),73.90,3160.55,1
2,2176-osjuv,male,0,yes,no,71,yes,yes,dsl,yes,...,no,yes,no,no,two_year,no,bank_transfer_(automatic),65.15,4681.75,0
3,6161-erdgd,male,0,yes,yes,71,yes,yes,dsl,yes,...,yes,yes,yes,yes,one_year,no,electronic_check,85.45,6300.85,0
4,2364-ufrom,male,0,no,no,30,yes,no,dsl,yes,...,no,yes,yes,no,one_year,no,electronic_check,70.40,2044.75,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5629,0781-lkxbr,male,1,no,no,9,yes,yes,fiber_optic,no,...,yes,no,yes,yes,month-to-month,yes,electronic_check,100.50,918.60,1
5630,3507-gasnp,male,0,no,yes,60,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.95,1189.90,0
5631,8868-wozgu,male,0,no,no,28,yes,yes,fiber_optic,no,...,yes,no,yes,yes,month-to-month,yes,electronic_check,105.70,2979.50,1
5632,1251-krreg,male,0,no,no,2,yes,yes,dsl,no,...,no,no,no,no,month-to-month,yes,mailed_check,54.40,114.10,1


In [19]:
df_full_train.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [20]:
df_full_train['churn'].value_counts(normalize=True)

churn
0    0.730032
1    0.269968
Name: proportion, dtype: float64

In [21]:
global_churn_rate = df_full_train['churn'].mean()
round(global_churn_rate,2)

np.float64(0.27)

In [22]:
numerical  = ['tenure','monthlycharges','totalcharges']

In [23]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents', 'phoneservice', 'multiplelines', 
               'internetservice', 'onlinesecurity', 'onlinebackup', 'deviceprotection', 
               'techsupport', 'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling', 'paymentmethod']

In [24]:
df_full_train[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

## Feature importance: Churn rate and risk ratio

- Churn rate for each groups

In [25]:
#gender variable
churn_female = df[df['gender']=='female']['churn'].mean()
churn_male = df[df['gender']=='male']['churn'].mean()
print(f'global_churn_rate: {global_churn_rate}\nchurn_female: {churn_female}\nchurn_male: {churn_male}')

global_churn_rate: 0.26996805111821087
churn_female: 0.26920871559633025
churn_male: 0.2616033755274262


In [26]:
#partner variable
churn_no_partner = df[df['partner']=='no']['churn'].mean()
churn_yes_partner = df[df['partner']=='yes']['churn'].mean()
print(f'global_churn_rate: {global_churn_rate}\nchurn_no_partner: {churn_no_partner}\nchurn_yes_partner: {churn_yes_partner}')

global_churn_rate: 0.26996805111821087
churn_no_partner: 0.32957978577313923
churn_yes_partner: 0.1966490299823633


In [32]:
print(f'global_churn minus female churn rate is {global_churn_rate-churn_female:%}%')
print(f'global_churn minus male churn rate is {global_churn_rate-churn_male:%}%')

global_churn minus female churn rate is 0.075934%%
global_churn minus male churn rate is 0.836468%%


In [33]:
print(f'global_churn minus no partners churn rate is {global_churn_rate-churn_no_partner:%}%')
print(f'global_churn minus having partners churn rate is {global_churn_rate-churn_yes_partner:%}%')

global_churn minus no partners churn rate is -5.961173%%
global_churn minus having partners churn rate is 7.331902%%


### The above variable shows that comparing global churn rate

- less than 1% in gender variable
- more than 1% in partner variable
- If the global churn rate - group churn rate is > 0 or positive, this group is less likely to churn and vice versa. 

two key metrics used to understand churn behavior across customer groups: **Difference** and **Risk Ratio**.

---

### 1. Difference

**Formula:**
Difference = Global - Group

**Interpretation:**

| Condition | Meaning |
|------------|----------|
| `< 0` | Group is **more likely to churn** (higher churn rate than global average) |
| `> 0` | Group is **less likely to churn** (lower churn rate than global average) |

**Usage:**  
Helps identify whether a specific group performs better or worse than the overall customer base in terms of churn likelihood.

---

#### 2. Risk Ratio

**Formula:**
Risk = Group / Global

**Interpretation:**

| Condition | Meaning |
|------------|----------|
| `> 1` | Group is **more likely to churn** |
| `< 1` | Group is **less likely to churn** |

**Usage:**  
Shows how much more (or less) likely a group is to churn compared to the global average — a *relative measure* of churn risk.

---

####  Summary

| Metric | Formula | When High | When Low |
|---------|----------|------------|-----------|
| **Difference** | Global - Group | Less likely to churn | More likely to churn |
| **Risk Ratio** | Group / Global | More likely to churn | Less likely to churn |

---

These metrics are useful for identifying customer segments that need attention in retention strategies or campaigns.


In [43]:
from IPython.display import display

In [47]:
# we can use group by function
for c in categorical:
    print(c)
    df_group =df_full_train.groupby(c)['churn'].agg(['mean','count'])
    df_group['diff'] = global_churn_rate-df_group['mean']
    df_group['risk'] = df_group['mean']/global_churn_rate
    display(df_group)
    print()

gender


Unnamed: 0_level_0,mean,count,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.276824,2796,-0.006856,1.025396
male,0.263214,2838,0.006755,0.97498



seniorcitizen


Unnamed: 0_level_0,mean,count,diff,risk
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.24227,4722,0.027698,0.897403
1,0.413377,912,-0.143409,1.531208



partner


Unnamed: 0_level_0,mean,count,diff,risk
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.329809,2932,-0.059841,1.221659
yes,0.205033,2702,0.064935,0.759472



dependents


Unnamed: 0_level_0,mean,count,diff,risk
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.31376,3968,-0.043792,1.162212
yes,0.165666,1666,0.104302,0.613651



phoneservice


Unnamed: 0_level_0,mean,count,diff,risk
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.241316,547,0.028652,0.89387
yes,0.273049,5087,-0.003081,1.011412



multiplelines


Unnamed: 0_level_0,mean,count,diff,risk
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.257407,2700,0.012561,0.953474
no_phone_service,0.241316,547,0.028652,0.89387
yes,0.290742,2387,-0.020773,1.076948



internetservice


Unnamed: 0_level_0,mean,count,diff,risk
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
dsl,0.192347,1934,0.077621,0.712482
fiber_optic,0.425171,2479,-0.155203,1.574895
no,0.077805,1221,0.192163,0.288201



onlinesecurity


Unnamed: 0_level_0,mean,count,diff,risk
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.420921,2801,-0.150953,1.559152
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.153226,1612,0.116742,0.56757



onlinebackup


Unnamed: 0_level_0,mean,count,diff,risk
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.404323,2498,-0.134355,1.497672
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.217232,1915,0.052736,0.80466



deviceprotection


Unnamed: 0_level_0,mean,count,diff,risk
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.395875,2473,-0.125907,1.466379
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.230412,1940,0.039556,0.85348



techsupport


Unnamed: 0_level_0,mean,count,diff,risk
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.418914,2781,-0.148946,1.551717
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.159926,1632,0.110042,0.59239



streamingtv


Unnamed: 0_level_0,mean,count,diff,risk
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.342832,2246,-0.072864,1.269897
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.302723,2167,-0.032755,1.121328



streamingmovies


Unnamed: 0_level_0,mean,count,diff,risk
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.338906,2213,-0.068938,1.255358
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.307273,2200,-0.037305,1.138182



contract


Unnamed: 0_level_0,mean,count,diff,risk
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
month-to-month,0.431701,3104,-0.161733,1.599082
one_year,0.120573,1186,0.149395,0.446621
two_year,0.028274,1344,0.241694,0.10473



paperlessbilling


Unnamed: 0_level_0,mean,count,diff,risk
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.172071,2313,0.097897,0.637375
yes,0.338151,3321,-0.068183,1.25256



paymentmethod


Unnamed: 0_level_0,mean,count,diff,risk
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bank_transfer_(automatic),0.168171,1219,0.101797,0.622928
credit_card_(automatic),0.164339,1217,0.10563,0.608733
electronic_check,0.45589,1893,-0.185922,1.688682
mailed_check,0.19387,1305,0.076098,0.718121





## Feature importance: Mutual information

Mutual information is a concept from information theory, which measures how much we can learn about one variable if we know the value of another.

In [48]:
from sklearn.metrics import mutual_info_score

In [57]:
def mutual_info_score_churn(series):
    return mutual_info_score(series,df_full_train['churn'])

In [64]:
mi = df_full_train[categorical].apply(mutual_info_score_churn)

mi.sort_values(ascending=False)


contract            0.098320
onlinesecurity      0.063085
techsupport         0.061032
internetservice     0.055868
onlinebackup        0.046923
deviceprotection    0.043453
paymentmethod       0.043210
streamingtv         0.031853
streamingmovies     0.031581
paperlessbilling    0.017589
dependents          0.012346
partner             0.009968
seniorcitizen       0.009410
multiplelines       0.000857
phoneservice        0.000229
gender              0.000117
dtype: float64

#### What is Mutual Information?

**Mutual Information (MI)** measures how much knowing one variable reduces uncertainty about another.  
In churn analysis, it tells us **how informative a feature is** for predicting whether a customer will churn.

The higher the mutual information (MI), the stronger the relationship between that feature and churn.

MI is measured in bits (if log base 2 is used), though mutual_info_score in scikit-learn uses natural logarithm (nats) — so it’s actually in nats, not bits.

## Feature importance: Correlation

This is for numerical columns


In [70]:
df_full_train[numerical].corrwith(df_full_train['churn'])

tenure           -0.351885
monthlycharges    0.196805
totalcharges     -0.196353
dtype: float64