# Churn Prediction

- Binary classification

$$g(x_{i}) \approx y_{i}$$

$$y_{i} \in \{0, 1\}$$

- $1$: Churn
- $0$: No Churn

**Dataset**:

[telco-customer-churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn)


Install packages


In [None]:
!uv pip install -q \
    python-dotenv==1.2.1 \
    pandas==2.3.2 \
    pandas-stubs==2.3.2.250827 \
    numpy==2.3.2 \
    matplotlib==3.10.6 \
    seaborn==0.13.2 \
    scikit-learn==1.7.1

Append notebooks directory to sys.path


In [None]:
import sys

sys.path.append("../../..")

Import packages


In [214]:
import os
import pathlib
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
from typing import Tuple
import numpy as np
import seaborn as sns
import datetime
from dotenv import load_dotenv
from sklearn.model_selection import train_test_split
from notebooks.python.utils.data_extraction.data_extraction import (
    KaggleDataExtractor,
    KaggleExtractionConfig,
)

pd.set_option("display.max_columns", None)

sns.set_style("darkgrid")
sns.set_theme(style="darkgrid")

%matplotlib inline

load_dotenv()  # Root directory .env file

True

## Utility scripts:

**KaggleDataExtractor**:

```py
--8<-- "docs/notebooks/python/utils/data_extraction/data_extraction.py"
```


Create data directory


In [None]:
DATA_DIR = pathlib.Path("data/predicting-customer-churn")

os.makedirs(DATA_DIR, exist_ok=True)

Download dataset from Kaggle


In [None]:
username = os.getenv("KAGGLE_USERNAME")
api_token = os.getenv("KAGGLE_API_TOKEN")
file_name = "WA_Fn-UseC_-Telco-Customer-Churn.csv"

extractor = KaggleDataExtractor(username=username, api_token=api_token)

config = KaggleExtractionConfig(
    dataset_slug="blastchar/telco-customer-churn",
    file_name=file_name,
    destination_path=DATA_DIR,
    output_file_name="churn.csv",
)

if not os.path.isfile(DATA_DIR / "churn.csv"):
    extractor.download_dataset(config)

## Data Preparation


Load dataset


In [None]:
df = pd.read_csv(DATA_DIR / "churn.csv")

df.head(n=2)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No


Inspect all columns at once


In [None]:
df.head(3).T

Unnamed: 0,0,1,2
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK
gender,Female,Male,Male
SeniorCitizen,0,0,0
Partner,Yes,No,No
Dependents,No,No,No
tenure,1,34,2
PhoneService,No,Yes,Yes
MultipleLines,No phone service,No,No
InternetService,DSL,DSL,DSL
OnlineSecurity,No,Yes,Yes


In [None]:
df_summary = pd.DataFrame(
    {
        "column": df.columns,
        "dtype": [df[col].dtype for col in df.columns],
        "sample_unique": [df[col].unique()[:6] for col in df.columns],
        "n_unique": [df[col].nunique() for col in df.columns],
    }
)
df_summary

Unnamed: 0,column,dtype,sample_unique,n_unique
0,customerID,object,"[7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOC...",7043
1,gender,object,"[Female, Male]",2
2,SeniorCitizen,int64,"[0, 1]",2
3,Partner,object,"[Yes, No]",2
4,Dependents,object,"[No, Yes]",2
5,tenure,int64,"[1, 34, 2, 45, 8, 22]",73
6,PhoneService,object,"[No, Yes]",2
7,MultipleLines,object,"[No phone service, No, Yes]",3
8,InternetService,object,"[DSL, Fiber optic, No]",3
9,OnlineSecurity,object,"[No, Yes, No internet service]",3


Clean column names


In [None]:
df.columns = df.columns.str.lower().str.replace(" ", "_")

df.head(n=2)

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No


Select only object type columns


In [None]:
object_type_columns = list(df.dtypes[df.dtypes == "object"].index)
object_type_columns

['customerid',
 'gender',
 'partner',
 'dependents',
 'phoneservice',
 'multiplelines',
 'internetservice',
 'onlinesecurity',
 'onlinebackup',
 'deviceprotection',
 'techsupport',
 'streamingtv',
 'streamingmovies',
 'contract',
 'paperlessbilling',
 'paymentmethod',
 'totalcharges',
 'churn']

Clean columns


In [None]:
object_type_columns = list(df.dtypes[df.dtypes == "object"].index)
for column in object_type_columns:
    df[column] = df[column].str.lower().str.replace(" ", "_")

Inspect values of total charges, it should numeric


In [None]:
df.totalcharges[:5]

0      29.85
1     1889.5
2     108.15
3    1840.75
4     151.65
Name: totalcharges, dtype: object

Cast total charges to numeric type


In [None]:
total_charges = pd.to_numeric(df.totalcharges, errors="coerce")
total_charges[:5]

0      29.85
1    1889.50
2     108.15
3    1840.75
4     151.65
Name: totalcharges, dtype: float64

Check for null values


In [None]:
total_charges.loc[total_charges.isnull()][:5]

488    NaN
753    NaN
936    NaN
1082   NaN
1340   NaN
Name: totalcharges, dtype: float64

Treat the null values


In [None]:
df.totalcharges = total_charges.fillna(0)

Check churn field values


In [None]:
df.churn[:5]

0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object

Encode churn field to binary


In [None]:
(df.churn == "yes").astype(int)[:5]

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

Set original churn dataset column to binary


In [None]:
df.churn = (df.churn == "yes").astype(int)

## Validation Framework


Set split sizes

- Training dataset: 60%
- Validation dataset: 20%
- Test dataset: 20%


Split dataset into full train (train + validation) and test


In [None]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

Get dataset's sizes


In [None]:
len(df_full_train), len(df_test)

(5634, 1409)

Calculate how train and validation dataset's sizes should be


In [None]:
print(
    f"df_full_train size: {(100 - 20)/100.:.0%}\n"
    f"df_test size: {(20)/100.:.0%}\n"
    f"df_train size: 60% of 80% = {(60)/80.:.0%}\n"
    f"df_validation size: 20% of 80% = {(20)/80.:.0%}\n"
)

df_full_train size: 80%
df_test size: 20%
df_train size: 60% of 80% = 75%
df_validation size: 20% of 80% = 25%



Split full train dataset into train and validation datasets


In [None]:
df_train, df_validation = train_test_split(
    df_full_train, test_size=0.25, random_state=1
)

Get full length of dataset


In [None]:
len(df_train), len(df_validation), len(df_test)

(4225, 1409, 1409)

Reset dataset's indexes


In [None]:
df_train.reset_index(drop=True, inplace=True)
df_validation.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

Get target variables


In [None]:
y_train = df_train["churn"]
y_validation = df_validation["churn"]
y_test = df_test["churn"]

Remove target variables from original datasets


In [None]:
df_train.drop(columns=["churn"], inplace=True)
df_validation.drop(columns=["churn"], inplace=True)
df_test.drop(columns=["churn"], inplace=True)

## Exploratory Data Analysis


Reset full train dataset index


In [None]:
df_full_train.reset_index(drop=True, inplace=True)

Inspect dataset


In [None]:
df_full_train.head(n=2)

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,5442-pptjy,male,0,yes,yes,12,yes,no,no,no_internet_service,no_internet_service,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.7,258.35,0
1,6261-rcvns,female,0,no,no,42,yes,no,dsl,yes,yes,yes,yes,no,yes,one_year,no,credit_card_(automatic),73.9,3160.55,1


Check if null values are present


In [None]:
df.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

Inspect target variable churn


In [None]:
df_full_train.churn.value_counts()

churn
0    4113
1    1521
Name: count, dtype: int64

Get percent of total


In [None]:
df_full_train.churn.value_counts(normalize=True)

churn
0    0.730032
1    0.269968
Name: proportion, dtype: float64

Get mean


In [None]:
df_full_train.churn.mean()  # number of ones divided by total

np.float64(0.26996805111821087)

Mean and percent of total for churn is the same because is encoded to binary. So both calculations are _number of ones divided by total_


In [None]:
global_churn_rate = df_full_train.churn.mean()
round(global_churn_rate, 2)

np.float64(0.27)

Inspect columns types


In [None]:
df_full_train.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                 int64
dtype: object

Set numerical columns


In [None]:
numerical_columns = ["tenure", "monthlycharges", "totalcharges"]

Set categorical columns


In [None]:
categorical_columns = [
    "gender",
    "seniorcitizen",
    "partner",
    "dependents",
    "phoneservice",
    "multiplelines",
    "internetservice",
    "onlinesecurity",
    "onlinebackup",
    "deviceprotection",
    "techsupport",
    "streamingtv",
    "streamingmovies",
    "contract",
    "paperlessbilling",
    "paymentmethod",
]

Inspect categorical columns


In [None]:
df_full_train[categorical_columns].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

## Feature Importance

### Churn Rate

**Difference**:

- (global_churn_rate - group_churn_rate) > 0: Less likely to churn
- (global_churn_rate - group_churn_rate) < 0: More likely to churn

**Risk Ratio**:

- (group_churn_rate / global_churn_rate) > 1: More likely to churn
- (group_churn_rate / global_churn_rate) < 1: Less likely to churn


In [None]:
df_groups = []

for column in categorical_columns:
    global_churn_rate = df_full_train.churn.mean()
    df_group = (
        df_full_train[[column, "churn"]]
        .groupby(column)
        .churn.agg(["mean", "count"])
    )
    df_group["diff"] = df_group["mean"] - global_churn_rate
    df_group["risk"] = df_group["mean"] / global_churn_rate
    df_group = df_group.reset_index().rename(columns={column: "label"})
    df_group.insert(0, "column", column)
    df_groups.append(df_group)

result = pd.concat(df_groups, ignore_index=True)
result

Unnamed: 0,column,label,mean,count,diff,risk
0,gender,female,0.276824,2796,0.006856,1.025396
1,gender,male,0.263214,2838,-0.006755,0.97498
2,seniorcitizen,0,0.24227,4722,-0.027698,0.897403
3,seniorcitizen,1,0.413377,912,0.143409,1.531208
4,partner,no,0.329809,2932,0.059841,1.221659
5,partner,yes,0.205033,2702,-0.064935,0.759472
6,dependents,no,0.31376,3968,0.043792,1.162212
7,dependents,yes,0.165666,1666,-0.104302,0.613651
8,phoneservice,no,0.241316,547,-0.028652,0.89387
9,phoneservice,yes,0.273049,5087,0.003081,1.011412


### Mutual information

How much We can learn about one variable if we know the value of another
