# Churn prediction project

Churn is when customers stop using the services of a company. So churn prediction is about identifying customers who are likely to cancel their contracts soon. If a company can do that, it can offer discounts on these services in an effort to keep the users.

We can use machine learning for to detect churn. We can put past data about customers who churned and create a model based on that for identifying present customers who are about to leave. This is a binary classification problem. The target variable that we want to predict is categorical and has only two possible outcomes: churn or not churn.

Logistic regression is the simplest supervised machine learning model that can be used for binary classification. It's fast and easy to understand, and its results are easy to interpret. It's the most widely used model in the industry.

Imagine that we are working at a telecom company that offers phone and internet services, and we have a problem: some of our customers are churning. They no longer are using our services and are going to a different provider. We would like to prevent that from happening, so we develop a system for identifying these customers and offer them an incentive to stay. We want to target them with promotional messages and give them a discount. We also would like to understand why the model thinks our customers churn, and for that, we need to be able to interpret the model’s predictions.

We have collected a dataset where we’ve recorded some information about our customers: what type of services they used, how much they paid, and how long they stayed with us. We also know who canceled their contracts and stopped using our services (churned). We will use this information as the target variable in the machine learning model and predict it using all other available information.

The plan for the project follows:

1. get a dataset and do initial preparation by renaming columns and changing values inside the columns to be consistent throughout the dataset.
2. split the data into train, validation, and test so we can validate our models.
3. as part of the initial data analysis, we look at feature importance to identify which features are important in our data.
4. transform categorical variables into numeric variables so we can use them in the model
5. train a logistic regression model.

We will use Scikit-learn in this proejct for splitting the dataset into train and test, encoding categorical variables, and training logistic regression.

### 3.1.2. Initial data preparation

In [106]:
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [107]:
df = pd.read_csv("Telco-Customer-Churn.csv")

In [108]:
len(df)

7043

The dataset is not large, but should be enough to train a decent model.

In [109]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [110]:
len(df.columns)

21

This dataframe has quite a few columns, so they all don't fit on the screen. We can transpose the dataframe using the T function so that we can see a lot more data.

In [111]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


When reading a CSV file, Pandas tries to automaticlaly determine the proper type of each column, but it's not always correct.

In [112]:
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

TotalCharges is an object when it should be numeric. This is because some rows contain a space (" ") to represent a missing value.

We can force this column to be numeric by converting it to numbers using a special function in Pandas: to_numeric. Using the errors="coerce" option allows us to handle nonnumeric data such as spaces that would normally give an error. This way Pandas will replace all nonnumeric values with a NaN (not a number).

In [113]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df["TotalCharges"] = df["TotalCharges"].fillna(0)

Let's make the column titles uniform by lowercasing everything and replacing spaces with underscores. This way we remove all the inconsistencies in the data.

In [114]:
df.columns = df.columns.str.lower().str.replace(" ", "_")
string_columns = list(df.dtypes[df.dtypes == "object"].index)
for column in string_columns:
    df[column] = df[column].str.lower().str.replace(" ", "_")

Look at the target variable: churn. It's currently categorical, with two values, "yes" and "no". For binary classification, all models typically expect a number: 0 for "no" and 1 for "yes". Convert it to numbers:

In [115]:
df.churn

0        no
1        no
2       yes
3        no
4       yes
       ... 
7038     no
7039     no
7040     no
7041    yes
7042     no
Name: churn, Length: 7043, dtype: object

In [116]:
df.churn = (df.churn == "yes").astype(int)

In [117]:
df.churn

0       0
1       0
2       1
3       0
4       1
       ..
7038    0
7039    0
7040    0
7041    1
7042    0
Name: churn, Length: 7043, dtype: int64

Scikit-learn is a Python library for machine learning and its module called model_selection can handle data splitting.

In [118]:
from sklearn.model_selection import train_test_split

The function train_test_split takes a dataframe df and creates two new dataframes: df_train_full and df_test. It shuffles the original dataset and then splitting it in such a way that the test set contains 20% of the data and the train set contains the remaining 80%.

In [119]:
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)

The function contains a few parameters:
1. The first parameter that is passed is the dataframe that we want to spllit: df.
2. The second parameter is test_size, which specifies the size of the dataset we want to set aside for testing, 20% in this case.
3. The third parameter is random_state. It's needed for ensuring that every time we run this code, the dataframe is split in the exact same way.

Shuffling of data is done using a random-number generator; it’s important to fix the random seed to ensure that every time we shuffle the data, the final arrangement of rows will be the same.

In [241]:
df_train_full.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
1814,5442-pptjy,male,0,yes,yes,12,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.7,258.35,0
5946,6261-rcvns,female,0,no,no,42,yes,no,dsl,yes,...,yes,yes,no,yes,one_year,no,credit_card_(automatic),73.9,3160.55,1
3881,2176-osjuv,male,0,yes,no,71,yes,yes,dsl,yes,...,no,yes,no,no,two_year,no,bank_transfer_(automatic),65.15,4681.75,0
2389,6161-erdgd,male,0,yes,yes,71,yes,yes,dsl,yes,...,yes,yes,yes,yes,one_year,no,electronic_check,85.45,6300.85,0
3676,2364-ufrom,male,0,no,no,30,yes,no,dsl,yes,...,no,yes,yes,no,one_year,no,electronic_check,70.4,2044.75,0


Let's take the df_train_full dataframe and split it one more time into train and validation:

In [121]:
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)

In [122]:
y_train = df_train.churn.values
y_val = df_val.churn.values

In [123]:
# del df_train["churn"]
# del df_val["churn"]

### 3.1.3. Exploratory data analysis

The more we know about the data and the problems inside, the better model we can build afterward.

We should always check for any missing values in the dataset becasue many machine learning models cannot easily deal with missing data.

Let's see if we need to perform any additional null handling:

In [124]:
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

It prints all zeros, so we have no missing values in the dataset and don't need to do anything extra.

Another thing we should do is check the distribution of values in the target variable. Let's see using the value_counts() method:

In [125]:
df_train_full["churn"].value_counts()

churn
0    4113
1    1521
Name: count, dtype: int64

The first column is the value of the target variable, and the second is the count. Most of the customers didn't churn.

The proportion of churned users, or the probability of churning, has a special name: churn rate.

There's another way to calculate the churn rate aside from doing 1521 / (4113 + 1521): the mean() method.

In [126]:
global_mean = df_train_full["churn"].mean()
round(global_mean, 3)

np.float64(0.27)

This churn dataset is an example of a so-called imbalanced dataset. There were three times as many people who didn't churn in our dataset as those who did churn, and we say that the nonchurn class dominates the churn class. The churn rate in our data is 0.27, which is a strong indicator of class imbalance. The opposite of imbalanced is the balanced case, when positive and negative classes are equally distributed among all observations.

Both the categorical and numerical variables in our dataset are important, but they are also different and need different treatment. For that, we want to look at them separately.

We will create two lists:
- categorical, which will contain the names of categorical variables
- numerical, whick, likewise, will have the names of numerical variables

In [127]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
               'phoneservice', 'multiplelines', 'internetservice',
               'onlinesecurity', 'onlinebackup', 'deviceprotection',
               'techsupport', 'streamingtv', 'streamingmovies',
               'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges']

First, we can see how many unique values each variable has. We already know we should have just a few for each column, but let’s verify it:

In [128]:
df_train_full[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

We see that most of the columns have two or three values and one (paymentmethod) has four. This is good. We don’t need to spend extra time preparing and cleaning the data; everything is already good to go.

Now we come to another important part of exploratory data analysis: understanding which features may be important for our model.

### 3.1.4. Feature importance

Knowing how other variables affect the target variable, churn, is the key to understanding the data and building a good model. This process is called feature importance analysis, and it's often done as part of exploratory data analysis to figure out which variables will be useful for the model. It also gives us more insight about the dataset and helps answer questions like "What makes customers churn?" and "What are the characteristics of people who churn?"

We have two different kinds of features: categorical and numerical. Each kind has different ways of measuring feature importance, so we will look at each separately.

Let's start by looking at categorical variables. The first thing we can do is look at the churn rate for each variable. We know that a categorical variable has a set of values it can take, and each value defines a group inside the dataset.

We can look at all the distinct values of a variable. Then, for each variable, there’s a group of customers: all the customers who have this value. For each such group, we can compute the churn rate, which is the group churn rate. When we have it, we can compare it with the global churn rate—the churn rate calculated for all the observations at once.

If the difference between the rates is small, the value is not important when predicting churn because this group of customers is not really different from the rest of the customers. On the other hand, if the differernce is not small, something inside that group sets it apart from the rest. A machine learning algorithm should be able to pick this up and use it when making predictions.

Let’s check first for the gender variable. This gender variable can take two values, female and male. There are two groups of customers: ones that have gender == "female" and ones that have gender == "male". To compute the churn rate for all female customers, we first select only rows that correspond to gender == "female" and then compute the churn rate for them:

In [129]:
female_mean = df_train_full[df_train_full.gender == "female"].churn.mean()
print("gender == female:", round(female_mean, 3))

male_mean = df_train_full[df_train_full.gender == "male"].churn.mean()
print("gender == male:", round(male_mean, 3))

gender == female: 0.277
gender == male: 0.263


When we check the results, we see that the churn rate of female customers is 27.7% and that of male customers is 26.3%, whereas the global churn rate is 27%. The difference between the group rates for both females and males is quite small, which indicates that knowing the gender of the customer doesn’t help us identify whether they will churn.

Let's look at another variable: partner. It takes values of yes and no, so there are two groups of customers: the ones for which partner == "yes" and the ones for which partner == "no".

In [130]:
partner_yes = df_train_full[df_train_full.partner == "yes"].churn.mean()
print("partner == yes:", round(partner_yes, 3))

partner_no = df_train_full[df_train_full.partner == "no"].churn.mean()
print("partner == no :", round(partner_no, 3))

partner == yes: 0.205
partner == no : 0.33


Here we can see that clients with no partner are more likely to churn than the ones with a partner.

In addition to looking at the difference between the group rate and the global rate, it's interesting to look at the ratio between them. In statistics, the ratio between probabilities in different groups is called the risk ratio, where risk refers to the risk of having the effect. In this case, the effect is churn, so it's the risk of churning:

risk = group rate / global rate

For gender == female, for example, the risk of churning is 1.02:

risk = 27.7% / 27% = 1.02

Risk is a number between zero and infinity. It has a nice interpretation that tells you how likely the elements of the group are to have the effect (churn) compared with the entire population.

If the difference between the group rate and the global rate is small, the risk is close to 1: this group has the same level of risk as the rest of the population. Customers in the group are as likely to churn as anyone else. In other words, a group with a risk close to 1 is not risky at all.

If the risk is lower than 1, the group has lower risks: the churn rate in this group is smaller than the global churn. For example, the value 0.5 means that the clients in this group are two times less likely to churn than clients in general.

On the other hand, if the value is higher than 1, the group is risky: there’s more churn in the group than in the population. So a risk of 2 means that customers from the group are two times more likely to churn.

We can calculate the churn rate for the gender variable using code:

In [131]:
global_mean = df_train_full.churn.mean()

df_group = df_train_full.groupby(by="gender").churn.agg(["mean"])
df_group["diff"] = df_group["mean"] - global_mean
df_group["risk"] = df_group["mean"] / global_mean

df_group

Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Let's get the churn rate for all categorical variables:

In [132]:
from IPython.display import display


for col in categorical:
    df_group = df_train_full.groupby(by=col).churn.agg(['mean'])
    df_group["diff"] = df_group["mean"] - global_mean
    df_group["risk"] = df_group["mean"] / global_mean
    display(df_group)

Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Unnamed: 0_level_0,mean,diff,risk
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.24227,-0.027698,0.897403
1,0.413377,0.143409,1.531208


Unnamed: 0_level_0,mean,diff,risk
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.329809,0.059841,1.221659
yes,0.205033,-0.064935,0.759472


Unnamed: 0_level_0,mean,diff,risk
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.31376,0.043792,1.162212
yes,0.165666,-0.104302,0.613651


Unnamed: 0_level_0,mean,diff,risk
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.241316,-0.028652,0.89387
yes,0.273049,0.003081,1.011412


Unnamed: 0_level_0,mean,diff,risk
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.257407,-0.012561,0.953474
no_phone_service,0.241316,-0.028652,0.89387
yes,0.290742,0.020773,1.076948


Unnamed: 0_level_0,mean,diff,risk
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dsl,0.192347,-0.077621,0.712482
fiber_optic,0.425171,0.155203,1.574895
no,0.077805,-0.192163,0.288201


Unnamed: 0_level_0,mean,diff,risk
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.420921,0.150953,1.559152
no_internet_service,0.077805,-0.192163,0.288201
yes,0.153226,-0.116742,0.56757


Unnamed: 0_level_0,mean,diff,risk
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.404323,0.134355,1.497672
no_internet_service,0.077805,-0.192163,0.288201
yes,0.217232,-0.052736,0.80466


Unnamed: 0_level_0,mean,diff,risk
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.395875,0.125907,1.466379
no_internet_service,0.077805,-0.192163,0.288201
yes,0.230412,-0.039556,0.85348


Unnamed: 0_level_0,mean,diff,risk
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.418914,0.148946,1.551717
no_internet_service,0.077805,-0.192163,0.288201
yes,0.159926,-0.110042,0.59239


Unnamed: 0_level_0,mean,diff,risk
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.342832,0.072864,1.269897
no_internet_service,0.077805,-0.192163,0.288201
yes,0.302723,0.032755,1.121328


Unnamed: 0_level_0,mean,diff,risk
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.338906,0.068938,1.255358
no_internet_service,0.077805,-0.192163,0.288201
yes,0.307273,0.037305,1.138182


Unnamed: 0_level_0,mean,diff,risk
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
month-to-month,0.431701,0.161733,1.599082
one_year,0.120573,-0.149395,0.446621
two_year,0.028274,-0.241694,0.10473


Unnamed: 0_level_0,mean,diff,risk
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.172071,-0.097897,0.637375
yes,0.338151,0.068183,1.25256


Unnamed: 0_level_0,mean,diff,risk
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bank_transfer_(automatic),0.168171,-0.101797,0.622928
credit_card_(automatic),0.164339,-0.10563,0.608733
electronic_check,0.45589,0.185922,1.688682
mailed_check,0.19387,-0.076098,0.718121


Two things are different in this code. First, instead of manually specifying the column name, we iterate over all categorical variables.

The second difference is more subtle: we need to call the display function to render a dataframe inside the loop. The way we typically display a dataframe is to leave it as the last line in a Jupyter Notebook cell and then execute the cell. If we do it that way, the dataframe is displayed as the cell output. However, we cannot do this inside a loop. To still be able to see the content of the dataframe, we call the display function explicitly.

This way, just by looking at the differences and the risks, we can identify the most discriminative features: the features that are helpful for detecting churn. Thus, we expect that these features will be useful for our future models.

Customers with month-to-month contracts tend to churn a lot more than customers with other kinds of contracts. This is exactly the kind of relationship we want to find in our data. Without such relationships in data, machine learning models will not work-they will not be able to make predictions. The higher the degree of dependency, the more useful a feature is.

Mutual information tells how much information you learn about one variable if you learn the value of another variable. In machine learning, it's used to measure the mutual dependency between teo variables.

Higher values of mutual information mean a higher degree of dependence: if the mutual information between a categorical variable and the target is high, this categorical variable will be quite useful for predicting the target. On the other hand, if the mutual information is low, the categorical variable and the target are independent, and thus the variable will not be useful for predicting the target.

Mutual information is already implemented in Scikit-learn in the mutual_info_score function from the metrics package, so we can just use it:

In [133]:
from sklearn.metrics import mutual_info_score

def calculate_mi(series):
    return mutual_info_score(series, df_train_full.churn)

df_mi = df_train_full[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name="MI")
df_mi

Unnamed: 0,MI
contract,0.09832
onlinesecurity,0.063085
techsupport,0.061032
internetservice,0.055868
onlinebackup,0.046923
deviceprotection,0.043453
paymentmethod,0.04321
streamingtv,0.031853
streamingmovies,0.031581
paperlessbilling,0.017589


As we see, contract, onlinesecurity, and techsupport are among the most important features. Indeed, we’ve already noted that contract and techsupport are quite informative. It’s also not surprising that gender is among the least important features, so we shouldn’t expect it to be useful for the model.

Mutual information is a way to quantify the degree of dependency between two categorical variables, but it doesn’t work when one of the features is numerical, so we cannot apply it to the three numerical variables that we have.

We can, however, measure the dependency between a binary target variable and a numerical variable. We can pretend that the binary variable is numerical (containing only the numbers zero and one) and then use the classical methods from statistics to check for any dependency between these variables.

One such method is the correlation coefficient (sometimes referred as Pearson’s correlation coefficient). It is a value from –1 to 1:
- Positive correlation means that when one variable goes up, the other variable tends to go up as well. In the case of a binary target, when the values of the variable are high, we see ones more often than zeros. But when the values of the variable are low, zeros become more frequent than ones.
- Zero correlation means no relationship between two variables: they are completely independent.
- Negative correlation occurs when one variable goes up and the other goes down. In the binary case, if the values are high, we see more zeros than ones in the target variable. When the values are low, we see more ones.

It’s very easy to calculate the correlation coefficient in Pandas:

In [134]:
df_train_full[numerical].corrwith(df_train_full.churn)

tenure           -0.351885
monthlycharges    0.196805
totalcharges     -0.196353
dtype: float64

In [139]:
# .to_frame("correlation") just makes it look nicer
df_train_full[numerical].corrwith(df_train_full.churn).to_frame("correlation")

Unnamed: 0,correlation
tenure,-0.351885
monthlycharges,0.196805
totalcharges,-0.196353


Correlation between numerical variables and churn. tenure has a high negative correlation: as tenure grows, churn rate goes down. monthlycharges has positive correlation: the more customers pay, the more likely they are to churn. totalcharges has a negative correlation: the more that customers have paid in total, the less likely that they will leave.

## 3.2. Feature engineering

Feature engineering is the process of selecting, transforming, and constructing features from raw data to create a set of relevant and informative features that are suitable for modeling. We will now transform all categorical variables to numeric features.

### 3.2.1. One-hot encoding for categorical variables

We cannot just take a categorical variable and put it into a machine learning model. The models can deal only with numbers in matrices. So, we need to convert our categorical data into a matrix form, or encode.

One such encoding technique is one-hot encoding.

If a variable, such as contract has possible values (monthly, yearly, and two-year), we can represent a customer with the yearly contract as (0, 1, 0). In this case, the yearly value is active, or hot, so it gets 1, whereas the remaining values are not active, or cold, so they are 0.

To understand this better, let’s consider a case with two categorical variables and see how we create a matrix from them. These variables are:
- gender, with values female and male
- contract, with values monthly, yearly, and two-year

Because the gender variable has only two possible values, we create two columns in the resulting matrix. The contract variable has three columns, and in total, our new matrix will have five columns:
- gender=female
- gender=male
- contract=monthly
- contract=yearly
- contract=two-year

When the number of features grows, the one-hot encoding process becomes tedious. Luckily, Scikit-learn can perform one-hot encoding in multiple ways, here we will use DictVectorizer.

DictVectorizer takes in a dictionary and vectorizes it. It creates vectors from the dictionary. Then the vectors are put together as rows of one matrix. This matrix is used as input to a machine learning algorithm.

To use this method, we need to convert our dataframe to a list of dictionaries, which is simple to do in Pandas using the to_dict method with the orient="records" parameter:

In [157]:
train_dict = df_train[categorical + numerical].to_dict(orient="records")

In [158]:
train_dict[0]

{'gender': 'male',
 'seniorcitizen': 0,
 'partner': 'yes',
 'dependents': 'no',
 'phoneservice': 'yes',
 'multiplelines': 'no',
 'internetservice': 'dsl',
 'onlinesecurity': 'yes',
 'onlinebackup': 'yes',
 'deviceprotection': 'yes',
 'techsupport': 'yes',
 'streamingtv': 'yes',
 'streamingmovies': 'yes',
 'contract': 'two_year',
 'paperlessbilling': 'yes',
 'paymentmethod': 'bank_transfer_(automatic)',
 'tenure': 71,
 'monthlycharges': 86.1,
 'totalcharges': 6045.9}

Each column from the dataframe is the key in this dictionary, with values coming from the actual dataframe row values.

Now we can use DictVectorizer. We create it and then fit it to the list of dictionaries we created previously:

In [166]:
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)
dv.fit(train_dict)

In this code we create a DictVectorizer instance, which we call dv and "train" it by invoking the fit method. The fit method looks at the content of these dictionaries and figures out the possible values for each variable and how to map them to the columns in the output matrix. If a feature is categorical, it applies the one-hot encoding scheme, but if a feature is numerical, it's left intact.

The DictVectorizer class can take in a set of parameters. We specify one of them: sparse=False. This parameter means that the created matrix will not be sparse and instead will create a simple NumPy array.

After we fit the vectorizer, we can use it for converting the dictionaries to a matrix by using the transform method:

In [167]:
X_train = dv.transform(train_dict)

In [168]:
X_train.shape

(3774, 45)

In [169]:
X_train[0]

array([0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
       1.0000e+00, 0.0000e+00, 0.0000e+00, 8.6100e+01, 1.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 7.1000e+01, 6.0459e+03])

The elements are one-hot encoded categorical variables. Not all of them are ones and zeros, however. We see that three of them are other numbers. These are our numeric variables: monthlycharges, tenure, and totalcharges.

In [180]:
dv.get_feature_names_out()

array(['contract=month-to-month', 'contract=one_year',
       'contract=two_year', 'dependents=no', 'dependents=yes',
       'deviceprotection=no', 'deviceprotection=no_internet_service',
       'deviceprotection=yes', 'gender=female', 'gender=male',
       'internetservice=dsl', 'internetservice=fiber_optic',
       'internetservice=no', 'monthlycharges', 'multiplelines=no',
       'multiplelines=no_phone_service', 'multiplelines=yes',
       'onlinebackup=no', 'onlinebackup=no_internet_service',
       'onlinebackup=yes', 'onlinesecurity=no',
       'onlinesecurity=no_internet_service', 'onlinesecurity=yes',
       'paperlessbilling=no', 'paperlessbilling=yes', 'partner=no',
       'partner=yes', 'paymentmethod=bank_transfer_(automatic)',
       'paymentmethod=credit_card_(automatic)',
       'paymentmethod=electronic_check', 'paymentmethod=mailed_check',
       'phoneservice=no', 'phoneservice=yes', 'seniorcitizen',
       'streamingmovies=no', 'streamingmovies=no_internet_service',

For each categorical feature it creates multiple columns for each of its distinct values. For contract, we have contract=month-to-month, contract=one_year, and contract=two_year, and for dependents, we have dependents=no and dependents=yes. Features such as tenure and totalcharges keep the original names because they are numerical; therefore, DictVectorizer doesn't change them.

## 3.3. Machine learning for classification

When we have a matrix, we are ready to do the model training part.

### 3.3.1. Logistic regression

Logistic regression has a lot in common with linear regression. Logistic regression is also a linear model, but unlike linear regression, it's a classification model, not regression, even though the name might suggest that. It's a binary classification model, so the target variable is binary; the only values it can have are zero and one. Observations with the target variable = 1 are typically called positive examples: examples in which the effect we want to predict is present. Likewise, examples with the target variable = 0 are called negative examples: the effect we want to predict is absent. For this project, yi (target variable) = 1 means that the customer churned, and yi = 0 means the opposite: the customer stayed with us.

The output of logistic regression is probability, in this case it's the probability that a customer will churn.

To be able to treat the output as a probability, we need to make sure that the predictions of the model always stay between zero and one. We use a special mathematical function for this purpose called sigmoid.

Both linear and logistic regression are linear, they are both based on the dot product operation, but the only difference is that linear regression is not transformed, whereas logistic regression is transformed by the sigmoid function.

The translation of the logistic regression formula is almost identical to the linear regression case, except that at the end, we appky the sigmoid function:

In [181]:
import math

def sigmoid(score):
    return 1 / (1 + math.exp(-score))

### 3.3.2. Training logistic regression

In [187]:
from sklearn.linear_model import LogisticRegression

We train LogisticRegression using the fit method:

In [188]:
model = LogisticRegression(solver="liblinear", random_state=1)
model.fit(X_train, y_train)

The class LogisticRegression from Scikit-learn encapsulates the training logic behind this model. It’s configurable, and we can change quite a few parameters. In fact, we already specify two of them: solver and random_state. Both are needed for reproducibility:
- random_state. The seed number for the random-number generator. It shuffles the data when training the model; to make sure the shuffle is the same every time, we fix the seed.
- solver. The underlying optimization library. In the current version (at the moment of writing, v0.20.3), the default value for this parameter is liblinear, but according to the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), it will change to a different one in version v0.22. To make sure our results are reproducible in the later versions, we also set this parameter.

When the training is done, the model is ready to make predictions. We can apply the model to our validation data to obtain the probability of churn for each customer in the validation dataset.

To do that, we need to apply the one-hot encoding scheme to all the categorical variables. First, we convert the dataframe to a list of dictionaries and then feed it to the DictVectorizer we fit previously:

In [189]:
val_dict = df_val[categorical + numerical].to_dict(orient="records")
X_val = dv.transform(val_dict)

As a result, we get X_val, a matrix with features from the validation dataset. Now we are ready to put this matrix to the model. To get the probabilities, we use the predict_proba method of the model:

In [192]:
y_pred = model.predict_proba(X_val)

The result of predict_proba is a two-dimensional NumPy array, or a two-column matrix. The first column of the array contains the probability that the target is negative (no churn), and the second column contains the probability that the targt is positive (churn).

These columns convey the same information. We know the probability of churn—it’s p—and the probability of not churning is always 1 – p, so we don’t need both columns.

In [194]:
y_pred = model.predict_proba(X_val)[:, 1]

The slicing operation [:, 1] allows us to select one column (the 2nd in this case) from a two-dimensional array in NumPy.

This syntax might be confusing, so let’s break it down. Two positions are inside the brackets, the first one for rows and the second one for columns.

When we use [:, 1], NumPy interprets it this way:

- : means select all the rows.
- 1 means select only the column at index 1, and because the indexing starts at 0, it’s the second column.

As a result, we get a one-dimensional NumPy array that contains the values from the second column only.

This output (probabilities) is often called soft predictions. These tell us the probability of churning as a number between zero and one. It's up to us to decide how to interpret this number and how to use it.

For this model, we want to retain customers by identifying those who are about to cancel their contract with the company and send them promotional messages, offering discounts and other benefits. We do this in the hope that after receiving the benefit, they will stay with the company. On the other hand, we don't want to give promotions to all our customers, because it will hurt us financially.

To make the actual decision about whether to send a promotional letter to our customers, using the probability alone is not enough. We need hard predictions-binary values of True (churn, so send the mail) or False (not churn, so don't send the mail).

To get the binary predictions, we take the probabilities and cut them above a certain threshold. If the probability for a ucstomer is higher than this threshold, we predict churn, otherwise, not churn. If we select 0.5 to be this threshold, making the binary predictions is easy. We just use the ">=" operator:

The comparison operators in NumPy are applied element-wise, and the result is a new array that contains only Boolean values. It performs the comparison for each element of the y_pred array.

In [204]:
y_pred[:20]

array([0.2349096 , 0.26886196, 0.3194511 , 0.36519744, 0.04552507,
       0.44016819, 0.01825099, 0.11323841, 0.00606939, 0.19301077,
       0.61544702, 0.00552225, 0.35911808, 0.09060988, 0.03247032,
       0.31902383, 0.63281109, 0.11959054, 0.73403606, 0.54590058])

In [205]:
y_pred[:20] >= 0.5

array([False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False,  True, False,
        True,  True])

In [206]:
churn = y_pred >= 0.5

When we have these hard predictions made by our model, we would like to understand how good they are, so we are ready to move to the next step: evaluating the quality of these predictions. Let's do a simple check to make sure our model learned something useful.

The simplest thing to check is to take each prediction and compare it with the actual value. If we predict churn and the actual value is churn, or we predict non-churn and the actual value is non-churn, our model made the correct prediction. If the predictions don't match, they aren't good. If we calculate the number of times our predictions match the actual value, we can use it for measuring the quality of our model.

This quality measure is called accuracy. It's very easy to calculate accuracy with NumPy:

In [207]:
(y_val == churn).mean()

np.float64(0.8016129032258065)

y_val contains only zeroes and ones, it is our target variable because it gives us one if the customer churned and zero otherwise. churn contains Boolean predictions. In this case, True means we predict the customer will churn, and False means the customer will not churn.

Even though these two arrays have different types inside (integer and Boolean), it’s still possible to compare them. The Boolean array is cast to integer such that True values are turned to “1” and False values are turned to “0”. Then it’s possible for NumPy to perform the actual comparison.

If the true value in y_pred matches our prediction in churn, the label is True, and if it doesn’t, the label is False. In other words, we have True if our prediction is correct and False if it’s not.

If we compute the mean of an array that contains only ones and zeros (True and False), the result is the fraction of ones (True) in that array, which we already used for calculating the churn rate.

We see 0.8 as the output. This means that the model predictions matched the actual value 80% of the time, or the model makes correct predictions in 80% of cases. This is what we call the accuracy of the model.

Now we know how to train a model and evaluate its accuracy, but it's still useful to understand how it makes the predictions.

### 3.3.3. Model interpretation

We know that the logistic regression model has two parameters that it learns from data:

- w0 is the bias term.
- w = (w1, w2, ..., wn) is the weights vector.

We can get the bias term from model.intercept_[0]. When we train our model on all features, the bias term is –0.12.

The rest of the weights are stored in model.coef_[0]. If we look inside, it’s just an array of numbers, which is hard to understand on its own.

To see which feature is associated with each weight, we can use the get_feature_names_out method of the DictVectorizer. We can zip the feature names together with the coefficients before looking at them:

In [209]:
dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))

{'contract=month-to-month': np.float64(0.563),
 'contract=one_year': np.float64(-0.086),
 'contract=two_year': np.float64(-0.599),
 'dependents=no': np.float64(-0.03),
 'dependents=yes': np.float64(-0.092),
 'deviceprotection=no': np.float64(0.1),
 'deviceprotection=no_internet_service': np.float64(-0.116),
 'deviceprotection=yes': np.float64(-0.106),
 'gender=female': np.float64(-0.027),
 'gender=male': np.float64(-0.095),
 'internetservice=dsl': np.float64(-0.323),
 'internetservice=fiber_optic': np.float64(0.317),
 'internetservice=no': np.float64(-0.116),
 'monthlycharges': np.float64(0.001),
 'multiplelines=no': np.float64(-0.168),
 'multiplelines=no_phone_service': np.float64(0.127),
 'multiplelines=yes': np.float64(-0.081),
 'onlinebackup=no': np.float64(0.136),
 'onlinebackup=no_internet_service': np.float64(-0.116),
 'onlinebackup=yes': np.float64(-0.142),
 'onlinesecurity=no': np.float64(0.258),
 'onlinesecurity=no_internet_service': np.float64(-0.116),
 'onlinesecurity=yes':

To understand how the model works, let’s consider what happens when we apply this model. To build the intuition, let’s train a simpler and smaller model that uses only three variables: contract, tenure, and totalcharges.

The variables tenure and totalcharges are numeric so we don’t need to do any additional preprocessing; we can take them as is. On the other hand, contract is a categorical variable, so to be able to use it, we need to apply one-hot encoding.

Let’s redo the same steps we did for training, this time using a smaller set of features:

In [216]:
subset = ["contract", "tenure", "totalcharges"]
train_dict_small = df_train[subset].to_dict(orient="records")
dv_small = DictVectorizer(sparse=False)
dv_small.fit(train_dict_small)

X_small_train = dv_small.transform(train_dict_small)

dv_small.get_feature_names_out()

array(['contract=month-to-month', 'contract=one_year',
       'contract=two_year', 'tenure', 'totalcharges'], dtype=object)

Let’s train the small model on this set of features:

In [217]:
model_small = LogisticRegression(solver="liblinear", random_state=1)
model_small.fit(X_small_train, y_train)

Let’s first check the bias term:

In [219]:
model_small.intercept_[0]

np.float64(-0.5772299097126418)

It outputs –0.577. Then we can check the other weights, using the same code as previously:

In [221]:
dict(zip(dv_small.get_feature_names_out(), model_small.coef_[0].round(3)))

{'contract=month-to-month': np.float64(0.866),
 'contract=one_year': np.float64(-0.327),
 'contract=two_year': np.float64(-1.117),
 'tenure': np.float64(-0.094),
 'totalcharges': np.float64(0.001)}

These weights are essentially w1, w2, w3, w4, and w5 for the weights vector.

In the case of linear regression, the bias term is the baseline prediction. It's the prediction we would make without knowing anything else about the observation. This baseline is corrected with other weights.

In the case of logistic regression, it's similar: it's the baseline prediction-or the score we would make on average. Likewise, we later correct this score with the other weights. However, for logistic regression, interpretation is a bit trickier because we also need to apply the sigmoid function before we get the final output.

In this case, the bias term has the value of -.577. This value is negative. If we look at the sigmoid function, we can see that for negative values, the output is lower than 0.5. For -.577, the resulting probability of churning is 36%. This means that on average, a customer is more likely to stay with us than churn.

In [223]:
sigmoid(-.577)

0.3596231853677901

The reason why the sign before the bias term is negative is the class imbalance. There are a lot fewer churned users in the training data than non-churned ones, meaning the probability of churn on average is low, so this value for the bias term makes sense.

### 3.3.4. Using the model

Now we can interpret what our model learned and understand how it makes the predictions.

We applied the model to the validation set, computed the probabilities of churning for every customer there, and concluded that the model is 80% accurate. Let's try to use the model we trained.

We take a customer we want to score and put all the variable values in a dictionary:

In [233]:
customer = {
    'customerid': '8879-zkjof',
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'no',
    'dependents': 'no',
    'tenure': 41,
    'phoneservice': 'yes',
    'multiplelines': 'no',
    'internetservice': 'dsl',
    'onlinesecurity': 'yes',
    'onlinebackup': 'no',
    'deviceprotection': 'yes',
    'techsupport': 'yes',
    'streamingtv': 'yes',
    'streamingmovies': 'yes',
    'contract': 'one_year',
    'paperlessbilling': 'yes',
    'paymentmethod': 'bank_transfer_(automatic)',
    'monthlycharges': 79.85,
    'totalcharges': 3320.75,
}

When we prepare items for prediction, they should undergo the same preprocessing steps we did for training the model. If we don’t do it in exactly the same way, the model might not get things it expects to see, and, in this case, the predictions could get really off. This is why in the previous example, in the customer dictionary, the field names and string values are lowercased and spaces are replaced with underscores.

Now we can use our model to see whether this customer is going to churn. Let’s do it.

First, we convert this dictionary to a matrix by using the DictVectorizer:

In [235]:
X_test = dv.transform([customer])

The input to the vectorizer is a list with one item: we want to score only one customer. The output is a matrix with features, and this matrix contains only one row-the features for this one customer:

In [236]:
X_test

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 1.00000e+00, 0.00000e+00,
        1.00000e+00, 0.00000e+00, 0.00000e+00, 7.98500e+01, 1.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 1.00000e+00,
        1.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 4.10000e+01, 3.32075e+03]])

We see a bunch of one-hot encoding features (ones and zeros) as well as some numeric ones (monthlycharges, tenure, and totalcharges).

Now we take this matrix and put it into the trained model:

In [237]:
model.predict_proba(X_test)

array([[0.92667668, 0.07332332]])

The output is a matrix with predictions. For each customer, it outputs two numbers, which are the probability of staying with the company and the probability of churn. Because there’s only one customer, we get a tiny NumPy array with one row and two columns.

All we need from the matrix is the number at the first row and second column: the probability of churning for this customer. To select this number from the array, we use the brackets operator:

In [238]:
model.predict_proba(X_test)[0, 1]

np.float64(0.0733233241408035)

We used this operator to select the second column from the array. However, this time there’s only one row, so we can explicitly ask NumPy to return the value from that row. Because indexes start from 0 in NumPy, [0, 1] means first row, second column.

When we execute this line, we see that the output is 0.073, so that the probability that this customer will churn is only 7%. It’s less than 50%, so we will not send this customer a promotional mail.

We can try to score another client:

In [239]:
customer = {
    'gender': 'female',
    'seniorcitizen': 1,
    'partner': 'no',
    'dependents': 'no',
    'phoneservice': 'yes',
    'multiplelines': 'yes',
    'internetservice': 'fiber_optic',
    'onlinesecurity': 'no',
    'onlinebackup': 'no',
    'deviceprotection': 'no',
    'techsupport': 'no',
    'streamingtv': 'yes',
    'streamingmovies': 'no',
    'contract': 'month-to-month',
    'paperlessbilling': 'yes',
    'paymentmethod': 'electronic_check',
    'tenure': 1,
    'monthlycharges': 85.7,
    'totalcharges': 85.7
}

Let's make a prediction:

In [240]:
X_test = dv.transform([customer])
model.predict_proba(X_test)[0, 1]

np.float64(0.8321646331247229)

The output of the model is 83% likelihood of churn, so we should send this client a promotional mail in the hope of retaining them.

Answers:
- Exercise 3.1 = b
- Exerciss 3.2 = a
- Exercise 3.3 = b