---
<img alt="Colaboratory logo" width="15%" src="https://raw.githubusercontent.com/carlosfab/escola-data-science/master/img/novo_logo_bg_claro.png">

#### **Data Science na Prática 3.0**
*by [sigmoidal.ai](https://sigmoidal.ai)*  

---

# Churn Prediction

A good business strategy for revenue growth is costumer retention. One study has previousley demonstrated that **80%** of a business' future profits will come from ***just* 20%** of the company's existing costumers<sup>[1](#references)</sup>. Also, when comparing to the acquisition of new clients, custumer retention is estimated to be 5-25 times cheaper<sup>[2](#references)</sup>. This where churn and churn rates come in.

<p align=center>
<img src="img/money_loss.jpg" width="50%"><br>
<i><sup>Image credits: upklyak (<a href="https://br.freepik.com/vetores-gratis/queda-da-venda-da-recessao-economica-da-crise-financeira_29942020.htm">www.freepik.com</a>)</sup></i>
</p>

### What is Churn?

Churn represents the number of clients, while the *churn rate* refers to the percentage of users that have been lost to the company over a determined period of time. That is:

**Churn Rate** = $C_B - C_E \over C_B$ $\times 100$, where $C_B$ is the number of customers at the beginning of the period and $C_E$ is the number of customers at the end of the period.

For telecom companies, the churn rate has been estimated at 1.9-2.1% monthly, and 10-67% anually<sup>[3](#references)</sup>. These high churn rates demonstrate why it is important for companies to identify potential *churners* in order to try to prevent this loss of revenue. In this notebook, we will analyse a churn dataset from such a company and employ machine learning strategies to prevent potential churners.

# Getting the data

The data that is used in this project were originally publish as part ot [IBM Developer's learning platform](https://developer.ibm.com/tutorials/watson-studio-using-jupyter-notebook/) and is also available on [Kaggle](https://www.kaggle.com/datasets/blastchar/telco-customer-churn). It contains information about customers who left within the last month of the analysed period and other customer information (demographics, account information, services that were signed up).

In [43]:
# Importing libraries
import pandas as pd
import numpy as np

# Getting the data
df = pd.read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Checking first entries of the dataset
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Data variables

As mentioned previously, the dataset contains several columns providing information about the customers:

* `customerID` = Unique ID for each costumer.
* `gender` = The costumers identified gender.
* `SeniorCitizen` = Wether the customer is a senior citizen (1) or not (0).
* `Partner` = Wether the customer has a partner (Yes) or not (No).
* `Dependents` = Whether the customer has dependents (Yes) or not (No).
* `tenure` = Number of months the customer has stayed with the company.
* `PhoneService` = Whether the customer has a phone service (Yes) or not (No).
* `MultipleLines` = Whether the customer has multiple lines (Yes) or not (No *or* No phone service).
* `InternetService` = Customer’s internet service provider (DSL, Fiber optic, No).
* `OnlineSecurity` = Whether the customer has online security (Yes) or not (No *or* No internet service).
* `OnlineBackup` = Whether the customer has online backup (Yes) or not (No *or* No internet service).
* `DeviceProtection` = Whether the customer has device protection (Yes) or not (No *or* No internet service).
* `TechSupport` = Whether the customer has tech support (Yes) or not (No *or* No internet service).
* `StreamingTV` = Whether the customer has streaming TV (Yes) or not (No *or* No internet service).
* `StreamingMovies` = Whether the customer has streaming movies (Yes) or not (No *or* No internet service).
* `Contract` = The contract term of the customer (Month-to-month, One year, Two year).
* `PaperlessBilling` = Whether the customer has paperless billing (Yes) or not (No).
* `PaymentMethod` = The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)).
* `MonthlyCharges` = The amount charged to the customer monthly.
* `TotalCharges` = The total amount charged to the customer.
* `Churn` = Whether the customer churned (Yes) or not (No).

# Data Preparation

## Exploratory analysis

First, we will begin by looking at the variables and their characteristics (type, missing values and descriptive statistics).

In [44]:
# Listing data types
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [45]:
# Checking null values
df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [46]:
# Checking unique values
df.nunique()

customerID          7043
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
Churn                  2
dtype: int64

Here, no columns present null values and `TotalCharges` column is supposed to be float type but is set as object. Also, several columns are binary or with multiple values representing categories that need to be treated as such (need to be encoded). Upon trying to convert `TotalCharges` to numeric:

In [47]:
# Trying to convert column
#df['TotalCharges'].astype('float64')

We face an error:
> `ValueError: could not convert string to float: ''`

This means that instead of null values, the empty values were filled as an empty string, which generates an error. For this reason, we will convert all empty strings to null values and see what we get:

In [50]:
# R7eplacing empty string or records with only spaces with NaN
df = df.replace(r'^\s*$', np.NaN, regex=True)
df.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

# References
1: https://smallbiztrends.com/2016/10/customer-retention-statistics.html

2: https://hbr.org/2014/10/the-value-of-keeping-the-right-customers

3: http://www.dbmarketing.com/telecom/churnreduction.html