# Reducing Customer Churn at Telco

---

## Project Goals

The goal of this project is to identify drivers of customer churn at Telco, produce a prediction model to identify 
which customers are at the highest risk of churning, and offer a recommendation for reducing customer churn.

---

## Project Description

Telco customers are churning at an unacceptably high rate which is affecting the company's bottom line. Retaining
existing customers costs far less than signing new customers. As such we want to reduce churn in order to help
the Telco's bottom line rather than relying on signing new customers to make up the difference. We will compare
and contrast customers who have churned versus those who haven't to determine the attributes that are driving 
customers to churn. We will produce a prediction model to help identify the customers that are at the highest risk 
of churning and we will provide a list of customers who are likely to churn (provided in predictions.csv). Finally, 
we will offer a recommended course of action to help promote customer retention.

---

# Imports

Below are all the imports needed to reproduce this project. This utilizes numpy, pandas, matplotlib, seaborn and sklearn. If you do not have these installed already install them now before proceeding.

In [23]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

We will also be using the random seed below throughout the notebook to ensure that reproduced results are consistent.

In [2]:
seed = 24

# Data Acquisition

---

## Accessing the Database

Here we will outline the process for acquiring the Telco customer dataset from the MySQL database. In order to access this data you will need login credentials. Assuming you have credentials save these in a env.py file in the following form:

In [3]:
username = 'your_username'
password = 'your_password'
hostname = 'data.codeup.com'

Save this file in the notebooks directory and save a copy in the util directory for use with the final report notebook. We will import our credentials from env.py and utilize the following function to acquiring our data.

In [4]:
from env import username, password, hostname

def get_db_url(database_name, username = username, password = password, hostname = hostname):
    return f'mysql+pymysql://{username}:{password}@{hostname}/{database_name}'

---

## Acquiring the Data

We can gather all the data we need by using a SQL query that joins together all the tables in the telco_churn database. The telco_churn database has several tables related with foreign keys. Here is a function that will give us the required SQL query:

In [5]:
def get_telco_sql():
    return '''
        SELECT *
        FROM customers
        JOIN payment_types USING (payment_type_id)
        JOIN internet_service_types USING (internet_service_type_id)
        JOIN contract_types USING (contract_type_id);
    '''

Next we want to read the dataset from the database. We would also like to cache this data for quicker access in the future. If we are doing this we must check if the cached file already exists before we try to read it from the database. We can achieve this using pandas and the os module from the python standard library. We put all this code into a function for our convenience.

In [6]:
def get_telco_data(use_cache = True):
    # If the file is cached, read from the .csv file
    if os.path.exists('telco.csv') and use_cache:
        print('Using cache')
        return pd.read_csv('telco.csv')
    
    # Otherwise read from the mysql database
    else:
        print('Reading from database')
        df = pd.read_sql(get_telco_sql(), get_db_url('telco_churn'))
        df.to_csv('telco.csv', index = False)
        return df

In [7]:
# Now we can acquire the data needed to proceed
telco_customers = get_telco_data()
telco_customers.head(2)

Using cache


Unnamed: 0,contract_type_id,internet_service_type_id,payment_type_id,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,...,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,payment_type,internet_service_type,contract_type
0,2,1,2,0002-ORFBO,Female,0,Yes,Yes,9,Yes,...,Yes,Yes,No,Yes,65.6,593.3,No,Mailed check,DSL,One year
1,1,1,2,0003-MKNFE,Male,0,No,No,9,Yes,...,No,No,Yes,No,59.9,542.4,No,Mailed check,DSL,Month-to-month


---



# Data Preparation

---

Here we will outline all the steps taken to prepare the data for exploratory analysis.

---

## Identify Missing Values

We start by checking if any null values exist in the dataframe.

In [8]:
telco_customers.isna().sum()

contract_type_id            0
internet_service_type_id    0
payment_type_id             0
customer_id                 0
gender                      0
senior_citizen              0
partner                     0
dependents                  0
tenure                      0
phone_service               0
multiple_lines              0
online_security             0
online_backup               0
device_protection           0
tech_support                0
streaming_tv                0
streaming_movies            0
paperless_billing           0
monthly_charges             0
total_charges               0
churn                       0
payment_type                0
internet_service_type       0
contract_type               0
dtype: int64

We do not have any null values so we can proceed.

---

## Remove Duplicates

Next we will remove any duplicates that may exist in the dataframe.

In [9]:
telco_customers.shape

(7043, 24)

In [10]:
telco_customers = telco_customers.drop_duplicates()
telco_customers.shape

(7043, 24)

---

## Identify Useless/Redundant Features

Now we will idenfify any useless or redundant features in our data. Useless features could be unique identifiers which will not have any influence over customer churn. Redundant features could be columns that were used as foreign keys in the MySQL database.

The customer_id column is a unique identifier. We can see this by looking at a few rows and counting the number of unique values and comparing to the total number of rows.

In [11]:
telco_customers['customer_id'].head(3)

0    0002-ORFBO
1    0003-MKNFE
2    0004-TLHLJ
Name: customer_id, dtype: object

In [12]:
telco_customers['customer_id'].nunique() == telco_customers.shape[0]

True

We will drop this column since it's of no use to us.

The contract_type_id, internet_service_type_id and payment_type_id columns are all foreign key columns. We see this by looking the unique values in each column compared to the columns contract_type, internet_service_type, and payment_type.

In [13]:
columns = [
    ['contract_type_id', 'contract_type'],
    ['internet_service_type_id', 'internet_service_type'],
    ['payment_type_id', 'payment_type']
]

for column in columns:
    print(telco_customers[column].value_counts(), end = '\n----------\n')

contract_type_id  contract_type 
1                 Month-to-month    3875
3                 Two year          1695
2                 One year          1473
dtype: int64
----------
internet_service_type_id  internet_service_type
2                         Fiber optic              3096
1                         DSL                      2421
3                         None                     1526
dtype: int64
----------
payment_type_id  payment_type             
1                Electronic check             2365
2                Mailed check                 1612
3                Bank transfer (automatic)    1544
4                Credit card (automatic)      1522
dtype: int64
----------


We can see here that each of these column pairs matches up perfectly meaning the id columns were foreign key columns in the MySQL database. We will drop the id columns and keep their counterparts since these have human readable values.

Now let's drop these columns.

In [14]:
columns_to_drop = [
    'customer_id',
    'contract_type_id',
    'internet_service_type_id',
    'payment_type_id'
]

telco_customers = telco_customers.drop(columns = columns_to_drop)

# Let's make sure it worked
telco_customers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   gender                 7043 non-null   object 
 1   senior_citizen         7043 non-null   int64  
 2   partner                7043 non-null   object 
 3   dependents             7043 non-null   object 
 4   tenure                 7043 non-null   int64  
 5   phone_service          7043 non-null   object 
 6   multiple_lines         7043 non-null   object 
 7   online_security        7043 non-null   object 
 8   online_backup          7043 non-null   object 
 9   device_protection      7043 non-null   object 
 10  tech_support           7043 non-null   object 
 11  streaming_tv           7043 non-null   object 
 12  streaming_movies       7043 non-null   object 
 13  paperless_billing      7043 non-null   object 
 14  monthly_charges        7043 non-null   float64
 15  tota

---

## Check Other Features for Unusual Values

Next we want to check our other columns for unusual values or redundant values.

If we take a look at the total_charges column we will see that there is a hidden group of missing values; however, this is related to the customers that have just recently signed on. We will remove these rows from the dataset since they are unlikely to affect our results.

In [15]:
columns = ['tenure', 'total_charges']

for column in columns:
    print(telco_customers[column].value_counts().sort_index(), end = '\n----------\n')

0      11
1     613
2     238
3     200
4     176
     ... 
68    100
69     95
70    119
71    170
72    362
Name: tenure, Length: 73, dtype: int64
----------
          11
100.2      1
100.25     1
100.35     1
100.4      1
          ..
997.75     1
998.1      1
999.45     1
999.8      1
999.9      1
Name: total_charges, Length: 6531, dtype: int64
----------


In [16]:
# Now we remove the rows
# This should remove 11 rows leaving us with 7032 rows

does_not_have_zero_tenure = telco_customers.tenure != 0
telco_customers = telco_customers[does_not_have_zero_tenure]
telco_customers.shape[0]

7032

In [19]:
# We'll also cast the total_charges columns to a float type

telco_customers.total_charges = telco_customers.total_charges.astype('float')
telco_customers.total_charges.dtypes

dtype('float64')

The columns multiple_lines, online_security, online_backup, device_protection, tech_support, streaming_tv, and streaming_movies all have two "No" values: one for "No" and one for "No service". This can be seen below.

In [17]:
columns = [
    'multiple_lines',
    'online_security',
    'online_backup',
    'device_protection',
    'tech_support',
    'streaming_tv',
    'streaming_movies'
]

for column in columns:
    print(telco_customers[column].value_counts(), end = '\n----------\n')

No                  3385
Yes                 2967
No phone service     680
Name: multiple_lines, dtype: int64
----------
No                     3497
Yes                    2015
No internet service    1520
Name: online_security, dtype: int64
----------
No                     3087
Yes                    2425
No internet service    1520
Name: online_backup, dtype: int64
----------
No                     3094
Yes                    2418
No internet service    1520
Name: device_protection, dtype: int64
----------
No                     3472
Yes                    2040
No internet service    1520
Name: tech_support, dtype: int64
----------
No                     2809
Yes                    2703
No internet service    1520
Name: streaming_tv, dtype: int64
----------
No                     2781
Yes                    2731
No internet service    1520
Name: streaming_movies, dtype: int64
----------


We can combine these duplicate "No" values into a single "No" value.

In [18]:
columns = [
    'multiple_lines',
    'online_security',
    'online_backup',
    'device_protection',
    'tech_support',
    'streaming_tv',
    'streaming_movies'
]

for column in columns:
    telco_customers[column] = np.where(telco_customers[column] == 'Yes', 'Yes', 'No')
    
# Let's verify that it worked
for column in columns:
    print(telco_customers[column].value_counts(), end = '\n----------\n')

No     4065
Yes    2967
Name: multiple_lines, dtype: int64
----------
No     5017
Yes    2015
Name: online_security, dtype: int64
----------
No     4607
Yes    2425
Name: online_backup, dtype: int64
----------
No     4614
Yes    2418
Name: device_protection, dtype: int64
----------
No     4992
Yes    2040
Name: tech_support, dtype: int64
----------
No     4329
Yes    2703
Name: streaming_tv, dtype: int64
----------
No     4301
Yes    2731
Name: streaming_movies, dtype: int64
----------


---

## Encode Non Numeric Features

Lastly we need to encode all of our non numeric features into numeric columns so that we can use them with our machine learning models. We will use pandas to help us with this.

In [20]:
# We only need non numeric categorical features
categorical_cols = telco_customers.dtypes[telco_customers.dtypes == 'object'].index

# pandas' get_dummies function will encode these features for us
dummy_df = pd.get_dummies(telco_customers[categorical_cols], dummy_na = False, drop_first = True)
telco_customers = pd.concat([telco_customers, dummy_df], axis = 1)

# We'll also clean up the column names by removing spaces and lower casing everything
telco_customers.columns = telco_customers.columns.str.replace(' ', '_', regex = False).str.lower()
telco_customers.columns = telco_customers.columns.str.replace('\(|\)', '', regex = True)

# Let's see if it worked
telco_customers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7032 entries, 0 to 7042
Data columns (total 40 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   gender                              7032 non-null   object 
 1   senior_citizen                      7032 non-null   int64  
 2   partner                             7032 non-null   object 
 3   dependents                          7032 non-null   object 
 4   tenure                              7032 non-null   int64  
 5   phone_service                       7032 non-null   object 
 6   multiple_lines                      7032 non-null   object 
 7   online_security                     7032 non-null   object 
 8   online_backup                       7032 non-null   object 
 9   device_protection                   7032 non-null   object 
 10  tech_support                        7032 non-null   object 
 11  streaming_tv                        7032 no

In util/prepare.py we have all these steps collected in a single function prep_telco_data() for our convenience.

---

## Split the Data

Now that we have our prepare function ready we need to create a function to split the data for us. We need to split the data into train, validate, and test datasets. We will first make an 80/20 split to get a test dataset that is 20% of our original data. Then with the other 80% we will make a 70/30 split where the train dataset will be 70% and the validate dataset will be 30%. We can use sklearn to help with this. We will put everything in a function for our convenience.

In [21]:
def split_data(df, stratify, random_seed = 24):
    test_split = 0.2
    train_validate_split = 0.3

    train_validate, test = train_test_split(
        df,
        test_size = test_split,
        random_state = random_seed,
        stratify = df[stratify]
    )
    
    train, validate = train_test_split(
        train_validate,
        test_size = train_validate_split,
        random_state = random_seed,
        stratify = train_validate[stratify]
    )
    return train, validate, test

In [24]:
# Now we split our data
train, validate, test = split_data(telco_customers, 'churn')

**It is important that from this point on we will only work with our train dataset. We will not use validate until we get to modeling and we will not use test until we have selected our best model.**

---

# Exploratory Analysis

---