# Capstone 3 - Customer Churn Prediction for Telco

[Telco](https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113), a fictional telecommunications company, needs to predict when customers are likely to "churn", or cease using the services provided by the company. As the data scientist tasked to solve this problem, we will look to create machine learning models to classify which customers will churn and which will stay on. Our measure for success will be the accuracy of predicting the customers who will leave, as well as the overall prediction quality on all customers.

First, we will start with a look at the data, which includes 7043 instances of customer data from the San Diego, California area in quarter 3 of a given year. 

### Data Wrangling

In this notebook, we will look to clean the data for our initial exploration. Below, we start by importing the necessary libraries and loading in the data.

In [196]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline



In [197]:
df_churn = pd.read_excel('data/Telco_customer_churn.xlsx')

df_churn.columns

Index(['CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code',
       'Lat Long', 'Latitude', 'Longitude', 'Gender', 'Senior Citizen',
       'Partner', 'Dependents', 'Tenure Months', 'Phone Service',
       'Multiple Lines', 'Internet Service', 'Online Security',
       'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV',
       'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method',
       'Monthly Charges', 'Total Charges', 'Churn Label', 'Churn Value',
       'Churn Score', 'CLTV', 'Churn Reason'],
      dtype='object')

In [198]:
df_demographics = pd.read_excel('data/Telco_customer_churn_demographics.xlsx')
df_location = pd.read_excel('data/Telco_customer_churn_location.xlsx')
df_population = pd.read_excel('data/Telco_customer_churn_population.xlsx')
df_services = pd.read_excel('data/Telco_customer_churn_services.xlsx')
df_status = pd.read_excel('data/Telco_customer_churn_status.xlsx')

In [199]:
df_demographics.columns

Index(['Customer ID', 'Count', 'Gender', 'Age', 'Under 30', 'Senior Citizen',
       'Married', 'Dependents', 'Number of Dependents'],
      dtype='object')

In [200]:
df_location.columns

Index(['Customer ID', 'Count', 'Country', 'State', 'City', 'Zip Code',
       'Lat Long', 'Latitude', 'Longitude'],
      dtype='object')

In [201]:
df_population.columns

Index(['ID', 'Zip Code', 'Population'], dtype='object')

In [202]:
df_services.columns

Index(['Customer ID', 'Count', 'Quarter', 'Referred a Friend',
       'Number of Referrals', 'Tenure in Months', 'Offer', 'Phone Service',
       'Avg Monthly Long Distance Charges', 'Multiple Lines',
       'Internet Service', 'Internet Type', 'Avg Monthly GB Download',
       'Online Security', 'Online Backup', 'Device Protection Plan',
       'Premium Tech Support', 'Streaming TV', 'Streaming Movies',
       'Streaming Music', 'Unlimited Data', 'Contract', 'Paperless Billing',
       'Payment Method', 'Monthly Charge', 'Total Charges', 'Total Refunds',
       'Total Extra Data Charges', 'Total Long Distance Charges',
       'Total Revenue'],
      dtype='object')

In [203]:
df_status.columns

Index(['Customer ID', 'Count', 'Quarter', 'Satisfaction Score',
       'Customer Status', 'Churn Label', 'Churn Value', 'Churn Score', 'CLTV',
       'Churn Category', 'Churn Reason'],
      dtype='object')

We have stored each file into these DataFrames:

* **df_churn** - Contains info on the customers related to churn that combines the following data sets:
    * **df_demographics** - Basic info about each of the customers not related to Telco directly
    * **df_location** - Location info for each customer
    * **df_population** - Location info for the overall population
    * **df_services** - Details for customers' various services and usages
    * **df_status** - Labels which customers churned with reasons if they did.

We will use the df_churn DataFrame in order to explore the data and get a grip on what we are looking at. First, we will clean the data set and remove any columns that will not assist in our analysis.

In [204]:
df_churn.head()

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


In [205]:
df_churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Count              7043 non-null   int64  
 2   Country            7043 non-null   object 
 3   State              7043 non-null   object 
 4   City               7043 non-null   object 
 5   Zip Code           7043 non-null   int64  
 6   Lat Long           7043 non-null   object 
 7   Latitude           7043 non-null   float64
 8   Longitude          7043 non-null   float64
 9   Gender             7043 non-null   object 
 10  Senior Citizen     7043 non-null   object 
 11  Partner            7043 non-null   object 
 12  Dependents         7043 non-null   object 
 13  Tenure Months      7043 non-null   int64  
 14  Phone Service      7043 non-null   object 
 15  Multiple Lines     7043 non-null   object 
 16  Internet Service   7043 

In [206]:
# This will print all unique entries per column for us to find out more about the data
for i in df_churn.columns:
    print(str(i) + ': ' + str(df_churn[i].unique()))

CustomerID: ['3668-QPYBK' '9237-HQITU' '9305-CDSKC' ... '2234-XADUH' '4801-JZAZL'
 '3186-AJIEK']
Count: [1]
Country: ['United States']
State: ['California']
City: ['Los Angeles' 'Beverly Hills' 'Huntington Park' ... 'Standish' 'Tulelake'
 'Olympic Valley']
Zip Code: [90003 90005 90006 ... 96128 96134 96146]
Lat Long: ['33.964131, -118.272783' '34.059281, -118.30742' '34.048013, -118.293953'
 ... '40.346634, -120.386422' '41.813521, -121.492666'
 '39.191797, -120.212401']
Latitude: [33.964131 34.059281 34.048013 ... 40.346634 41.813521 39.191797]
Longitude: [-118.272783 -118.30742  -118.293953 ... -120.386422 -121.492666
 -120.212401]
Gender: ['Male' 'Female']
Senior Citizen: ['No' 'Yes']
Partner: ['No' 'Yes']
Dependents: ['No' 'Yes']
Tenure Months: [ 2  8 28 49 10  1 47 17  5 34 11 15 18  9  7 12 25 68 55 37  3 27 20  4
 58 53 13  6 19 59 16 52 24 32 38 54 43 63 21 69 22 61 60 48 40 23 39 35
 56 65 33 30 45 46 62 70 50 44 71 26 14 41 66 64 29 42 67 51 31 57 36 72
  0]
Phone Service: ['

Some observations about each of the columns:
* 'Count', 'Country', and  'State' contain information that is not useful for analysis, so we will delete these columns. 
* 'Lat Long' contains the information contained within the columns 'Latitude' and 'Longitude'.
* 'City' info is held within the 'Zip Code' so we can get rid of 'City' column
* 'Gender', 'Senior Citizen', 'Partner', 'Dependents', 'Phone Service', and 'Paperless Billing' all are two values, meaning we can change each to 0 or 1 to make numeric.
* 'Multiple Lines', 'Internet Service', 'Online Security', 'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', and 'Streaming Movies' all have 3 options, with one being the category that the service wasn't on the plan. We can simplify these with 1 or 2 for services and 0 for not an option. For Multiple Lines, it should be noted that we can just include anyone who doesn't have a phone plan as just having 'No' for multiple lines.
* 'Churn Label' and 'Churn Value' are synonomous, so we can delete Churn Label.
* One hot encoding will be needed for all other columns with values that are strings. 

In [207]:
# df_churn = df_churn.drop(columns=['Count', 'City', 'Lat Long', 'Country', 'State', 'Churn Label'])
df_churn = df_churn.drop(columns=['Count', 'City', 'Lat Long', 'Country', 'State', 'Churn Label', 'Churn Score', 'Churn Reason'])

In [208]:
df_churn['Gender'].replace({'Male':0, 'Female':1}, inplace=True)
df_churn['Gender'].unique()

array([0, 1], dtype=int64)

In [209]:
cols = ['Senior Citizen', 'Partner', 'Dependents', 'Phone Service', 'Paperless Billing', 'Multiple Lines', 'Online Security', 
        'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', 'Streaming Movies']

for c in cols:
    df_churn[c].replace({'Yes': 1, 'No': 0}, inplace=True)
    print(df_churn[c].unique())
    


[0 1]
[0 1]
[0 1]
[1 0]
[1 0]
[0 1 'No phone service']
[1 0 'No internet service']
[1 0 'No internet service']
[0 1 'No internet service']
[0 1 'No internet service']
[0 1 'No internet service']
[0 1 'No internet service']


In [210]:
cols = ['Online Security', 'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', 'Streaming Movies']
df_churn['Multiple Lines'].replace({'No phone service':0}, inplace=True)

for c in cols:
    df_churn[c].replace({'No internet service': 0}, inplace=True)

In [211]:
df_churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Zip Code           7043 non-null   int64  
 2   Latitude           7043 non-null   float64
 3   Longitude          7043 non-null   float64
 4   Gender             7043 non-null   int64  
 5   Senior Citizen     7043 non-null   int64  
 6   Partner            7043 non-null   int64  
 7   Dependents         7043 non-null   int64  
 8   Tenure Months      7043 non-null   int64  
 9   Phone Service      7043 non-null   int64  
 10  Multiple Lines     7043 non-null   int64  
 11  Internet Service   7043 non-null   object 
 12  Online Security    7043 non-null   int64  
 13  Online Backup      7043 non-null   int64  
 14  Device Protection  7043 non-null   int64  
 15  Tech Support       7043 non-null   int64  
 16  Streaming TV       7043 

In [212]:
df_dummy = pd.get_dummies(df_churn, columns=['Internet Service', 'Contract', 'Payment Method'])

df_dummy.head()

Unnamed: 0,CustomerID,Zip Code,Latitude,Longitude,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,...,Internet Service_DSL,Internet Service_Fiber optic,Internet Service_No,Contract_Month-to-month,Contract_One year,Contract_Two year,Payment Method_Bank transfer (automatic),Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check
0,3668-QPYBK,90003,33.964131,-118.272783,0,0,0,0,2,1,...,1,0,0,1,0,0,0,0,0,1
1,9237-HQITU,90005,34.059281,-118.30742,1,0,0,1,2,1,...,0,1,0,1,0,0,0,0,1,0
2,9305-CDSKC,90006,34.048013,-118.293953,1,0,0,1,8,1,...,0,1,0,1,0,0,0,0,1,0
3,7892-POOKP,90010,34.062125,-118.315709,1,0,1,1,28,1,...,0,1,0,1,0,0,0,0,1,0
4,0280-XJGEX,90015,34.039224,-118.266293,0,0,0,1,49,1,...,0,1,0,1,0,0,1,0,0,0


In [213]:
df_dummy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 32 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   CustomerID                                7043 non-null   object 
 1   Zip Code                                  7043 non-null   int64  
 2   Latitude                                  7043 non-null   float64
 3   Longitude                                 7043 non-null   float64
 4   Gender                                    7043 non-null   int64  
 5   Senior Citizen                            7043 non-null   int64  
 6   Partner                                   7043 non-null   int64  
 7   Dependents                                7043 non-null   int64  
 8   Tenure Months                             7043 non-null   int64  
 9   Phone Service                             7043 non-null   int64  
 10  Multiple Lines                      

And there we have it! We have cleaned the data and kept the data that we think will help with our analysis. The only column we have with string data is the Customer ID, which we can set as our index, leaving us with only numeric data, concluding our Data Wrangling notebook. We will finish by uploading the cleaned data set into the file 'clean_data.csv' file, which we will explore in the next notebook.

In [214]:
df_dummy.set_index('CustomerID', inplace=True)

df_dummy.head()

Unnamed: 0_level_0,Zip Code,Latitude,Longitude,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,...,Internet Service_DSL,Internet Service_Fiber optic,Internet Service_No,Contract_Month-to-month,Contract_One year,Contract_Two year,Payment Method_Bank transfer (automatic),Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3668-QPYBK,90003,33.964131,-118.272783,0,0,0,0,2,1,0,...,1,0,0,1,0,0,0,0,0,1
9237-HQITU,90005,34.059281,-118.30742,1,0,0,1,2,1,0,...,0,1,0,1,0,0,0,0,1,0
9305-CDSKC,90006,34.048013,-118.293953,1,0,0,1,8,1,1,...,0,1,0,1,0,0,0,0,1,0
7892-POOKP,90010,34.062125,-118.315709,1,0,1,1,28,1,1,...,0,1,0,1,0,0,0,0,1,0
0280-XJGEX,90015,34.039224,-118.266293,0,0,0,1,49,1,1,...,0,1,0,1,0,0,1,0,0,0


In [222]:
df_dummy[df_dummy['Total Charges'] == ' ']

Unnamed: 0_level_0,Zip Code,Latitude,Longitude,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,...,Internet Service_DSL,Internet Service_Fiber optic,Internet Service_No,Contract_Month-to-month,Contract_One year,Contract_Two year,Payment Method_Bank transfer (automatic),Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4472-LVYGI,92408,34.084909,-117.258107,1,0,1,0,0,0,0,...,1,0,0,0,0,1,1,0,0,0
3115-CZMZD,93526,36.869584,-118.189241,0,0,0,0,0,1,0,...,0,0,1,0,0,1,0,0,0,1
5709-LVOEQ,94401,37.590421,-122.306467,1,0,1,0,0,1,0,...,1,0,0,0,0,1,0,0,0,1
4367-NUYAO,95014,37.306612,-122.080621,0,0,1,1,0,1,1,...,0,0,1,0,0,1,0,0,0,1
1371-DWPAZ,95569,40.363446,-123.835041,1,0,1,0,0,0,0,...,1,0,0,0,0,1,0,1,0,0
7644-OMVMY,90029,34.089953,-118.294824,0,0,1,1,0,1,0,...,0,0,1,0,0,1,0,0,0,1
3213-VVOLG,92585,33.739412,-117.173334,0,0,1,1,0,1,1,...,0,0,1,0,0,1,0,0,0,1
2520-SGTTA,95005,37.078873,-122.090386,1,0,1,1,0,1,0,...,0,0,1,0,0,1,0,0,0,1
2923-ARZLG,91750,34.144703,-117.770299,0,0,1,1,0,1,0,...,0,0,1,0,1,0,0,0,0,1
4075-WKNIU,90201,33.970343,-118.171368,1,0,1,1,0,1,1,...,1,0,0,0,0,1,0,0,0,1


In [226]:
df_dummy['Total Charges'].replace(' ', np.nan, inplace=True)

df_dummy.dropna(inplace=True)

In [227]:
df_dummy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 3668-QPYBK to 3186-AJIEK
Data columns (total 31 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Zip Code                                  7032 non-null   int64  
 1   Latitude                                  7032 non-null   float64
 2   Longitude                                 7032 non-null   float64
 3   Gender                                    7032 non-null   int64  
 4   Senior Citizen                            7032 non-null   int64  
 5   Partner                                   7032 non-null   int64  
 6   Dependents                                7032 non-null   int64  
 7   Tenure Months                             7032 non-null   int64  
 8   Phone Service                             7032 non-null   int64  
 9   Multiple Lines                            7032 non-null   int64  
 10  Online Security           

In [228]:
df_dummy.to_csv('data/clean_data.csv')