In [1]:
import pandas as pd
import numpy as np
from pydataset import data

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler

from telco_pipeline import get_data_from_sql, peekatdata, split, df_value_counts, percent_missing, clean_data

### Walkthroughs before project

-What are the profiles of the customers who are churning? In presentation

-How do you deal with the column that converts monthly tenure to annual?

-Don't just identify that people leave bc they can, but what are the drivers that make them leave?

-Answer quetions in exploration and analysis stages. Add comments and notes in notebook about why you're doing what you're doing and what your conclusions are.

-Don't deliver the data without your assessment of what is going on even if you add "I would need to assess this, but..."

-This "leader" wants to know all of the data behind our assessments. A diagram of the work flow and what you did is great.

-How likely is your model to accurately predict churn, return False positives and false negatives?


-Your .py files do NOT have to be exactly the ones and names of what is laid out in the project.

-If you transform data in a column, then you need to explain why you did that. ie. if you scale data, why did you do that?

-Add context (text) to numbers that make them meaningful.

-Include Google slide summarizing your model with links in your README.

-Project Planning from project outline would be good for the structure of your README file. ie. - data dictionary.

-Make a task list that lays out what you need to do at each stage of the pipeline. This could look like headers in your notebook.

-Exploration is when you would to do your t-tests, chi-squared tests.

-Train multiple models, different algorithms, but also change hyperparameters for the same model. ie. - k=2, k=3, k=4, diff features in same algorithm.

-You can decide what your cutoff point will be when deciding what the probability should be to predict churn or not churn at the end.

-Baseline #1 is proportion of churn to not churn. Find this rate first! It needs to be better than 60% accurate.

-Baseline #2 is doing minimal prep and running it through a model. 

-Model #3 is your MVP. Explore and answer the required questions. Prepare data to go into other algorithms (encoding or scaling as needed). Some automated feature selection here, too. Make predictions on this data

-Other Modeling - to gain more insight but beyond simple answers of basic questions. This would be where you include extra feature engineering. 

### Acquistion

- Here I use my function to bring in data using a sql query
    
- My query brought over everything from all of the tables together,
    so I could look at the data before deciding how to clean and process.

In [2]:
df = get_data_from_sql()

### Data Prep

- I created a function that returns important info on the dataframe.

In [3]:
peekatdata(df)

DataFrame Shape:

(7043, 24)

Info about:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 24 columns):
payment_type_id             7043 non-null int64
internet_service_type_id    7043 non-null int64
contract_type_id            7043 non-null int64
customer_id                 7043 non-null object
gender                      7043 non-null object
senior_citizen              7043 non-null int64
partner                     7043 non-null object
dependents                  7043 non-null object
tenure                      7043 non-null int64
phone_service               7043 non-null object
multiple_lines              7043 non-null object
online_security             7043 non-null object
online_backup               7043 non-null object
device_protection           7043 non-null object
tech_support                7043 non-null object
streaming_tv                7043 non-null object
streaming_movies            7043 non-null object
paperless_billing     

- I created and used a function to decide whether to bin data for the value counts.

- I'm not worried about customer_id's unique counts because each is unique, and I will end up dropping this column before running it through a model.

- That leaves tenure, monthly_charges, and total_charges with a large number of unique values which may benefit from binning.

In [4]:
valcount_df = df_value_counts(df)
valcount_df

Unnamed: 0,0
payment_type_id,4
internet_service_type_id,3
contract_type_id,3
customer_id,7043
gender,2
senior_citizen,2
partner,2
dependents,2
tenure,73
phone_service,2


- I want to decide if and which rows or columns should be dropped.

- Running .value_counts() on total charges column showed me it has 11 values that are blank space. These are most likely for customers who have not had the service long enough to have a total_charge.

- Considering these findings, I will create and run a function that replaces the blank values with NaN and returns the percent of missing values in each column in order to make my final decision about dropping rows that contain NaNs.

- I'm including in this function a line to drop customer_id as I will not be needing this column.

In [5]:
df["total_charges"].value_counts(dropna=False)

20.2       11
           11
19.75       9
20.05       8
19.65       8
19.9        8
19.55       7
45.3        7
20.15       6
19.45       6
20.25       6
20.3        5
20.45       5
69.65       4
70.6        4
44.4        4
50.15       4
69.9        4
19.95       4
74.7        4
69.6        4
20.35       4
19.4        4
19.85       4
49.9        4
20.5        4
19.2        4
69.95       4
19.5        4
20.4        4
19.3        4
75.3        4
44          4
45.7        3
305.55      3
20.1        3
20          3
45.85       3
80.55       3
55.7        3
69.25       3
2317.1      3
85          3
70.45       3
50.45       3
85.5        3
1284.2      3
74.9        3
20.55       3
25.25       3
70.1        3
35.9        3
470.2       3
24.8        3
69.55       3
220.45      3
79.55       3
69.1        3
19.25       3
70.3        3
19.1        3
70.15       3
75.35       3
84.5        3
50.75       3
383.65      3
44.75       3
86.05       3
74.3        3
24.4        3
74.35       3
74.6  

- This function reveals that only total_charges has missing values, and the percent of missing values is .16%.

- This confirms my decision to drop the rows with NaNs.


In [6]:
percent_missing(df)

payment_type_id             0.000000
internet_service_type_id    0.000000
contract_type_id            0.000000
customer_id                 0.000000
gender                      0.000000
senior_citizen              0.000000
partner                     0.000000
dependents                  0.000000
tenure                      0.000000
phone_service               0.000000
multiple_lines              0.000000
online_security             0.000000
online_backup               0.000000
device_protection           0.000000
tech_support                0.000000
streaming_tv                0.000000
streaming_movies            0.000000
paperless_billing           0.000000
monthly_charges             0.000000
total_charges               0.156183
churn                       0.000000
contract_type               0.000000
internet_service_type       0.000000
payment_type                0.000000
dtype: float64

- Here I am running my function clean_data to replace empty space values in total_charges with NaN, dropping the rows with Nan, dropping customer_id and confirming the drop with an .isnull().sum().

In [7]:
df = clean_data(df)
df.isnull().sum()

payment_type_id             0
internet_service_type_id    0
contract_type_id            0
gender                      0
senior_citizen              0
partner                     0
dependents                  0
tenure                      0
phone_service               0
multiple_lines              0
online_security             0
online_backup               0
device_protection           0
tech_support                0
streaming_tv                0
streaming_movies            0
paperless_billing           0
monthly_charges             0
total_charges               0
churn                       0
contract_type               0
internet_service_type       0
payment_type                0
dtype: int64

- Here I will transform dataframe values from "Yes" and "No"/"No phone service"/"No internet service" to True/False, so I can feed this as a target into my model.

- I will hardcode this by hand for and drop duplicate columns at this point.

In [8]:
df.replace(to_replace=['No', 'Yes'], value=[True, False], inplace=True)
df.replace(to_replace=['No phone service'], value=[False], inplace=True)
df.replace(to_replace=['Female', 'Male'], value=[True, False], inplace=True)
df.replace(to_replace=['No internet service'], value=[False], inplace=True)
df.dtypes

payment_type_id               int64
internet_service_type_id      int64
contract_type_id              int64
gender                         bool
senior_citizen                int64
partner                        bool
dependents                     bool
tenure                        int64
phone_service                  bool
multiple_lines                 bool
online_security                bool
online_backup                  bool
device_protection              bool
tech_support                   bool
streaming_tv                   bool
streaming_movies               bool
paperless_billing              bool
monthly_charges             float64
total_charges               float64
churn                          bool
contract_type                object
internet_service_type        object
payment_type                 object
dtype: object

- Here I compute a new feature comleted_years translating tenure from months to years. 

- I want to use this later to look at month-to-month and 1-year contract customers.


In [9]:
df["completed_years"] = round(df["tenure"] / 12)
df["completed_years"].value_counts().sort_index()

0.0    1470
1.0    1156
2.0    1004
3.0     715
4.0     868
5.0     786
6.0    1033
Name: completed_years, dtype: int64

- Here I created a new column phone_id that captures phone_service and multiple_lines into a single int variable.


In [10]:
#df["phone_id"] = df["multiple_lines"].map({"Yes": 1, "No": 0, "No phone service": 0})

- Here I add a new column "family" that combines "dependents" and "partner" and is True for either having dependents or family or False for having neither.

Data Dictionary:

- True - partner or dependents, False - neither partner or dependents

In [11]:
df['family'] = (df.partner == True) | (df.dependents == True)

- Here I added a new column streaming_services of dtype bool that combines streaming_movies and streaming_tv into one. If the customer has either of these, the value is True. 

- I decided to combine these based on the results of my heat map in exploration. These two variables are highly correlated with each other.

In [12]:
df["streaming_services"] = (df.streaming_movies == True) | (df.streaming_tv == True)

- Here I added a column online_security_backup of dtype int that combines
online_security and online_backup dependents and partner and is dtype int.

Data Dictionary: 
    
- (0 = none, 1 = online security, 2 = online backup, 3 = both)

In [16]:
df["online_services"] = df.online_security + df.online_backup + df.device_protection + df.tech_support
df["online_services"].value_counts()

True     5043
False    1989
Name: online_services, dtype: int64

- Here I am checking my data types to begin cleaning up my dataframe, transforming all categorical/object dtypes to numerical and dropping columns I used to create the new, merged columns above.

In [None]:
df = df.drop(columns=["phone_service", "multiple_lines", "contract_type", "internet_service_type", "payment_type", "online_security", "online_backup", "device_protection", "tech_support"])

In [None]:
df.dtypes

- Here I am ready to split the data 70/30 train/test using my split function.


In [None]:
train, test = split(df=df, target="churn", train_prop=.70, seed=123)

train.head()

In [None]:
df.dtypes

In [None]:
# 11. Scale monthly_charges and total_charges
# I will run my first model without scaling and return to this step for my MVP


### Data Exploration

In [None]:
# 1. Could the month they signed up influence churn? 
# (Plot the rate of churn on a line chart where x is the tenure 
# and y is the rate of churn (customers churned/total customers)).



In [None]:
# 2. Are there features that indicate higher propensity to churn?


In [None]:
# 3. Is there a price threshold for specific services where the liklihood
# of churn increases? What services and at what price point?


In [None]:
# 4. Looking at churn rate for month-to-month vs. 1-year contract customers
# afther their 12th month of service, is the rate of churn different?


In [None]:
# 5. Use a t-test to find out of the monthly charges of those who have
# churned is significantly higher thatn those who have not. Control for:
# (phone_id, internet_service_type_id, online_security_backup, device_protection, 
# tech_support, and contract_type_id)


In [None]:
# 6. Perform a correlation test, stating hypothesis and conclusion clearly
# that states if montly charges can be explained by internet_service_type


In [None]:
# 7. 


In [None]:
# 8. Create visualizations exploring interactions of variables (independent
# with independent and independent with dependent). The goal is to identify
# features that are related to churn, identify integrity issues, understand
# how the data works.


In [None]:
# 9. 


In [None]:
# 10. 


### Modeling

In [None]:
# 1. Feature selection: can you remove any features that provide limited
# to no additional info?


In [None]:
# 2. Train (fit, transform, evaluate) multiple models and select the best
# performing model.


In [None]:
# 3. Compare eval metrics across all the models and select best performing


In [None]:
# 4. Test the final model (tranform, evaluate) on your out-of-sample data.
# Summarize the performance, interpret results.
