Classification Project
Why are our customers churning?

Some questions I have include:

Could the month in which they signed up influence churn? i.e. if a cohort is identified by tenure, is there a cohort or cohorts who have a higher rate of churn than other cohorts? (Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers))
Are there features that indicate a higher propensity to churn? like type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?
Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?
If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?

### Deliverables:

I will also need a report (ipynb) answering the question, "Why are our customers churning?" I want to see the analysis you did to answer my questions and lead to your findings. Please clearly call out the questions and answers you are analyzing. E.g. If you find that month-to-month customers churn more, I won't be surprised, but I am not getting rid of that plan. The fact that they churn is not because they can, it's because they can and they are motivated to do so. I want some insight into why they are motivated to do so. I realize you will not be able to do a full causal experiment, but I hope to see some solid evidence of your conclusions.

I will need you to deliver to me a csv with the customer_id, probability of churn, and the prediction of churn (1=churn, 0=not_churn). I would also like a single goolgle slide that illustrates how your model works, including the features being used, so that I can deliver this to the SLT when they come with questions about how these values were derived. Please make sure you include how likely your model is to give a high probability of churn when churn doesn't occur, to give a low probability of churn when churn occurs, and to accurately predict churn.

Finally, our development team will need a .py file that will take in a new dataset, (in the exact same form of the one you acquired from telco_churn.customers) and perform all the transformations necessary to run the model you have developed on this new dataset to provide probabilities and predictions.

Specification
Detailed instructions for each section are below.

In general, make sure you document your work. You don't need to explain what every line of code is doing, but you should explain what and why you are doing. For example, if you drop a feature from the dataset, you should explain why you decided to do so, or why that is a reasonable thing to do. If you transform the data in a column, you should explain why you are making that transformation.

In addition, you should not present numers in isolation. If your code outputs a number, be sure you give some context to the number.

### Specific Deliverables:

- a jupyter notebook where your work takes place
- a csv file that predicts churn for each customer
- a python script that prepares data such that it can be fed into your model
- a google slide summarizing your model
- a README.md file that contains a link to your google slides presentation, and instructions for how to use your python script(s)

# Acquisition
Get the data from the customers table from the telco_churn database on the codeup data science database server.

You may wish to join some tables as part of your query.
This data should end up in a pandas data frame.

In [12]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree

import graphviz
from graphviz import Graph

from telco_prepare import peekatdata

import env

In [2]:
def get_connection(db, user=env.user, host=env.host, password=env.password):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

def get_telco_data():
    return pd.read_sql('SELECT c.*, ct.contract_type, ist.internet_service_type, pt.payment_type\
    FROM customers as c\
    JOIN contract_types as ct USING (contract_type_id)\
    JOIN internet_service_types as ist USING (internet_service_type_id)\
    JOIN payment_types as pt USING (payment_type_id);', get_connection('telco_churn'))

In [3]:
df = get_telco_data()
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,streaming_movies,contract_type_id,paperless_billing,payment_type_id,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,Yes,1,No,2,59.9,542.4,No,Month-to-month,DSL,Mailed check
1,0013-MHZWF,Female,0,No,Yes,9,Yes,No,1,No,...,Yes,1,Yes,4,69.4,571.45,No,Month-to-month,DSL,Credit card (automatic)
2,0015-UOCOJ,Female,1,No,No,7,Yes,No,1,Yes,...,No,1,Yes,1,48.2,340.35,No,Month-to-month,DSL,Electronic check
3,0023-HGHWL,Male,1,No,No,1,No,No phone service,1,No,...,No,1,Yes,1,25.1,25.1,Yes,Month-to-month,DSL,Electronic check
4,0032-PGELS,Female,0,Yes,Yes,1,No,No phone service,1,Yes,...,No,1,No,3,30.5,30.5,Yes,Month-to-month,DSL,Bank transfer (automatic)


Write a function, peekatdata(dataframe), that takes a dataframe as input and computes and returns the following:

- creates dataframe object head_df (df of the first 5 rows) and prints contents to screen
- creates dataframe object tail_df (df of the last 5 rows) and prints contents to screen
- creates tuple object shape_tuple (tuple of (nrows, ncols)) and prints tuple to screen
- creates dataframe object describe_df (summary statistics of all numeric variables) and prints contents to screen.
- prints to screen the information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [4]:
# def peekatdata(dataframe):
    
#     print('df head:')
#     print(dataframe.head())
    
#     print('df tail:')
#     print(dataframe.tail())
    
#     print('df shape:')
#     print(dataframe.shape)
    
#     print('df described:')
#     print(dataframe.describe())
    

# #     index_dtype = 
# #     return index_dtype

#     print('df types:')
#     print(dataframe.dtypes)
    
# peekatdata(df)

In [5]:
# def df_head(dataframe):
    
# #     print('df head:')
#     return dataframe.head()

# def df_tail(dataframe):    
# #     print('df tail:')
#     return dataframe.tail()
    
# def df_shape(dataframe):
# #     print('df shape:')
#     return dataframe.shape

# def df_describe(dataframe):
# #     print('df described:')
#     return dataframe.describe()
    

# # #     index_dtype = 
# # #     return index_dtype

# def df_types(dataframe):
# #     print('df types:')
#     return dataframe.dtypes

# def peekatdata(dataframe):
#     peekatda = df\
#         .pipe(df_head)\
#         .pipe(df_tail)\
#         .pipe(df_describe)\
#         .pipe(df_types)
# #         .pipe(df_shape)\
        
#     return peekatda

In [6]:
from telco_prepare import df_head

In [7]:
df_head(df)

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,streaming_movies,contract_type_id,paperless_billing,payment_type_id,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,Yes,1,No,2,59.9,542.4,No,Month-to-month,DSL,Mailed check
1,0013-MHZWF,Female,0,No,Yes,9,Yes,No,1,No,...,Yes,1,Yes,4,69.4,571.45,No,Month-to-month,DSL,Credit card (automatic)
2,0015-UOCOJ,Female,1,No,No,7,Yes,No,1,Yes,...,No,1,Yes,1,48.2,340.35,No,Month-to-month,DSL,Electronic check
3,0023-HGHWL,Male,1,No,No,1,No,No phone service,1,No,...,No,1,Yes,1,25.1,25.1,Yes,Month-to-month,DSL,Electronic check
4,0032-PGELS,Female,0,Yes,Yes,1,No,No phone service,1,Yes,...,No,1,No,3,30.5,30.5,Yes,Month-to-month,DSL,Bank transfer (automatic)


In [8]:
from telco_prepare import df_tail
df_tail(df)

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,streaming_movies,contract_type_id,paperless_billing,payment_type_id,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
7038,9950-MTGYX,Male,0,Yes,Yes,28,Yes,No,3,No internet service,...,No internet service,3,Yes,4,20.3,487.95,No,Two year,,Credit card (automatic)
7039,9953-ZMKSM,Male,0,No,No,63,Yes,Yes,3,No internet service,...,No internet service,3,No,2,25.25,1559.3,No,Two year,,Mailed check
7040,9964-WBQDJ,Female,0,Yes,No,71,Yes,Yes,3,No internet service,...,No internet service,3,Yes,4,24.4,1725.4,No,Two year,,Credit card (automatic)
7041,9972-EWRJS,Female,0,Yes,Yes,67,Yes,No,3,No internet service,...,No internet service,3,Yes,3,19.25,1372.9,No,Two year,,Bank transfer (automatic)
7042,9975-GPKZU,Male,0,Yes,Yes,46,Yes,No,3,No internet service,...,No internet service,3,No,4,19.75,856.5,No,Two year,,Credit card (automatic)


In [9]:
from telco_prepare import df_shape
df_shape(df)

(7043, 24)

In [10]:
from telco_prepare import df_describe
df_describe(df)

Unnamed: 0,senior_citizen,tenure,internet_service_type_id,contract_type_id,payment_type_id,monthly_charges
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,0.162147,32.371149,1.872923,1.690473,2.315633,64.761692
std,0.368612,24.559481,0.737796,0.833755,1.148907,30.090047
min,0.0,0.0,1.0,1.0,1.0,18.25
25%,0.0,9.0,1.0,1.0,1.0,35.5
50%,0.0,29.0,2.0,1.0,2.0,70.35
75%,0.0,55.0,2.0,2.0,3.0,89.85
max,1.0,72.0,3.0,3.0,4.0,118.75


In [14]:
from telco_prepare import df_types
df_types(df)

customer_id                  object
gender                       object
senior_citizen                int64
partner                      object
dependents                   object
tenure                        int64
phone_service                object
multiple_lines               object
internet_service_type_id      int64
online_security              object
online_backup                object
device_protection            object
tech_support                 object
streaming_tv                 object
streaming_movies             object
contract_type_id              int64
paperless_billing            object
payment_type_id               int64
monthly_charges             float64
total_charges                object
churn                        object
contract_type                object
internet_service_type        object
payment_type                 object
dtype: object

In [13]:
peekatdata(df)

Unnamed: 0,senior_citizen,tenure,internet_service_type_id,contract_type_id,payment_type_id,monthly_charges
count,5.0,5.0,5.0,5.0,5.0,5.0
mean,0.4,5.4,1.0,1.0,2.2,46.62
std,0.547723,4.09878,0.0,0.0,1.30384,18.846405
min,0.0,1.0,1.0,1.0,1.0,25.1
25%,0.0,1.0,1.0,1.0,1.0,30.5
50%,0.0,7.0,1.0,1.0,2.0,48.2
75%,1.0,9.0,1.0,1.0,3.0,59.9
max,1.0,9.0,1.0,1.0,4.0,69.4


# Data Prep

Write a function, df_value_counts(dataframe), that takes a dataframe as input and computes and returns the values by frequency for each column. The function should decide whether or not to bin the data for the value counts.

In [17]:
# Need to update with decision to bin.  Function below will give us the features with more than 10 different options, which will be the features we bin.
def df_value_counts(dataframe):
    df_cols = dataframe.columns
    for col in df_cols:
        print('-----%s-----' %col)
        print(df[col].value_counts())
        
df_value_counts(df)

-----customer_id-----
0623-EJQEG    1
0219-QAERP    1
2402-TAIRZ    1
4957-TREIR    1
8208-EUMTE    1
7766-CLTIC    1
1506-YJTYT    1
4719-UMSIY    1
4933-IKULF    1
2428-ZMCTB    1
1261-FWTTE    1
7696-AMHOD    1
6969-MVBAI    1
3400-ESFUW    1
5376-PCKNB    1
0133-BMFZO    1
0962-CQPWQ    1
3716-UVSPD    1
7009-LGECI    1
4897-QSUYC    1
5569-OUICF    1
4514-GFCFI    1
5493-SDRDQ    1
3398-FSHON    1
4129-LYCOI    1
4612-SSVHJ    1
8204-YJCLA    1
7137-RYLPP    1
7011-CVEUC    1
9851-QXEEQ    1
             ..
5382-TEMLV    1
7295-JOMMD    1
8128-YVJRG    1
6711-VTNRE    1
5227-JSCFE    1
5091-HFAZW    1
3976-NLDEZ    1
4806-HIPDW    1
6919-ELBGL    1
9885-CSMWE    1
6993-OHLXR    1
7785-RDVIG    1
4456-RHSNB    1
9992-UJOEL    1
6771-XWBDM    1
3926-CUQZX    1
8181-YHCMF    1
3428-MMGUB    1
3621-CHYVB    1
2121-JAFOM    1
1017-FBQMM    1
5553-AOINX    1
6229-UOLQL    1
3519-ZKXGG    1
4826-TZEVA    1
3704-IEAXF    1
1623-NLDOT    1
6620-JDYNW    1
6048-NJXHX    1
3082-VQXNH    1
Na

In [18]:
def df_value_counts_bin(dataframe):
    df_cols = dataframe.columns
    features_to_bin = []
    for col in df_cols:
#         print('-----%s-----' %col)
#         print(df[col].value_counts())
        if df[col].value_counts().count() > 10:
            features_to_bin.append(col)
        print(features_to_bin[-1:])
        
df_value_counts_bin(df)

['customer_id']
['customer_id']
['customer_id']
['customer_id']
['customer_id']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['monthly_charges']
['total_charges']
['total_charges']
['total_charges']
['total_charges']
['total_charges']
['tenure_year']


### Handle Missing Values

Explore the data and see if there are any missing values.

Write a function that accepts a dataframe and returns the names of the columns that have missing values, and the percent of missing values in each column that has missing values.

In [19]:
def missing_values(dataframe):
    df_cols = df.columns
    col_name = []
    null_values = []
    null_percents = []
    for col in df_cols:
        value = df[col].isnull().sum()
        null_percent = value / df[col].count()
        col_name.append(col)
        null_values.append(value)
        null_percents.append(null_percent)

    null_tuples = list(zip(col_name, null_values, null_percents))
    null_df = pd.DataFrame(null_tuples, columns = ['Feature', 'Null_Count', 'Null_Percent'])

    print(type(null_df))
    print(null_df)  
    
missing_values(df)

<class 'pandas.core.frame.DataFrame'>
                     Feature  Null_Count  Null_Percent
0                customer_id           0           0.0
1                     gender           0           0.0
2             senior_citizen           0           0.0
3                    partner           0           0.0
4                 dependents           0           0.0
5                     tenure           0           0.0
6              phone_service           0           0.0
7             multiple_lines           0           0.0
8   internet_service_type_id           0           0.0
9            online_security           0           0.0
10             online_backup           0           0.0
11         device_protection           0           0.0
12              tech_support           0           0.0
13              streaming_tv           0           0.0
14          streaming_movies           0           0.0
15          contract_type_id           0           0.0
16         paperless_billin

In [20]:
missing_values(df)

<class 'pandas.core.frame.DataFrame'>
                     Feature  Null_Count  Null_Percent
0                customer_id           0           0.0
1                     gender           0           0.0
2             senior_citizen           0           0.0
3                    partner           0           0.0
4                 dependents           0           0.0
5                     tenure           0           0.0
6              phone_service           0           0.0
7             multiple_lines           0           0.0
8   internet_service_type_id           0           0.0
9            online_security           0           0.0
10             online_backup           0           0.0
11         device_protection           0           0.0
12              tech_support           0           0.0
13              streaming_tv           0           0.0
14          streaming_movies           0           0.0
15          contract_type_id           0           0.0
16         paperless_billin

The function below will sort each column and output the head and tail for that column.  This will let us see if there's something fishy about any of data in each column.

In [21]:
def sort_col_val(dataframe):
    df_cols = dataframe.columns
    for col in df_cols:
        print('Sorted by ' + str(col) + ':')
        print('Head:')
        print(df[[col]].sort_values(by=[col]).head().T)
        print(' ')
        print('Tail: ')
        print(df[[col]].sort_values(by=[col]).tail().T)
        print('-----')

In [22]:
sort_col_val(df)

Sorted by customer_id:
Head:
                   1223        0           2421        2422        2423
customer_id  0002-ORFBO  0003-MKNFE  0004-TLHLJ  0011-IGKFF  0013-EXCHZ
 
Tail: 
                   1792        4548        1222        2419        2420
customer_id  9987-LUTYD  9992-RRAMN  9992-UJOEL  9993-LHIEB  9995-HOTOH
-----
Sorted by gender:
Head:
          3521    5577    2895    2897    2899
gender  Female  Female  Female  Female  Female
 
Tail: 
        3095  3094  3092  3122  7042
gender  Male  Male  Male  Male  Male
-----
Sorted by senior_citizen:
Head:
                0     4703  4702  4701  4700
senior_citizen     0     0     0     0     0
 
Tail: 
                4542  4540  4537  2259  5117
senior_citizen     1     1     1     1     1
-----
Sorted by partner:
Head:
        0    3776 3775 3773 3772
partner   No   No   No   No   No
 
Tail: 
        4104 1385 4107 4087 7042
partner  Yes  Yes  Yes  Yes  Yes
-----
Sorted by dependents:
Head:
           0    4065 4063 4062 406

Looking at the above output, something seems odd about the lower end of total_charges.  Below, I am checking for values that are whitespace, or ' '.  We have 11 rows without an actual amount in total_charges.

In [23]:
df['total_charges'].replace(' ', (df['monthly_charges'] * df['tenure']), inplace=True)

In [24]:
df.loc[(df['total_charges'] == ' ')].T

customer_id
gender
senior_citizen
partner
dependents
tenure
phone_service
multiple_lines
internet_service_type_id
online_security
online_backup


We are now showing no empty cells.  Let's do a value_count to if anything strange is in our list now.  They are showing as 0.0 now.  Let's convert the total_charges column to a float and then drop these rows.

In [25]:
df_value_counts(df)

-----customer_id-----
0623-EJQEG    1
0219-QAERP    1
2402-TAIRZ    1
4957-TREIR    1
8208-EUMTE    1
7766-CLTIC    1
1506-YJTYT    1
4719-UMSIY    1
4933-IKULF    1
2428-ZMCTB    1
1261-FWTTE    1
7696-AMHOD    1
6969-MVBAI    1
3400-ESFUW    1
5376-PCKNB    1
0133-BMFZO    1
0962-CQPWQ    1
3716-UVSPD    1
7009-LGECI    1
4897-QSUYC    1
5569-OUICF    1
4514-GFCFI    1
5493-SDRDQ    1
3398-FSHON    1
4129-LYCOI    1
4612-SSVHJ    1
8204-YJCLA    1
7137-RYLPP    1
7011-CVEUC    1
9851-QXEEQ    1
             ..
5382-TEMLV    1
7295-JOMMD    1
8128-YVJRG    1
6711-VTNRE    1
5227-JSCFE    1
5091-HFAZW    1
3976-NLDEZ    1
4806-HIPDW    1
6919-ELBGL    1
9885-CSMWE    1
6993-OHLXR    1
7785-RDVIG    1
4456-RHSNB    1
9992-UJOEL    1
6771-XWBDM    1
3926-CUQZX    1
8181-YHCMF    1
3428-MMGUB    1
3621-CHYVB    1
2121-JAFOM    1
1017-FBQMM    1
5553-AOINX    1
6229-UOLQL    1
3519-ZKXGG    1
4826-TZEVA    1
3704-IEAXF    1
1623-NLDOT    1
6620-JDYNW    1
6048-NJXHX    1
3082-VQXNH    1
Na

In [26]:
df['total_charges'] = df['total_charges'].convert_objects(convert_numeric=True)
df.dtypes

customer_id                  object
gender                       object
senior_citizen                int64
partner                      object
dependents                   object
tenure                        int64
phone_service                object
multiple_lines               object
internet_service_type_id      int64
online_security              object
online_backup                object
device_protection            object
tech_support                 object
streaming_tv                 object
streaming_movies             object
contract_type_id              int64
paperless_billing            object
payment_type_id               int64
monthly_charges             float64
total_charges               float64
churn                        object
contract_type                object
internet_service_type        object
payment_type                 object
tenure_year                 float64
dtype: object

Success!  Total_charges is now a float.  Let's drop the rows with 0.0.

In [27]:
df = df.drop(df[df.total_charges == 0].index)

In [28]:
df.sort_values(by=['total_charges']).head().T

Unnamed: 0,6145,6010,5989,6039,5589
customer_id,2967-MXRAV,9318-NKNFC,8992-CEUEN,9975-SKRNR,1423-BMPBQ
gender,Male,Male,Female,Male,Female
senior_citizen,0,0,0,0,0
partner,Yes,No,No,No,Yes
dependents,Yes,No,No,No,Yes
tenure,1,1,1,1,1
phone_service,Yes,Yes,Yes,Yes,Yes
multiple_lines,No,No,No,No,No
internet_service_type_id,3,3,3,3,3
online_security,No internet service,No internet service,No internet service,No internet service,No internet service


Let's create a feature that calculates estimated total_charges based on tenure * monthly_charges and gives us a percentage vs. the actual total_charges.  This will be used to check for data integrity issues.

In [29]:
df['percent_var_tc_from_act_tc'] = (df['monthly_charges'] * df['tenure']) / df['total_charges']

In [30]:
df.describe()

Unnamed: 0,senior_citizen,tenure,internet_service_type_id,contract_type_id,payment_type_id,monthly_charges,total_charges,tenure_year,percent_var_tc_from_act_tc
count,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0
mean,0.1624,32.421786,1.872582,1.688567,2.315557,64.798208,2283.300441,2.701816,1.002311
std,0.368844,24.54526,0.737271,0.832934,1.149523,30.085974,2266.771362,2.045438,0.051288
min,0.0,1.0,1.0,1.0,1.0,18.25,18.8,0.083333,0.635545
25%,0.0,9.0,1.0,1.0,1.0,35.5875,401.45,0.75,0.980813
50%,0.0,29.0,2.0,1.0,2.0,70.35,1397.475,2.416667,1.0
75%,0.0,55.0,2.0,2.0,3.0,89.8625,3794.7375,4.583333,1.020881
max,1.0,72.0,3.0,3.0,4.0,118.75,8684.8,6.0,1.450628


To make things a little clearer, let's reorganize the columns so the new columns created are closer to the columns they represent/interact with.

Document your takeaways. For each variable:

- should you remove the observations with a missing value for that variable?
- should you remove the variable altogether?
- is missing equivalent to 0 (or some other constant value) in the specific case of this variable?
- should you replace the missing values with a value it is most likely to represent (e.g. Are the missing values a - result of data integrity issues and should be replaced by the most likely value?)
- Handle the missing values in the way you recommended above.

Transform churn such that "yes" = 1 and "no" = 0

In [44]:
def encode_churn(df):
    encoder = LabelEncoder()
    encoder.fit(df.churn)
    return df.assign(churn_encoded = encoder.transform(df.churn))

In [47]:
df = encode_churn(df)

In [48]:
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,payment_type_id,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type,tenure_year,percent_var_tc_from_act_tc,churn_encoded
0,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,2,59.9,542.4,No,Month-to-month,DSL,Mailed check,0.75,0.993916,0
1,0013-MHZWF,Female,0,No,Yes,9,Yes,No,1,No,...,4,69.4,571.45,No,Month-to-month,DSL,Credit card (automatic),0.75,1.093009,0
2,0015-UOCOJ,Female,1,No,No,7,Yes,No,1,Yes,...,1,48.2,340.35,No,Month-to-month,DSL,Electronic check,0.583333,0.991332,0
3,0023-HGHWL,Male,1,No,No,1,No,No phone service,1,No,...,1,25.1,25.1,Yes,Month-to-month,DSL,Electronic check,0.083333,1.0,1
4,0032-PGELS,Female,0,Yes,Yes,1,No,No phone service,1,Yes,...,3,30.5,30.5,Yes,Month-to-month,DSL,Bank transfer (automatic),0.083333,1.0,1


Compute a new feature, tenure_year, that is a result of translating tenure from months to years.

In [49]:
def create_tenure_year(df):
    df[['tenure_year']] = df[['tenure']] / 12
    return df

In [50]:
create_tenure_year(df)

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,payment_type_id,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type,tenure_year,percent_var_tc_from_act_tc,churn_encoded
0,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,2,59.90,542.40,No,Month-to-month,DSL,Mailed check,0.750000,0.993916,0
1,0013-MHZWF,Female,0,No,Yes,9,Yes,No,1,No,...,4,69.40,571.45,No,Month-to-month,DSL,Credit card (automatic),0.750000,1.093009,0
2,0015-UOCOJ,Female,1,No,No,7,Yes,No,1,Yes,...,1,48.20,340.35,No,Month-to-month,DSL,Electronic check,0.583333,0.991332,0
3,0023-HGHWL,Male,1,No,No,1,No,No phone service,1,No,...,1,25.10,25.10,Yes,Month-to-month,DSL,Electronic check,0.083333,1.000000,1
4,0032-PGELS,Female,0,Yes,Yes,1,No,No phone service,1,Yes,...,3,30.50,30.50,Yes,Month-to-month,DSL,Bank transfer (automatic),0.083333,1.000000,1
5,0067-DKWBL,Male,1,No,No,2,Yes,No,1,Yes,...,1,49.25,91.10,Yes,Month-to-month,DSL,Electronic check,0.166667,1.081229,1
6,0076-LVEPS,Male,0,No,Yes,29,No,No phone service,1,Yes,...,2,45.00,1242.45,No,Month-to-month,DSL,Mailed check,2.416667,1.050344,0
7,0082-LDZUE,Male,0,No,No,1,Yes,No,1,No,...,2,44.30,44.30,No,Month-to-month,DSL,Mailed check,0.083333,1.000000,0
8,0096-BXERS,Female,0,Yes,No,6,Yes,Yes,1,No,...,1,50.35,314.55,No,Month-to-month,DSL,Electronic check,0.500000,0.960420,0
9,0096-FCPUF,Male,0,No,No,30,Yes,Yes,1,Yes,...,2,64.50,1888.45,No,Month-to-month,DSL,Mailed check,2.500000,1.024650,0


Figure out a way to capture the information contained in phone_service and multiple_lines into a single variable of dtype int. Write a function that will transform the data and place in a new column named phone_id.

Figure out a way to capture the information contained in dependents and partner into a single variable of dtype int. Transform the data and place in a new column household_type_id.

Figure out a way to capture the information contained in streaming_tv and streaming_movies into a single variable of dtype int. Transform the data and place in a new column streaming_services.

Figure out a way to capture the information contained in online_security and online_backup into a single variable of dtype int. Transform the data and place in a new column online_security_backup.

Split the data into train (70%) & test (30%) samples.

In [51]:
list(df)

['customer_id',
 'gender',
 'senior_citizen',
 'partner',
 'dependents',
 'tenure',
 'phone_service',
 'multiple_lines',
 'internet_service_type_id',
 'online_security',
 'online_backup',
 'device_protection',
 'tech_support',
 'streaming_tv',
 'streaming_movies',
 'contract_type_id',
 'paperless_billing',
 'payment_type_id',
 'monthly_charges',
 'total_charges',
 'churn',
 'contract_type',
 'internet_service_type',
 'payment_type',
 'tenure_year',
 'percent_var_tc_from_act_tc',
 'churn_encoded']

In [61]:
X = df.drop(['churn', 'churn_encoded'], axis=1)
y = df[['churn_encoded']]

In [62]:
train, test = train_test_split(df)

In [55]:
scaler = MinMaxScaler()

In [56]:
scaler.fit(X_train[['monthly_charges', 'total_charges']])

MinMaxScaler(copy=True, feature_range=(0, 1))

In [63]:
train[['monthly_charges', 'total_charges']] = scaler.transform(train[['monthly_charges', 'total_charges']])
test[['monthly_charges', 'total_charges']] = scaler.transform(test[['monthly_charges', 'total_charges']])
train.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,payment_type_id,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type,tenure_year,percent_var_tc_from_act_tc,churn_encoded
1403,3205-MXZRA,Male,0,No,No,26,Yes,No,1,No,...,4,0.40995,0.171974,No,One year,DSL,Credit card (automatic),2.166667,1.02568,0
593,4815-GBTCD,Female,0,Yes,No,4,No,No phone service,1,No,...,1,0.068657,0.009378,No,Month-to-month,DSL,Electronic check,0.333333,1.006503,0
4762,3865-YIOTT,Male,0,Yes,Yes,72,Yes,Yes,2,Yes,...,3,0.874129,0.904786,No,One year,Fiber optic,Bank transfer (automatic),6.0,0.973332,0
5837,5868-YTYKS,Male,0,No,Yes,1,Yes,No,3,No internet service,...,2,0.0199,0.000168,No,Month-to-month,,Mailed check,0.083333,1.0,0
1638,7163-OCEQI,Male,0,Yes,Yes,22,Yes,Yes,1,No,...,2,0.600995,0.190087,No,One year,DSL,Mailed check,1.833333,1.04,0


In [70]:
X_train = train.drop(['churn', 'churn_encoded'], axis=1)
X_test = test.drop(['churn', 'churn_encoded'], axis=1)
y_train = train[['churn_encoded']]
y_test = test[['churn_encoded']]

Variable Encoding: encode the values in each non-numeric feature such that they are numeric.

Numeric Scaling: scale the monthly_charges and total_charges data. Make sure that the parameters for scaling are learned from the training data set.

# Data Exploration
Could the month in which they signed up influence churn? i.e. if a cohort is identified by tenure, is there a cohort or cohorts who have a higher rate of churn than other cohorts? (Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers)).

Are there features that indicate a higher propensity to churn? like type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?

Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?

If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?

Controlling for services (phone_id, internet_service_type_id, online_security_backup, device_protection, tech_support, and contract_type_id), is the mean monthly_charges of those who have churned significantly different from that of those who have not churned? (Use a t-test to answer this.)

How much of monthly_charges can be explained by internet_service_type? (hint: correlation test). State your hypotheses and your conclusion clearly.

How much of monthly_charges can be explained by internet_service_type + phone service type (0, 1, or multiple lines). State your hypotheses and your conclusion clearly.

Create visualizations exploring the interactions of variables (independent with independent and independent with dependent). The goal is to identify features that are related to churn, identify any data integrity issues, understand 'how the data works'. For example, we may find that all who have online services also have device protection. In that case, we don't need both of those. (The visualizations done in your analysis for questions 1-5 count towards the requirements below)

Each independent variable (except for customer_id) should be visualized in at least two plots, and at least 1 of those compares the independent variable with the dependent variable.

For each plot where x and y are independent variables, add a third dimension (where possible), of churn represented by color.

Use subplots when plotting the same type of chart but with different variables.

Adjust the axes as necessary to extract information from the visualizations (adjusting the x & y limits, setting the scale where needed, etc.)

Add annotations to at least 5 plots with a key takeaway from that plot.

Use plots from matplotlib, pandas and seaborn.

Use each of the following:

sns.heatmap
pd.crosstab (along with sns.heatmap)
pd.scatter_matrix
sns.barplot
sns.swarmplot
sns.pairplot
sns.jointplot
sns.relplot or plt.scatter
sns.distplot or plt.hist
sns.boxplot
plt.plot
Use at least one more type of plot that is not included in the list above.

What can you say about each variable's relationship to churn, based on your initial exploration? If there appears to be some sort of interaction or correlation, assume there is no causal relationship and brainstorm (and document) ideas on reasons there could be correlation.

Summarize your conclusions, provide clear answers to the specific questions, and summarize any takeaways/action plan from the work above.

# Modeling
Feature Selection: Are there any variables that seem to provide limited to no additional information? If so, remove them.

Train (fit, transform, evaluate) multiple different models, varying the model type and your meta-parameters.

In [60]:
list(df)

['customer_id',
 'gender',
 'senior_citizen',
 'partner',
 'dependents',
 'tenure',
 'phone_service',
 'multiple_lines',
 'internet_service_type_id',
 'online_security',
 'online_backup',
 'device_protection',
 'tech_support',
 'streaming_tv',
 'streaming_movies',
 'contract_type_id',
 'paperless_billing',
 'payment_type_id',
 'monthly_charges',
 'total_charges',
 'churn',
 'contract_type',
 'internet_service_type',
 'payment_type',
 'tenure_year',
 'percent_var_tc_from_act_tc',
 'churn_encoded']

# Models
## Random Forests

In [147]:
def analyze_rf_binomial(X_df_train, X_df_test, y_df_train, y_df_test, string_criterion, min_sample_leaf_input, max_depth_input):
    features = list(X_df_train)
    
    print('Results using ' + str(string_criterion) + ' as the measure of impurity and ' + str(max_depth_input) + ' as max depth level and ' +str(min_sample_leaf_input) + ' as the min_sample_leaf.')
    print('The features being used: ' + str(features))
    print('-----')

    rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion=string_criterion,
                            min_samples_leaf=min_sample_leaf_input,
                            n_estimators=100,
                            max_depth=max_depth_input, 
                            random_state=123)

    rf.fit(X_df_train, y_df_train)

    y_df_pred = rf.predict(X_df_train)
    print('Head of predicted on X_train:')
    print(y_df_pred[0:5])
    print('-----')

    y_df_pred_proba = rf.predict_proba(X_df_train)
    print('Head of probabilities on X_train:')
    print(y_df_pred_proba[0:5])
    print('-----')

    print('Accuracy of rf classifier on training set: {:.8f}'.format(rf.score(X_df_train, y_df_train)))
    print('-----')
    
    cm = pd.DataFrame(confusion_matrix(y_df_train, y_df_pred),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    
    print(cm)
    print('-----')
    
    print(classification_report(y_df_train, y_df_pred, digits=4))
    print('-----')
    
    y_df_pred_test = rf.predict(X_df_test)
    y_df_pred_proba_test = rf.predict_proba(X_df_test)
    print('Accuracy of RF classifier on train set: {:.6f}'
     .format(rf.score(X_df_train, y_df_train)))
    print('-----')
    

In [148]:
X_rf_1_train = X_train[['tenure_year', 'monthly_charges', 'internet_service_type_id']]
X_rf_1_test = X_test[['tenure_year', 'monthly_charges', 'internet_service_type_id']]

In [149]:
analyze_rf_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 'entropy', 5, 6)

Results using entropy as the measure of impurity and 6 as max depth level and 5 as the min_sample_leaf.
The features being used: ['tenure_year', 'monthly_charges', 'internet_service_type_id']
-----
Head of predicted on X_train:
[0 0 0 0 0]
-----
Head of probabilities on X_train:
[[0.88085044 0.11914956]
 [0.59824183 0.40175817]
 [0.88918637 0.11081363]
 [0.64532968 0.35467032]
 [0.81974776 0.18025224]]
-----
Accuracy of rf classifier on training set: 0.80034130
-----
          Pred -  Pred +
Actual -    3533     336
Actual +     717     688
-----
              precision    recall  f1-score   support

           0     0.8313    0.9132    0.8703      3869
           1     0.6719    0.4897    0.5665      1405

   micro avg     0.8003    0.8003    0.8003      5274
   macro avg     0.7516    0.7014    0.7184      5274
weighted avg     0.7888    0.8003    0.7894      5274

-----
Accuracy of RF classifier on train set: 0.800341
-----


## Best performing model on train, also did well on test.

In [150]:
analyze_rf_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 'entropy', 4, 8)

Results using entropy as the measure of impurity and 8 as max depth level and 4 as the min_sample_leaf.
The features being used: ['tenure_year', 'monthly_charges', 'internet_service_type_id']
-----
Head of predicted on X_train:
[0 0 0 0 0]
-----
Head of probabilities on X_train:
[[0.91027759 0.08972241]
 [0.61640215 0.38359785]
 [0.94593335 0.05406665]
 [0.63872806 0.36127194]
 [0.81577279 0.18422721]]
-----
Accuracy of rf classifier on training set: 0.81949185
-----
          Pred -  Pred +
Actual -    3595     274
Actual +     678     727
-----
              precision    recall  f1-score   support

           0     0.8413    0.9292    0.8831      3869
           1     0.7263    0.5174    0.6043      1405

   micro avg     0.8195    0.8195    0.8195      5274
   macro avg     0.7838    0.7233    0.7437      5274
weighted avg     0.8107    0.8195    0.8088      5274

-----
Accuracy of RF classifier on train set: 0.819492
-----


## Test on RF model that performed best on train set.

In [161]:
def test_rf_binomial(X_df_train, X_df_test, y_df_train, y_df_test, string_criterion, min_sample_leaf_input, max_depth_input):
    features = list(X_df_train)
    
    print('Results using ' + str(string_criterion) + ' as the measure of impurity and ' + str(max_depth_input) + ' as max depth level and ' +str(min_sample_leaf_input) + ' as the min_sample_leaf.')
    print('The features being used: ' + str(features))
    print('-----')

    rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion=string_criterion,
                            min_samples_leaf=min_sample_leaf_input,
                            n_estimators=100,
                            max_depth=max_depth_input, 
                            random_state=123)

    rf.fit(X_df_train, y_df_train)

    y_df_pred = rf.predict(X_df_train)
    print('Head of predicted on X_train:')
    print(y_df_pred[0:5])
    print('-----')

    y_df_pred_proba = rf.predict_proba(X_df_train)
    print('Head of probabilities on X_train:')
    print(y_df_pred_proba[0:5])
    print('-----')

    print('Accuracy of rf classifier on training set: {:.8f}'.format(rf.score(X_df_train, y_df_train)))
    print('-----')
    
    cm = pd.DataFrame(confusion_matrix(y_df_train, y_df_pred),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    
    print(cm)
    print('-----')
    
    print(classification_report(y_df_train, y_df_pred, digits=4))
    print('-----')
    
    y_df_pred_test = rf.predict(X_df_test)
    y_df_pred_proba_test = rf.predict_proba(X_df_test)

    print('-----')
    
    print('The results of running the model on the test sample:')
    
    cm = pd.DataFrame(confusion_matrix(y_df_test, y_df_pred_test),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    print(cm)
    print('-----')
    
    print(classification_report(y_df_test, y_df_pred_test, digits=4))
    print('-----')
    
    y_df_pred_test = rf.predict(X_df_test)
    y_df_pred_proba_test = rf.predict_proba(X_df_test)
    print('-----')
    print('Head of probabilities on X_test:')
    print(y_df_pred_proba_test[:5])
    print('Accuracy of Logistic Regression classifier on test set: {:.6f}'
     .format(rf.score(X_df_test, y_df_test)))
    print('-----')
    Probabilities_on_X_test = y_df_pred_proba_test
    return Probabilities_on_X_test

In [162]:
test_rf_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 'entropy', 4, 8)

Results using entropy as the measure of impurity and 8 as max depth level and 4 as the min_sample_leaf.
The features being used: ['tenure_year', 'monthly_charges', 'internet_service_type_id']
-----
Head of predicted on X_train:
[0 0 0 0 0]
-----
Head of probabilities on X_train:
[[0.91027759 0.08972241]
 [0.61640215 0.38359785]
 [0.94593335 0.05406665]
 [0.63872806 0.36127194]
 [0.81577279 0.18422721]]
-----
Accuracy of rf classifier on training set: 0.81949185
-----
          Pred -  Pred +
Actual -    3595     274
Actual +     678     727
-----
              precision    recall  f1-score   support

           0     0.8413    0.9292    0.8831      3869
           1     0.7263    0.5174    0.6043      1405

   micro avg     0.8195    0.8195    0.8195      5274
   macro avg     0.7838    0.7233    0.7437      5274
weighted avg     0.8107    0.8195    0.8088      5274

-----
-----
The results of running the model on the test sample:
          Pred -  Pred +
Actual -    1183     111
Actua

array([[0.62608383, 0.37391617],
       [0.7451202 , 0.2548798 ],
       [0.99341178, 0.00658822],
       ...,
       [0.56856898, 0.43143102],
       [0.72751592, 0.27248408],
       [0.9020052 , 0.0979948 ]])

## Classification/Decision Tree

In [198]:
def analyze_classification_model(X_df_train, X_df_test, y_df_train, y_df_test, string_criterion, max_depth_input):   
    
    print('Results using ' + str(string_criterion) + ' as the measure of impurity and ' + str(max_depth_input) + ' as the depth.')
    print('-----')

    clf = DecisionTreeClassifier(criterion=string_criterion, max_depth=max_depth_input, random_state=123)

    clf.fit(X_df_train, y_df_train)

    y_df_pred = clf.predict(X_df_train)
    print('Head of predicted on X_train:')
    print(y_df_pred[0:5])
    print('-----')

    y_df_pred_proba = clf.predict_proba(X_df_train)
    print('Head of probabilities on X_train:')
    print(y_df_pred_proba[0:5])
    print('-----')
    
    cm = pd.DataFrame(confusion_matrix(y_df_train, y_df_pred),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    print(cm)
    print('-----')
    
    print(classification_report(y_df_train, y_df_pred, digits=4))
    print('-----')

    print('Accuracy of Decision Tree classifier on training set: {:.8f}'.format(clf.score(X_df_train, y_df_train)))
    print('-----')

In [199]:
analyze_classification_model(X_rf_1_train, X_rf_1_test, y_train, y_test, 'gini', 5)

Results using gini as the measure of impurity and 5 as the depth.
-----
Head of predicted on X_train:
[0 0 0 0 0]
-----
Head of probabilities on X_train:
[[0.85441527 0.14558473]
 [0.73428571 0.26571429]
 [0.96       0.04      ]
 [0.66412214 0.33587786]
 [0.5        0.5       ]]
-----
          Pred -  Pred +
Actual -    3483     386
Actual +     695     710
-----
              precision    recall  f1-score   support

           0     0.8337    0.9002    0.8657      3869
           1     0.6478    0.5053    0.5678      1405

   micro avg     0.7950    0.7950    0.7950      5274
   macro avg     0.7407    0.7028    0.7167      5274
weighted avg     0.7841    0.7950    0.7863      5274

-----
Accuracy of Decision Tree classifier on training set: 0.79503223
-----


In [200]:
analyze_classification_model(X_rf_1_train, X_rf_1_test, y_train, y_test, 'gini', 7)

Results using gini as the measure of impurity and 7 as the depth.
-----
Head of predicted on X_train:
[0 0 0 1 1]
-----
Head of probabilities on X_train:
[[0.93846154 0.06153846]
 [0.52727273 0.47272727]
 [0.93617021 0.06382979]
 [0.35       0.65      ]
 [0.25       0.75      ]]
-----
          Pred -  Pred +
Actual -    3531     338
Actual +     671     734
-----
              precision    recall  f1-score   support

           0     0.8403    0.9126    0.8750      3869
           1     0.6847    0.5224    0.5927      1405

   micro avg     0.8087    0.8087    0.8087      5274
   macro avg     0.7625    0.7175    0.7338      5274
weighted avg     0.7989    0.8087    0.7998      5274

-----
Accuracy of Decision Tree classifier on training set: 0.80868411
-----


In [201]:
analyze_classification_model(X_rf_1_train, X_rf_1_test, y_train, y_test, 'entropy', 5)

Results using entropy as the measure of impurity and 5 as the depth.
-----
Head of predicted on X_train:
[0 0 0 0 0]
-----
Head of probabilities on X_train:
[[0.88647343 0.11352657]
 [0.60493827 0.39506173]
 [0.96       0.04      ]
 [0.66412214 0.33587786]
 [0.83673469 0.16326531]]
-----
          Pred -  Pred +
Actual -    3527     342
Actual +     745     660
-----
              precision    recall  f1-score   support

           0     0.8256    0.9116    0.8665      3869
           1     0.6587    0.4698    0.5484      1405

   micro avg     0.7939    0.7939    0.7939      5274
   macro avg     0.7421    0.6907    0.7074      5274
weighted avg     0.7811    0.7939    0.7817      5274

-----
Accuracy of Decision Tree classifier on training set: 0.79389458
-----


In [202]:
analyze_classification_model(X_rf_1_train, X_rf_1_test, y_train, y_test, 'entropy', 7)

Results using entropy as the measure of impurity and 7 as the depth.
-----
Head of predicted on X_train:
[0 0 0 0 1]
-----
Head of probabilities on X_train:
[[0.86970684 0.13029316]
 [0.72       0.28      ]
 [0.94623656 0.05376344]
 [0.67521368 0.32478632]
 [0.4        0.6       ]]
-----
          Pred -  Pred +
Actual -    3519     350
Actual +     666     739
-----
              precision    recall  f1-score   support

           0     0.8409    0.9095    0.8739      3869
           1     0.6786    0.5260    0.5926      1405

   micro avg     0.8074    0.8074    0.8074      5274
   macro avg     0.7597    0.7178    0.7332      5274
weighted avg     0.7976    0.8074    0.7989      5274

-----
Accuracy of Decision Tree classifier on training set: 0.80735684
-----


## KNN

In [203]:
def analyze_knn_binomial(X_df_train, X_df_test, y_df_train, y_df_test, n_neighbor, weight):
    features = list(X_df_train)
    
    print('Results using ' + str(weight) + ' as the measure of impurity and ' + str(n_neighbor) + ' as the number of neighbors.')
    print('The features being used: ' + str(features))
    print('-----')

    knn = KNeighborsClassifier(n_neighbors=n_neighbor, weights=weight)

    knn.fit(X_df_train, y_df_train)

    y_df_pred = knn.predict(X_df_train)
    print('Head of predicted on X_train:')
    print(y_df_pred[0:5])
    print('-----')

    y_df_pred_proba = knn.predict_proba(X_df_train)
    print('Head of probabilities on X_train:')
    print(y_df_pred_proba[0:5])
    print('-----')
    
    cm = pd.DataFrame(confusion_matrix(y_df_train, y_df_pred),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    print(cm)
    print('-----')
    
    print(classification_report(y_df_train, y_df_pred, digits=4))
    print('-----')

    print('Accuracy of KNN classifier on training set: {:.8f}'.format(knn.score(X_df_train, y_df_train)))
    print('-----')

In [204]:
analyze_knn_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 5, 'uniform')

Results using uniform as the measure of impurity and 5 as the number of neighbors.
The features being used: ['tenure_year', 'monthly_charges', 'internet_service_type_id']
-----
Head of predicted on X_train:
[0 1 0 0 0]
-----
Head of probabilities on X_train:
[[0.8 0.2]
 [0.4 0.6]
 [1.  0. ]
 [0.8 0.2]
 [0.6 0.4]]
-----
          Pred -  Pred +
Actual -    3556     313
Actual +     563     842
-----
              precision    recall  f1-score   support

           0     0.8633    0.9191    0.8903      3869
           1     0.7290    0.5993    0.6578      1405

   micro avg     0.8339    0.8339    0.8339      5274
   macro avg     0.7962    0.7592    0.7741      5274
weighted avg     0.8275    0.8339    0.8284      5274

-----
Accuracy of KNN classifier on training set: 0.83390216
-----


In [205]:
analyze_knn_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 6, 'uniform')

Results using uniform as the measure of impurity and 6 as the number of neighbors.
The features being used: ['tenure_year', 'monthly_charges', 'internet_service_type_id']
-----
Head of predicted on X_train:
[0 0 0 0 0]
-----
Head of probabilities on X_train:
[[0.83333333 0.16666667]
 [0.5        0.5       ]
 [1.         0.        ]
 [0.66666667 0.33333333]
 [0.66666667 0.33333333]]
-----
          Pred -  Pred +
Actual -    3679     190
Actual +     739     666
-----
              precision    recall  f1-score   support

           0     0.8327    0.9509    0.8879      3869
           1     0.7780    0.4740    0.5891      1405

   micro avg     0.8239    0.8239    0.8239      5274
   macro avg     0.8054    0.7125    0.7385      5274
weighted avg     0.8182    0.8239    0.8083      5274

-----
Accuracy of KNN classifier on training set: 0.82385286
-----


In [206]:
analyze_knn_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 5, 'distance')

Results using distance as the measure of impurity and 5 as the number of neighbors.
The features being used: ['tenure_year', 'monthly_charges', 'internet_service_type_id']
-----
Head of predicted on X_train:
[0 0 0 0 0]
-----
Head of probabilities on X_train:
[[1.  0. ]
 [1.  0. ]
 [1.  0. ]
 [0.8 0.2]
 [1.  0. ]]
-----
          Pred -  Pred +
Actual -    3858      11
Actual +      90    1315
-----
              precision    recall  f1-score   support

           0     0.9772    0.9972    0.9871      3869
           1     0.9917    0.9359    0.9630      1405

   micro avg     0.9808    0.9808    0.9808      5274
   macro avg     0.9845    0.9665    0.9750      5274
weighted avg     0.9811    0.9808    0.9807      5274

-----
Accuracy of KNN classifier on training set: 0.98084945
-----


In [207]:
analyze_knn_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 6, 'distance')

Results using distance as the measure of impurity and 6 as the number of neighbors.
The features being used: ['tenure_year', 'monthly_charges', 'internet_service_type_id']
-----
Head of predicted on X_train:
[0 0 0 0 0]
-----
Head of probabilities on X_train:
[[1.         0.        ]
 [1.         0.        ]
 [1.         0.        ]
 [0.66666667 0.33333333]
 [1.         0.        ]]
-----
          Pred -  Pred +
Actual -    3858      11
Actual +      90    1315
-----
              precision    recall  f1-score   support

           0     0.9772    0.9972    0.9871      3869
           1     0.9917    0.9359    0.9630      1405

   micro avg     0.9808    0.9808    0.9808      5274
   macro avg     0.9845    0.9665    0.9750      5274
weighted avg     0.9811    0.9808    0.9807      5274

-----
Accuracy of KNN classifier on training set: 0.98084945
-----


## Test KNN that performed best on train set.  These are high numbers, let's hope for the best...

In [186]:
def test_knn_binomial(X_df_train, X_df_test, y_df_train, y_df_test, n_neighbor, weight):
    features = list(X_df_train)
    
    print('Results using ' + str(weight) + ' as the measure of impurity and ' + str(n_neighbor) + ' as the number of neighbors.')
    print('The features being used: ' + str(features))
    print('-----')

    knn = KNeighborsClassifier(n_neighbors=n_neighbor, weights=weight)

    knn.fit(X_df_train, y_df_train)

    y_df_pred = knn.predict(X_df_train)
    print('Head of predicted on X_train:')
    print(y_df_pred[0:5])
    print('-----')

    y_df_pred_proba = knn.predict_proba(X_df_train)
    print('Head of probabilities on X_train:')
    print(y_df_pred_proba[0:5])
    print('-----')

    print('Accuracy of KNN classifier on training set: {:.8f}'.format(knn.score(X_df_train, y_df_train)))
    print('-----')
    
    cm = pd.DataFrame(confusion_matrix(y_df_train, y_df_pred),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    
    print(cm)
    print('-----')
    
    print(classification_report(y_df_train, y_df_pred, digits=4))
    print('-----')
    
    y_df_pred_test = knn.predict(X_df_test)
    y_df_pred_proba_test = knn.predict_proba(X_df_test)

    print('-----')
    
    print('The results of running the model on the test sample:')
    
    cm = pd.DataFrame(confusion_matrix(y_df_test, y_df_pred_test),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    print(cm)
    print('-----')
    
    print(classification_report(y_df_test, y_df_pred_test, digits=4))
    print('-----')
    
    y_df_pred_test = knn.predict(X_df_test)
    y_df_pred_proba_test = knn.predict_proba(X_df_test)
    print('-----')
    print('Head of probabilities on X_test:')
    print(y_df_pred_proba_test[:5])
    print('Accuracy of Logistic KNN classifier on test set: {:.6f}'
     .format(knn.score(X_df_test, y_df_test)))
    print('-----')
    Probabilities_on_X_test = y_df_pred_proba_test
    return Probabilities_on_X_test

### We had some great results on the train set, but running it on the test shows that this model overfit.

In [187]:
test_knn_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 5, 'distance')

Results using distance as the measure of impurity and 5 as the number of neighbors.
The features being used: ['tenure_year', 'monthly_charges', 'internet_service_type_id']
-----
Head of predicted on X_train:
[0 0 0 0 0]
-----
Head of probabilities on X_train:
[[1.  0. ]
 [1.  0. ]
 [1.  0. ]
 [0.8 0.2]
 [1.  0. ]]
-----
Accuracy of KNN classifier on training set: 0.98084945
-----
          Pred -  Pred +
Actual -    3858      11
Actual +      90    1315
-----
              precision    recall  f1-score   support

           0     0.9772    0.9972    0.9871      3869
           1     0.9917    0.9359    0.9630      1405

   micro avg     0.9808    0.9808    0.9808      5274
   macro avg     0.9845    0.9665    0.9750      5274
weighted avg     0.9811    0.9808    0.9807      5274

-----
-----
The results of running the model on the test sample:
          Pred -  Pred +
Actual -    1090     204
Actual +     252     212
-----
              precision    recall  f1-score   support

        

array([[0.85362244, 0.14637756],
       [0.91436181, 0.08563819],
       [1.        , 0.        ],
       ...,
       [0.85136791, 0.14863209],
       [1.        , 0.        ],
       [1.        , 0.        ]])

## Logistic Regression

In [193]:
def analyze_log_reg(X_df_train, X_df_test, y_df_train, y_df_test, solver_name):
    features = list(X_df_train)
    
    print('Results using ' + str(solver_name) + ' as the solver.')
    print('The features being used: ' + str(features))
    print('-----')

    logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver=solver_name)
    logit.fit(X_df_train, y_df_train)
    print('-----')
    
    print('Coefficient: \n', logit.coef_)
    print('Intercept: \n', logit.intercept_)
    print('-----')

    y_df_pred = logit.predict(X_df_train)
    y_df_pred_proba = logit.predict_proba(X_df_train)
    print('Accuracy of Logistic Regression classifier on training set: {:.6f}'
         .format(logit.score(X_df_train, y_df_train)))
    print('-----')

    print('The results of running the model on the train sample:')

    cm = pd.DataFrame(confusion_matrix(y_df_train, y_df_pred),
                 columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    print(cm)
    print('-----')

    print(classification_report(y_df_train, y_df_pred, digits=4))
    print('-----')
    
        

In [194]:
analyze_log_reg(X_rf_1_train, X_rf_1_test, y_train, y_test, 'saga')

Results using saga as the solver.
The features being used: ['tenure_year', 'monthly_charges', 'internet_service_type_id']
-----
-----
Coefficient: 
 [[-0.66377943  3.32675848 -0.00701596]]
Intercept: 
 [-0.49087548]
-----
Accuracy of Logistic Regression classifier on training set: 0.752370
-----
The results of running the model on the train sample:
          Pred -  Pred +
Actual -    3033     836
Actual +     470     935
-----
              precision    recall  f1-score   support

           0     0.8658    0.7839    0.8228      3869
           1     0.5280    0.6655    0.5888      1405

   micro avg     0.7524    0.7524    0.7524      5274
   macro avg     0.6969    0.7247    0.7058      5274
weighted avg     0.7758    0.7524    0.7605      5274

-----


In [195]:
analyze_log_reg(X_rf_1_train, X_rf_1_test, y_train, y_test, 'liblinear')

Results using liblinear as the solver.
The features being used: ['tenure_year', 'monthly_charges', 'internet_service_type_id']
-----
-----
Coefficient: 
 [[-0.66377843  3.32362738 -0.00888818]]
Intercept: 
 [-0.48543306]
-----
Accuracy of Logistic Regression classifier on training set: 0.752370
-----
The results of running the model on the train sample:
          Pred -  Pred +
Actual -    3033     836
Actual +     470     935
-----
              precision    recall  f1-score   support

           0     0.8658    0.7839    0.8228      3869
           1     0.5280    0.6655    0.5888      1405

   micro avg     0.7524    0.7524    0.7524      5274
   macro avg     0.6969    0.7247    0.7058      5274
weighted avg     0.7758    0.7524    0.7605      5274

-----


Compare evaluation metrics across all the models, and select the best performing model.

Test the final model (transform, evaluate) on your out-of-sample data (the testing data set). Summarize the performance. Interpret your results.