Classification Project
Why are our customers churning?

Some questions I have include:

Could the month in which they signed up influence churn? i.e. if a cohort is identified by tenure, is there a cohort or cohorts who have a higher rate of churn than other cohorts? (Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers))
Are there features that indicate a higher propensity to churn? like type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?
Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?
If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?

### Deliverables:

I will also need a report (ipynb) answering the question, "Why are our customers churning?" I want to see the analysis you did to answer my questions and lead to your findings. Please clearly call out the questions and answers you are analyzing. E.g. If you find that month-to-month customers churn more, I won't be surprised, but I am not getting rid of that plan. The fact that they churn is not because they can, it's because they can and they are motivated to do so. I want some insight into why they are motivated to do so. I realize you will not be able to do a full causal experiment, but I hope to see some solid evidence of your conclusions.

I will need you to deliver to me a csv with the customer_id, probability of churn, and the prediction of churn (1=churn, 0=not_churn). I would also like a single goolgle slide that illustrates how your model works, including the features being used, so that I can deliver this to the SLT when they come with questions about how these values were derived. Please make sure you include how likely your model is to give a high probability of churn when churn doesn't occur, to give a low probability of churn when churn occurs, and to accurately predict churn.

Finally, our development team will need a .py file that will take in a new dataset, (in the exact same form of the one you acquired from telco_churn.customers) and perform all the transformations necessary to run the model you have developed on this new dataset to provide probabilities and predictions.

Specification
Detailed instructions for each section are below.

In general, make sure you document your work. You don't need to explain what every line of code is doing, but you should explain what and why you are doing. For example, if you drop a feature from the dataset, you should explain why you decided to do so, or why that is a reasonable thing to do. If you transform the data in a column, you should explain why you are making that transformation.

In addition, you should not present numers in isolation. If your code outputs a number, be sure you give some context to the number.

### Specific Deliverables:

- a jupyter notebook where your work takes place
- a csv file that predicts churn for each customer
- a python script that prepares data such that it can be fed into your model
- a google slide summarizing your model
- a README.md file that contains a link to your google slides presentation, and instructions for how to use your python script(s)

# Acquisition
Get the data from the customers table from the telco_churn database on the codeup data science database server.

You may wish to join some tables as part of your query.
This data should end up in a pandas data frame.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree

import graphviz
from graphviz import Graph

import env

In [2]:
def get_connection(db, user=env.user, host=env.host, password=env.password):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

def get_telco_data():
    return pd.read_sql('SELECT c.*, ct.contract_type, ist.internet_service_type, pt.payment_type\
    FROM customers as c\
    JOIN contract_types as ct USING (contract_type_id)\
    JOIN internet_service_types as ist USING (internet_service_type_id)\
    JOIN payment_types as pt USING (payment_type_id);', get_connection('telco_churn'))

Write a function, peekatdata(dataframe), that takes a dataframe as input and computes and returns the following:

- creates dataframe object head_df (df of the first 5 rows) and prints contents to screen
- creates dataframe object tail_df (df of the last 5 rows) and prints contents to screen
- creates tuple object shape_tuple (tuple of (nrows, ncols)) and prints tuple to screen
- creates dataframe object describe_df (summary statistics of all numeric variables) and prints contents to screen.
- prints to screen the information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [3]:
def peekatdata(df):
    Head_df = df.head(5)
    print('Header:  \n')
    print(Head_df)
    print('Tail:  \n')
    Tail_df = df.tail(5)
    print(Tail_df)
    print('Shape:  \n')
    Shape_tuple = df.shape
    print(Shape_tuple)
    print('Describe:  \n')
    Describe_df = df.describe()
    print(Describe_df)
    print('Info:  \n')
    print(df.info())
    return 

In [4]:
df = get_telco_data()
peekatdata(df)

Header:  

  customer_id  gender  senior_citizen partner dependents  tenure  \
0  0003-MKNFE    Male               0      No         No       9   
1  0013-MHZWF  Female               0      No        Yes       9   
2  0015-UOCOJ  Female               1      No         No       7   
3  0023-HGHWL    Male               1      No         No       1   
4  0032-PGELS  Female               0     Yes        Yes       1   

  phone_service    multiple_lines  internet_service_type_id online_security  \
0           Yes               Yes                         1              No   
1           Yes                No                         1              No   
2           Yes                No                         1             Yes   
3            No  No phone service                         1              No   
4            No  No phone service                         1             Yes   

             ...             streaming_movies contract_type_id  \
0            ...                       

# Data Prep

Write a function, df_value_counts(dataframe), that takes a dataframe as input and computes and returns the values by frequency for each column. The function should decide whether or not to bin the data for the value counts.

In [5]:
# Need to update with decision to bin.  Function below will give us the features with more than 10 different options, which will be the features we bin.
def df_value_counts(dataframe):
    df_cols = dataframe.columns
    for col in df_cols:
        print('-----%s-----' %col)
        print(df[col].value_counts())
        
df_value_counts(df)

-----customer_id-----
9451-LPGOO    1
8739-WWKDU    1
8805-JNRAZ    1
1832-PEUTS    1
4003-OCTMP    1
4378-MYPGO    1
4154-AQUGT    1
1414-YADCW    1
7120-RFMVS    1
3146-MSEGF    1
1206-EHBDD    1
2595-KIWPV    1
2898-MRKPI    1
5384-ZTTWP    1
8207-DMRVL    1
3571-RFHAR    1
5445-UTODQ    1
8878-RYUKI    1
4751-ERMAN    1
4704-ERYFC    1
9617-UDPEU    1
0655-RBDUG    1
4592-IWTJI    1
3223-WZWJM    1
0422-OHQHQ    1
8775-LHDJH    1
9348-ROUAI    1
2739-CACDQ    1
7797-EJMDP    1
2293-IJWPS    1
             ..
4415-IJZTP    1
3446-QDSZF    1
6728-WYQBC    1
0322-CHQRU    1
2754-VDLTR    1
3244-CQPHU    1
7379-POKDZ    1
3445-HXXGF    1
0380-NEAVX    1
6242-MBHPK    1
5095-ETBRJ    1
5956-YHHRX    1
1047-RNXZV    1
5696-JVVQY    1
2181-TIDSV    1
3703-KBKZP    1
3798-EPWRR    1
3199-XGZCY    1
0725-CXOTM    1
9824-BEMCV    1
4328-VUFWD    1
6898-RBTLU    1
3452-SRFEG    1
9560-ARGQJ    1
5394-SVGJV    1
1625-JAIIY    1
9281-CEDRU    1
7503-QQRVF    1
5973-EJGDP    1
4853-OITSN    1
Na

In [6]:
def df_value_counts_bin(dataframe):
    df_cols = dataframe.columns
    features_to_bin = []
    for col in df_cols:
#         print('-----%s-----' %col)
#         print(df[col].value_counts())
        if df[col].value_counts().count() > 10:
            features_to_bin.append(col)
        print(features_to_bin[-1:])
        
df_value_counts_bin(df)

['customer_id']
['customer_id']
['customer_id']
['customer_id']
['customer_id']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['tenure']
['monthly_charges']
['total_charges']
['total_charges']
['total_charges']
['total_charges']
['total_charges']


### Handle Missing Values

Explore the data and see if there are any missing values.

Write a function that accepts a dataframe and returns the names of the columns that have missing values, and the percent of missing values in each column that has missing values.

In [7]:
def missing_values(dataframe):
    df_cols = df.columns
    col_name = []
    null_values = []
    null_percents = []
    for col in df_cols:
        value = df[col].isnull().sum()
        null_percent = value / df[col].count()
        col_name.append(col)
        null_values.append(value)
        null_percents.append(null_percent)

    null_tuples = list(zip(col_name, null_values, null_percents))
    null_df = pd.DataFrame(null_tuples, columns = ['Feature', 'Null_Count', 'Null_Percent'])

    print(type(null_df))
    print(null_df)  
    
missing_values(df)

<class 'pandas.core.frame.DataFrame'>
                     Feature  Null_Count  Null_Percent
0                customer_id           0           0.0
1                     gender           0           0.0
2             senior_citizen           0           0.0
3                    partner           0           0.0
4                 dependents           0           0.0
5                     tenure           0           0.0
6              phone_service           0           0.0
7             multiple_lines           0           0.0
8   internet_service_type_id           0           0.0
9            online_security           0           0.0
10             online_backup           0           0.0
11         device_protection           0           0.0
12              tech_support           0           0.0
13              streaming_tv           0           0.0
14          streaming_movies           0           0.0
15          contract_type_id           0           0.0
16         paperless_billin

In [8]:
missing_values(df)

<class 'pandas.core.frame.DataFrame'>
                     Feature  Null_Count  Null_Percent
0                customer_id           0           0.0
1                     gender           0           0.0
2             senior_citizen           0           0.0
3                    partner           0           0.0
4                 dependents           0           0.0
5                     tenure           0           0.0
6              phone_service           0           0.0
7             multiple_lines           0           0.0
8   internet_service_type_id           0           0.0
9            online_security           0           0.0
10             online_backup           0           0.0
11         device_protection           0           0.0
12              tech_support           0           0.0
13              streaming_tv           0           0.0
14          streaming_movies           0           0.0
15          contract_type_id           0           0.0
16         paperless_billin

The function below will sort each column and output the head and tail for that column.  This will let us see if there's something fishy about any of data in each column.

In [9]:
def sort_col_val(dataframe):
    df_cols = dataframe.columns
    for col in df_cols:
        print('Sorted by ' + str(col) + ':')
        print('Head:')
        print(df[[col]].sort_values(by=[col]).head().T)
        print(' ')
        print('Tail: ')
        print(df[[col]].sort_values(by=[col]).tail().T)
        print('-----')

In [10]:
sort_col_val(df)

Sorted by customer_id:
Head:
                   1223        0           2421        2422        2423
customer_id  0002-ORFBO  0003-MKNFE  0004-TLHLJ  0011-IGKFF  0013-EXCHZ
 
Tail: 
                   1792        4548        1222        2419        2420
customer_id  9987-LUTYD  9992-RRAMN  9992-UJOEL  9993-LHIEB  9995-HOTOH
-----
Sorted by gender:
Head:
          3521    5577    2895    2897    2899
gender  Female  Female  Female  Female  Female
 
Tail: 
        3095  3094  3092  3122  7042
gender  Male  Male  Male  Male  Male
-----
Sorted by senior_citizen:
Head:
                0     4703  4702  4701  4700
senior_citizen     0     0     0     0     0
 
Tail: 
                4542  4540  4537  2259  5117
senior_citizen     1     1     1     1     1
-----
Sorted by partner:
Head:
        0    3776 3775 3773 3772
partner   No   No   No   No   No
 
Tail: 
        4104 1385 4107 4087 7042
partner  Yes  Yes  Yes  Yes  Yes
-----
Sorted by dependents:
Head:
           0    4065 4063 4062 406

Looking at the above output, something seems odd about the lower end of total_charges.  Below, I am checking for values that are whitespace, or ' '.  We have 11 rows without an actual amount in total_charges.

In [11]:
df['total_charges'].replace(' ', (df['monthly_charges'] * df['tenure']), inplace=True)

In [12]:
df.loc[(df['total_charges'] == ' ')].T

customer_id
gender
senior_citizen
partner
dependents
tenure
phone_service
multiple_lines
internet_service_type_id
online_security
online_backup


We are now showing no empty cells.  Let's do a value_count to if anything strange is in our list now.  They are showing as 0.0 now.  Let's convert the total_charges column to a float and then drop these rows.

In [13]:
df_value_counts(df)

-----customer_id-----
9451-LPGOO    1
8739-WWKDU    1
8805-JNRAZ    1
1832-PEUTS    1
4003-OCTMP    1
4378-MYPGO    1
4154-AQUGT    1
1414-YADCW    1
7120-RFMVS    1
3146-MSEGF    1
1206-EHBDD    1
2595-KIWPV    1
2898-MRKPI    1
5384-ZTTWP    1
8207-DMRVL    1
3571-RFHAR    1
5445-UTODQ    1
8878-RYUKI    1
4751-ERMAN    1
4704-ERYFC    1
9617-UDPEU    1
0655-RBDUG    1
4592-IWTJI    1
3223-WZWJM    1
0422-OHQHQ    1
8775-LHDJH    1
9348-ROUAI    1
2739-CACDQ    1
7797-EJMDP    1
2293-IJWPS    1
             ..
4415-IJZTP    1
3446-QDSZF    1
6728-WYQBC    1
0322-CHQRU    1
2754-VDLTR    1
3244-CQPHU    1
7379-POKDZ    1
3445-HXXGF    1
0380-NEAVX    1
6242-MBHPK    1
5095-ETBRJ    1
5956-YHHRX    1
1047-RNXZV    1
5696-JVVQY    1
2181-TIDSV    1
3703-KBKZP    1
3798-EPWRR    1
3199-XGZCY    1
0725-CXOTM    1
9824-BEMCV    1
4328-VUFWD    1
6898-RBTLU    1
3452-SRFEG    1
9560-ARGQJ    1
5394-SVGJV    1
1625-JAIIY    1
9281-CEDRU    1
7503-QQRVF    1
5973-EJGDP    1
4853-OITSN    1
Na

In [14]:
df['total_charges'] = df['total_charges'].convert_objects(convert_numeric=True)
df.dtypes

customer_id                  object
gender                       object
senior_citizen                int64
partner                      object
dependents                   object
tenure                        int64
phone_service                object
multiple_lines               object
internet_service_type_id      int64
online_security              object
online_backup                object
device_protection            object
tech_support                 object
streaming_tv                 object
streaming_movies             object
contract_type_id              int64
paperless_billing            object
payment_type_id               int64
monthly_charges             float64
total_charges               float64
churn                        object
contract_type                object
internet_service_type        object
payment_type                 object
dtype: object

Success!  Total_charges is now a float.  Let's drop the rows with 0.0.

In [15]:
df = df.drop(df[df.total_charges == 0].index)

In [16]:
df.sort_values(by=['total_charges']).head().T

Unnamed: 0,6145,6010,5989,6039,5589
customer_id,2967-MXRAV,9318-NKNFC,8992-CEUEN,9975-SKRNR,1423-BMPBQ
gender,Male,Male,Female,Male,Female
senior_citizen,0,0,0,0,0
partner,Yes,No,No,No,Yes
dependents,Yes,No,No,No,Yes
tenure,1,1,1,1,1
phone_service,Yes,Yes,Yes,Yes,Yes
multiple_lines,No,No,No,No,No
internet_service_type_id,3,3,3,3,3
online_security,No internet service,No internet service,No internet service,No internet service,No internet service


Let's create a feature that calculates estimated total_charges based on tenure * monthly_charges and gives us a percentage vs. the actual total_charges.  This will be used to check for data integrity issues.

In [17]:
df['percent_var_tc_from_act_tc'] = (df['monthly_charges'] * df['tenure']) / df['total_charges']

In [18]:
df.describe()

Unnamed: 0,senior_citizen,tenure,internet_service_type_id,contract_type_id,payment_type_id,monthly_charges,total_charges,percent_var_tc_from_act_tc
count,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0
mean,0.1624,32.421786,1.872582,1.688567,2.315557,64.798208,2283.300441,1.002311
std,0.368844,24.54526,0.737271,0.832934,1.149523,30.085974,2266.771362,0.051288
min,0.0,1.0,1.0,1.0,1.0,18.25,18.8,0.635545
25%,0.0,9.0,1.0,1.0,1.0,35.5875,401.45,0.980813
50%,0.0,29.0,2.0,1.0,2.0,70.35,1397.475,1.0
75%,0.0,55.0,2.0,2.0,3.0,89.8625,3794.7375,1.020881
max,1.0,72.0,3.0,3.0,4.0,118.75,8684.8,1.450628


To make things a little clearer, let's reorganize the columns so the new columns created are closer to the columns they represent/interact with.

Document your takeaways. For each variable:

- should you remove the observations with a missing value for that variable?
- should you remove the variable altogether?
- is missing equivalent to 0 (or some other constant value) in the specific case of this variable?
- should you replace the missing values with a value it is most likely to represent (e.g. Are the missing values a - result of data integrity issues and should be replaced by the most likely value?)
- Handle the missing values in the way you recommended above.

Transform churn such that "yes" = 1 and "no" = 0

In [19]:
def encode_churn(df):
    encoder = LabelEncoder()
    encoder.fit(df.churn)
    return df.assign(churn_encoded = encoder.transform(df.churn))

In [20]:
df = encode_churn(df)

In [21]:
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,paperless_billing,payment_type_id,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type,percent_var_tc_from_act_tc,churn_encoded
0,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,No,2,59.9,542.4,No,Month-to-month,DSL,Mailed check,0.993916,0
1,0013-MHZWF,Female,0,No,Yes,9,Yes,No,1,No,...,Yes,4,69.4,571.45,No,Month-to-month,DSL,Credit card (automatic),1.093009,0
2,0015-UOCOJ,Female,1,No,No,7,Yes,No,1,Yes,...,Yes,1,48.2,340.35,No,Month-to-month,DSL,Electronic check,0.991332,0
3,0023-HGHWL,Male,1,No,No,1,No,No phone service,1,No,...,Yes,1,25.1,25.1,Yes,Month-to-month,DSL,Electronic check,1.0,1
4,0032-PGELS,Female,0,Yes,Yes,1,No,No phone service,1,Yes,...,No,3,30.5,30.5,Yes,Month-to-month,DSL,Bank transfer (automatic),1.0,1


Compute a new feature, tenure_year, that is a result of translating tenure from months to years.

In [22]:
def create_tenure_year(df):
    df[['tenure_year']] = df[['tenure']] / 12
    return df

In [23]:
create_tenure_year(df)

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,payment_type_id,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type,percent_var_tc_from_act_tc,churn_encoded,tenure_year
0,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,2,59.90,542.40,No,Month-to-month,DSL,Mailed check,0.993916,0,0.750000
1,0013-MHZWF,Female,0,No,Yes,9,Yes,No,1,No,...,4,69.40,571.45,No,Month-to-month,DSL,Credit card (automatic),1.093009,0,0.750000
2,0015-UOCOJ,Female,1,No,No,7,Yes,No,1,Yes,...,1,48.20,340.35,No,Month-to-month,DSL,Electronic check,0.991332,0,0.583333
3,0023-HGHWL,Male,1,No,No,1,No,No phone service,1,No,...,1,25.10,25.10,Yes,Month-to-month,DSL,Electronic check,1.000000,1,0.083333
4,0032-PGELS,Female,0,Yes,Yes,1,No,No phone service,1,Yes,...,3,30.50,30.50,Yes,Month-to-month,DSL,Bank transfer (automatic),1.000000,1,0.083333
5,0067-DKWBL,Male,1,No,No,2,Yes,No,1,Yes,...,1,49.25,91.10,Yes,Month-to-month,DSL,Electronic check,1.081229,1,0.166667
6,0076-LVEPS,Male,0,No,Yes,29,No,No phone service,1,Yes,...,2,45.00,1242.45,No,Month-to-month,DSL,Mailed check,1.050344,0,2.416667
7,0082-LDZUE,Male,0,No,No,1,Yes,No,1,No,...,2,44.30,44.30,No,Month-to-month,DSL,Mailed check,1.000000,0,0.083333
8,0096-BXERS,Female,0,Yes,No,6,Yes,Yes,1,No,...,1,50.35,314.55,No,Month-to-month,DSL,Electronic check,0.960420,0,0.500000
9,0096-FCPUF,Male,0,No,No,30,Yes,Yes,1,Yes,...,2,64.50,1888.45,No,Month-to-month,DSL,Mailed check,1.024650,0,2.500000


Figure out a way to capture the information contained in phone_service and multiple_lines into a single variable of dtype int. Write a function that will transform the data and place in a new column named phone_id.

Figure out a way to capture the information contained in dependents and partner into a single variable of dtype int. Transform the data and place in a new column household_type_id.

Figure out a way to capture the information contained in streaming_tv and streaming_movies into a single variable of dtype int. Transform the data and place in a new column streaming_services.

Figure out a way to capture the information contained in online_security and online_backup into a single variable of dtype int. Transform the data and place in a new column online_security_backup.

Split the data into train (70%) & test (30%) samples.

In [24]:
list(df)

['customer_id',
 'gender',
 'senior_citizen',
 'partner',
 'dependents',
 'tenure',
 'phone_service',
 'multiple_lines',
 'internet_service_type_id',
 'online_security',
 'online_backup',
 'device_protection',
 'tech_support',
 'streaming_tv',
 'streaming_movies',
 'contract_type_id',
 'paperless_billing',
 'payment_type_id',
 'monthly_charges',
 'total_charges',
 'churn',
 'contract_type',
 'internet_service_type',
 'payment_type',
 'percent_var_tc_from_act_tc',
 'churn_encoded',
 'tenure_year']

In [25]:
X = df.drop(['churn', 'churn_encoded'], axis=1)
y = df[['churn_encoded']]

In [26]:
train, test = train_test_split(df)

In [27]:
scaler = MinMaxScaler()

In [28]:
scaler.fit(X_train[['monthly_charges', 'total_charges']])

NameError: name 'X_train' is not defined

In [None]:
train[['monthly_charges', 'total_charges']] = scaler.transform(train[['monthly_charges', 'total_charges']])
test[['monthly_charges', 'total_charges']] = scaler.transform(test[['monthly_charges', 'total_charges']])
train.head()

In [None]:
X_train = train.drop(['churn', 'churn_encoded'], axis=1)
X_test = test.drop(['churn', 'churn_encoded'], axis=1)
y_train = train[['churn_encoded']]
y_test = test[['churn_encoded']]

Variable Encoding: encode the values in each non-numeric feature such that they are numeric.

Numeric Scaling: scale the monthly_charges and total_charges data. Make sure that the parameters for scaling are learned from the training data set.

# Data Exploration
Could the month in which they signed up influence churn? i.e. if a cohort is identified by tenure, is there a cohort or cohorts who have a higher rate of churn than other cohorts? (Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers)).

Are there features that indicate a higher propensity to churn? like type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?

Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?

If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?

Controlling for services (phone_id, internet_service_type_id, online_security_backup, device_protection, tech_support, and contract_type_id), is the mean monthly_charges of those who have churned significantly different from that of those who have not churned? (Use a t-test to answer this.)

How much of monthly_charges can be explained by internet_service_type? (hint: correlation test). State your hypotheses and your conclusion clearly.

How much of monthly_charges can be explained by internet_service_type + phone service type (0, 1, or multiple lines). State your hypotheses and your conclusion clearly.

Create visualizations exploring the interactions of variables (independent with independent and independent with dependent). The goal is to identify features that are related to churn, identify any data integrity issues, understand 'how the data works'. For example, we may find that all who have online services also have device protection. In that case, we don't need both of those. (The visualizations done in your analysis for questions 1-5 count towards the requirements below)

Each independent variable (except for customer_id) should be visualized in at least two plots, and at least 1 of those compares the independent variable with the dependent variable.

For each plot where x and y are independent variables, add a third dimension (where possible), of churn represented by color.

Use subplots when plotting the same type of chart but with different variables.

Adjust the axes as necessary to extract information from the visualizations (adjusting the x & y limits, setting the scale where needed, etc.)

Add annotations to at least 5 plots with a key takeaway from that plot.

Use plots from matplotlib, pandas and seaborn.

Use each of the following:

sns.heatmap
pd.crosstab (along with sns.heatmap)
pd.scatter_matrix
sns.barplot
sns.swarmplot
sns.pairplot
sns.jointplot
sns.relplot or plt.scatter
sns.distplot or plt.hist
sns.boxplot
plt.plot
Use at least one more type of plot that is not included in the list above.

What can you say about each variable's relationship to churn, based on your initial exploration? If there appears to be some sort of interaction or correlation, assume there is no causal relationship and brainstorm (and document) ideas on reasons there could be correlation.

Summarize your conclusions, provide clear answers to the specific questions, and summarize any takeaways/action plan from the work above.

# Modeling
Feature Selection: Are there any variables that seem to provide limited to no additional information? If so, remove them.

Train (fit, transform, evaluate) multiple different models, varying the model type and your meta-parameters.

In [None]:
list(df)

In [None]:
# def testing(X_df_train, X_df_test, y_df_train, y_df_test, string_criterion, min_sample_leaf_input, max_depth_input):
#     features = list(X_df_train)
    
#     print('Results using ' + str(string_criterion) + ' as the measure of impurity and ' + str(max_depth_input) + ' as max depth level and ' +str(min_sample_leaf_input) + ' as the min_sample_leaf.')
#     print('The features being used: ' + str(features))
#     print('-----')
model
def create_model(model_number, X_df_train, X_df_test, y_df_train, y_df_test, string_criterion, min_sample_leaf_input, max_depth_input):
    model_name = 'rf_' + str(model_number)
    model_name = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion=string_criterion,
                            min_samples_leaf=min_sample_leaf_input,
                            n_estimators=100,
                            max_depth=max_depth_input, 
                            random_state=123)
#     rf_name = str(rf) + model_name(-1:)
    return model_name

#     rf.fit(X_df_train, y_df_train)

#     y_df_pred = rf.predict(X_df_train)
#     print('Head of predicted on X_train:')
#     print(y_df_pred[0:5])
#     print('-----')

#     y_df_pred_proba = rf.predict_proba(X_df_train)
#     print('Head of probabilities on X_train:')
#     print(y_df_pred_proba[0:5])
#     print('-----')

#     print('Accuracy of rf classifier on training set: {:.8f}'.format(rf.score(X_df_train, y_df_train)))
#     print('-----')
    
#     cm = pd.DataFrame(confusion_matrix(y_df_train, y_df_pred),
#              columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    
#     print(cm)
#     print('-----')
    
#     print(classification_report(y_df_train, y_df_pred, digits=4))
#     print('-----')
    
#     y_df_pred_test = rf.predict(X_df_test)
#     y_df_pred_proba_test = rf.predict_proba(X_df_test)
#     print('Accuracy of RF classifier on train set: {:.6f}'
#      .format(rf.score(X_df_train, y_df_train)))
#     print('-----')
    

In [None]:
create_model(3, X_rf_1_train, X_rf_1_test, y_train, y_test, 'entropy', 5, 6)

In [None]:
Work on the one above to get it as a proper function.  Leave the one below alone.

In [None]:
def analyze_rf_binomial(X_df_train, X_df_test, y_df_train, y_df_test, string_criterion, min_sample_leaf_input, max_depth_input):
    features = list(X_df_train)
    
    print('Results using ' + str(string_criterion) + ' as the measure of impurity and ' + str(max_depth_input) + ' as max depth level and ' +str(min_sample_leaf_input) + ' as the min_sample_leaf.')
    print('The features being used: ' + str(features))
    print('-----')

    rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion=string_criterion,
                            min_samples_leaf=min_sample_leaf_input,
                            n_estimators=100,
                            max_depth=max_depth_input, 
                            random_state=123)

    rf.fit(X_df_train, y_df_train)

    y_df_pred = rf.predict(X_df_train)
    print('Head of predicted on X_train:')
    print(y_df_pred[0:5])
    print('-----')

    y_df_pred_proba = rf.predict_proba(X_df_train)
    print('Head of probabilities on X_train:')
    print(y_df_pred_proba[0:5])
    print('-----')

    print('Accuracy of rf classifier on training set: {:.8f}'.format(rf.score(X_df_train, y_df_train)))
    print('-----')
    
    cm = pd.DataFrame(confusion_matrix(y_df_train, y_df_pred),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    
    print(cm)
    print('-----')
    
    print(classification_report(y_df_train, y_df_pred, digits=4))
    print('-----')
    
    y_df_pred_test = rf.predict(X_df_test)
    y_df_pred_proba_test = rf.predict_proba(X_df_test)
    print('Accuracy of RF classifier on train set: {:.6f}'
     .format(rf.score(X_df_train, y_df_train)))
    print('-----')
    

In [None]:
X_rf_1_train = X_train[['tenure_year', 'monthly_charges', 'internet_service_type_id']]
X_rf_1_test = X_test[['tenure_year', 'monthly_charges', 'internet_service_type_id']]

In [None]:
analyze_rf_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 'entropy', 5, 6)

In [None]:
analyze_rf_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 'entropy', 4, 15)

In [None]:
def test_rf_binomial(X_df_train, X_df_test, y_df_train, y_df_test, string_criterion, min_sample_leaf_input, max_depth_input):
    features = list(X_df_train)
    
    print('Results using ' + str(string_criterion) + ' as the measure of impurity and ' + str(max_depth_input) + ' as max depth level and ' +str(min_sample_leaf_input) + ' as the min_sample_leaf.')
    print('The features being used: ' + str(features))
    print('-----')

    rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion=string_criterion,
                            min_samples_leaf=min_sample_leaf_input,
                            n_estimators=100,
                            max_depth=max_depth_input, 
                            random_state=123)

    rf.fit(X_df_train, y_df_train)

    y_df_pred = rf.predict(X_df_train)
    print('Head of predicted on X_train:')
    print(y_df_pred[0:5])
    print('-----')

    y_df_pred_proba = rf.predict_proba(X_df_train)
    print('Head of probabilities on X_train:')
    print(y_df_pred_proba[0:5])
    print('-----')

    print('Accuracy of rf classifier on training set: {:.8f}'.format(rf.score(X_df_train, y_df_train)))
    print('-----')
    
    cm = pd.DataFrame(confusion_matrix(y_df_train, y_df_pred),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    
    print(cm)
    print('-----')
    
    print(classification_report(y_df_train, y_df_pred, digits=4))
    print('-----')
    
    y_df_pred_test = rf.predict(X_df_test)
    y_df_pred_proba_test = rf.predict_proba(X_df_test)

    print('-----')
    
    print('The results of running the model on the test sample:')
    
    cm = pd.DataFrame(confusion_matrix(y_df_train, y_df_pred),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
    print(cm)
    print('-----')
    
    print(classification_report(y_df_train, y_df_pred, digits=4))
    print('-----')
    
    y_df_pred_test = rf.predict(X_df_test)
    y_df_pred_proba_test = rf.predict_proba(X_df_test)
    print('Accuracy of Logistic Regression classifier on test set: {:.6f}'
     .format(rf.score(X_df_test, y_df_test)))
    print('-----')
    

In [None]:
test_rf_binomial(X_rf_1_train, X_rf_1_test, y_train, y_test, 'entropy', 4, 7)

Compare evaluation metrics across all the models, and select the best performing model.

Test the final model (transform, evaluate) on your out-of-sample data (the testing data set). Summarize the performance. Interpret your results.