# Project Planning

- Describe the project and goals.

- Task out how you will work through the pipeline in as much detail as you need to keep on track.

- Incluce a data dictionary.

- Clearly state your starting hypotheses (and add the testing of these to your task list).

## Goals

- Find drivers for customer churn.

- Construct a ML classification model that accurately predicts customer churn.

- Create modules that make your process repeateable.

- Document your process well enough to be presented or read like a report.



## Audience

- Your target audience for your notebook walkthrough is the Codeup Data Science team. This should guide your language and level of explanations in your walkthrough.

## Project Specifications

#### Why are our customers churning?

###### Some questions to think about include but are not limited to:

- Are there clear groupings where a customer is more likely to churn?

    - What if you consider contract type?
    - Is there a tenure that month-to-month customers are most likely to churn? 1-year contract customers? 2-year contract customers?
    - Do you have any thoughts on what could be going on? (Be sure to state these thoughts not as facts but as untested hypotheses. Unless you test them!). Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers).

- Are there features that indicate a higher propensity to churn?

    - How influential are type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?

- Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point?

    - If so, what is that point and for which service(s)?

    - If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?


### Acquisition

- Acquire data from the customers table from the telco_churn database on the codeup data science database server.

- You will want to join some tables as part of your query.

- This data should end up in a pandas data frame.

- summarize data (.info(), .describe(), .value_counts(), ...)

- plot distributions of individual variables



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from acquire import get_telco_data

In [None]:
#grabbing telco data from SQL using function and storing it as a DataFrame
df = pd.DataFrame(get_telco_data())
df = df.set_index('customer_id')

In [None]:
df.head()

In [None]:
df.info()

__All of the columns do not have any null values__

In [None]:
df.describe()

__There does not seem to be any outliers__

In [None]:
df.payment_type_id.value_counts()

- 1: electronic check
- 2: mailed check
- 3: bank transfer(automatic)
- 4: credit card(automatic)

In [None]:
df.internet_service_type_id.value_counts()

- 1: DSL
- 2: Fiber Optic
- 3: None

In [None]:
df.contract_type_id.value_counts()

- 1: Month-to-month
- 2: 1 year
- 3: 2 year

In [None]:
df.senior_citizen.value_counts()

In [None]:
df.phone_service.value_counts()

- 0: Is not senior citizen
- 1: Is senior citizen

__Confirmed there are no null values in senior citizens, payment, internet and contract type id__

In [None]:
# distribution of columns whose data type is 'int64'
num_cols = df.columns[[df[col].dtype == 'int64' for col in df.columns]]
for col in num_cols:
    plt.hist(df[col])
    plt.title(col)
    plt.show()

__There is a normal distribution across payment type, internet service and tenure__

__There are a lot more customers who are not senior citizens than who are__

__There are more month-to-month customers than the 1 and 2 year subscriptions combined__

### Data Prep

- Change device_protection, tech_support and papperless_billing to 0/1

- Create a new feature that represents tenure in years.

- Create single variables for or find other methods to merge variables representing the information from the following columns:

    - phone_service and multiple_lines
    - dependents and partner
    - streaming_tv & streaming_movies
    - online_security & online_backup
    
- Split your data into train/validate/test.

In [9]:
def online_checker(row):
        if row == 'Yes':
            return 2
        elif row == 'No':
            return 1
        elif row == "No internet service":
            return 0

In [None]:
df.device_protection = df.device_protection.apply(online_checker)

In [None]:
df.tech_support = df.tech_support.apply(online_checker)

In [None]:
df['tenure_by_year'] = df.tenure / 12

In [10]:
def phone_checker(row):
        if row == 'Yes':
            return 2
        elif row == 'No':
            return 1
        elif row == "No phone service":
            return 0

In [None]:
df['multiple_line_values'] = df.multiple_lines.apply(phone_checker)

In [11]:
def family_checker(row):
        if row == 'Yes Yes':
            return 2
        elif row == 'Yes No' or row == 'No Yes':
            return 1
        elif row == 'No No':
            return 0

In [None]:
df['part_or_dep_values'] = df['partner'].str.cat(df['dependents'], sep =" ") 

In [None]:
df.part_or_dep_values = df.part_or_dep_values.apply(family_checker)

In [12]:
def stream_checker(row):
        if row == 'Yes Yes':
            return 3
        elif row == 'Yes No' or row == 'No Yes':
            return 2
        elif row == 'No No':
            return 1
        elif row == "No internet service No internet service":
            return 0

In [None]:
df['streaming_tv_or_movie'] = df['streaming_tv'].str.cat(df['streaming_movies'], sep =" ") 

In [None]:
df.streaming_tv_or_movie = df.streaming_tv_or_movie.apply(stream_checker)

In [None]:
df['security_or_backup_values'] = df.online_security.str.cat(df.online_backup, sep=" ")

In [None]:
df.security_or_backup_values = df.security_or_backup_values.apply(stream_checker)

In [None]:
telco_dummies = pd.get_dummies(df[['gender', 'churn', 'paperless_billing']], drop_first=True)

In [None]:
df = pd.concat([df, telco_dummies], axis=1)

In [None]:
col_to_drop = ['gender', 'partner', 'dependents', 'phone_service', 'multiple_lines', 'online_security', 'online_backup', 'streaming_tv', 'streaming_movies', 'paperless_billing', 'churn', 'contract_type', 'internet_service_type', 'payment_type']

In [None]:
df = df.drop(columns = col_to_drop)

In [None]:
df

In [13]:
def telco_split(df):

    train_validate, test = train_test_split(df, test_size=.15, 
                                        random_state=123, 
                                        stratify=df.churn_Yes)
    train, validate = train_test_split(train_validate, test_size=.3, 
                                   random_state=123, 
                                   stratify=train_validate.churn_Yes)
    return train, validate, test

In [None]:
train, validate, test = telco_split(df)

In [None]:
train.shape, validate.shape, test.shape

In [None]:
train.head()

In [14]:
def prep_telco_data(cached = True):
    # use my acquire function to read data into a df from a csv file
    df = pd.DataFrame(get_telco_data())
    #set index as customer_id
    df = df.set_index('customer_id')
    # change device_protection into numeric values
    df.device_protection = df.device_protection.apply(online_checker)
    # change tech_suppport into numeric values
    df.tech_support = df.tech_support.apply(online_checker)
    #create a new column with tenure by the year
    df['tenure_by_year'] = df.tenure / 12
    #change multiple_lines into a numeric value
    df['multiple_line_values'] = df.multiple_lines.apply(phone_checker)
    # combine partner and dependents into one category
    df['part_or_dep_values'] = df['partner'].str.cat(df['dependents'], sep =" ") 
    #change the category into a numeric value
    df.part_or_dep_values = df.part_or_dep_values.apply(family_checker)
    #combine streaming tv and movies into one category
    df['streaming_tv_or_movie'] = df['streaming_tv'].str.cat(df['streaming_movies'], sep =" ") 
    #change the category into a numeric value
    df.streaming_tv_or_movie = df.streaming_tv_or_movie.apply(stream_checker)
    #combine online_security and online_backup into one category
    df['security_or_backup_values'] = df.online_security.str.cat(df.online_backup, sep=" ")
    #change the category into a numeric value
    df.security_or_backup_values = df.security_or_backup_values.apply(stream_checker)
    #create dummy values for gender, churn, and paperless_billing
    telco_dummies = pd.get_dummies(df[['gender', 'churn', 'paperless_billing']], drop_first=True)
    #add dummy values into the main dataframe
    df = pd.concat([df, telco_dummies], axis=1)
    #list duplicate columns
    col_to_drop = ['gender', 'partner', 'dependents', 'phone_service', 'multiple_lines', 'online_security', 'online_backup', 'streaming_tv', 'streaming_movies', 'paperless_billing', 'churn', 'contract_type', 'internet_service_type', 'payment_type']
    #drop duplicate columns
    df = df.drop(columns = col_to_drop)
    #split data into train, validate and test subsets
    train, validate, test = telco_split(df)
    return train, validate, test

In [15]:
train, validate, test = prep_telco_data()

In [16]:
train

Unnamed: 0_level_0,payment_type_id,internet_service_type_id,contract_type_id,senior_citizen,tenure,device_protection,tech_support,monthly_charges,total_charges,tenure_by_year,multiple_line_values,part_or_dep_values,streaming_tv_or_movie,security_or_backup_values,gender_Male,churn_Yes,paperless_billing_Yes
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
3961-SXAXY,3,1,1,0,1,1,1,44.05,44.05,0.083333,1,0,1,1,1,0,1
8085-MSNLK,2,2,3,0,62,2,2,113.95,6891.4,5.166667,2,1,3,3,0,0,0
3873-WOSBC,4,3,3,0,67,0,0,25.60,1784.9,5.583333,2,1,0,0,1,0,0
4544-RXFMG,2,1,2,0,8,1,1,43.45,345.5,0.666667,1,2,1,1,1,0,1
8644-XLFBW,1,2,1,1,1,1,1,71.65,71.65,0.083333,1,0,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7317-GGVPB,4,2,3,0,71,2,2,108.60,7690.9,5.916667,2,1,3,2,1,1,1
6332-FBZRI,4,1,2,0,67,2,2,69.35,4653.25,5.583333,2,2,1,3,1,0,1
7554-NEWDD,3,3,3,0,10,0,0,25.70,251.6,0.833333,2,0,0,0,1,0,0
9127-FHJBZ,1,2,3,0,72,2,2,114.00,8093.15,6.000000,2,2,3,3,1,0,1
