# Modeling

---

This notebook outlines the modeling process for producing a machine learning model to predict whether or not a customer will churn.

---

In [1]:
# Throughout the notebook we will use this random seed
seed = 24

## Acquire and Prepare Data

First let's acquire and prepare our data using the functions we previously created.

In [2]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from env import username, password, hostname
database_name = 'telco_churn'

def get_db_url(database_name, username = username, password = password, hostname = hostname):
    return f'mysql+pymysql://{username}:{password}@{hostname}/{database_name}'

def get_telco_sql():
    return '''
        SELECT *
        FROM customers
        JOIN payment_types USING (payment_type_id)
        JOIN internet_service_types USING (internet_service_type_id)
        JOIN contract_types USING (contract_type_id);
    '''

def get_telco_data(use_cache = True):
    # If the file is cached, read from the .csv file
    if os.path.exists('telco.csv') and use_cache:
        print('Using cache')
        return pd.read_csv('telco.csv')
    
    # Otherwise read from the mysql database
    else:
        print('Reading from database')
        df = pd.read_sql(get_telco_sql(), get_db_url('telco_churn'))
        df.to_csv('telco.csv', index = False)
        return df
    
def prep_telco_data(df):
    df = df.drop_duplicates()

    cols_to_drop = [
        'customer_id',
        'contract_type_id',
        'internet_service_type_id',
        'payment_type_id'
    ]
    df = df.drop(columns = cols_to_drop)

    does_not_have_zero_tenure = df.tenure != 0
    df = df[does_not_have_zero_tenure]
    df.total_charges = df.total_charges.astype('float')

    columns = [
        'multiple_lines',
        'online_security',
        'online_backup',
        'device_protection',
        'tech_support',
        'streaming_tv',
        'streaming_movies'
    ]

    for column in columns:
        df[column] = np.where(df[column] == 'Yes', 'Yes', 'No')

    categorical_cols = df.dtypes[df.dtypes == 'object'].index

    dummy_df = pd.get_dummies(df[categorical_cols], dummy_na = False, drop_first = True)
    df = pd.concat([df, dummy_df], axis = 1)

    df.columns = df.columns.str.replace(' ', '_', regex = False).str.lower()
    df.columns = df.columns.str.replace('\(|\)', '', regex = True)

    return df

def split_data(df, stratify, random_seed = 24):
    test_split = 0.2
    train_validate_split = 0.3

    train_validate, test = train_test_split(
        df,
        test_size = test_split,
        random_state = random_seed,
        stratify = df[stratify]
    )
    
    train, validate = train_test_split(
        train_validate,
        test_size = train_validate_split,
        random_state = random_seed,
        stratify = train_validate[stratify]
    )
    return train, validate, test

In [3]:
telco_customers = get_telco_data()
telco_customers = prep_telco_data(telco_customers)
train, validate, test = split_data(telco_customers, 'churn')
train.shape, validate.shape, test.shape

Using cache


((3937, 40), (1688, 40), (1407, 40))

**We will only be using our train and validate datasets in this notebook.**

---

## Establishing a Baseline Model

Before we begin creating models we must first establish a baseline against which we can compare the performance of our models to determine if they meet at least the minimum standard.

In [4]:
# We will look at the unique values of churn along with the counts of those values
train.churn.value_counts(), train.churn.value_counts(normalize = True)

(No     2891
 Yes    1046
 Name: churn, dtype: int64,
 No     0.734315
 Yes    0.265685
 Name: churn, dtype: float64)

Since the most frequent value of churn is No we will establish our baseline model as one that always predicts that a customer will not churn. We can see above that such a model will have an accuracy of roughly 73%.

In [5]:
# Create a pandas series of all "No"s to serve as our baseline model
baseline = pd.Series(['No'] * train.shape[0])
baseline.value_counts()

No    3937
dtype: int64

We can turn this into a function that will create the baseline model for us, we simply need to have a way of determining the most common value of our target variable and then turn that into a pandas series with that value that is the same size as our dataset.

In [6]:
# Let's find the most common value in the churn column
most_common_value = train.churn.mode()[0]
most_common_value

'No'

In [7]:
# Now let's create a series of the same size as train with only that most common value
pd.Series([most_common_value] * train.churn.size).value_counts()

No    3937
dtype: int64

In [8]:
# Now let's turn it into a function

def create_baseline_model(column):
    most_common_value = column.mode()[0]
    return pd.Series([most_common_value] * column.size)

create_baseline_model(train.churn).value_counts()

No    3937
dtype: int64

---

## Measure the Performance of the Baseline

Before we continue let's measure the performance of our baseline.

We also want to consider which metric we should optimizing for. In this problem the cost of a false negative is far more expensive than the cost of a false positive, where a positive means that a customer does churn. The reason is because it is far more expensive to sign a new customer than it is to keep an existing one. Based on this information we should optimize for recall first and accuracy second. With this in mind we will make sure to focus on the recall score of our baseline model as well as our other models as we go through the modeling process.

In [9]:
# We will need sklearn to measure the performance of our baseline

from sklearn.metrics import accuracy_score, precision_score, recall_score

In [10]:
accuracy_score(train.churn, baseline)

0.7343154686309372

In [11]:
recall_score(train.churn, baseline, pos_label = 'Yes')

0.0

In [12]:
precision_score(train.churn, baseline, pos_label = 'Yes', zero_division = 0)

0.0

The baseline model scores a zero in our focus metric of recall. This is because it only predicts 'No'. In regards to comparing our models to the baseline we will look only at accuracy.

In [13]:
# Let's turn these measurements into a function for our convenience

def measure_model_performance(y_true, *y_pred, positive_label = 1):
    scores = []
    
    for index, predictions in enumerate(y_pred):
        scores.append({
            'model' : index,
            'accuracy' : accuracy_score(y_true, predictions),
            'precision' : precision_score(y_true, predictions, pos_label = positive_label, zero_division = 0),
            'recall' : recall_score(y_true, predictions, pos_label = positive_label, zero_division = 0)
        })
        
    df = pd.DataFrame(scores)
    return df.set_index('model')
    
measure_model_performance(train.churn, baseline, positive_label = 'Yes')

Unnamed: 0_level_0,accuracy,precision,recall
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.734315,0.0,0.0


---

## Create and Test 3 Different Models

Now we will create three different machine learning models to use for predicting customer churn. We will use the following 3 types of models:

- A decision tree, this will give us a simple model that can easily be understood and may provide the results we need.
- A random forest, this will ideally give us a model that can generalize well and provide better results than the decision tree.
- K nearest neighbors, potentially we might be able to get good results from this model.

Before we begin let's separate our train and validates sets into X and y, where X is the dataset with the features we identified as drivers of churn in our explore phase and y is the churn column.

### Split Data Into X and y

In [14]:
# We will need to use encoded variables for our features
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3937 entries, 5467 to 2212
Data columns (total 40 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   gender                              3937 non-null   object 
 1   senior_citizen                      3937 non-null   int64  
 2   partner                             3937 non-null   object 
 3   dependents                          3937 non-null   object 
 4   tenure                              3937 non-null   int64  
 5   phone_service                       3937 non-null   object 
 6   multiple_lines                      3937 non-null   object 
 7   online_security                     3937 non-null   object 
 8   online_backup                       3937 non-null   object 
 9   device_protection                   3937 non-null   object 
 10  tech_support                        3937 non-null   object 
 11  streaming_tv                        3937

In [15]:
# We will use the encoded columns for contract_type and payment_type
features = [
    'monthly_charges',
    'tenure',
    'contract_type_one_year',
    'contract_type_two_year',
    'tech_support_yes'
]

# X_train and y_train will be used to train our models
# X_validate and y_validate will be used to test the performance of our models

X_train, y_train = train[features], train.churn
X_validate, y_validate = validate[features], validate.churn

---

### Decision Tree

Let's start with a decision tree.

In [16]:
from sklearn.tree import DecisionTreeClassifier, export_text

In [17]:
# We'll try a max_depth of 5 so we can get good results without overfitting
model_1 = DecisionTreeClassifier(criterion = 'entropy', max_depth = 5, random_state = seed)
model_1.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=24)

In [18]:
# Let's see the predictions our model makes
y_pred_model_1 = model_1.predict(X_train)
pd.Series(y_pred_model_1).value_counts()

No     2874
Yes    1063
dtype: int64

In [19]:
measure_model_performance(y_train, baseline, y_pred_model_1, positive_label = 'Yes')

Unnamed: 0_level_0,accuracy,precision,recall
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.734315,0.0,0.0
1,0.786894,0.597366,0.607075


---

### Random Forest

Now let's try a random forest.

In [20]:
from sklearn.ensemble import RandomForestClassifier

In [21]:
# We'll try a max_depth of five
model_2 = RandomForestClassifier(max_depth = 5, random_state = seed)
model_2.fit(X_train, y_train)

RandomForestClassifier(max_depth=5, random_state=24)

In [22]:
# Let's see the predictions our model makes
y_pred_model_2 = model_2.predict(X_train)
pd.Series(y_pred_model_2).value_counts()

No     3292
Yes     645
dtype: int64

In [23]:
# Now let's measure the performance of this model
measure_model_performance(y_train, baseline, y_pred_model_1, y_pred_model_2, positive_label = 'Yes')

Unnamed: 0_level_0,accuracy,precision,recall
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.734315,0.0,0.0
1,0.786894,0.597366,0.607075
2,0.79807,0.694574,0.428298


The performance is similar to the decision tree.

---

### K Nearest Neighbors

Lastly let's try a k nearest neighbors model.

In [24]:
from sklearn.neighbors import KNeighborsClassifier

In [25]:
model_3 = KNeighborsClassifier(n_neighbors = 10, weights = 'uniform')
model_3.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=10)

In [26]:
y_pred_model_3 = model_3.predict(X_train)
pd.Series(y_pred_model_3).value_counts()

No     3266
Yes     671
dtype: int64

In [27]:
measure_model_performance(
    y_train,
    baseline,
    y_pred_model_1,
    y_pred_model_2,
    y_pred_model_3,
    positive_label = 'Yes'
)

Unnamed: 0_level_0,accuracy,precision,recall
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.734315,0.0,0.0
1,0.786894,0.597366,0.607075
2,0.79807,0.694574,0.428298
3,0.807214,0.71386,0.457935


The k nearest neighbors model is an improvement on the previous two models.

---

## Testing Our Models on Validate

Now let's see how each model performs on the out of sample validate set. The best performing model here will be the one we move forward with.

In [28]:
measure_model_performance(
    y_validate,
    create_baseline_model(y_validate),
    model_1.predict(X_validate),
    model_2.predict(X_validate),
    model_3.predict(X_validate),
    positive_label = 'Yes'
)

Unnamed: 0_level_0,accuracy,precision,recall
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.734005,0.0,0.0
1,0.78436,0.595937,0.587973
2,0.797986,0.706107,0.412027
3,0.777844,0.637037,0.383073


The decision tree model has the best performance on validate with only a slight drop off in performance compared to the train set and 38% correct identification of churned customers.

---

## Looking Under the Hood

Before concluding let's look at how the decision tree is making its decisions.

In [29]:
# It's going to be difficult to read so scroll down for the summary
print(export_text(model_1, feature_names = X_train.columns.tolist()))

|--- contract_type_two_year <= 0.50
|   |--- contract_type_one_year <= 0.50
|   |   |--- monthly_charges <= 67.70
|   |   |   |--- tenure <= 3.50
|   |   |   |   |--- monthly_charges <= 24.55
|   |   |   |   |   |--- class: No
|   |   |   |   |--- monthly_charges >  24.55
|   |   |   |   |   |--- class: Yes
|   |   |   |--- tenure >  3.50
|   |   |   |   |--- monthly_charges <= 25.32
|   |   |   |   |   |--- class: No
|   |   |   |   |--- monthly_charges >  25.32
|   |   |   |   |   |--- class: No
|   |   |--- monthly_charges >  67.70
|   |   |   |--- tenure <= 6.50
|   |   |   |   |--- tenure <= 1.50
|   |   |   |   |   |--- class: Yes
|   |   |   |   |--- tenure >  1.50
|   |   |   |   |   |--- class: Yes
|   |   |   |--- tenure >  6.50
|   |   |   |   |--- tenure <= 29.50
|   |   |   |   |   |--- class: Yes
|   |   |   |   |--- tenure >  29.50
|   |   |   |   |   |--- class: No
|   |--- contract_type_one_year >  0.50
|   |   |--- monthly_charges <= 98.12
|   |   |   |--- monthly_cha

This is a lot to digest so let's break down some of the key takeaways:
- Most of the "Yes"s are split by contract type of month to month and monthly_charges > 67.70
- Most of the remaining "Yes"s are decided by high monthly charges

Likely the best recommendation to make here is whenever a customer is identified as likely to churn they should be offered a discounted rate which would make them less likely to churn.

---

## Conclusion

We choose the decision tree model as our best model and in the final report notebook we will run a final check on this model to see how it performs with the test dataset and create a csv file of predictions.