# Modelling Customer Churn 1 - Getting and Cleaning Data

## A. P. Young

### 2022-04-02

In [1]:
import time
START = time.time()
import datetime
from datetime import timedelta
import os
import pandas as pd
from collections import Counter

# Introduction

We want to model and understand the reasons behind customer churn. **Customer churn** refers to when customers have voluntarily stopped using a business' services within a given time window, usually expressed as a rate or a percentage. Usually, it is more expensive for businesses to acquire new customers than keep existing customers. Therefore, measuring, modelling and understanding customer churn is important for businesses, especially in industries where churn rate is high (e.g. mobile phones, insurance, internet service providers... etc.)

For more details, see https://www.investopedia.com/terms/c/churnrate.asp (last accessed 3/4/2022)

The dataset is from Kaggle - https://www.kaggle.com/datasets/blastchar/telco-customer-churn (last accessed 2/4/2022).

# Get Data

We have already downloaded the data from the above link into this directory and renamed the file.

In [2]:
data_filename = 'customer_churn.csv'

The file is a single csv file of less than $1Mb$. We load the table into memory.

In [3]:
data = pd.read_csv(data_filename)

We have $n = 7043$ for this dataset.

In [4]:
data.shape

(7043, 21)

We can inspect the first five rows.

In [5]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Data Dictionary

We have a data dictionary from the Kaggle link above, which explains all columns.

The independent variables (features) are grouped into several categories:

  1. `customerID` is the customer ID.
  
We then have attributes of the customers:
  
  2. `gender` denotes whether the customer is male or female.
  
  3. `SeniorCitizen` denotes whether the customer is a senior citizen, $\in\{0,1\}$.
  
  4. `Partner` denotes whether the customer has a partner or not.
  
  5. `Dependents` denotes whether a customer has dependents or not.

We then have the customers' relationships with the business in terms of the services used:

  6. `tenure` denotes the number of months the customer has stayed with the company / used the business' services.
  
  7. `PhoneService` denotes whether the customer has a phone service or not.
  
  8. `MultipleLines` denotes whether the customer has multiple lines or not, given that the customer has phone service.
  
  9. `InternetService` denotes the customer's internet service provider.
  
  10. `OnlineSecurity` denotes whether the customer has online security, given that the customer has internet service.
  
  11. `OnlineBackup` denotes whether the customer has online backup, given that the customer has internet service.
  
  12. `DeviceProtection` denotes whether the customer has online backup, given that the customer has internet service.
  
  13. `TechSupport` denotes whether the customer has tech support, given that the customer has internet service.
  
  14. `StreamingTV` denotes whether the customer has streaming TV, given that the customer has internet service.
  
  15. `StreamingMovies` denotes whether the customer has streaming movies.
  
We then have financial information:
  
  16. `Contract` denotes whether the customer has a monthly contract, a one-year contract or a two-year contract.
  
  17. `PaperlessBilling` denotes whether the customer has paperless billing or not.
  
  18. `PaymentMethod` denotes how the customer is paying.
  
  19. `MonthlyCharges` is the amount charged to the customer monthly.
  
  20. `TotalCharges` is the total amount charged to the customer so far.

The dependent variable (target) is `Churn` which denotes whether the customer has churned or not; this is the last column.

# Data Science Questions

  1. How can we accurately model customer churn given this dataset?
  
  2. What are the most important features that influence customer churn?

# Clean Data

We clean the data by examining which features to keep and how all features can be represented numerically. The principles we apply are as follows:

  1. If a column's values are totally unique, i.e. a column has as many distinct values as there are rows in the column, then it has no predictive power and we delete it.
  
  2. If a column only has two values which are variations of `yes` and `no`, we replace these values respectively with $1$ and $0$.
  
  3. If a column is categorical, we make a one-hot representation of that column.
  
We first inspect these properties by eye by iterating across each feature and noting whether the features values are totally unique, and if not, print out the values and their counts.

In [6]:
for feature in list(data):
    print(feature)
    column = data[feature]
    count_of_values = Counter(column)
    if len(set(count_of_values.values())) == 1:
        print('Totally Unique values')
    # we use 10 as a rough heuristic
    # i.e. it is unlikely for this data that a categorical
    # variable has more than 10 categories
    elif len(count_of_values) > 10:
        print('Too many distinct values to list, perhaps this variable is numerical?')
    else:
        print(count_of_values)
    print()

customerID
Totally Unique values

gender
Counter({'Male': 3555, 'Female': 3488})

SeniorCitizen
Counter({0: 5901, 1: 1142})

Partner
Counter({'No': 3641, 'Yes': 3402})

Dependents
Counter({'No': 4933, 'Yes': 2110})

tenure
Too many distinct values to list, perhaps this variable is numerical?

PhoneService
Counter({'Yes': 6361, 'No': 682})

MultipleLines
Counter({'No': 3390, 'Yes': 2971, 'No phone service': 682})

InternetService
Counter({'Fiber optic': 3096, 'DSL': 2421, 'No': 1526})

OnlineSecurity
Counter({'No': 3498, 'Yes': 2019, 'No internet service': 1526})

OnlineBackup
Counter({'No': 3088, 'Yes': 2429, 'No internet service': 1526})

DeviceProtection
Counter({'No': 3095, 'Yes': 2422, 'No internet service': 1526})

TechSupport
Counter({'No': 3473, 'Yes': 2044, 'No internet service': 1526})

StreamingTV
Counter({'No': 2810, 'Yes': 2707, 'No internet service': 1526})

StreamingMovies
Counter({'No': 2785, 'Yes': 2732, 'No internet service': 1526})

Contract
Counter({'Month-to-month':

We can see that:

  1. `customerID` has unique values, so it has no predictive power and we remove it from our independent variables.
  
  2. `gender` is a categorical variable and we can split it using a one-hot representation into two variables `is_male` and `is_female`. However, common sense states that both of these variables will be perfectly anticorrelated. Therefore, we will translate `gender` to `is_female` ($1$ = `Female`, $0$ = `Male`).

  3. `SeniorCitizen`, `Partner`, `Dependents`, `PhoneService`, `PaperlessBilling` and `Churn` are already Boolean variables. We will replace `No` with $0$ and `Yes` with $1$ in all cases.
  
  4. `tenure` we will rename `tenure/month` and keep the integer values.
  
  5. `MonthlyCharges` and `TotalCharges` are already floats which we will keep. HOWEVER, notice that `TotalCharges` are STRINGS of floats, and there are also empty strings / spaces amongst its values, perhaps because a customer has churned before they have a total charge. We will need to fix this.

These are not all the features - we will come to features such as `StreamingTV` and `PaymentMethod` in a moment.

We will perform the above transformations. First, we will drop `customerID`.

In [7]:
data = data.drop(columns = 'customerID')

We will change `gender` into `is_female` and make the appropriate numerical replacements. We write a helper function to do so:

In [8]:
def change_into_zero_one_column(mydf, col_name, pos, neg):
    """
    Input a dataframe, column name string (present in the dataframe),
    positive value string and negative value string
    Output same dataframe (does not mutate input)
    with positive values replaced with 1 and negative values
    replaced with 0
    Of course this is only useful if the column is Boolean
    """
    answer = mydf.copy()
    mycol = answer[col_name]
    mycol = mycol.replace(pos, 1)
    mycol = mycol.replace(neg, 0)
    answer[col_name] = mycol
    return answer

Check the frequency of values:

In [9]:
Counter(data['gender'])

Counter({'Female': 3488, 'Male': 3555})

We change the string values into integers.

In [10]:
data = change_into_zero_one_column(data, 'gender', 'Female', 'Male')

We rename the column accordingly, and we also change `tenure` to `tenure/month`.

In [11]:
data = data.rename(columns = {'gender' : 'is_female',
                              'tenure' : 'tenure/month'})

We check the resulting values:

In [12]:
Counter(data['is_female'])

Counter({1: 3488, 0: 3555})

Notice these counts match those of `gender`, so the replacement should be correct.

We now perform step 3 above for other Boolean features. `SeniorCitizen` already has an integer representation so we will not change it.

In [13]:
boolean_columns = ['Partner', 'Dependents', 'PhoneService',
                   'PaperlessBilling', 'Churn']

We verify that for all of these columns, we only have `No` and `Yes`.

In [14]:
for feature in boolean_columns:
    print(feature)
    print(Counter(data[feature]))
    print()

Partner
Counter({'No': 3641, 'Yes': 3402})

Dependents
Counter({'No': 4933, 'Yes': 2110})

PhoneService
Counter({'Yes': 6361, 'No': 682})

PaperlessBilling
Counter({'Yes': 4171, 'No': 2872})

Churn
Counter({'No': 5174, 'Yes': 1869})



We make the changes:

In [15]:
for feature in boolean_columns:
    data = change_into_zero_one_column(data, feature, 'Yes', 'No')

We check the columns:

In [16]:
for feature in boolean_columns:
    print(feature)
    print(Counter(data[feature]))
    print()

Partner
Counter({0: 3641, 1: 3402})

Dependents
Counter({0: 4933, 1: 2110})

PhoneService
Counter({1: 6361, 0: 682})

PaperlessBilling
Counter({1: 4171, 0: 2872})

Churn
Counter({0: 5174, 1: 1869})



The counts match exactly so we assume the transformation is correct.

We now check that `MonthlyCharges` have all floats:

In [17]:
Counter([type(item) for item in data['MonthlyCharges']])

Counter({float: 7043})

We know that `TotalCharges` have string-floats, but also there are non-float strings. They are:

In [18]:
total_charges_non_float_strings = []
for item in data['TotalCharges']:
    try:
        float(item)
        continue
    except:
        total_charges_non_float_strings.append(item)

In [19]:
Counter(total_charges_non_float_strings)

Counter({' ': 11})

So the only non-float string is a single space ` `. What does this mean? Consider the sub-dataframe:

In [20]:
data[data['TotalCharges'] == ' ']

Unnamed: 0,is_female,SeniorCitizen,Partner,Dependents,tenure/month,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,1,0,1,1,0,0,No phone service,DSL,Yes,No,Yes,Yes,Yes,No,Two year,1,Bank transfer (automatic),52.55,,0
753,0,0,0,1,0,1,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,0,Mailed check,20.25,,0
936,1,0,1,1,0,1,No,DSL,Yes,Yes,Yes,No,Yes,Yes,Two year,0,Mailed check,80.85,,0
1082,0,0,1,1,0,1,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,0,Mailed check,25.75,,0
1340,1,0,1,1,0,0,No phone service,DSL,Yes,Yes,Yes,Yes,Yes,No,Two year,0,Credit card (automatic),56.05,,0
3331,0,0,1,1,0,1,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,0,Mailed check,19.85,,0
3826,0,0,1,1,0,1,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,0,Mailed check,25.35,,0
4380,1,0,1,1,0,1,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,0,Mailed check,20.0,,0
5218,0,0,1,1,0,1,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,1,Mailed check,19.7,,0
6670,1,0,1,1,0,1,Yes,DSL,No,Yes,Yes,Yes,Yes,No,Two year,0,Mailed check,73.35,,0


Notice that none of these customers have churned. As churn is measured across a period of time, it is *plausible* that the time window of this data is less than one month since these customers have joined, so these customers have not been charged yet despite there being a valid value in `MonthlyCharges`.

**N.B.** This is just speculation.

We will replace `TotalCharges` with floats, where ` ` is $0$ as these customers have not yet been charged (this is an assumption).

In [21]:
new_total_charges = []
for item in data['TotalCharges']:
    if item == ' ':
        item2 = 0
    else:
        item2 = float(item)
    new_total_charges.append(item2)

In [22]:
data['TotalCharges'] = new_total_charges

We now check the type of the column - everything should be a float.

In [23]:
Counter([type(item) for item in data['TotalCharges']])

Counter({float: 7043})

We now return to the other features.

Notice that `MultipleLines`, specifically its value `No`, is correlated with `PhoneService`:

In [24]:
Counter(zip(data['PhoneService'], data['MultipleLines']))

Counter({(0, 'No phone service'): 682, (1, 'No'): 3390, (1, 'Yes'): 2971})

We know if `PhoneService` is $0$, `MultipleLines` is also $0$. However, if `PhoneService` is $1$, then `MultipleLines` can either be $0$ or $1$. We replace all `Yes` with $1$ and other values with $0$ via the below helper function:

In [25]:
def change_column_value_into_one_and_rest_zero(mydf, col_name, pos):
    """
    Input a dataframe, column name string (present)
    and positive value string
    Output same dataframe (does not mutate input)
    with positive values replaced with 1 and ALL OTHER VALUES
    replaced with 0
    """
    answer = mydf.copy()
    mycol = answer[col_name]
    value_counts = Counter(mycol)
    for value in value_counts.keys():
        if value == pos:
            mycol = mycol.replace(value, 1)
        else:
            mycol = mycol.replace(value, 0)
    answer[col_name] = mycol
    return answer

In [26]:
data = change_column_value_into_one_and_rest_zero(data, 'MultipleLines', 'Yes')

The counts are the same, so this replacement is correct:

In [27]:
Counter(zip(data['PhoneService'], data['MultipleLines']))

Counter({(0, 0): 682, (1, 0): 3390, (1, 1): 2971})

We now group together all internet service features:

In [28]:
internet_service_features = ['OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies']

We count which values of these service features match with with `InternetService`.

In [29]:
for feature in internet_service_features:
    print(feature)
    calculation = Counter(zip(data['InternetService'], data[feature]))
    for key, value in calculation.items():
        print(key, value)
    print()

OnlineSecurity
('DSL', 'No') 1241
('DSL', 'Yes') 1180
('Fiber optic', 'No') 2257
('No', 'No internet service') 1526
('Fiber optic', 'Yes') 839

OnlineBackup
('DSL', 'Yes') 1086
('DSL', 'No') 1335
('Fiber optic', 'No') 1753
('Fiber optic', 'Yes') 1343
('No', 'No internet service') 1526

DeviceProtection
('DSL', 'No') 1356
('DSL', 'Yes') 1065
('Fiber optic', 'No') 1739
('Fiber optic', 'Yes') 1357
('No', 'No internet service') 1526

TechSupport
('DSL', 'No') 1243
('DSL', 'Yes') 1178
('Fiber optic', 'No') 2230
('Fiber optic', 'Yes') 866
('No', 'No internet service') 1526

StreamingTV
('DSL', 'No') 1464
('Fiber optic', 'No') 1346
('Fiber optic', 'Yes') 1750
('No', 'No internet service') 1526
('DSL', 'Yes') 957

StreamingMovies
('DSL', 'No') 1440
('Fiber optic', 'No') 1345
('Fiber optic', 'Yes') 1751
('No', 'No internet service') 1526
('DSL', 'Yes') 981



By inspection, if the customer has no internet service, then these features say there is no internet service. Therefore, we replace `Yes` with $1$ for all of these features, and `No` with $0$.

In [30]:
for feature in internet_service_features:
    data = change_column_value_into_one_and_rest_zero(data, feature, 'Yes')

We check the counts:

In [31]:
for feature in internet_service_features:
    print(feature)
    calculation = Counter(zip(data['InternetService'], data[feature]))
    for key, value in calculation.items():
        print(key, value)
    print()

OnlineSecurity
('DSL', 0) 1241
('DSL', 1) 1180
('Fiber optic', 0) 2257
('No', 0) 1526
('Fiber optic', 1) 839

OnlineBackup
('DSL', 1) 1086
('DSL', 0) 1335
('Fiber optic', 0) 1753
('Fiber optic', 1) 1343
('No', 0) 1526

DeviceProtection
('DSL', 0) 1356
('DSL', 1) 1065
('Fiber optic', 0) 1739
('Fiber optic', 1) 1357
('No', 0) 1526

TechSupport
('DSL', 0) 1243
('DSL', 1) 1178
('Fiber optic', 0) 2230
('Fiber optic', 1) 866
('No', 0) 1526

StreamingTV
('DSL', 0) 1464
('Fiber optic', 0) 1346
('Fiber optic', 1) 1750
('No', 0) 1526
('DSL', 1) 957

StreamingMovies
('DSL', 0) 1440
('Fiber optic', 0) 1345
('Fiber optic', 1) 1751
('No', 0) 1526
('DSL', 1) 981



This is fine as the counts match. We now change `InternetService` into the following representation:

  1. `internet_service_is_DSL` as a Boolean variable.
  
  2. `internet_service_is_fiber_optic` as a Boolean variable.
  
If both of these values are zero, then this means `No`.

In [32]:
Counter(data['InternetService'])

Counter({'DSL': 2421, 'Fiber optic': 3096, 'No': 1526})

In [33]:
internet_service_is_dsl = change_column_value_into_one_and_rest_zero(data, 'InternetService', 'DSL')['InternetService']
internet_service_is_fiber_optic = change_column_value_into_one_and_rest_zero(data, 'InternetService', 'Fiber optic')['InternetService']

We replace `InternetService` with both of these columns by deleting `InternetService` and inserting those columns at the same position.

In [34]:
list(data).index('InternetService')

7

In [35]:
data = data.drop(columns = 'InternetService')
data.insert(7, 'internet_service_is_fiber_optic', internet_service_is_fiber_optic)
data.insert(7, 'internet_service_is_dsl', internet_service_is_dsl)

Notice the counts are the same as above, so we preserve the same information.

In [36]:
Counter(zip(data['internet_service_is_dsl'], data['internet_service_is_fiber_optic']))

Counter({(1, 0): 2421, (0, 1): 3096, (0, 0): 1526})

We look at `Contract`

In [37]:
Counter(data['Contract'])

Counter({'Month-to-month': 3875, 'One year': 1473, 'Two year': 1695})

This is a categorical variable, which we change into a one-hot representation with the following helper functions:

In [38]:
def change_categorical_column_to_one_hot_dataframe(mydf, column):
    """
    Input dataframe and column string that is a categorical variable
    Output dataframe of one-hot representation of that column
    """
    mycol = mydf[column]
    values_counts = Counter(mycol)
    answer = {}
    for value in values_counts.keys():
        value2 = value.replace(' ', '_').lower()
        column_as_value = column + "_is_" + value2
        new_column = change_column_value_into_one_and_rest_zero(mydf, column, value)[column]
        answer[column_as_value] = new_column
    return pd.DataFrame(answer)

In [39]:
def change_categorical_column_to_one_hot_dataframe_incorporated(mydf, column):
    """
    Repeat the preceding function
    But keep other columns intact
    """
    column_position = list(mydf).index(column)
    left_df = mydf[list(mydf)[:column_position]]
    middle = change_categorical_column_to_one_hot_dataframe(mydf, column)
    right_df = mydf[list(mydf)[column_position + 1:]]
    answer = pd.concat([left_df, middle, right_df], axis = 1)
    return answer

We change the dataframe `Contract` accordingly:

In [40]:
data = change_categorical_column_to_one_hot_dataframe_incorporated(data, 'Contract')

We do the same to `PaymentMethod`:

In [41]:
Counter(data['PaymentMethod'])

Counter({'Electronic check': 2365,
         'Mailed check': 1612,
         'Bank transfer (automatic)': 1544,
         'Credit card (automatic)': 1522})

We get rid of `(automatic)` in the values.

In [42]:
data = data.replace('Bank transfer (automatic)', 'Bank transfer')
data = data.replace('Credit card (automatic)', 'Credit card')

The counts are the same.

In [43]:
Counter(data['PaymentMethod'])

Counter({'Electronic check': 2365,
         'Mailed check': 1612,
         'Bank transfer': 1544,
         'Credit card': 1522})

We put the new categorical columns back into the data:

In [44]:
data = change_categorical_column_to_one_hot_dataframe_incorporated(data, 'PaymentMethod')

In [45]:
data

Unnamed: 0,is_female,SeniorCitizen,Partner,Dependents,tenure/month,PhoneService,MultipleLines,internet_service_is_dsl,internet_service_is_fiber_optic,OnlineSecurity,...,Contract_is_one_year,Contract_is_two_year,PaperlessBilling,PaymentMethod_is_electronic_check,PaymentMethod_is_mailed_check,PaymentMethod_is_bank_transfer,PaymentMethod_is_credit_card,MonthlyCharges,TotalCharges,Churn
0,1,0,1,0,1,0,0,1,0,0,...,0,0,1,1,0,0,0,29.85,29.85,0
1,0,0,0,0,34,1,0,1,0,1,...,1,0,0,0,1,0,0,56.95,1889.50,0
2,0,0,0,0,2,1,0,1,0,1,...,0,0,1,0,1,0,0,53.85,108.15,1
3,0,0,0,0,45,0,0,1,0,1,...,1,0,0,0,0,1,0,42.30,1840.75,0
4,1,0,0,0,2,1,0,0,1,0,...,0,0,1,1,0,0,0,70.70,151.65,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0,0,1,1,24,1,1,1,0,1,...,1,0,1,0,1,0,0,84.80,1990.50,0
7039,1,0,1,1,72,1,1,0,1,0,...,1,0,1,0,0,0,1,103.20,7362.90,0
7040,1,0,1,1,11,0,0,1,0,1,...,0,0,1,1,0,0,0,29.60,346.45,0
7041,0,1,1,0,4,1,1,0,1,0,...,0,0,1,0,1,0,0,74.40,306.60,1


This cleans the data, where all values are appropriately represented by numbers. We do a final check with the data types:

In [46]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 26 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   is_female                          7043 non-null   int64  
 1   SeniorCitizen                      7043 non-null   int64  
 2   Partner                            7043 non-null   int64  
 3   Dependents                         7043 non-null   int64  
 4   tenure/month                       7043 non-null   int64  
 5   PhoneService                       7043 non-null   int64  
 6   MultipleLines                      7043 non-null   int64  
 7   internet_service_is_dsl            7043 non-null   int64  
 8   internet_service_is_fiber_optic    7043 non-null   int64  
 9   OnlineSecurity                     7043 non-null   int64  
 10  OnlineBackup                       7043 non-null   int64  
 11  DeviceProtection                   7043 non-null   int64

This indeed shows that all nulls have been cleaned and everything is a number. We output this dataframe to be ingested in subsequent notebooks:

In [47]:
data.to_csv('customer_churn_cleaned.csv', index = False)

# Conclusion

We have obtained data about customer churn from Kaggle (see URL above), understood its features, the range of values each features take, and how they can be represented by numbers, e.g. $0$ and $1$ for Booleans, or one-hot representations for categorical variables... etc. The result is a dataframe that has no null values and is fully numerical.

In the next notebook, we will visualise the distributions in this data and check for multicollinearity.

In [48]:
"Notebook done in " + str(datetime.timedelta(seconds=time.time() - START)) + "."

'Notebook done in 0:00:02.335729.'