## Telecom client churn forecast using Machine Learning

Interconnect telecom would like to be able to forecast their churn of clients. If it's discovered that a user is planning to leave, they will be offered promotional codes and special plan options. Interconnect's marketing team has collected some of their clientele's personal data, including information about their plans and contracts.

### Interconnect's services

Interconnect mainly provides two types of services:

1. Landline communication. The telephone can be connected to several lines simultaneously.
2. Internet. The network can be set up via a telephone line (DSL, *digital subscriber line*) or through a fiber optic cable.

Some other services the company provides include:

- Internet security: antivirus software (*DeviceProtection*) and a malicious website blocker (*OnlineSecurity*)
- A dedicated technical support line (*TechSupport*)
- Cloud file storage and data backup (*OnlineBackup*)
- TV streaming (*StreamingTV*) and a movie directory (*StreamingMovies*)

The clients can choose either a monthly payment or sign a 1- or 2-year contract. They can use various payment methods and receive an electronic invoice after a transaction.

### Data Description

The data consists of files obtained from different sources:

- `contract.csv` — contract information
- `personal.csv` — the client's personal data
- `internet.csv` — information about Internet services
- `phone.csv` — information about telephone services

In each file, the column `customerID` contains a unique code assigned to each client.

The contract information is valid as of February 1, 2020.

### Objectives

The objectives of this project is to:
- Build a machine learning model to forecast Interconnect telecom's client churn
- Apply exploratory data analysis in determining whether special promotional services and plan options will discourage client churn
- Analyze the speed and quality of prediction, time required for training, etc.

<hr>

 # Table of contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#open_the_data">Open the data file and study the general information</a></li>
        <li><a href="#data_preprocessing">Data Preprocessing</a></li>
        <li><a href="#data_visualization">Exploratory Data Analysis</a></li>
        <li><a href="#model_training">Model Training</a></li>
        <li><a href="#model_testing">Model Testing</a></li>
        <li><a href="#model_analysis">Model Analysis</a></li>
        <li><a href="#overall_conclusion">Overall Conclusion</a></li>
    </ol>
</div>
<br>
<hr>

<div id="open_the_data">
    <h2>Open the data and study the general information</h2> 
</div>

We require the following libraries: *pandas* and *numpy* for data preprocessing and manipulation, *matplotlib* and *seaborn* for data visualization, *scikit-learn* for building our machine learning algorithms

In [None]:
import numpy as np
import pandas as pd
import time
from datetime import datetime

# matplotlib for visualization
import matplotlib.pyplot as plt
%matplotlib inline

# seaborn for statistical data visualization
import seaborn as sns

# import module for splitting and cross-validation using gridsearch
from sklearn.model_selection import train_test_split, GridSearchCV

# import modules for preprocessing
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
pd.options.mode.chained_assignment = None # to avoid SettingWithCopyWarning after scaling

# import machine learning module from the sklearn library
from sklearn.dummy import DummyClassifier # import dummy classifier
from sklearn.tree import DecisionTreeClassifier # import decision tree classifier
from sklearn.linear_model import LogisticRegression # import logistic regression 
from sklearn.ensemble import RandomForestClassifier # import random forest algorithm
from catboost import CatBoostClassifier # import catboost classifier
from lightgbm import LGBMClassifier # import lightgbm classifier
from xgboost import XGBClassifier # import xgboost classifier

# import metrics for sanity check on model
from sklearn.metrics import *
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import balanced_accuracy_score, roc_auc_score

from IPython.display import display

# import warnings
import warnings
warnings.filterwarnings('ignore')

print('Project libraries has been successfully been imported!')

In [None]:
# read the data
try:
    contract_data = pd.read_csv('C:/Users/hotty/Desktop/Practicum by Yandex/Projects/Final Project/final_provider/contract.csv')
    internet_data = pd.read_csv('C:/Users/hotty/Desktop/Practicum by Yandex/Projects/Final Project/final_provider/internet.csv')
    personal_data = pd.read_csv('C:/Users/hotty/Desktop/Practicum by Yandex/Projects/Final Project/final_provider/personal.csv')
    phone_data = pd.read_csv('C:/Users/hotty/Desktop/Practicum by Yandex/Projects/Final Project/final_provider/phone.csv')
except:
    contract_data = pd.read_csv('https://code.s3.yandex.net/datasets/final_provider/contract.csv')
    internet_data = pd.read_csv('https://code.s3.yandex.net/datasets/final_provider/internet.csv')
    personal_data = pd.read_csv('https://code.s3.yandex.net/datasets/final_provider/personal.csv')
    phone_data = pd.read_csv('https://code.s3.yandex.net/datasets/final_provider/phone.csv')
print('Data has been read correctly!')

In [None]:
# function to determine if columns in file have null values
def get_percent_of_na(df, num):
    count = 0
    df = df.copy()
    s = (df.isna().sum() / df.shape[0])
    for column, percent in zip(s.index, s.values):
        num_of_nulls = df[column].isna().sum()
        if num_of_nulls == 0:
            continue
        else:
            count += 1
        print('Column {} has {:.{}%} percent of Nulls, and {} of nulls'.format(column, percent, num, num_of_nulls))
    if count != 0:
        print("\033[1m" + 'There are {} columns with NA.'.format(count) + "\033[0m")
    else:
        print()
        print("\033[1m" + 'There are no columns with NA.' + "\033[0m")
        
# function to display general information about the dataset
def get_info(df):
    """
    This function uses the head(), info(), describe(), shape() and duplicated() 
    methods to display the general information about the dataset.
    """
    print("\033[1m" + '-'*100 + "\033[0m")
    print('Head:')
    print()
    display(df.head())
    print('-'*100)
    print('Info:')
    print()
    display(df.info())
    print('-'*100)
    print('Describe:')
    print()
    display(df.describe())
    print('-'*100)
    display(df.describe(include='object'))
    print()
    print('Columns with nulls:')
    display(get_percent_of_na(df, 4))  # check this out
    print('-'*100)
    print('Shape:')
    print(df.shape)
    print('-'*100)
    print('Duplicated:')
    print("\033[1m" + 'We have {} duplicated rows.\n'.format(df.duplicated().sum()) + "\033[0m")
    print()

In [None]:
# study the general information about the contract dataset 
print('General information about the contract dataset')
get_info(contract_data)

In [None]:
# study the general information about the internet dataset 
print('General information about the internet dataset')
get_info(internet_data)

In [None]:
# study the general information about the personal dataset 
print('General information about the personal dataset')
get_info(personal_data)

In [None]:
# study the general information about the phone dataset 
print('General information about the phone dataset')
get_info(phone_data)

#### Conclusion

By looking at the general information about the data, we find that:
 - `contract_data` has 7043 rows and 8 columns with no missing values and no duplicated values
 - `internet_data` has 5517 rows and 8 columns with no missing values and no duplicated values
 - `personal_data` has 7043 rows and 5 columns with no missing values and no duplicated values
 - `phone_data` has 6361 rows and 2 columns with no missing values and no duplicated values
 
We need to change datatype to the right datatype. For instance, in `contract_data`, we need to change `BeginDate`, `EndDate` to `Datetime` and `TotalCharges` to `float`. We also need to preprocess the data and generate new features for machine learning.

<div id="data_preprocessing">
    <h2>Data Preprocessing</h2> 
</div>

In this section, we would be wrangling the data. We would have to merge the dataset, replace column names, change datatypes and perform feature engineering. 

### Merge Datasets

Before we begin to preprocess the data, we can merge all the individual datasets into one dataframe using the `merge()` function in pandas.

In [None]:
# joining datasets 
merged_df = pd.merge(contract_data, personal_data, on="customerID")
merged_df1 = pd.merge(merged_df, phone_data, on="customerID")
merged_df2 = pd.merge(merged_df1, internet_data, on="customerID")
merged_df2.sample(5)

In [None]:
# create copy of dataset
telecom_df = merged_df2.copy()
telecom_df.info()

### Replace column names

The next step in data preprocessing will be to replace columns names in our dataset.

In [None]:
# rename columns
telecom_df = telecom_df.rename(columns={'customerID': 'customer_id', 'BeginDate': 'begin_date', 'EndDate': 'end_date', 'Type': 'type',
       'PaperlessBilling': 'paperless_billing', 'PaymentMethod': 'payment_method', 'MonthlyCharges': 'monthly_charges', 'TotalCharges': 'total_charges',
       'gender': 'gender', 'SeniorCitizen': 'senior_citizen', 'Partner': 'partner', 'Dependents': 'dependents', 'MultipleLines': 'multiple_lines',
       'InternetService': 'internet_service', 'OnlineSecurity': 'online_security', 'OnlineBackup': 'online_backup', 'DeviceProtection': 'device_protection',
       'TechSupport': 'tech_support', 'StreamingTV': 'streaming_tv', 'StreamingMovies': 'streaming_movies'})

We renamed column names so that the column names will be uniform. i.e., using snake case for improved readability. The `rename()` function in pandas is used to make these changes.

### Change Datatypes 

Next, we change datatypes to the right format. For instance, `begin_date` and `end_date` will be changed to `Datetime`, `monthly_charges` and `total_charges` to `float32`, `senior_citizen` to `int32` datatypes.

In [None]:
# function to change data to the right type
def change_datatype(df, cols, type_val):
    for col in cols:
        df[col] = df[col].astype(type_val)

# create new end date feature
list_value = []
for value in telecom_df.end_date:
    if value != 'No':
        datetime_value = datetime.strptime(value, '%Y-%m-%d %H:%M:%S')
        list_value.append(datetime_value)
    else:
        list_value.append(datetime.now())
EndDate_value = pd.to_datetime(list_value)
telecom_df.insert(3, 'end_date_value', EndDate_value)

# prepare TotalChargers 
telecom_df.loc[telecom_df['total_charges'].isin([' ']),'total_charges'] = 0

# change datatypes
change_datatype(telecom_df, ['begin_date'], 'datetime64[ns]')
change_datatype(telecom_df, ['monthly_charges', 'total_charges'], 'float32')
change_datatype(telecom_df, ['senior_citizen'], 'int32')

In [None]:
# check data information
telecom_df.info()

### Feature engineering

Here, we will create new features such as length of tenure `tenure`, the target end date `exited` denoted as 0 for no churn and 1 for churn, `service_count` denoting the number of products (or services) the customer is currently using, `has_crcard` indicating the customer uses credit card for payment, `year`, `month` and `dayofweek` the customer began using Interconnect's services.

In [None]:
# change date type to datetime and split into day, month and year
def new_date_features(df):
    columns = df.columns.tolist()
    idx = [columns.index(x) for x in columns if 'begin_date' in x][0]
    
    df[columns[idx]] = pd.to_datetime(df[columns[idx]])
    df['dayofweek'] = df[columns[idx]].dt.day_name()
    df['month'] = df[columns[idx]].dt.month_name()
    df['year'] = df[columns[idx]].dt.year
    return df;    

In [None]:
# add new features to data
new_date_features(telecom_df)
telecom_df['tenure'] = telecom_df['end_date_value'].dt.year - telecom_df['begin_date'].dt.year
telecom_df['has_crcard'] = [1 if x == 'Credit card (automatic)' else 0 for x in telecom_df['payment_method']]
telecom_df['exited'] = [1 if x != 'No' else 0 for x in telecom_df['end_date']]
telecom_df['service_count'] = [x.count('Yes') for x in zip(telecom_df['online_security'], telecom_df['online_backup'], telecom_df['device_protection'], 
                                                           telecom_df['tech_support'], telecom_df['streaming_tv'], telecom_df['streaming_movies'])]
change_datatype(telecom_df, ['year', 'tenure', 'has_crcard', 'exited', 'service_count'], 'int32') # reduce memory usage by changing datatypes

In [None]:
# check dataframe
telecom_df.head()

Using list comprehension, we have been able to generate new features that are relevant to the dataset. We engineered features such as `tenure`, `exited`, `service_count`, `has_crcard`, `year`, `month` and `dayofweek`. All these features will help our machine learning model to avoid bias when building the model. We also don't want to have too many features to avoid high variance - when the model is too complex that it doesn't generalize well to the test data or it *overfits* the data.

In [None]:
# recheck dataframe information
telecom_df.info()

#### Conclusion

We carried out data preprocessing in order to merge the datasets, replace column names, change datatype, and generate new features for machine learning. We applied the SQL-flavored merging with pandas to merge the dataset. We renamed column names for improved readability, and change datatypes to the right format in order to reduce memory requirement during computation. We performed feature engineering in order to generate new features that will be helpful in exploring the data and useful for our machine learning process. Now the data is ready for further exploration.

<div id="data_visualization">
    <h2>Exploratory Data Analysis</h2> 
</div>

In exploring the data, we would be asking various questions that need answers in order to uncover or understand the data.

### What payment type and payment methods are unique to Interconnect's customer?

In [None]:
unique_payment_type_count = (telecom_df['type'].value_counts() / telecom_df['type'].value_counts().sum() * 100).tolist()   

# unique payment type
unique_payment_type = telecom_df['type'].value_counts().reset_index().rename(columns={'index': 'type', 'type': 'unique count'})
unique_payment_type['percentage split (%)'] = ['{:.2f}'.format(x) for x in unique_payment_type_count]
unique_payment_type

In [None]:
unique_payment_method_count = (telecom_df['payment_method'].value_counts() / telecom_df['payment_method'].value_counts().sum() * 100).tolist()

# unique payment method
unique_payment_method = telecom_df['payment_method'].value_counts().reset_index().rename(columns={'index': 'payment method', 'payment_method': 'count'})
unique_payment_method['% payment split'] = ['{:.2f}'.format(x) for x in unique_payment_method_count]
unique_payment_method

From the analysis above, we see that most Interconnect customers prefer month-to-month payment with 61% of payment done using this medium. Also, electronic check was frequently used to make payment amongst the payment method available.

### Can we deduce a relationship between payment method and total charges?

In [None]:
# total charges grouped by payment method
total_charges_grouped = telecom_df.groupby('payment_method', as_index=False).agg({'total_charges': 'sum'}).sort_values(
    by='total_charges', ascending=False, ignore_index=True)
total_charges_grouped

We can agree that customers making payment with electronic check had the highest total charges. With this knowledge, marketing team can channel more marketing campaign to make these set of customers use more services. Customers who mail-in check on the other hand had the lowest total charges. Here, marketing team can device new marketing campaign to make the these sets of customers to embrace either the bank transfer method or the electronic check method. If we can get all the customers sending in mail-in checks to use the electronic check, then we would have more total customer charges which translate to more revenue for Interconnect telecom.

### Can we deduce a relationship between payment type and total monthly charges?

In [None]:
# total monthly charges grouped by payment type
(telecom_df.groupby('type', as_index=False)
     .agg({'monthly_charges': 'sum', 'total_charges': 'sum'})
     .sort_values(by='total_charges', ascending=False, ignore_index=True)
)

We can see that customers on a two-year contract bring in more total revenue than customers on a one year contract. The marketing team at Interconnect can introduce more two year contract plan to entice more customers to sign up for a two year contract. 

### Services count by contract type

In [None]:
# services count grouped by contract type
(telecom_df.groupby('type', as_index=False)
     .agg({'service_count': 'sum'})
     .sort_values(by='service_count', ascending=False, ignore_index=True)
)

We observe that customers on a month-to-month contract use more services than customers on a one year contract. This knowledge would inform advertisement campaigns and marketing efforts.

### What gender have the most total charges and service count?

In [None]:
# total charges grouped by gender
gender_charges = telecom_df.groupby('gender', as_index=False).agg({'total_charges': 'sum', 'service_count': 'sum'}).sort_values(by='total_charges', ascending=False, ignore_index=True)
change_datatype(gender_charges, ['total_charges'], 'int32')
gender_charges['percent_total_charges'] = gender_charges['total_charges'] / sum(gender_charges['total_charges']) * 100
gender_charges

From the above, we can see that the female gender contributed almost as much as the male to the total charges and Interconnect's revenue. In addition, the female used more services than the male folks even though this did not translate to increased revenue or total charges.

In [None]:
# function to plot seaborn barplot
def plot_snsbar(df, x, y, title):
    xlabel = str(x.replace('_', ' ').capitalize())
    ylabel = str(y.replace('_', ' ').capitalize())
    # create grouped data
    data = df.groupby([x])[y].count().sort_values(ascending=False).reset_index()
    fig, ax=plt.subplots(figsize=(10,6))
    ax = sns.barplot(x = x, y = y, data=data)
    ax.set_title(title, fontdict={'size':12})
    ax.set_ylabel(ylabel, fontsize = 10)
    ax.set_xlabel(xlabel, fontsize = 10)
    ax.set_xticklabels(data[x], rotation=90);

### Check correlation in data

In [None]:
# correlation matrix of features
plt.figure(figsize=(8, 6))
corrMatrix = telecom_df.corr()
sns.heatmap(corrMatrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix Plot for certain features')
plt.show();

In [None]:
# correlation of tenure and churn
telecom_df.plot(
    x='tenure', y='exited', title = 'Hexagonal binning plot for correlation of tenure and churn', 
    kind='hexbin', gridsize=20, figsize=(8, 6), sharex=False, grid=True
);

From the correlation plot, we can see that there is a strong negative correlation between `tenure` and `exited` (or churn). Customers with less tenure are more likely to churn than well-established customers. To prevent churn, Interconnect telecom must introduce promotion and increase service offering in other to keep customers for longer. We see that the longer a customer stays with Interconnect telecoms, the less likely the customers churn. Whether a customer made subscription on a month-to-month basis did not really affect churn. 

In [None]:
# correlation of dependents on customer churn
telecom_df['has_dependents'] = [1 if x != 'No' else 0 for x in telecom_df['dependents']]
telecom_df.plot(
    x='has_dependents', y='exited', title = 'Hexagonal binning plot for correlation of dependents and customer churn', 
    kind='hexbin', gridsize=20, figsize=(8, 6), sharex=False, grid=True
);

We can see that more customers without dependents stayed longer with Interconnect telecoms than customers with dependent. This is reasonable because having dependent tends to increase your average expenses. It would make sense for Interconnect to target customers with less dependents. 

### Can contract type affect customer churn?

In [None]:
# effect of contract type on customer churn
contract_type_percent = telecom_df.groupby('type', as_index=False).agg({'exited': 'sum'}).sort_values(by='exited', ascending=False, ignore_index=True)
contract_type_effect = (telecom_df['type'].value_counts() / telecom_df['type'].value_counts().sum() * 100).tolist()
contract_type_percent['% exit percent'] = ['{:.2f}'.format(x) for x in contract_type_effect]
contract_type_percent

In [None]:
# plot of contract type on customer churn
plot_snsbar(telecom_df, 'type', 'exited', 'Plot of contract type on customer churn')

We visualized the contract type to see whether customers with shorter contract churn faster than customers with year-long contracts. Our analysis shows that customers with two-year long contract tends to stay longer while customers on a month-to-month contract type churned faster.

### What are the top 5 services offered?

In [None]:
# we create a copy of the dataframe to use for encoding
telecom_df_encode = telecom_df.copy()

# encoding services offered 
online_security = {'online_security':{'Yes': 1, 'No': 0}}
online_backup = {'online_backup':{'Yes': 1, 'No': 0}}
device_protection = {'device_protection':{'Yes': 1, 'No': 0}}
tech_support = {'tech_support':{'Yes': 1, 'No': 0}}
streaming_tv = {'streaming_tv':{'Yes': 1, 'No': 0}}
streaming_movies = {'streaming_movies':{'Yes': 1, 'No': 0}}

telecom_df_encode.replace(online_security, inplace =True)
telecom_df_encode.replace(online_backup, inplace =True)
telecom_df_encode.replace(device_protection, inplace =True)
telecom_df_encode.replace(tech_support, inplace =True)
telecom_df_encode.replace(streaming_tv, inplace =True)
telecom_df_encode.replace(streaming_movies, inplace =True)

telecom_services_data = telecom_df_encode[['online_security', 'online_backup', 'device_protection', 'tech_support', 'streaming_tv', 'streaming_movies', 'exited']]
telecom_services_data = telecom_services_data.transpose()
telecom_services_data

In [None]:
# getting dataframe showing services and percentage count 
telecom_services_data['count'] = telecom_services_data.sum(axis=1)
telecom_services_df = telecom_services_data.reset_index(inplace=False)
telecom_services_df = telecom_services_df[['index', 'count']].rename(columns={'index': 'services'})
telecom_services_df['% service offered']  = telecom_services_df['count'] / telecom_services_df['count'].sum() * 100
telecom_services = telecom_services_df.copy()
telecom_services.sort_values('% service offered', axis = 0, ascending = False, inplace = True, ignore_index=True)
telecom_services

In [None]:
# plot of top 5 Interconnect service by count 
telecom_services_pie = telecom_services.head(5)
(telecom_services_pie.set_index('services').plot(y='% service offered', kind='pie', 
                      title = 'Pie chart showing relative size of the five popular services', 
                      figsize=(8, 8), autopct='%1.1f%%', shadow=True)
);

From the plot of top 5 services offered by Interconnect telecoms, we can see that `streaming_tv`, `streaming_videos` are in high demand. 

### Can number of services offered affect customer churn?

In [None]:
# plot of service count on customer churn
plot_snsbar(telecom_df, 'service_count', 'exited', 'Plot of service count on customer churn')

The service count has very weak correlation with customer churn. From the plot, we see that customers using between 5 and 6 services churned less. This indicates that having customers to sign up for more than 5 services at a time may likely prevent customer churn.

### What about the day of the week effect on customer churn?

In [None]:
# plot to determine day of the week effect on customer churn
plot_snsbar(telecom_df, 'dayofweek', 'exited', 'Plot of day of the week effect on customer churn')

We can see that most churn occured during the weekend. With this knowledge, Interconnect telecom can introduce incentives and weekend service bonuses to ensure customers do not disconnect their services over the weekend.

### What months had the most churn and how can it be prevented?

In [None]:
# plot to determine months with the most churn
plot_snsbar(telecom_df, 'month', 'exited', 'Plot of months with the most customer churn')

From the plot above, the months of February, September, November and December had the most churn. With this understanding, Interconnect telecoms can introduce several bonuses, free service plans, free movie streaming services or discounted TV streaming services for six months starting from September to February. This will prevent customer churn during those period.

#### Conclusion

We can conclude the following from the exploratory data analysis done:
 - Most of Interconnect customers prefer month-to-month payment with 61% of payment done using month-to-month
 - Payment made with electronic check had the highest total charges and thus will bring in the most revenue
 - Customers on a two-year contract have the highest total charges and bring in more total revenue than customers on a one year contract. 
 - Customers on a two-year contract churn less than other contract type.
 - Customers with more than 5 services at a time churn less 
 - Most churn occured at weekends.
 
 
**Action plan:**
 - The marketing team at Interconnect can introduce more two year contract plan to entice more customers to sign up for a two year contract.
 - Targeted marketing campaigns and promotional events should be done to promote the two-year plan to Interconnects customers
 - Customers should be encouraged to make payment using electronic check. This will increase Interconnect's revenue.
 - Incentivize services with serveral promos to induce more customers signing up for more services at a time.
 - Introduce serveral bonuses, free service plans, free streaming services or discounted TV streaming services for six months starting from September to February to prevent churn.
 - Special promotional events and services options should be introduced towards the end of the week to discourage client churn

<div id="model_training">
    <h2>Model Training</h2> 
</div>

Here, we would train different models. We would be training a couple of tree-based models, gradient-boosted model and leaf-based model. The primary metric we chose to evaluate the model is AUC-ROC. The secondary metric is accuracy. AUC computes the area under the curve and the objective is to **maximize** this area. Accuracy tells us how often the classifier is correct and the objective is to **maximize** accuracy. 

### Feature Engineering for Machine Learning

We perform feature engineering to encode all categorical features to numeric. Encoding features makes them useful for machine learning. We would be applying one-hot encoding, target encoding and ordinal encoding depending on the machine learning algorithm. We would be training the following models:

| Model type | Model | Encoding type | Highlight | Cons |
|:--- |:----|:---:|:---:| :--- |
| Statistical based| Logistic regression | label encoding | Less prone to over-fitting | Can overfit in high dimensional datasets |
| Tree-based | Decision Tree | label encoding | Prone to errors |    |
|            | Random Forest | label encoding | Better than DT |     |
| Leaf-based | Catboost      | No encoding    | Fastest algo  |      |
| Gradient boosted | XGBoost | One-hot encoding | Pretty fast |      |
| Gradient-boosted | LightGBM | One-hot encoding | Extremely fast |      |

For tree based model such as decision tree and random forest, we make use of label encoding and one-hot encoding. For the XGBoost and LightGBM model, we make use of one-hot encoding. The XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. The CatBoost regressor has its own implementation for encoding of categorical features. In this case, we create a separate dataset without any encoding. Internally, catboost encodes the categorical features. 

In [None]:
# drop unimportant features
df = telecom_df.drop(['customer_id', 'begin_date', 'end_date', 'end_date_value'], axis=1)

# declare variables for target and features
y = df.exited
X = df.drop(['exited'], axis=1)

# split data into 75% training and 25% testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12345)

In [None]:
# create copy of initial split feature dataset
features_train = X_train.copy()
features_test = X_test.copy()

# select numerical columns
numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'int32']]

# list of categorical variables
s = (features_train.dtypes == 'object')
object_cols = list(s[s].index)
print('Categorical variables:')
print(object_cols)

In [None]:
# Encoding features for machine learning

# Approach 1: Ordinal Encoding
# make a copy to avoid changing original data
label__X_train = features_train.copy()
label__X_test = features_test.copy()

# apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label__X_train[object_cols] = ordinal_encoder.fit_transform(features_train[object_cols])
label__X_test[object_cols] = ordinal_encoder.transform(features_test[object_cols])

# Approach 2: One-Hot Encoding
# one-hot encoding of categorical features
df_ohe = pd.get_dummies(df, drop_first=True)

# declare variables for target and features
y_ohe = df_ohe.exited
X_ohe = df_ohe.drop(['exited'], axis=1)

# split data into 75% training and 25% testing sets
X_train_ohe, X_test_ohe, y_train_ohe, y_test_ohe = train_test_split(X_ohe, y_ohe, test_size=0.25, random_state=12345)

# numerical features
numerical_cols = [cname for cname in X_train_ohe.columns if X_train_ohe[cname].dtype in ['float32', 'float64', 'int64', 'int32']]

# features scaling
scaler = StandardScaler()
scaler.fit(X_train_ohe[numerical_cols])
# transform the training set and the test set using transform()
X_train_ohe[numerical_cols] = scaler.transform(X_train_ohe[numerical_cols])
X_test_ohe[numerical_cols]  = scaler.transform(X_test_ohe[numerical_cols])

#### Conclusion

We split the data into 75%  training and 25% testing sets. We applied both ordinal encoding and one-hot encoding to the features. We scaled the data after one-hot encoding using the standard scaler function. Next we are going to examine class imbalance and improve the model quality if class imbalance exist.

### Examine the balance of class

In [None]:
# function to calculate model evaluation metrics
def print_model_evaluation(y_test, test_predictions):
    print("\033[1m" + 'F1 score: ' + "\033[0m", '{:.3f}'.format(f1_score(y_test, test_predictions)))
    print("\033[1m" + 'Accuracy Score: ' + "\033[0m", '{:.2%}'.format(accuracy_score(y_test, test_predictions)))
    print("\033[1m" + 'Precision: ' + "\033[0m", '{:.3f}'.format(precision_score(y_test, test_predictions)))
    print("\033[1m" + 'Recall: ' + "\033[0m", '{:.3f}'.format(recall_score(y_test, test_predictions)))
    print("\033[1m" + 'Balanced Accuracy Score: ' + "\033[0m", '{:.2%}'.format(balanced_accuracy_score(y_test, test_predictions)))
    print("\033[1m" + 'AUC-ROC Score: ' + "\033[0m", '{:.2%}'.format(roc_auc_score(y_test, test_predictions)))
    print()
    print("\033[1m" + 'Confusion Matrix' + "\033[0m")
    print('-'*50)
    print(confusion_matrix(y_test, test_predictions))
    print()
    print("\033[1m" + 'Classification report' + "\033[0m")
    print('-'*50)
    print(classification_report(y_test, test_predictions))
    print()

#### Baseline Model

In [None]:
# baseline model using a dummy classifier
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(features_train, y_train)
dummy_clf_test_predictions = dummy_clf.predict(features_test)

In [None]:
# evaluate baseline model
print_model_evaluation(y_test, dummy_clf_test_predictions)

The baseline model predicts the most frequent class in this case "0". Looking at the baseline model report, we can see that the accuracy is low and the AUC-ROC score is 50%. This is due to class imbalance. 

#### Sanity check with Logistic regression

In [None]:
# sanity check
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, y_train) # train the model 
test_predictions = pd.Series(model.predict(features_test))
class_frequency = test_predictions.value_counts(normalize=True)
print(class_frequency)
class_frequency.plot(kind='bar');

Here, we would train different models. 
- We split the data into 80% training and 20% testing sets
- We use AUC-ROC as our primary metric and accuracy as the secondary metric to predict customer churn
- We apply encoding to categorical variables so they can be read by our machine learning models
- We scale the data by applying standard scaler function to the features
- We will use linear regresion model as our baseline model
- We train different tree based model and gradient boosting model
- We apply hyperparameter tuning to tune our different model and cross validation during sampling of data for machine learning
- We choose the best performing models on the training accuracy and AUC-ROC metric

<div id="model_testing">
    <h2>Model Testing</h2> 
</div>

- Using the best performing model, we evaluate on the test dataset
- The best performing model is one that has the best accuracy on the training set
- We plot a confusion matrix for the models performance on the test sets
- We determine the best model's accuracy on the test set

<div id="model_analysis">
    <h2>Model Analysis</h2> 
</div>

<div id="overall_conclusion">
    <h2>Overall Conclusion</h2> 
</div>

- Based on chosen model, we can predict customer churn for Interconnect telecoms