## Objectives

The main goal of the case study is to build ML models to predict churn. The predictive model built has the following purposes:

- It will be used to predict whether a high-value customer will churn or not, in near future (i.e. churn phase). By knowing this, the company can take action steps such as providing special plans, discounts on recharge etc.

- It will be used to identify important variables that are strong predictors of churn. These variables may also indicate why customers choose to switch to other networks. The predictive model built along with something like PCA could give us a high accuracy of preducting churn, but might not have the explanatory power to show us which variables are important for churn. Hence, a model like Decision Tree or Logistic Regression can be built alongside to help us interpret the important variables.

- Even though overall accuracy will be the primary evaluation metric,other metrics like precision, recall, etc. should also taken into account for evaluation purposes based on different business objectives. For example, in this problem statement, one business goal can be to build an ML model that identifies customers who'll definitely churn with more accuracy as compared to the ones who'll not churn.

- Recommend strategies to manage customer churn based on your observations.

#### Brief outline of different steps

Let's take a look at the high level steps followed to satisfy the objectives of this project.

- __Data Understanding, Preparation, and Pre-Processing__ :
    1. Data understanding, identification of potentially useful and non-useful attributes and variable importance and impact estimation
    2. Data preparation, performing data cleaning, missing values imputation, outlier removal, and column level standardization (for e.g., date, etc.) into one format.
\
<br>

- __Exploratory Data Analysis__ :
    1. Performing basic preliminary data analysis including finding the correlation between variables and scatter plots to identify relationships between variables
    2. Performing advanced data analysis, including plotting relevant heatmaps, histograms, and basic clustering to find patterns in the data.
\
<br>

- __Feature Engineering and Variable Transformation__ :
    1. Feature engineering and performing one or more methods on attributes that can lead to the creation of a new potentially useful variable; for e.g., day from the date
    2. Variable transformation and applying categorical variable transformations to turn into numerical data and numerical variable transformations to scale data
\
<br>

- __Model Selection, Model Building, and  Prediction__ :
    1. Identifying the type of problem and making a list of decisive models from all available choices
    2. Choosing a training mechanism; for e.g., cross-validation, etc., and tuning hyperparameters of each model
    3. Testing each model on the respective model evaluation metric
    4. Choosing the best model based on the fit of the data set and output variable
    5. Using ensemble options to improve the efficacy based on the evaluation metric stated in the problem
\
<br>

Alright, let's now proceed to perform the different steps outlined above.

#### 1. __Data Understanding, Preparation, and Pre-Processing__

In [1]:
#importing necessary libraries
import sys
import os
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
#loading and reading the training data
data_path = os.path.join(os.getcwd(), 'telecom-churn-data')
train_path = os.path.join(data_path, 'train.csv')
train_df = pd.read_csv(train_path)
train_df.head()


Unnamed: 0,id,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,arpu_6,arpu_7,...,sachet_3g_7,sachet_3g_8,fb_user_6,fb_user_7,fb_user_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,churn_probability
0,0,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,31.277,87.009,...,0,0,,,,1958,0.0,0.0,0.0,0
1,1,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,0.0,122.787,...,0,0,,1.0,,710,0.0,0.0,0.0,0
2,2,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,60.806,103.176,...,0,0,,,,882,0.0,0.0,0.0,0
3,3,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,156.362,205.26,...,0,0,,,,982,0.0,0.0,0.0,0
4,4,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,240.708,128.191,...,1,0,1.0,1.0,1.0,647,0.0,0.0,0.0,0


We can see that there are about 172 columns in our dataset. Let's take look at some more details about the datatypes of columns, the number of rows, whether or not null values are present, distribution of numerical variables and so on in the following steps.

In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69999 entries, 0 to 69998
Columns: 172 entries, id to churn_probability
dtypes: float64(135), int64(28), object(9)
memory usage: 91.9+ MB


There are about 70,000 rows in the dataset and 172 columns. 163 of them are numerical variables(float64: 135, int64: 28) columns and 9 of them are categorical variables(having object datatype).

In [10]:
#let's take a look at the numerical columns
train_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,69999.0,34999.000000,20207.115084,0.0,17499.5,34999.0,52498.5,69998.00
circle_id,69999.0,109.000000,0.000000,109.0,109.0,109.0,109.0,109.00
loc_og_t2o_mou,69297.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.00
std_og_t2o_mou,69297.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.00
loc_ic_t2o_mou,69297.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.00
...,...,...,...,...,...,...,...,...
aon,69999.0,1220.639709,952.426321,180.0,468.0,868.0,1813.0,4337.00
aug_vbc_3g,69999.0,68.108597,269.328659,0.0,0.0,0.0,0.0,12916.22
jul_vbc_3g,69999.0,65.935830,267.899034,0.0,0.0,0.0,0.0,9165.60
jun_vbc_3g,69999.0,60.076740,257.226810,0.0,0.0,0.0,0.0,11166.21
