## Business Problem Overview
In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition. 

For many incumbent operators, retaining high profitable customers is the number one business goal. 

To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.

#### Definitions of Churn

<B>Usage-based churn</b>: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time. In this project, we will use the <b>usage-based definition</b> to define churn.

Analysis of only high values customer to be done in this case study. Defincation of high values customer is as follow

2. Filter high-value customers
Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).

Expectation is to build two model 
1. It will be used to predict whether a high-value customer will churn or not, in near future (i.e. churn phase).

2. It will be used to identify important variables that are strong predictors of churn. 

In [None]:
import pandas as pd
import numpy as np
import sklearn 
from sklearn.preprocessing import StandardScaler

#visualization library
import matplotlib.pyplot as plt
import seaborn as sns

 # trying to impute the continuous null features values using KNNImputer
from sklearn.impute import KNNImputer
# splitter
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# scaler
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# pca
from sklearn.decomposition import PCA, IncrementalPCA

# GridSearch CV for hyperparameter tuning
from sklearn.model_selection import GridSearchCV, KFold

# evaluation metrics
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score

# for showing tree
from IPython.display import Image
from six import StringIO # use six instead of sklearn.externals.six
from sklearn.tree import export_graphviz
import pydotplus, graphviz

# display upto 3 decimals
pd.options.display.float_format = "{:,.3f}".format

import warnings
warnings.filterwarnings('ignore')
pd.set_option ('display.max_columns', None) 
pd.set_option ('display.max_rows', 999)

#### We will be performing following steps to address the problem in hand 
1. Data Load and Cleaning
2. EDA
3. Data Visualisation
4. Model Building 
5. Model Validation
6. Conclusion

### 1. Data Load and Cleaning

In [None]:
# Read csv
churn_df = pd.read_csv("telecom_churn_data.csv")
churn_df.head()

In [None]:
churn_df.shape

In [None]:
churn_df.describe()

In [None]:
# feature type summary
churn_df.info(verbose=1)

In [None]:
# get all the missing values with %
missing = (churn_df.isna().sum()/churn_df.shape[0])*100
missing.reset_index()
#Create Datafame
df_missing = pd.DataFrame(missing)

#reset index
df_missing = df_missing.reset_index()
#Set Column names
df_missing.columns = ['Name','MissingValuesPer']
df_missing.head()

In [None]:
df_missing[df_missing.MissingValuesPer >0]['Name'].values

In [None]:
# create column name list by types of columns
drop_columns = ['mobile_number', 'circle_id', 'last_date_of_month_6',
             'last_date_of_month_7',
             'last_date_of_month_8',
             'last_date_of_month_9',
             'date_of_last_rech_6',
             'date_of_last_rech_7',
             'date_of_last_rech_8',
             'date_of_last_rech_9',
             'date_of_last_rech_data_6',
             'date_of_last_rech_data_7',
             'date_of_last_rech_data_8',
             'date_of_last_rech_data_9']

# drop these coulmns as it would not be significant in the churn analysis as we will be performing usage based anaylysis.
churn_df = churn_df.drop(drop_columns, axis=1)


In [None]:
churn_df.head()

In [None]:
# List down the category columns which would be imputed with -1 for missing values
cat_cols =  ['night_pck_user_6',
             'night_pck_user_7',
             'night_pck_user_8',
             'night_pck_user_9',
             'fb_user_6',
             'fb_user_7',
             'fb_user_8',
             'fb_user_9'
            ]

# Imput -1 for missing continuous variable values
churn_df[cat_cols] = churn_df[cat_cols].fillna(-1)

In [None]:
churn_df[cat_cols].head()

In [None]:
# List down all the continuos variables 
cont_var = [column for column in churn_df.columns if column not in drop_columns +  cat_cols]

# print the number of columns in each list
print("#Dropped cols: %d\n#Continuous Variables:%d\n#Category cols:%d" % (len(drop_columns), len(cont_var), len(cat_cols)))

# check if we have missed any column or not
print(len(cont_var) + len(cat_cols) == churn_df.shape[1])
churn_df.shape[1]

In [None]:
# ll values in these columns are zero or nan. Dropig these columns as it has no variance 
churn_df[['loc_og_t2o_mou','std_og_t2o_mou','loc_ic_t2o_mou','std_og_t2c_mou_6',
                     'std_og_t2c_mou_7','std_og_t2c_mou_8','std_ic_t2o_mou_6','std_ic_t2o_mou_7',
                   'std_ic_t2o_mou_8']].value_counts()

churn_df = churn_df.drop(['loc_og_t2o_mou','std_og_t2o_mou','loc_ic_t2o_mou','std_og_t2c_mou_6',
                     'std_og_t2c_mou_7','std_og_t2c_mou_8','std_ic_t2o_mou_6','std_ic_t2o_mou_7',
                   'std_ic_t2o_mou_8'], axis=1)

churn_df.shape

In [None]:
# some recharge columns have minimum value of 1 while some don't
recharge_cols = ['total_rech_data_6', 'total_rech_data_7', 'total_rech_data_8', 'total_rech_data_9',
                 'count_rech_2g_6', 'count_rech_2g_7', 'count_rech_2g_8', 'count_rech_2g_9',
                 'count_rech_3g_6', 'count_rech_3g_7', 'count_rech_3g_8', 'count_rech_3g_9',
                 'max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8', 'max_rech_data_9',
                 'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8', 'av_rech_amt_data_9',
                 ]

churn_df[recharge_cols].describe(include='all')

In [None]:
# impute these columns with zero
churn_df[recharge_cols] = churn_df[recharge_cols].fillna(0)
churn_df[recharge_cols].head()

In [None]:
(churn_df.isna().sum()/churn_df.shape[0])*100

In [None]:
# get all the missing values with %
missing = (churn_df.isna().sum()/churn_df.shape[0])*100
missing.reset_index()
#Create Datafame
df_missing = pd.DataFrame(missing)

#reset index
df_missing = df_missing.reset_index()
#Set Column names
df_missing.columns = ['Name','MissingValuesPer']
df_missing.head()

In [None]:
# drop columns with more than 70% empty values
churn_df = churn_df.drop( df_missing[df_missing.MissingValuesPer>70]['Name'].values, axis=1)
churn_df.shape

In [None]:
# Imput 0 in missing values to all features related to calls.
churn_df['onnet_mou_6'].describe()

In [None]:
missing_values = df_missing[(df_missing.MissingValuesPer >0) & (df_missing.MissingValuesPer <70)]['Name'].values

In [None]:
import missingno as msno 

In [None]:
# Visualize missing values as a matrix 
msno.matrix(churn_df[missing_values])

In [None]:
df_missing.columns

In [None]:
# get list of all the columns with missing values for month 6,7 and 8
col_6_names = df_missing[(df_missing.MissingValuesPer>0) & (df_missing.MissingValuesPer <= 70) & df_missing.Name.str.contains('6$')]['Name'].values
col_7_names = df_missing[(df_missing.MissingValuesPer>0) & (df_missing.MissingValuesPer <= 70) & df_missing.Name.str.contains('7$')]['Name'].values
col_8_names = df_missing[(df_missing.MissingValuesPer>0) & (df_missing.MissingValuesPer <= 70) & df_missing.Name.str.contains('8$')]['Name'].values

In [None]:
def CustomeFillNa(row, col, total_rec_amt):
    print(row)
    if((row[col] == np.nan) & (row[total_rec_amt] == 0)):
        return 0
    else:
        return row[col]

In [None]:
# impute values to 0 form month in which thre was no recharge i.e. total_recharge_amount =0 
churn_df[(churn_df.onnet_mou_6 == np.nan) & (churn_df.total_rech_amt_6 == 0)][col_6_names]

#impute all these values to 0, as total recharge amount is 0 and values are missing for all such features, 
# we assume customer was inactive during this time
# imputing all such columns to 0

churn_df[(churn_df.onnet_mou_6.isna()) ][col_6_names]

In [None]:
churn_df[(churn_df.onnet_mou_6.isna()) & (churn_df.total_rech_amt_6 == 0)][col_6_names]

In [None]:
churn_df[col_6_names].isna().sum()

In [None]:
# Pending imputing missing values
missing_values

In [None]:
# taking mod to impute the misssing values in the columns
print(churn_df.onnet_mou_6.head())
churn_df[['onnet_mou_6']].fillna(churn_df.onnet_mou_6.mode()[0]).head()

## 2. EDA
#### Derive Churn
9th Month is our Churn Phase. 
- Usage-based churn
Calculate total incoming and outgoing minutes of usage
Calculate 2g and 3g data consumption
Create churn variable: those who have not used either calls or internet in the month of September are customers who have churned
Check Churn percentage.
Delete columns that belong to the churn month


In [None]:
#Check nulls
churn_df[['total_ic_mou_9','total_og_mou_9','vol_2g_mb_9','vol_3g_mb_9']].isnull().sum()

In [None]:
# Create total calls in 9th month
churn_df['total_calls_mou_9'] = churn_df.total_ic_mou_9 + churn_df.total_og_mou_9

#create total data usage derived column
churn_df['total_data_mb_9'] = churn_df.vol_2g_mb_9 + churn_df.vol_3g_mb_9
churn_df.head()

In [None]:
churn_df[['total_calls_mou_9','total_data_mb_9']].info()

In [None]:
#create churn column
# 1- churned 0- not Churned
churn_df['churned'] = churn_df.apply(lambda row: 1 if ((row['total_calls_mou_9'] + row['total_data_mb_9']) == 0) else 0, axis=1)
churn_df.churned =  churn_df.churned.astype('category')
churn_df.head()

In [None]:
#lets check churn percentage on total data
round(churn_df.churned.value_counts()/churn_df.shape[0] *100, 2)

### Define high values customer and filter the data set.

We need to predict churn only for the high-value customers. Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).

- calculate total recharge amount

    call recharge amount( total_rech_amt ) + data recharge amount


In [None]:
#Check nulls
print(churn_df[['total_rech_amt_6','total_rech_amt_7','av_rech_amt_data_6','total_rech_data_6','av_rech_amt_data_7','total_rech_data_7']].isnull().sum())


In [None]:
# Derived Column Average Total Recharge Amount for 6th and 7th (first two) months
churn_df['total_avg_rech_amt_6n7'] = round((churn_df.total_rech_amt_6 + churn_df.total_rech_amt_7 + 
(churn_df.av_rech_amt_data_6 * churn_df.total_rech_data_6) + (churn_df.av_rech_amt_data_7 * churn_df.total_rech_data_7))/2, 2)

churn_df['total_avg_rech_amt_6n7'].head()

In [None]:
# 70th quantile of average of first two month of recharge. Ignored the records where re charge amount is 0
Q70 = np.quantile(churn_df.total_avg_rech_amt_6n7, 0.70)
print(Q70)
churn_df[churn_df.total_avg_rech_amt_6n7 >= Q70].shape

In [None]:
#Use filter data set for further processing
telecom_df = churn_df[churn_df.total_avg_rech_amt_6n7 >= Q70]
telecom_df.head()

In [None]:
#lets check churn percentage on total data
round(telecom_df.churned.value_counts()/telecom_df.shape[0] *100, 2)

In [None]:
# pie chart depicting the churn customer as pert of full portfolio.
(telecom_df['churned'].value_counts(1)*100).plot(kind='pie')
plt.show()

In [None]:
#delete columns from 9th months
col_9_names = churn_df.filter(regex='9$', axis=1).columns
telecom_df = telecom_df.drop(col_9_names, axis=1)
telecom_df.shape

In [None]:
# drop derived column created for filtering high value customer.
telecom_df = telecom_df.drop('total_avg_rech_amt_6n7', axis=1)

### Missing values treatment using KNNImputer

Using KNN Imputer to impute missing data in the filtered dataset.

In [None]:
# Checking missing values
ms_df = telecom_df.isnull().sum()/len(telecom_df.index)*100
ms_df.head()

In [None]:
# creating dataframe
ms_df = pd.DataFrame(ms_df, columns=['MissingVal'])
ms_df = ms_df.reset_index()
ms_df = ms_df.rename(columns={'index':'Column'})

# selecting only missing values
missing_values_columns = ms_df[ms_df.MissingVal > 0]['Column'].tolist()

# selecting only continuous variables for imputing
missing_values_columns = telecom_df[missing_values_columns].select_dtypes(include='float64').columns.tolist()

# checking the number of variables for imputing
len(missing_values_columns)
#len(churn_df.columns[churn_df.columns.str.contains('9$')])

In [None]:
# define imputer
imputer = KNNImputer(n_neighbors= 5)

# fit on the dataset
for col in missing_values_columns:
    print(col)
    telecom_df[col] = imputer.fit_transform(telecom_df[col].values.reshape(-1,1))

### Calculate the difference between 8th and previous months.

The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual.

The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a  competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.)

In our case 6 and 7th are the good phase where as 8th is action phase. Hence its important to compare the service usage bewteen the average usage during 6th anf 7th with that of 8th month.

In [None]:
# Calculate the variance between 6 and 7th months with that of 8th for continuous variables.
telecom_df['arpu_var'] = telecom_df.arpu_8 - ((telecom_df.arpu_6 + telecom_df.arpu_7)/2)

telecom_df['onnet_mou_var'] = telecom_df.onnet_mou_8 - ((telecom_df.onnet_mou_6 + telecom_df.onnet_mou_7)/2)

telecom_df['offnet_mou_var'] = telecom_df.offnet_mou_8 - ((telecom_df.offnet_mou_6 + telecom_df.offnet_mou_7)/2)

telecom_df['roam_ic_mou_var'] = telecom_df.roam_ic_mou_8 - ((telecom_df.roam_ic_mou_6 + telecom_df.roam_ic_mou_7)/2)

telecom_df['roam_og_mou_var'] = telecom_df.roam_og_mou_8 - ((telecom_df.roam_og_mou_6 + telecom_df.roam_og_mou_7)/2)

telecom_df['loc_og_mou_var'] = telecom_df.loc_og_mou_8 - ((telecom_df.loc_og_mou_6 + telecom_df.loc_og_mou_7)/2)

telecom_df['std_og_mou_var'] = telecom_df.std_og_mou_8 - ((telecom_df.std_og_mou_6 + telecom_df.std_og_mou_7)/2)

telecom_df['isd_og_mou_var'] = telecom_df.isd_og_mou_8 - ((telecom_df.isd_og_mou_6 + telecom_df.isd_og_mou_7)/2)

telecom_df['spl_og_mou_var'] = telecom_df.spl_og_mou_8 - ((telecom_df.spl_og_mou_6 + telecom_df.spl_og_mou_7)/2)

telecom_df['total_og_mou_var'] = telecom_df.total_og_mou_8 - ((telecom_df.total_og_mou_6 + telecom_df.total_og_mou_7)/2)

telecom_df['loc_ic_mou_var'] = telecom_df.loc_ic_mou_8 - ((telecom_df.loc_ic_mou_6 + telecom_df.loc_ic_mou_7)/2)

telecom_df['std_ic_mou_var'] = telecom_df.std_ic_mou_8 - ((telecom_df.std_ic_mou_6 + telecom_df.std_ic_mou_7)/2)

telecom_df['isd_ic_mou_var'] = telecom_df.isd_ic_mou_8 - ((telecom_df.isd_ic_mou_6 + telecom_df.isd_ic_mou_7)/2)

telecom_df['spl_ic_mou_var'] = telecom_df.spl_ic_mou_8 - ((telecom_df.spl_ic_mou_6 + telecom_df.spl_ic_mou_7)/2)

telecom_df['total_ic_mou_var'] = telecom_df.total_ic_mou_8 - ((telecom_df.total_ic_mou_6 + telecom_df.total_ic_mou_7)/2)

telecom_df['total_rech_num_var'] = telecom_df.total_rech_num_8 - ((telecom_df.total_rech_num_6 + telecom_df.total_rech_num_7)/2)

telecom_df['total_rech_amt_var'] = telecom_df.total_rech_amt_8 - ((telecom_df.total_rech_amt_6 + telecom_df.total_rech_amt_7)/2)

telecom_df['max_rech_amt_var'] = telecom_df.max_rech_amt_8 - ((telecom_df.max_rech_amt_6 + telecom_df.max_rech_amt_7)/2)

telecom_df['total_rech_data_var'] = telecom_df.total_rech_data_8 - ((telecom_df.total_rech_data_6 + telecom_df.total_rech_data_7)/2)

telecom_df['max_rech_data_var'] = telecom_df.max_rech_data_8 - ((telecom_df.max_rech_data_6 + telecom_df.max_rech_data_7)/2)

telecom_df['av_rech_amt_data_var'] = telecom_df.av_rech_amt_data_8 - ((telecom_df.av_rech_amt_data_6 + telecom_df.av_rech_amt_data_7)/2)

telecom_df['vol_2g_mb_var'] = telecom_df.vol_2g_mb_8 - ((telecom_df.vol_2g_mb_6 + telecom_df.vol_2g_mb_7)/2)

telecom_df['vol_3g_mb_var'] = telecom_df.vol_3g_mb_8 - ((telecom_df.vol_3g_mb_6 + telecom_df.vol_3g_mb_7)/2)

In [None]:
telecom_df.head()

## 3. Data Visualization

#### Univariate Data Analysis

In this section we will analyse and understand the various variables in the dateset. Since there are more than 100 variables/features in the datase will plot few which we think are important.


In [None]:
# common function to plot the data
def UnivariatePlot(feature, is_categorical, viewStats=True):
    plt.figure(figsize=(8,5))
    if(viewStats):        
        print(feature.describe())
    if(is_categorical):  
        plt.title('Count Plot')
        sns.countplot(feature)
    else:
        
        plt.title('Distribution Plot')
        sns.distplot(feature)
    plt.show()
    
    

In [None]:
telecom_df.head()

In [None]:
#Plotting the continuous variables 
plt.figure(figsize=(12,6))
plt.title('Distribution Plot')       
plt.subplot(2,2,1)
sns.distplot(telecom_df.total_rech_amt_6)
plt.subplot(2,2,2)
sns.distplot(telecom_df.total_rech_amt_7)
plt.subplot(2,2,3)
sns.distplot(telecom_df.total_rech_amt_8)

# data shows the recharge amount it concentrated between 0 and 2500 for each month

In [None]:
#Plotting the number of recharges for 6,7,8 months variables 
plt.figure(figsize=(12,6))
plt.title('Distribution Plot')       
plt.subplot(2,2,1)
sns.distplot(telecom_df.total_rech_num_6)
plt.subplot(2,2,2)
sns.distplot(telecom_df.total_rech_num_7)
plt.subplot(2,2,3)
sns.distplot(telecom_df.total_rech_num_8)

# Again not very clear trend but number of reacharges seems ot have fallen in the month 7 and 8

In [None]:
#Plotting the total recharge data for 6,7,8 months variables 
plt.figure(figsize=(12,6))
plt.title('Distribution Plot')       
plt.subplot(2,2,1)
sns.distplot(telecom_df.total_rech_data_6)
plt.subplot(2,2,2)
sns.distplot(telecom_df.total_rech_data_7)
plt.subplot(2,2,3)
sns.distplot(telecom_df.total_rech_data_8)

# Not much variance between three months data.

In [None]:
#Plotting the average data recharge amount for 6,7,8 months variables 
plt.figure(figsize=(12,6))
plt.title('Distribution Plot')       
plt.subplot(2,2,1)
sns.distplot(telecom_df.av_rech_amt_data_6)
plt.subplot(2,2,2)
sns.distplot(telecom_df.av_rech_amt_data_7)
plt.subplot(2,2,3)
sns.distplot(telecom_df.av_rech_amt_data_8)
plt.show()
# Not much variance between three months data.

In [None]:
#Plotting the average data recharge amount for 6,7,8 months variables 
plt.figure(figsize=(12,6))
plt.title('Distribution Plot')       
plt.subplot(2,2,1)
sns.distplot(telecom_df.arpu_6)
plt.subplot(2,2,2)
sns.distplot(telecom_df.arpu_7)
plt.subplot(2,2,3)
sns.distplot(telecom_df.arpu_8)
plt.show()
# Not much variance between three months data. There seems to be sligh drop in october.

In [None]:
#UnivariatePlot(telecom_df.fb_user_6, True, True)
#Plotting the average data recharge amount for 6,7,8 months variables 
plt.figure(figsize=(12,6))
plt.title('Count Plot')       
plt.subplot(2,2,1)
sns.countplot(telecom_df.fb_user_6)
plt.subplot(2,2,2)
sns.countplot(telecom_df.fb_user_7)
plt.subplot(2,2,3)
sns.countplot(telecom_df.fb_user_8)
plt.show()
#There seems to be slight increase in number of user not opting for this feature in october

In [None]:
#UnivariatePlot(telecom_df.fb_user_6, True, True)
#Plotting the average data recharge amount for 6,7,8 months variables 
plt.figure(figsize=(12,6))
plt.title('Count Plot')       
plt.subplot(2,2,1)
sns.countplot(telecom_df.night_pck_user_6)
plt.subplot(2,2,2)
sns.countplot(telecom_df.night_pck_user_7)
plt.subplot(2,2,3)
sns.countplot(telecom_df.night_pck_user_8)
plt.show()
#There seems to be slight increase in number of user not opting for this feature in october

In [None]:

#UnivariatePlot(telecom_df.fb_user_6, True, True)
#Plotting the last day recharge for 6,7,8 months variables 
plt.figure(figsize=(12,6))
plt.title('Count Plot')       
plt.subplot(2,2,1)
sns.distplot(telecom_df.last_day_rch_amt_6)
plt.subplot(2,2,2)
sns.distplot(telecom_df.last_day_rch_amt_7)
plt.subplot(2,2,3)
sns.distplot(telecom_df.last_day_rch_amt_7)
plt.show()
#There seems to be little to no variance in the data.

In [None]:
#  avg recharge mant for data
plt.figure(figsize=(12,6))
plt.title('Count Plot')       
plt.subplot(2,2,1)
sns.distplot(telecom_df.av_rech_amt_data_6)
plt.subplot(2,2,2)
sns.distplot(telecom_df.av_rech_amt_data_7)
plt.subplot(2,2,3)
sns.distplot(telecom_df.av_rech_amt_data_8)
plt.show()
#There seems to be fall in average recharge amount in octber

### Bivariate Visualization of Data

In this section we will plot the bivariate plots for the both continuos and categorical variables against target variable.

In [None]:
# Cutomer perios with network against churn data. 
plt.figure(figsize=(12,6))   
sns.boxplot(telecom_df.aon,telecom_df.churned )
plt.show()

In [None]:
# Cutomer perios with network against churn data. 
plt.figure(figsize=(12,6))   
sns.boxplot(data=telecom_df, x='total_rech_data_6',y=telecom_df.churned )
plt.show()

In [None]:
# We will plot the continuous variables against in pairplot
sns.pairplot(data=telecom_df[['aon','total_rech_amt_6','total_rech_amt_7','total_rech_amt_8','total_rech_num_6','total_rech_num_7','total_rech_num_8','churned']],hue='churned' )
plt.show()

In [None]:
# We will plot the continuous variables against in pairplot
sns.pairplot(data=telecom_df[['arpu_6','arpu_7','arpu_8','churned']],hue='churned' )
plt.show()

In [None]:
# We will plot the continuous variables against in pairplot
sns.pairplot(data=telecom_df[['arpu_var','total_rech_amt_var','av_rech_amt_data_var','churned']],hue='churned' )
plt.show()
# we can clearly see when variance is 0 or lower chances of customer being churn are higher.

In [None]:
# We will plot the continuous variables against in pairplot
sns.pairplot(data=telecom_df[['total_og_mou_var','total_ic_mou_var','vol_2g_mb_var','vol_3g_mb_var','churned']],hue='churned' )
plt.show()
# its clearly visible from the plot that as calls and data usage fall in 8th months resulted in the customer being churn

In [None]:
# Using pandas cross tab for the categorical values
pd.crosstab(telecom_df.churned, [telecom_df.night_pck_user_6,telecom_df.night_pck_user_7,telecom_df.night_pck_user_8], normalize='columns')*100

In [None]:
# Using pandas cross tab for the categorical values
pd.crosstab(telecom_df.churned, [telecom_df.fb_user_6,telecom_df.fb_user_7,telecom_df.fb_user_8], normalize='columns')*100

In [None]:
# Using pandas cross tab for the categorical values
pd.crosstab(telecom_df.churned, [telecom_df.monthly_2g_6, telecom_df.monthly_2g_7, telecom_df.monthly_2g_8], normalize='columns')*100

In [None]:
telecom_df.head()

In [None]:
#Create list for each models param and performance.
ModelName = []
Accuracy = []
Sensitivity = []
Specificity = []
ROC = []
AUC=[]
Param = []


## 4. Modeling 

In this section we will create model for churn prediction and to explain the important variables.



###  Outlier treatment of the data

After checking the data, we can see that there are some significant outliers and there are so many values of outliers, which means that they are not actually outliers but the high value customers with more usage. And removing those values as outliers are not suitable as it can impact the model accuracy. Also, we are losing a lot of data if we remove these values. Let's continue with all the values and see the model performance first.

In [None]:
# checking the outliers by describing the data and looking at diff percentile values
telecom_df.describe(percentiles=[0.1,0.25,0.5,0.75,0.80,0.85,0.9,0.95,0.99,0.999])

In [None]:
# plotting boxplots
plt.figure(figsize=(10,8))
sns.boxplot(data = telecom_df[['arpu_6', 'arpu_7', 'arpu_8']])
plt.show()

In [None]:

# plotting boxplots
plt.figure(figsize=(10,8))
sns.boxplot(data = telecom_df[['onnet_mou_6', 'onnet_mou_7', 'onnet_mou_8']])
plt.show()

In [None]:
#columns for outlier treatment.
columns= ['arpu_6', 'arpu_7',	'arpu_8',	'onnet_mou_6',	'onnet_mou_7',	'onnet_mou_8',	'offnet_mou_6',	'offnet_mou_7',	'offnet_mou_8',	'roam_ic_mou_6',	'roam_ic_mou_7',	'roam_ic_mou_8',	'roam_og_mou_6',	'roam_og_mou_7',	'roam_og_mou_8',	'loc_og_t2t_mou_6',	'loc_og_t2t_mou_7',	'loc_og_t2t_mou_8',	'loc_og_t2m_mou_6',	'loc_og_t2m_mou_7',	'loc_og_t2m_mou_8',	'loc_og_t2f_mou_6',	'loc_og_t2f_mou_7',	'loc_og_t2f_mou_8',	'loc_og_t2c_mou_6',	'loc_og_t2c_mou_7',	'loc_og_t2c_mou_8',	'loc_og_mou_6',	'loc_og_mou_7',	'loc_og_mou_8',	'std_og_t2t_mou_6',	'std_og_t2t_mou_7',	'std_og_t2t_mou_8',	'std_og_t2m_mou_6',	'std_og_t2m_mou_7',	'std_og_t2m_mou_8',	'std_og_t2f_mou_6',	'std_og_t2f_mou_7',	'std_og_t2f_mou_8',	'std_og_mou_6',	'std_og_mou_7',	'std_og_mou_8',	'isd_og_mou_6',	'isd_og_mou_7',	'isd_og_mou_8',	'spl_og_mou_6',	'spl_og_mou_7',	'spl_og_mou_8',	'og_others_6',	'og_others_7',	'og_others_8',	'total_og_mou_6',	'total_og_mou_7',	'total_og_mou_8',	'loc_ic_t2t_mou_6',	'loc_ic_t2t_mou_7',	'loc_ic_t2t_mou_8',	'loc_ic_t2m_mou_6',	'loc_ic_t2m_mou_7',	'loc_ic_t2m_mou_8',	'loc_ic_t2f_mou_6',	'loc_ic_t2f_mou_7',	'loc_ic_t2f_mou_8',	'loc_ic_mou_6',	'loc_ic_mou_7',	'loc_ic_mou_8',	'std_ic_t2t_mou_6',	'std_ic_t2t_mou_7',	'std_ic_t2t_mou_8',	'std_ic_t2m_mou_6',	'std_ic_t2m_mou_7',	'std_ic_t2m_mou_8',	'std_ic_t2f_mou_6',	'std_ic_t2f_mou_7',	'std_ic_t2f_mou_8',	'std_ic_mou_6',	'std_ic_mou_7',	'std_ic_mou_8',	'total_ic_mou_6',	'total_ic_mou_7',	'total_ic_mou_8',	'spl_ic_mou_6',	'spl_ic_mou_7',	'spl_ic_mou_8',	'isd_ic_mou_6',	'isd_ic_mou_7',	'isd_ic_mou_8',	'ic_others_6',	'ic_others_7',	'ic_others_8',	'total_rech_num_6',	'total_rech_num_7',	'total_rech_num_8',	'total_rech_amt_6',	'total_rech_amt_7',	'total_rech_amt_8',	'max_rech_amt_6',	'max_rech_amt_7',	'max_rech_amt_8',	'last_day_rch_amt_6',	'last_day_rch_amt_7',	'last_day_rch_amt_8',	'total_rech_data_6',	'total_rech_data_7',	'total_rech_data_8',	'max_rech_data_6',	'max_rech_data_7',	'max_rech_data_8',	'count_rech_2g_6',	'count_rech_2g_7',	'count_rech_2g_8',	'count_rech_3g_6',	'count_rech_3g_7',	'count_rech_3g_8',	'av_rech_amt_data_6',	'av_rech_amt_data_7',	'av_rech_amt_data_8',	'vol_2g_mb_6',	'vol_2g_mb_7',	'vol_2g_mb_8',	'vol_3g_mb_6',	'vol_3g_mb_7',	'vol_3g_mb_8',	'aon',	'aug_vbc_3g',	'jul_vbc_3g',	'jun_vbc_3g',	'sep_vbc_3g',	'arpu_var',	'onnet_mou_var',	'offnet_mou_var',	'roam_ic_mou_var',	'roam_og_mou_var',	'loc_og_mou_var',	'std_og_mou_var',	'isd_og_mou_var',	'spl_og_mou_var',	'total_og_mou_var',	'loc_ic_mou_var',	'std_ic_mou_var',	'isd_ic_mou_var',	'spl_ic_mou_var',	'total_ic_mou_var',	'total_rech_num_var',	'total_rech_amt_var',	'max_rech_amt_var',	'total_rech_data_var',	'max_rech_data_var',	'av_rech_amt_data_var',	'vol_2g_mb_var',	'vol_3g_mb_var'
]

In [None]:
for col in columns:
    Q98= telecom_df[col].quantile(0.98)
    print(col +' :', Q98,len(telecom_df[telecom_df[col] >Q98]))
    telecom_df[col] = telecom_df[col].apply(lambda x: x if x <Q98 else Q98)

#### Split data into training and testing dataset.

We will use sklearns train_test_split module

In [None]:
# stratify will ensure we will have same proportion of churned data in the split
X = telecom_df.drop("churned", axis = 1)
y = telecom_df.churned
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 45, stratify = y)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

### Random Forest 

In [None]:

# The class weight is used to handle class imbalance - it adjusts the cost function
forest = RandomForestClassifier(class_weight={0:0.1, 1: 0.9}, n_jobs = -1)

# hyperparameter space
params = {"criterion": ['gini', 'entropy'], "max_features": ['auto', 0.4]}

# create 5 folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 4)

# create gridsearch object
rf_model = GridSearchCV(estimator=forest, cv=folds, param_grid=params, scoring='roc_auc', n_jobs=-1, verbose=1)

In [None]:
# fit the model 
rf_model.fit(X_train, y_train)

In [None]:
# Print model with best parameters 
print("Best AUC: ", rf_model.best_score_)
print("Best hyperparameters: ", rf_model.best_params_)

AUC.append(rf_model.best_score_)
Param.append(rf_model.best_params_)

In [None]:
# Make prediction and then calculate the accuracy of model
y_test_pred = rf_model.predict(X_test)

In [None]:
confusion_matrix(y_test, y_test_pred)

In [None]:
# Evaluate the model on predicted data
confusion_score = confusion_matrix(y_test, y_test_pred,  labels=[1,0])
print(confusion_score)

# check sensitivity and specificity
total=sum(sum(confusion_score))

accuracy=(confusion_score[0,0] + confusion_score[1,1])/total
print ('Accuracy : ', accuracy)

sensitivity = confusion_score[0,0]/(confusion_score[0,0] + confusion_score[0,1])
print('Sensitivity : ', sensitivity )

specificity = confusion_score[1,1]/(confusion_score[1,0] + confusion_score[1,1])
print('Specificity : ', specificity)

ModelName.append('Ramdon Forest')
Accuracy.append(accuracy)
Sensitivity.append(sensitivity)
Specificity.append(specificity)


print('Specificity is very bad.')

In [None]:
#AUC
# check area under curve
y_pred_prob = rf_model.predict_proba(X_test)[:, 1]
print("AUC:    \t", round(roc_auc_score(y_test, y_pred_prob),2))

ROC.append(round(roc_auc_score(y_test, y_pred_prob),2))

####  Hyper parameter seclection for model
- max_fearutre  = number of variables to consider to split each node. The hyperparameter that controls the split-variable randomization feature of random forests. For classification problem we use Square Root of number of features.

- No of estimators: General practice is to have large number of trees in the forest to stabilize the model. We will start with 5* nuber of parameter.

In [None]:
# Build model base on the best features and evaluate it.
# run a random forest model on train data
max_features = int(round(np.sqrt(X_train.shape[1])))    # 
print(max_features)
random_forest_model = RandomForestClassifier(n_estimators=max_features*5, max_features=max_features, class_weight={0:0.1, 1: 0.9}, oob_score=True, random_state=45, verbose=1)

In [None]:
# fit model
random_forest_model.fit(X_train, y_train)

In [None]:
#Model evaluation
# OOB score
random_forest_model.oob_score_

# predict churn on test data
y_test_pred = random_forest_model.predict(X_test)

cm = confusion_matrix(y_test, y_test_pred,  labels=[1,0])
print(confusion_score)

# check sensitivity and specificity
total=sum(sum(cm))

accuracy=(cm[0,0] + cm[1,1])/total  #(TF+TN)/Total Pred
print ('Accuracy : ', accuracy)

sensitivity = cm[0,0]/(cm[0,0] + cm[0,1]) 
print('Sensitivity : ', sensitivity )

specificity = cm[1,1]/(cm[1,0] + cm[1,1])
print('Specificity : ', specificity)

# check area under curve
y_pred_prob = random_forest_model.predict_proba(X_test)[:, 1]
print("ROC:    \t", round(roc_auc_score(y_test, y_pred_prob),2))

ModelName.append('Ramdon Forest 2')
Accuracy.append(accuracy)
Sensitivity.append(sensitivity)
Specificity.append(specificity)
ROC.append(round(roc_auc_score(y_test, y_pred_prob),2))
AUC.append(random_forest_model.oob_score_)
Param.append('NaN')

#### Feature Importnace 

Feature importance for the random forest model 

In [None]:
# predictors
features = X_train.columns

In [None]:
# feature_importance
importance = random_forest_model.feature_importances_

In [None]:
# Create data frame 
var_imp_df = pd.DataFrame({'variables':features, 'imp_per':importance*100})
var_imp_df.head()

In [None]:
var_imp_df = var_imp_df.sort_values('imp_per', ascending=False).reset_index(drop=True)
var_imp_df

In [None]:
var_imp_df['imp_per'][0:100].sum()

### Build logistic regression model using top 60 features.

Top 60 feature in the model explains the 89% of variance. We will use top 100 features to build logistic regression model.

In [None]:
top_60_variables = var_imp_df.variables[0:59]

In [None]:
plt.rcParams["figure.figsize"] =(15,15)
corr = sns.diverging_palette(199, 359, s=99, center="light", as_cmap=True)
sns.heatmap(data=X_train[top_60_variables].corr(), center=0.0, cmap=corr)

In [None]:
# train and test dataset
X_train_top_60 = X_train[top_60_variables]
X_test_top_60 = X_test[top_60_variables]

print(X_train_top_60.shape)

<b>Pipeline</b> 
sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

In [None]:
# logistic regression
steps = [('scaler', StandardScaler()), 
         ("logistic", LogisticRegression(class_weight={0:0.1, 1:0.9}))
        ]

# initiate pipeline
log_regression = Pipeline(steps)

# hyperparameter space
params = {'logistic__C': [0.001, 0.01, 0.1, 0.5, 1, 2, 3, 4, 5, 10,50,100, ], 'logistic__penalty': ['l1', 'l2']}

# create 5 validation sets
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 45)

# create gridsearch object
ensamble = GridSearchCV(estimator=log_regression, cv=folds, param_grid=params, scoring='roc_auc', n_jobs=-1, verbose=1)

In [None]:
# fit model
ensamble.fit(X_train_top_60, y_train)

In [None]:
print("Best AUC: ", ensamble.best_score_)
print("Best hyperparameters: ", ensamble.best_params_)

In [None]:
# predict churn on test data
y_pred_test = ensamble.predict(X_test_top_60)

# create onfusion matrix
cm = confusion_matrix(y_test, y_pred_test,  labels=[1,0])
print(cm)
# check sensitivity and specificity
total=sum(sum(cm))

accuracy=(cm[0,0] + cm[1,1])/total  #(TF+TN)/Total Pred
print ('Accuracy : ', accuracy)

sensitivity = cm[0,0]/(cm[0,0] + cm[0,1]) 
print('Sensitivity : ', sensitivity )

specificity = cm[1,1]/(cm[1,0] + cm[1,1])
print('Specificity : ', specificity)

# check area under curve
y_pred_prob = ensamble.predict_proba(X_test_top_60)[:, 1]
print("ROC:    \t", round(roc_auc_score(y_test, y_pred_prob),2))

ModelName.append('Logistic Regression with Top 60 Param from RF')
Accuracy.append(accuracy)
Sensitivity.append(sensitivity)
Specificity.append(specificity)
ROC.append(round(roc_auc_score(y_test, y_pred_prob),2))
AUC.append(ensamble.best_score_)
Param.append(ensamble.best_params_)

print('Decent model with good accuracy and specificity. ROC score is also good.')

In [None]:
ensamble.best_estimator_

In [None]:
logistic_model = ensamble.best_estimator_.named_steps['logistic']
logistic_model.intercept_

In [None]:
logistic_model.coef_

In [None]:
log_df = pd.DataFrame(logistic_model.coef_, columns=list(X_test_top_60)).T.reset_index()
log_df.columns = ['Feature','Coefficient']
log_df.sort_values(by='Coefficient', ascending=False)

In [None]:
# for data imputation but Taking too much time

## using fast knn for imputation
#from impyute.imputation.cs import fast_knn

#imputed_df = fast_knn(churn_df[missing_values_columns].values, k=2)

## 1st install impyute
#from impyute.imputation.cs import mice

#imputed_df = mice(churn_df.values)

### Splitting the data

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
# to check if stratify worked in splitting the target variable proportinately

print(y_train.value_counts(1))
print('\n')
print(y_test.value_counts(1))

In [None]:
# # reshaping the dependent variable df

# y_train = y_train.values.reshape(-1,1)
# y_test = y_test.values.reshape(-1,1)

# print(y_train.shape)
# print(y_test.shape)

###  Scaling the data for model with good prediction

In [None]:
#  Standardizing the data
scaler = StandardScaler()
#scaler = MinMaxScaler()

# transform train data
X_train_scaled = scaler.fit_transform(X_train)
print(X_train_scaled[:5,:10])
print('\n')
# transform test data
X_test_scaled = scaler.transform(X_test)
print(X_test_scaled[:5,:10])

### 5. PCA

#### 1. Trying to fit and check the variance using PCA

In [None]:
# initialising the PCA
pca_1 = PCA(random_state=10)

# fitting the PCA
pca_1.fit(X_train_scaled)

pca_1.components_

In [None]:
# checking the variance ratio
pca_1.explained_variance_ratio_

In [None]:
# taking cumulative sum
variance_cumsum = np.cumsum(pca_1.explained_variance_ratio_)

# plotting scree plot to see the variance explained by features
plt.figure(figsize=(10,8))
plt.plot(range(len(variance_cumsum)), variance_cumsum)
plt.title('Scree Plot\n')
plt.plot()

# it seems that the 90-95% variance is explained by only 90 features

In [None]:
# look at explainded variance of PCA components
print(pd.Series(np.round(pca_1.explained_variance_ratio_.cumsum(), 4)*100))

In [None]:
print('90% variance is explained by 57 variables.')
print('95% variance is explained by 73 variables.')
print('98% variance is explained by 89 variables.')

#### 2. Let's fit the data using Incremental PCA

In [None]:
# checking how many components explain the 95 percent variance using unsupervised way

pca_2_unsup = PCA(0.95, random_state=10)

X_train_pca2 = pca_2_unsup.fit_transform(X_train_scaled)

X_train_pca2.shape

In [None]:
# using IncrementalPCA to fit and transform the data
pca_inc = IncrementalPCA(X_train_pca2.shape[1])

# fitting the pca on train data
X_train_pca = pca_inc.fit_transform(X_train_scaled)
print(X_train_pca.shape)

# transforming the test data
X_test_pca = pca_inc.transform(X_test_scaled)
print(X_test_pca.shape)

In [None]:
# creating correlation matrix
corr_mat = np.corrcoef(X_train_pca.transpose())

# we still have too many features, but we can see that in heatmap that there is no correlation now
plt.figure(figsize=(18,10))
sns.heatmap(corr_mat)
plt.show()

###  Random Forest Modelling

In [None]:
# random forest classifier model
rf_model = RandomForestClassifier(random_state=10, n_jobs=-1, class_weight={0:1, 1: 15})

### setting parameter grid for hyper parameter tuning
## these values below are set after trying to run the model on different values and set it to close values,
# so that it doesn't take too much time fitting it

params = {
    'max_depth' : [30],
    'min_samples_leaf' : [20],
    'min_samples_split' : [20],
    'max_features' : [35],
    'n_estimators' : [100]
}


# using GridSearchCV for hyper parameter tuning
grid_model = GridSearchCV(estimator=rf_model, param_grid=params, verbose=1, cv=4, n_jobs=-1, scoring="roc_auc")

In [None]:
# fitting the model
grid_model.fit(X_train_pca, y_train)

# Print model with best parameters 
print("Best AUC: ", grid_model.best_score_)
print("Best hyperparameters: ", grid_model.best_params_)

# best model
rf_best = grid_model.best_estimator_

rf_best

In [None]:
# rf_best = RandomForestClassifier(random_state=10, class_weight={0:1, 1: 15},
#                                  max_depth= 30,max_features= 35, min_samples_leaf= 20,min_samples_split= 20,
#                                  n_estimators= 100)
# rf_best.fit(X_train_pca, y_train)

In [None]:
# Make prediction on train data and then calculate the accuracy of model
y_train_pred = rf_best.predict(X_train_pca)

# Evaluate the model on predicted data
confusion_score = confusion_matrix(y_train, y_train_pred)
print('Train Confusion Matrix: \n', confusion_score)

# Confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_train, y_train_pred).ravel()
print(f'Train Confusion metrics values: {tn, fp, fn, tp}')

# Area under curve
print(f'Train Area under the curve: {roc_auc_score(y_train, y_train_pred)}')

# check sensitivity and specificity
total=sum(sum(confusion_score))

accuracy= accuracy_score(y_train, y_train_pred)
print ('Train Accuracy : ', accuracy)

sensitivity = tp/(tp+fn)
print('Train Sensitivity : ', sensitivity )

specificity = tn/(tn+fp)
print('Train Specificity : ', specificity)



In [None]:
# Make prediction on test data and then calculate the accuracy of model
y_test_pred = rf_best.predict(X_test_pca)

# Evaluate the model on predicted data
confusion_score = confusion_matrix(y_test, y_test_pred)
print('Test Confusion Matrix: \n', confusion_score)

# Confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, y_test_pred).ravel()
print(f'Test Confusion metrics values: {tn, fp, fn, tp}')

print(f'Test Area under the curve: {roc_auc_score(y_test, y_test_pred)}')

# check sensitivity and specificity
total=sum(sum(confusion_score))

accuracy= accuracy_score(y_test, y_test_pred)
print ('Test Accuracy : ', accuracy)

sensitivity = tp/(tp+fn)
print('Test Sensitivity : ', sensitivity )

specificity = tn/(tn+fp)
print('Test Specificity : ', specificity)

ModelName.append('Random Forest with PCA')
Accuracy.append(accuracy)
Sensitivity.append(sensitivity)
Specificity.append(specificity)
ROC.append(round(roc_auc_score(y_test, y_test_pred),2))


In [None]:
AUC.append(np.nan)
Param.append(np.nan)

 Poor sensitivity score

In [None]:
# feature importance, but these are pca derived 
rf_best.feature_importances_

In [None]:
# showing sample tree
dot_data = StringIO()

export_graphviz(rf_best.estimators_[0] , out_file=dot_data, filled=True, rounded=True, 
               class_names=['Not Churn', 'Churn'])

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

Image(graph.create_png())

### Logistic Regression with the PCA

We will use logisti regression model based on the PCA selection.

In [None]:
PCA_VARS = 75
steps = [
        ('scaler', StandardScaler()),
        ("pca", PCA(n_components=PCA_VARS)),
         ("logistic", LogisticRegression(class_weight='balanced'))
        ]
pipeline = Pipeline(steps)

# fit model
pipeline.fit(X_train, y_train)

# check score on train data
pipeline.score(X_train, y_train)

In [None]:
# predict churn on test data
y_pred = pipeline.predict(X_test)

# create onfusion matrix
cm = confusion_matrix(y_test, y_pred,  labels=[1,0])
print(cm)
# check sensitivity and specificity
total=sum(sum(cm))

accuracy=(cm[0,0] + cm[1,1])/total  #(TF+TN)/Total Pred
print ('Accuracy : ', accuracy)

sensitivity = cm[0,0]/(cm[0,0] + cm[0,1]) 
print('Sensitivity : ', sensitivity )

specificity = cm[1,1]/(cm[1,0] + cm[1,1])
print('Specificity : ', specificity)

# check area under curve
y_pred_prob = pipeline.predict_proba(X_test)[:, 1]
print("ROC:    \t", round(roc_auc_score(y_test, y_pred_prob),2))

ModelName.append('Logistic Regression with 75 PCA')
Accuracy.append(accuracy)
Sensitivity.append(sensitivity)
Specificity.append(specificity)
ROC.append(round(roc_auc_score(y_test, y_test_pred),2))
AUC.append(np.nan)
Param.append(np.nan)

print('Decent model with good accuracy and specificity. ROC score is also good.')

### Hyperparameter tuning for the PCA with Logistic Regression

In [None]:
# PCA
pca = PCA()

logistic = LogisticRegression(class_weight={0:0.1, 1: 0.9})

# create pipeline
steps = [("scaler", StandardScaler()), 
         ("pca", pca),
         ("logistic", logistic)
        ]

# compile pipeline
pca_logistic = Pipeline(steps)

# hyperparameter space
params = {'pca__n_components': [55, 75, 85], 'logistic__C': [0.1, 0.5, 1, 2, 3, 4, 5, 10], 'logistic__penalty': ['l1', 'l2']}

# create 5 folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 45)

# create gridsearch object
model = GridSearchCV(estimator=pca_logistic, cv=folds, param_grid=params, scoring='roc_auc', n_jobs=-1, verbose=1)

In [None]:
model.fit(X_train, y_train)

In [None]:
# cross validation results
pd.DataFrame(model.cv_results_)

In [None]:
print("Best AUC: ", model.best_score_)
print("Best hyperparameters: ", model.best_params_)

In [None]:
# predict churn on test data
y_pred = model.predict(X_test)

# create onfusion matrix
cm = confusion_matrix(y_test, y_pred,  labels=[1,0])
print(cm)
# check sensitivity and specificity
total=sum(sum(cm))

accuracy=(cm[0,0] + cm[1,1])/total  #(TF+TN)/Total Pred
print ('Accuracy : ', accuracy)

sensitivity = cm[0,0]/(cm[0,0] + cm[0,1]) 
print('Sensitivity : ', sensitivity )

specificity = cm[1,1]/(cm[1,0] + cm[1,1])
print('Specificity : ', specificity)

# check area under curve
y_pred_prob = model.predict_proba(X_test)[:, 1]
print("AUC:    \t", round(roc_auc_score(y_test, y_pred_prob),2))

ModelName.append('Logistic Regression with 85 PCA')
Accuracy.append(accuracy)
Sensitivity.append(sensitivity)
Specificity.append(specificity)
ROC.append(round(roc_auc_score(y_test, y_pred_prob),2))
AUC.append(model.best_score_)
Param.append(model.best_params_)

## 5. Model Validation

<b> We built 5 models in this excercise. </b>
- 1. Random Forest for prediction as well as for feature importance
- 2. Logistic Regression based on the important features from the random forest ensamble.
- 3. PCA for feature selection.
- 4. Random forest with the PCA for prediction.
- 5. Logistic Regression with PCA for prediction.

In [None]:
results = pd.DataFrame({'Model Name':ModelName, 'Accuracy':Accuracy,'Sensitivity':Sensitivity,'Specificity':Specificity,'ROC':ROC,'AUC':AUC,'Parameter':Param})
results

-  Logistic regresion works well for the identification of the churn with good accuracy, sensitivity and specificity score.
     this will be the the algorithm that would be used for the prediction.
-  PCA is simple an dquick way to identify the important parameters. However important parameter extracted from the ensamble aslo gives good explanation of feature selection with weitage factor.

## 6. Recommend strategies to manage customer churn 

1. Company should offer better roaming packages to the customer.
2. total incoming and outgoing calls plays have higher influance on the customer retaintion. A better calling esperiance and package would make impact on the churn rate.
3. Titmly reddresal of grivenace by the company would benifit in retaining the customers.