# Telecom Churn - Case Study
##### By: Kirti Gupta & Debayan Talapatra

###### Business Problem Statement
In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business goal.
To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.

In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.


##### Defining Churn
There are two main models of payment in the telecom industry - **postpaid** (customers pay a monthly/annual bill after using the services) and **prepaid** (customers pay/recharge with a certain amount in advance and then use the services).

In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and you directly know that this is an instance of churn.

However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again).

Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully.  Also, prepaid is the most common model in India and southeast Asia, while postpaid is more common in Europe in North America.

###### Definitions of Churn
Definitions of Churn
There are various ways to define churn, such as:

###### Revenue-based churn: 
Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as ‘customers who have generated less than INR 4 per month in total/average/median revenue’.

 
The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas.

###### High-value Churn
In the Indian and the southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage.


###### Business Objective 
The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful.

#### DataSet
The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively.

**Dataset name:** telecom_churn_data.csv



###### Understanding Customer Behaviour During Churn
Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle :

The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual.

The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a  competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.)

The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase.

 

In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase.

# Data Reading and Understanding

In [1]:
# Import required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
# reading dataset
churn= pd.read_csv('telecom_churn_data.csv')
churn.head()

Unnamed: 0,mobile_number,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,last_date_of_month_9,arpu_6,...,sachet_3g_9,fb_user_6,fb_user_7,fb_user_8,fb_user_9,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,sep_vbc_3g
0,7000842753,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,197.385,...,0,1.0,1.0,1.0,,968,30.4,0.0,101.2,3.58
1,7001865778,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,34.047,...,0,,1.0,1.0,,1006,0.0,0.0,0.0,0.0
2,7001625959,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,167.69,...,0,,,,1.0,1103,0.0,0.0,4.17,0.0
3,7001204172,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,221.338,...,0,,,,,2491,0.0,0.0,0.0,0.0
4,7000142493,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,261.636,...,0,0.0,,,,1526,0.0,0.0,0.0,0.0


In [3]:
#Check Shape
churn.shape

(99999, 226)

In [None]:
#Check DataType
churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Columns: 226 entries, mobile_number to sep_vbc_3g
dtypes: float64(179), int64(35), object(12)
memory usage: 172.4+ MB


In [None]:
#Standard Calculations
churn.describe()

In [None]:
print ("Total Features %d "% (churn.shape[1]))
print ("Unique customers: %d"%len(churn.mobile_number.unique()))

In [None]:
#columns name
pd.DataFrame(churn.columns)

##  Data Cleaning

##### Function to get missing/nan values

In [None]:
def get_nan_values(nanCutoff):
    # argument: nanCutoff:- % threshold for missing/nan values
    nan_values = round(100*(churn.isnull().sum()/churn.shape[0]))
    return nan_values.loc[nan_values > nanCutoff]

##### Function to impute missing/nan values

In [None]:
def impute_nan_values(data,imputedList=False,nan_list=False):
    # argument: imputedList, list for which nan is to be replaced with 0
    if imputedList:
        for col in [y + s for s in ['_6','_7','_8','_9'] for y in imputedList]:
            data[col].fillna(0, inplace=True)
    else:    
        for col in nan_list:
            data[col].fillna(0, inplace=True)

##### Handling missing values/Entries
##### check  missing values

In [None]:
# missing/nan values more than 50%
get_nan_values(50)

Out the these 40 features, some of them are required for Data analysis.We can impute these values for now for analysis.

In [None]:
# 'av_rech_amt_data', 'arpu_2g', 'arpu_3g', 'count_rech_2g', 'count_rech_3g',
# 'max_rech_data', 'total_rech_data','fb_user','night_pck_user 
#features are important for getting the high-value customers,
#impute  missing values with 0

impute_highValueCols = ['av_rech_amt_data', 'arpu_2g', 'arpu_3g', 'count_rech_2g', 'count_rech_3g',
             'max_rech_data', 'total_rech_data','fb_user','night_pck_user']
impute_nan_values(churn,impute_highValueCols)

In [None]:
get_nan_values(50)

In [None]:
# dropping rest of the columns having more than 50% missing values
nan_columns = list(get_nan_values(50).index)
churn.drop(nan_columns,axis=1,inplace=True)


In [None]:
churn.shape

In [None]:
# missing/nan values more than 30%
get_nan_values(30) 

In [None]:
# missing/nan values more than 10%
get_nan_values(10) 

In [None]:
# missing/nan values more than 5%
get_nan_values(5) 

from above data :- all features for the month september(9th) have missing values.

In [None]:
# Column/Features which have more tah 5% missing value
nan_columns = list(get_nan_values(5).index)
print(nan_columns)
churn[churn[nan_columns].isnull().all(axis=1)][nan_columns].head()

##### above features can be imputed with 0

In [None]:
impute_nan_values(churn,nan_list=nan_columns)

In [None]:
churn=churn[~churn[nan_columns].isnull().all(axis=1)]
churn.shape

In [None]:
# missing/nan values more than 2%
get_nan_values(2)

In [None]:
# Column/Features which have more than 2% missing value
nan_columns = list(get_nan_values(2).index)
print (nan_columns)

churn[churn[nan_columns].isnull().all(axis=1)][nan_columns].head()

##### drop these customers

In [None]:
churn=churn[~churn[nan_columns].isnull().all(axis=1)]
churn.shape

In [None]:
# For other customers where we have missing values, impute them with 0. 

nan_columns.remove('date_of_last_rech_8')
nan_columns.remove('date_of_last_rech_9')
impute_nan_values(churn,nan_list=nan_columns)

In [None]:
# Column/Features which have more than 0% missing values
get_nan_values(0)

In [None]:
#Check above features
nan_columns = ['loc_og_t2o_mou','std_og_t2o_mou','loc_ic_t2o_mou','last_date_of_month_7','last_date_of_month_8','last_date_of_month_9']
for c in nan_columns:
    print(churn[c].value_counts())
    churn[c].fillna(churn[c].mode()[0], inplace=True)


##### above features have  one value. imputing their missing values with  mode


In [None]:
# Column/Features which have more than 0% missing value
get_nan_values(0)

In [None]:
#number of rows that has null values
nan_columns = list(get_nan_values(0).index)
len(churn[churn[nan_columns].isnull().all(axis=1)])

In [None]:
churn[churn['date_of_last_rech_6'].isnull()]['date_of_last_rech_6'] = '6/30/2014'
churn[churn['date_of_last_rech_7'].isnull()]['date_of_last_rech_7'] = '7/31/2014'
churn[churn['date_of_last_rech_8'].isnull()]['date_of_last_rech_8'] = '8/31/2014'
churn[churn['date_of_last_rech_9'].isnull()]['date_of_last_rech_9'] = '9/30/2014'

<br><br>columns with 0 (as values).

In [None]:
zero_columns=churn.columns[(churn == 0).all()]
zero_columns

In [None]:
# drop columns which have single value '0'. 
churn.drop(zero_columns,axis=1,inplace=True)

In [None]:
# Percentage of data after Data cleaning.
print("after Data cleaning :% of data {}%".format(round(churn.shape[0]/99999 *100,2)))
churn.shape

##### check  data types 

In [None]:
churn.reset_index(inplace=True,drop=True)
# date columns filter
date_columns = list(churn.filter(regex='date').columns)
date_columns

In [None]:
# Converting dtype of date columns to datetime
for col in date_columns:
    churn[col] = pd.to_datetime(churn[col], format='%m/%d/%Y')

In [None]:
churn.info()

##### monthly features which are not in the standard naming (\_6,\_7,\_8,\_9)

In [None]:
# renaming columns,
#'jun_vbc_3g' : 'jun_vbc_3g_6'
#'jul_vbc_3g' : 'july_vbc_3g_7'
#'aug_vbc_3g' : 'aug_vbc_3g_8'
#'sep_vbc_3g' : 'sep_vbc_3g_9'
churn.rename(columns={'jun_vbc_3g' : 'jun_vbc_3g_6', 'jul_vbc_3g' : 'july_vbc_3g_7', 'aug_vbc_3g' : 'aug_vbc_3g_8',
                      'sep_vbc_3g' : 'sep_vbc_3g_9'}, inplace=True)

**Derived Variables for** 'vol_data_mb_6', 'vol_data_mb_7', 'vol_data_mb_8', 'vol_data_mb_9'

These will store the total data volume (= vol_2g_mb_* + vol_3g_mb_*) monthwise.

In [None]:
#Derived Variables for: 'vol_data_mb_6', 'vol_data_mb_7', 'vol_data_mb_8', 'vol_data_mb_9',
for i in range(6,10):
    churn['vol_data_mb_'+str(i)] = (churn['vol_2g_mb_'+str(i)]+churn['vol_3g_mb_'+str(i)]).astype(int)

###### Filter high-value customers
upto 70th percentile of the average recharge amount in the first two months ( good phase).

In [None]:
recharge_col = churn.filter(regex=('count')).columns
churn[recharge_col].head()

**Derived Variables for** avg_rech_amt_6,avg_rech_amt_7,avg_rech_amt_8,avg_rech_amt_9
##### average recharge value month wise

In [None]:
# Derived Variables for: avg_rech_amt_6,avg_rech_amt_7,avg_rech_amt_8,avg_rech_amt_9
for i in range(6,10):
    churn['avg_rech_amt_'+str(i)] = round(churn['total_rech_amt_'+str(i)]/churn['total_rech_num_'+str(i)]+1,2)

In [None]:
impute_nan_values(churn,nan_list=['avg_rech_amt_6','avg_rech_amt_7','avg_rech_amt_8','avg_rech_amt_9'])

**Derived Variables for** total_rech_num_data_6,total_rech_num_data_7,total_rech_num_data_8,total_rech_num_data_9

##### total number of data recharge month wise.

In [None]:
#Derived Variables for total_rech_num_data_6,total_rech_num_data_7,total_rech_num_data_8,total_rech_num_data_9
for i in range(6,10):
    churn['total_rech_num_data_'+str(i)] = (churn['count_rech_2g_'+str(i)]+churn['count_rech_3g_'+str(i)]).astype(int)

**Derived Variables for** total_rech_amt_data_6,total_rech_amt_data_7,total_rech_amt_data_8,total_rech_amt_data_9
##### total amount of data recharge month wise.

In [None]:
#Derived Variables for total_rech_amt_data_6,total_rech_amt_data_7,total_rech_amt_data_8,total_rech_amt_data_9
for i in range(6,10):
    churn['total_rech_amt_data_'+str(i)] = churn['total_rech_num_data_'+str(i)]*churn['av_rech_amt_data_'+str(i)]

**Derived Variables for** total_month_rech_6,total_month_rech_7,total_month_rech_8,total_month_rech_9

##### total recharge amount month wise.

In [None]:
#Derived Variables for total_mon_rech_6,total_mon_rech_7,total_mon_rech_8,total_mon_rech_9
for i in range(6,10):
    churn['total_month_rech_'+str(i)] = churn['total_rech_amt_'+str(i)]+churn['total_rech_amt_data_'+str(i)]
churn.filter(regex=('total_month_rech')).head()

In [None]:
# calculate avegare of first two months (good phase) total monthly recharge amount
good_phase =(churn.total_month_rech_6 + churn.total_month_rech_7)/2
# calculate cutoff which is the 70th percentile of the good phase average recharge amounts
highvalue_cutoff= np.percentile(good_phase,70)
# users who has avg. recharge amount >= to the cutoff of 70th percentile.
highvalue_users = churn[good_phase >=  highvalue_cutoff]
highvalue_users.reset_index(inplace=True,drop=True)

print("No. of High-Value Customers: %d\n"% len(highvalue_users))
print("% of High-value users : {}%".format(round(len(highvalue_users)/churn.shape[0]*100),2))

###### Tagging Churners
Now tag the churned customers (churn=1, else 0) based on the fourth month as follows:

Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes we need to use to tag churners are:
- total_ic_mou_9
- total_og_mou_9
- vol_2g_mb_9
- vol_3g_mb_9

In [None]:
#function to find out churn status

In [None]:
def churn_status(data,churn_phase=9):
    #argument: churn_phase,4th month number in which users churn (default= 9)
    churn_var= ['vol_2g_mb_','vol_3g_mb_','total_ic_mou_','total_og_mou_']
    isChurn = ~data[[s + str(churn_phase) for s in churn_var ]].any(axis=1)
    isChurn = isChurn.map({True:1, False:0})
    return isChurn

In [None]:
highvalue_users['churn'] = churn_status(highvalue_users,9)
print(" {} users tagged as churners out of {} High-Value Customers.".format(len(highvalue_users[highvalue_users.churn == 1]),highvalue_users.shape[0]))
print("High-value Customer Churn Percentage : {}%".format(round(len(highvalue_users[highvalue_users.churn == 1])/highvalue_users.shape[0] *100,2)))


Here we have **highly imbalanced** data set.

---
##  Data Analysis

---

##### function to plot histogram

In [None]:
# Function to plot the histogram with labels
def plot_hist(dataset,col,binsize):
    fig, ax = plt.subplots(figsize=(20,4))
    counts, bins, patches = ax.hist(dataset[col],bins=range(0,dataset[col].max(),round(binsize)), facecolor='yellow', edgecolor='red')
    
    # Set the ticks to be at the edges of the bins.
    ax.set_xticks(bins)
    bin_centers = 0.5 * np.diff(bins) + bins[:-1]
    for count, x in zip(counts, bin_centers):
        # Label
        percent = '%0.0f%%' % (100 * float(count) / counts.sum())
        ax.annotate(percent, xy=(x,0.2), xycoords=('data', 'axes fraction'),
        xytext=(0, -32), textcoords='offset points', va='top', ha='center')
    
    ax.set_xlabel(col.upper())
    ax.set_ylabel('Count')
    plt.show()
    

##### Function to calculate monthly avg calls and plot it

In [None]:
def plot_avgMonthlyCalls(pltType,data,calltype,colList):
    # style
    plt.style.use('seaborn-darkgrid')
    # create a color palette
    palette = plt.get_cmap('Set1')
    
    if pltType == 'multi':
        #Create dataframe after grouping on AON with colList features
        total_call_mou = pd.DataFrame(data.groupby('aon_bin',as_index=False)[colList].mean())
        total_call_mou['aon_bin']=pd.to_numeric(total_call_mou['aon_bin'])
        total_call_mou
        # multiple line plot
        num=0
        fig, ax = plt.subplots(figsize=(15,8))
        for column in total_call_mou.drop('aon_bin', axis=1):
            num+=1
            ax.plot(total_call_mou['aon_bin'] , total_call_mou[column], marker='', color=palette(num), linewidth=2, alpha=0.9, label=column)
         
        ## Add legend
        plt.legend(loc=2, ncol=2)
        ax.set_xticks(total_call_mou['aon_bin'])
        
        # Add titles
        plt.title("Avg.Monthly "+calltype+" MOU  V/S AON", loc='left', fontsize=12, fontweight=0, color='orange')
        plt.xlabel("Aon (years)")
        plt.ylabel("Avg. Monthly "+calltype+" MOU")
    elif pltType == 'single':
        fig, ax = plt.subplots(figsize=(8,4))
        ax.plot(data[colList].mean())
        ax.set_xticklabels(['Jun','Jul','Aug','Sep'])
        
        # Add titles
        plt.title("Avg. "+calltype+" MOU  V/S Month", loc='left', fontsize=12, fontweight=0, color='orange')
        plt.xlabel("Month")
        plt.ylabel("Avg. "+calltype+" MOU")
        
    plt.show()

##### Function to plot churn by mou

In [None]:
def plot_byChurnMou(colList,calltype):
    fig, ax = plt.subplots(figsize=(7,4))
    df=highvalue_users.groupby(['churn'])[colList].mean().T
    plt.plot(df)
    ax.set_xticklabels(['Jun','Jul','Aug','Sep'])
    ## Add legend
    plt.legend(['Non-Churn', 'Churn'])
    # Add titles
    plt.title("Avg. "+calltype+" MOU  V/S Month", loc='left', fontsize=12, fontweight=0, color='orange')
    plt.xlabel("Month")
    plt.ylabel("Avg. "+calltype+" MOU")

##### function to plot by churn 

In [None]:
def plot_byChurn(data,col):
    # per month churn vs Non-Churn
    fig, ax = plt.subplots(figsize=(7,4))
    colList=list(data.filter(regex=(col)).columns)
    colList = colList[:3]
    plt.plot(highvalue_users.groupby('churn')[colList].mean().T)
    ax.set_xticklabels(['Jun','Jul','Aug','Sep'])
    ## Add legend
    plt.legend(['Non-Churn', 'Churn'])
    # Add titles
    plt.title( str(col) +" V/S Month", loc='left', fontsize=12, fontweight=0, color='orange')
    plt.xlabel("Month")
    plt.ylabel(col)
    plt.show()
    # Numeric stats for per month churn vs Non-Churn
    return highvalue_users.groupby('churn')[colList].mean()

In [None]:
# Filtering the common monthly columns
common_cols = highvalue_users.filter(regex ='_6').columns
monthly_cols = [item.strip('_6') for item in common_cols]
monthly_cols

In [None]:
# getting the number of monthly columns and profile columns
print ("columns:", highvalue_users.shape[1] )
print ("monthly columns : ",len(monthly_cols))
print ("Total monthly columns phase wise (%d*4): %d"%(len(monthly_cols), len(monthly_cols) * 4))
print ("Columns other than monthly columns :", highvalue_users.shape[1] - (len(monthly_cols) * 4))

In [None]:
#  remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names).
attr_9List = highvalue_users.filter(regex=('_9')).columns
highvalue_users.drop(attr_9List,axis=1,inplace=True)

In [None]:
# list of monthly columns 6,7,8
monthly_cols = [x + s for s in ['_6','_7','_8'] for x in monthly_cols]
monthly_cols

In [None]:
# columns which are not monthly columns
nonmonthly_cols = [col for col in highvalue_users.columns if col not in monthly_cols]
nonmonthly_cols

###### Feature: circle_id

In [None]:
# Getting  distinct circle_id's
highvalue_users.circle_id.value_counts()

We can drop this feature since it has only one value

In [None]:
highvalue_users.drop('circle_id',axis=1,inplace=True)

###### Feature: aon

In [None]:
# Customers distribution by age 
plot_hist(highvalue_users,'aon',365)

- **Minimun Age** on network is 180 days.
- **Average age** on network for customers is 1200 days (3.2 years).
- 27% of the **High Value users are in their 2nd year** with the network.
- Almost 71% users have Age on network **less than 4 years.**
- 15% users are with the network from **over 7 years.**

In [None]:
#Create Derived categorical variable
highvalue_users['aon_bin'] = pd.cut(churn['aon'], range(0,churn['aon'].max(),365), labels=range(0,int(round(churn['aon'].max()/365))-1))

###### Incoming VS month VS AON

In [None]:
# Plotting Avg. total monthly incoming MOU vs AON
incoming_columns = highvalue_users.filter(regex ='total_ic_mou').columns
plot_avgMonthlyCalls('single',highvalue_users,calltype='incoming',colList=incoming_columns)
plot_avgMonthlyCalls('multi',highvalue_users,calltype='incoming',colList=incoming_columns)

Above plot shows that
- The more a customer stays on with the operator(AON), more the total monthly incoming MOU.
- Total Incoming MOU avg. for Jul(_7) are more than the previous Jun(_6) for customers in all AON bands.
- Total Incoming MOU avg. for Aug(_8) cease to increace, infact it shows a decline compared to Jul(_7).
- Total Incoming MOU avg. for Sep(_9) is well below the first months(jun _6) avg..

###### Outgoing VS month VS AON

In [None]:
#  Avg. total monthly outgoing MOU vs AON
outgoing_columns = highvalue_users.filter(regex ='total_og_mou').columns
plot_avgMonthlyCalls('single',highvalue_users,calltype='outgoing',colList=outgoing_columns)
plot_avgMonthlyCalls('multi',highvalue_users,calltype='outgoing',colList=outgoing_columns)

Above plot shows that
- Total Outgoing MOU avg. for Jul(_7) are more than the previous Jun(_6) for customers in all AON bands, except in the AON band between 7 - 8 years where it is almost simillar.
- Total outgoing MOU avg. for Aug(_8) cease to increace, infact it shows a significant decline compared to Jul(_7).
- Total outgoing MOU avg. for Sep(_9) is the lowest of all 4 months.
- The Avg. outgoing usage reduces drastically for customers in the AON band between 7 - 8  years.

###### Incoming/Outgoing MOU VS Churn 

In [None]:
incoming_columns = ['total_ic_mou_6','total_ic_mou_7','total_ic_mou_8']
outgoing_columns = ['total_og_mou_6','total_og_mou_7','total_og_mou_8']
plot_byChurnMou(incoming_columns,'Incoming')
plot_byChurnMou(outgoing_columns,'Outgoing')

It can be observed,
- Churners Avg. Incoming/Outgoing MOU's **drops drastically after the 2nd month,Jul.**
- While the non-churners Avg. MOU's remains consistant and stable with each month.
- Therefore, users MOU is a key feature to predict churn.

in terms of Statistics.

In [None]:
# Avg.Incoming MOU per month churn vs Non-Churn
highvalue_users.groupby(['churn'])['total_ic_mou_6','total_ic_mou_7','total_ic_mou_8'].mean()

In [None]:
# Avg. Outgoing MOU per month churn vs Non-Churn
highvalue_users.groupby(['churn'])['total_og_mou_6','total_og_mou_7','total_og_mou_8'].mean()

**Derived Variables:** og_to_ic_mou_6, og_to_ic_mou_7, og_to_ic_mou_8
---->(=total_og_mou_* / total_ic_mou_*)

In [None]:
# adding 1 to denominator to avoid dividing by 0 and getting nan values.
for i in range(6,9):
    highvalue_users['og_to_ic_mou_'+str(i)] = (highvalue_users['total_og_mou_'+str(i)])/(highvalue_users['total_ic_mou_'+str(i)]+1)

In [None]:
plot_byChurn(highvalue_users,'og_to_ic_mou')

- Outgoing to incoming mou remains drops significantly for churners from month Jul(6) to Aug(7).
- While it remains almost consistent for the non-churners.

##### Derived Variables:
loc_og_to_ic_mou_6, loc_og_to_ic_mou_7, loc_og_to_ic_mou_8(=loc_og_mou_* / loc_ic_mou_*) for each month. These features will combine the local calls, both incoming and outgoing informations and should be a **better predictor of churn

In [None]:
# adding 1 to denominator to avoid dividing by 0 and getting nan values.
for i in range(6,9):
    highvalue_users['loc_og_to_ic_mou_'+str(i)] = (highvalue_users['loc_og_mou_'+str(i)])/(highvalue_users['loc_ic_mou_'+str(i)]+1)

In [None]:
plot_byChurn(highvalue_users,'loc_og_to_ic_mou')

It can be observed that,
- The local outgoing to incoming call mou ratio is genrally low for churners right from the begining of the good phase.
- local mou pattern for the non-churners remains almost constant through out the 3 months.
- The churners genrally show a low loc mou ratio but it drops dramatically after the 2nd month.
- This might suggest that people who are not making/reciving much local calls during their tenure are more likely to churn.

###### Total data volume VS Churn 

In [None]:
plot_byChurn(highvalue_users,'vol_data_mb')

- The volume of data mb used drops significantly for churners from month Jul(6) to Aug(7).
- While it remains almost consistent for the non-churners.

###### Total monthly rech VS Churn 

In [None]:
plot_byChurn(highvalue_users,'total_month_rech')

- total monthly rech amount also drops significantly for churners from month Jul(6) to Aug(7).
- While it remains almost consistent for the non-churners.

###### max_rech_amt VS Churn 

In [None]:
plot_byChurn(highvalue_users,'max_rech_amt')

- maximum recharge amount also drops significantly for churners from month Jul(6) to Aug(7).
- While it remains almost consistent for the non-churners.

###### arpu VS Churn 

In [None]:
plot_byChurn(highvalue_users,'arpu')

- Average revenue per user,arpu also drops significantly for churners from month Jul(6) to Aug(7).
- While it remains almost consistent for the non-churners.

**Derived Variables:** Total_loc_mou_6, Total_loc_mou_7, Total_loc_mou_8<br>
 **Total MOU** (=loc_og_mou+loc_ic_mou)Month wise

In [None]:
#Create new feature: Total_loc_mou_6,Total_loc_mou_7,lTotal_loc_mou_8
for i in range(6,9):
    highvalue_users['Total_loc_mou_'+str(i)] = (highvalue_users['loc_og_mou_'+str(i)])+(highvalue_users['loc_ic_mou_'+str(i)])

In [None]:
plot_byChurn(highvalue_users,'Total_loc_mou_')

It can be observed that,
- The Total local call mou is genrally low for churners right from the begining of the good phase.
- local mou pattern for the non-churners remains almost constant through out the 3 months.
- The churners genrally show a low total loc mou but it drops dramatically after the 2nd month.
- This might suggest that people who are not making/reciving much local calls during their tenure are more likely to churn.

**Derived Variables:** Total_roam_mou_6,Total_roam_mou_7,Total_roam_mou_8<br>
**Total roaming MOU** (=roam_ic_mou+roam_og_mou) month wise

In [None]:
#Create new feature: Total_roam_mou_6,Total_roam_mou_7,Total_roam_mou_8
for i in range(6,9):
    highvalue_users['Total_roam_mou_'+str(i)] = (highvalue_users['roam_ic_mou_'+str(i)])+(highvalue_users['roam_og_mou_'+str(i)])

In [None]:
plot_byChurn(highvalue_users,'Total_roam_mou')

It can be observed that,
- Surprisingly, the roaming usage of churners is way higher than those of non-churners across all months
- People who are making/reciving more roaming calls during their tenure are more likely to churn.
- This might suggest that the operators roaming tariffs are higher than what are offered by its competitor, thus forming one of the reasons of churn.

###### last_day_rch_amt VS Churn 

In [None]:
plot_byChurn(highvalue_users,'last_day_rch_amt')

- The avg. last recharge amount for churners is less than half the amount of that of the non-churners.
- Suggesting, as the recharge amount reduces for a customer its chances to churn increases.

## Modeling

In [None]:
import sklearn.preprocessing
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

##### Function to plot roc

In [None]:
def plot_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(6, 6))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return fpr, tpr, thresholds

##### Function to get Model Metrics Results

In [None]:
def getModelMetricsResults(actual_churn=False,pred_churn=False):

    confusion = metrics.confusion_matrix(actual_churn, pred_churn)

    TP = confusion[1,1] # true positive 
    TN = confusion[0,0] # true negatives
    FP = confusion[0,1] # false positives
    FN = confusion[1,0] # false negatives

    print("Roc_auc_score : {}".format(metrics.roc_auc_score(actual_churn,pred_churn)))
    # Let's see the sensitivity of our logistic regression model
    print('Sensitivity/Recall : {}'.format(TP / float(TP+FN)))
    # Let us calculate specificity
    print('Specificity: {}'.format(TN / float(TN+FP)))
    # Calculate false postive rate - predicting churn when customer does not have churned
    print('False Positive Rate: {}'.format(FP/ float(TN+FP)))
    # positive predictive value 
    print('Positive predictive value: {}'.format(TP / float(TP+FP)))
    # Negative predictive value
    print('Negative Predictive value: {}'.format(TN / float(TN+ FN)))
    # sklearn precision score value 
    print('sklearn precision score value: {}'.format(metrics.precision_score(actual_churn, pred_churn )))
    
    

##### Function to predict churning with probability

In [None]:
def predictChurnUsingProbCutOff(model,X,y,prob):
    # Funtion to predict the churn using the input probability cut-off
    # Input arguments: model instance, x and y to predict using model and cut-off probability
    
    # predict
    pred_probs = model.predict_proba(X)[:,1]
    
    y_df= pd.DataFrame({'churn':y, 'churn_Prob':pred_probs})
    # Creating new column 'predicted' with 1 if Churn_Prob>0.5 else 0
    y_df['final_predicted'] = y_df.churn_Prob.map( lambda x: 1 if x > prob else 0)
    # Let's see the head
    getModelMetricsResults(y_df.churn,y_df.final_predicted)
    return y_df

In [None]:
def findOptimalCutoff(df):
    #Function to find the optimal cutoff for classifing as churn/non-churn
    # Let's create columns with different probability cutoffs 
    numbers = [float(x)/10 for x in range(10)]
    for i in numbers:
        df[i] = df.churn_Prob.map( lambda x: 1 if x > i else 0)
    #print(df.head())
    
    # Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
    cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
    from sklearn.metrics import confusion_matrix
    
    # TP = confusion[1,1] # true positive 
    # TN = confusion[0,0] # true negatives
    # FP = confusion[0,1] # false positives
    # FN = confusion[1,0] # false negatives
    
    num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
    for i in num:
        cm1 = metrics.confusion_matrix(df.churn, df[i] )
        total1=sum(sum(cm1))
        accuracy = (cm1[0,0]+cm1[1,1])/total1
        
        speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
        sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
        cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
    print(cutoff_df)
    # Let's plot accuracy sensitivity and specificity for various probabilities.
    cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
    plt.show()

In [None]:
def fit_model(alg, X_train, y_train, performCV=True, cv_folds=5):
    #Fit the algorithm on the data
    alg.fit(X_train, y_train)
        
    #Predict training set:
    dtrain_predictions = alg.predict(X_train)
    dtrain_predprob = alg.predict_proba(X_train)[:,1]
    
    #Perform cross-validation:
    if performCV:
        cv_score = cross_val_score(alg, X_train, y_train, cv=cv_folds, scoring='roc_auc')
    
    #Print model report:
    print ("\nModel Report")
    print ("Accuracy : %.4g" % metrics.roc_auc_score(y_train, dtrain_predictions))
    print ("Recall/Sensitivity : %.4g" % metrics.recall_score(y_train, dtrain_predictions))
    print ("AUC Score (Train): %f" % metrics.roc_auc_score(y_train, dtrain_predprob))
    
    if performCV:
        print ("CV Score : Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" % (np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score)))
        

In [None]:
# creating copy of the final hv_user dataframe
highvalue_users_PCA = highvalue_users.copy()
# removing the columns not required for modeling
highvalue_users_PCA.drop(['mobile_number', 'aon_bin'], axis=1, inplace=True)

In [None]:
# removing the datatime columns before PCA
dateTimeCols = list(highvalue_users_PCA.select_dtypes(include=['datetime64']).columns)
print(dateTimeCols)
highvalue_users_PCA.drop(dateTimeCols, axis=1, inplace=True)

In [None]:
from sklearn.model_selection import train_test_split

#putting features variables in X
X = highvalue_users_PCA.drop(['churn'], axis=1)

#putting response variables in Y
y = highvalue_users_PCA['churn']    

# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=100)

In [None]:
#Rescaling the features before PCA as it is sensitive to the scales of the features
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# fitting and transforming the scaler on train
X_train = scaler.fit_transform(X_train)
# transforming the train using the already fit scaler
X_test = scaler.transform(X_test)

### Handling imbalance.

Standard classifier algorithms like Decision Tree and Logistic have Regression have a bias towards classes which have number of instances,all tends to only predict the majority class data,there is a high probability of misclassification of the minority class as compared to the majority class.

##### Synthetic Minority Over-sampling Technique

A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. 

The new dataset is used as a sample to train the classification models.

In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))
print("Before OverSampling, churn event rate : {}% \n".format(round(sum(y_train==1)/len(y_train)*100,2)))

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=12, ratio = 1)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)

In [None]:
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))
print("After OverSampling, churn event rate : {}% \n".format(round(sum(y_train_res==1)/len(y_train_res)*100,2)))

In [None]:
#Improting the PCA module
from sklearn.decomposition import PCA
pca = PCA(svd_solver='randomized', random_state=42)

In [None]:
#Doing the PCA on the train data
pca.fit(X_train_res)

we'll let PCA select the number of components basen on a variance cutoff we provide

 **screeplot to assess the number of needed principal components**

In [None]:
pca.explained_variance_ratio_[:50]

In [None]:
#Making the screeplot - plotting the cumulative variance against the number of components
%matplotlib inline
fig = plt.figure(figsize = (12,8))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()

##### **50 components are enough to describe 95% of the variance in the dataset**
- We'll take 50 components for modeling

In [None]:
#Using incremental PCA for efficiency - saves a lot of time on larger datasets
from sklearn.decomposition import IncrementalPCA
pca_final = IncrementalPCA(n_components=35)

In [None]:
X_train_pca = pca_final.fit_transform(X_train_res)
X_train_pca.shape

In [None]:
#creating correlation matrix for the principal components
corrmat = np.corrcoef(X_train_pca.transpose())
# 1s -> 0s in diagonals
corrmat_nodiag = corrmat - np.diagflat(corrmat.diagonal())
print("max corr:",corrmat_nodiag.max(), ", min corr: ", corrmat_nodiag.min(),)
# we see that correlations are indeed very close to 0

there is no correlation between any two components, We  have almost removed multicollinearity here , and models will be more stable now

In [None]:
#Applying selected components to the test data - 50 components
X_test_pca = pca_final.transform(X_test)
X_test_pca.shape

For prediction of churned customers we will be fitting variety of models,these are-
    1. Logistic Regression
    2. Decision Tree
    3. Random Forest
   

### 1. Logistic Regression

##### Applying Logistic Regression on  principal components

In [None]:
#Training the model on the train data
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

lr0 = LogisticRegression(class_weight='balanced')

In [None]:
fit_model(lr0, X_train_pca, y_train_res)

In [None]:
# Test  data Results:
pred_probs_test = lr0.predict(X_test_pca)
getModelMetricsResults(y_test,pred_probs_test)

In [None]:
print("Accuracy : {}".format(metrics.accuracy_score(y_test,pred_probs_test)))
print("Recall : {}".format(metrics.recall_score(y_test,pred_probs_test)))
print("Precision : {}".format(metrics.precision_score(y_test,pred_probs_test)))

In [None]:
#Making prediction on the test data
pred_probs_train = lr0.predict_proba(X_train_pca)[:,1]
print("roc_auc_score(Train) {:2.2}".format(metrics.roc_auc_score(y_train_res, pred_probs_train)))

In [None]:
cut_off_prob=0.5
y_train_df = predictChurnUsingProbCutOff(lr0,X_train_pca,y_train_res,cut_off_prob)
y_train_df.head()

**ROC Curve :**
An ROC curve states that:-
- It shows the tradeoff between sensitivity and specificity.
- The closer the curve follows the left-hand border and then the top border of the ROC space,the test becomes more accurate
- The closer the curve comes to the 45-degree diagonal of the ROC space,test becomes less accurate.

In [None]:
plot_roc(y_train_df.churn, y_train_df.final_predicted)

The roc curve is lying in the top left corner which is a sign of a good fit.

In [None]:
#plot_roc(y_pred_final.Churn, y_pred_final.predicted)
print("roc_auc_score : {:2.2f}".format(metrics.roc_auc_score(y_train_df.churn, y_train_df.final_predicted)))

**Optimal Cutoff Point**<br>
Since recall or sensitivity is a much more important metrics for churn prediction. A trade off between sensitivity(or recall) and specificity is to be consideredfor the same . We will try adjusting the probability cut offs which will result into higher sensitivity or recall rate.

In [None]:
# sensitivity vs specificity trade-off
findOptimalCutoff(y_train_df)

#### **From the curve above, 0.45 is the optimum point .**
cutoff between 0.4 and 0.6 can also be taken but to keep the test sensitivity/recall significant ,here we can take 0.45. At this point there is a balance of sensitivity, specificity and accuracy.

In [None]:
# predicting with the choosen cut-off on train
cut_off_prob = 0.45
predictChurnUsingProbCutOff(lr0,X_train_pca,y_train_res,cut_off_prob)

**Making prediction on test**

In [None]:
# predicting with the choosen cut-off on test
predictChurnUsingProbCutOff(lr0,X_test_pca,y_test,cut_off_prob)

The resulting model, after PCA and logistic regression (with optimal cutoff setting) on the train and test set.
- **train sensitivity  :** 86.47%, **train roc auc score  :** 82.1%
- **test sensitivity   :** 84.40%, **test roc auc score  :** 81.21%

### 2. Decision Tree

##### Applying Decision Tree Classifier on our principal components with Hyperparameter tuning

In [None]:
dt0 = DecisionTreeClassifier(class_weight='balanced',
                             max_features='auto',
                             min_samples_split=100,
                             min_samples_leaf=100,
                             max_depth=6,
                             random_state=10)
fit_model(dt0, X_train_pca, y_train_res)

In [None]:
# make predictions
pred_probs_test = dt0.predict(X_test_pca)
#Let's check the model metrices.
getModelMetricsResults(actual_churn=y_test,pred_churn=pred_probs_test)

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'max_depth': range(5,15,3),
    'min_samples_leaf': range(100, 400, 50),
    'min_samples_split': range(100, 400, 100),
    'max_features': [8,10,15]
}
# Create a based model
dt = DecisionTreeClassifier(class_weight='balanced',random_state=10)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = dt, param_grid = param_grid, 
                          cv = 3, n_jobs = 4,verbose = 1,scoring="f1_weighted")

In [None]:
# Fit the grid search to the data
grid_search.fit(X_train_pca, y_train_res)

In [None]:
# printing the optimal accuracy score and hyperparameters
print('recall of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# model with the best hyperparameters
dt_final = DecisionTreeClassifier(class_weight='balanced',
                             max_depth=14,
                             min_samples_leaf=100, 
                             min_samples_split=100,
                             max_features=15,
                             random_state=10)

In [None]:
fit_model(dt_final,X_train_pca,y_train_res)

In [None]:
# make predictions
pred_probs_test = dt_final.predict(X_test_pca)
#Let's check the model metrices.
getModelMetricsResults(actual_churn=y_test,pred_churn=pred_probs_test)

In [None]:
# classification report
print(classification_report(y_test,pred_probs_test))

##### Recall rate by deciding an optimal cut-off for the model to predict churn.

In [None]:
# predicting churn with default cut-off 0.5
cut_off_prob = 0.5
y_train_df = predictChurnUsingProbCutOff(dt_final,X_train_pca,y_train_res,cut_off_prob)
y_train_df.head()

In [None]:
# finding cut-off with the right balance of the metrices
findOptimalCutoff(y_train_df)

**From the curve above, let'choose 0.4 as the optimum point to make a high enough sensitivity.**

In [None]:
# predicting churn with cut-off 0.4
cut_off_prob=0.4
y_train_df = predictChurnUsingProbCutOff(dt_final,X_train_pca,y_train_res,cut_off_prob)
y_train_df.head()

- At 0.58 cut-off prob. there is a balance of sensitivity , specificity and accuracy.
<br>Lets see how it performs on test data.

In [None]:
#Test data Results
y_test_df= predictChurnUsingProbCutOff(dt_final,X_test_pca,y_test,cut_off_prob)
y_test_df.head()

- Decision tree after selecting optimal cut-off also is resulting in a model with
<br>**Train Recall : 89.78%**  and  **Train Roc_auc_score : 82.40**
<br>**Test Recall : 78.13%**  and  **Test Roc_auc_score : 76.56**

Random Forest still seems overfitted to the data. 

### 3. Random Forest

##### Applying Random Forest Classifier on our principal components with Hyperparameter tuning

In [None]:
def plot_Accuracy(score,param):
    scores = score
    # plotting accuracies with max_depth
    plt.figure()
    plt.plot(scores["param_"+param], 
    scores["mean_train_score"], 
    label="training accuracy")
    plt.plot(scores["param_"+param], 
    scores["mean_test_score"], 
    label="test accuracy")
    plt.xlabel(param)
    plt.ylabel("f1")
    plt.legend()
    plt.show()

#### Tuning max_depth

In [None]:
parameters = {'max_depth': range(10, 30, 5)}
rf0 = RandomForestClassifier()
rfgs = GridSearchCV(rf0, parameters, 
                    cv=5, 
                   scoring="f1",return_train_score=True)
rfgs.fit(X_train_pca,y_train_res)

In [None]:
scores = rfgs.cv_results_
# plotting accuracies with max_depth
plt.figure()
plt.plot(scores["param_max_depth"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_max_depth"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

Test f1-score almost becomes constant after max_depth=20

#### Tuning n_estimators

In [None]:
parameters = {'n_estimators': range(50, 150, 25)}
rf1 = RandomForestClassifier(max_depth=20,random_state=10)
rfgs = GridSearchCV(rf1, parameters, 
                    cv=3, 
                   scoring="recall",return_train_score=True)

In [None]:
rfgs.fit(X_train_pca,y_train_res)

In [None]:
plot_Accuracy(rfgs.cv_results_,'n_estimators')

Selecting n_estimators = 80

#### Tuning max_features

In [None]:
parameters = {'max_features': [4, 8, 14]}
rf3 = RandomForestClassifier(max_depth=20,n_estimators=80,random_state=10)
rfgs = GridSearchCV(rf3, parameters, 
                    cv=3, 
                   scoring="f1",return_train_score=True)

In [None]:
rfgs.fit(X_train_pca,y_train_res) 

In [None]:
plot_Accuracy(rfgs.cv_results_,'max_features')

Selecting max_features = 3

#### Tuning min_sample_leaf

In [None]:
parameters = {'min_samples_leaf': range(100, 400, 50)}
rf4 = RandomForestClassifier(max_depth=20,n_estimators=80,max_features=5,random_state=10)
rfgs = GridSearchCV(rf4, parameters, 
                    cv=3, 
                   scoring="f1",return_train_score=True)

In [None]:
rfgs.fit(X_train_pca,y_train_res)

In [None]:
plot_Accuracy(rfgs.cv_results_,'min_samples_leaf')

Selecting min_sample_leaf = 100

#### Tuning min_sample_split

In [None]:
parameters = {'min_samples_split': range(150, 300, 50)}
rf5 = RandomForestClassifier(max_depth=20,n_estimators=80,max_features=5,min_samples_leaf=100,random_state=10)
rfgs = GridSearchCV(rf5, parameters, 
                    cv=3, 
                   scoring="f1",return_train_score=True)

In [None]:
rfgs.fit(X_train_pca,y_train_res)
plot_Accuracy(rfgs.cv_results_,'min_samples_split')

Selecting min_sample_split = 150

#### Tunned Random Forest

In [None]:
rf_final = RandomForestClassifier(max_depth=20,
                                  n_estimators=80,
                                  max_features=3,
                                  min_samples_leaf=100,
                                  min_samples_split=50,
                                  random_state=10)

In [None]:
print("Train data Results:")
fit_model(rf_final,X_train_pca,y_train_res)

In [None]:
# predict on test data
predictions = rf_final.predict(X_test_pca)

In [None]:
print("Test data Results:")
getModelMetricsResults(y_test,predictions)

After hyperparameter tuning for the random forest. The Recall rate(Test) is 73.11%.

Let's see if we can achive a better Recall rate by deciding an optimal cut-off for the model to predict churn.

In [None]:
# predicting churn with default cut-off 0.5
cut_off_prob=0.5
y_train_df = predictChurnUsingProbCutOff(rf_final,X_train_pca,y_train_res,cut_off_prob)
y_train_df.head()

In [None]:
# finding cut-off with the right balance of the metrices
findOptimalCutoff(y_train_df)

**From the plot above, 0.45 is the optimal point with high enough sensitivity.**

In [None]:
cut_off_prob=0.45
predictChurnUsingProbCutOff(rf_final,X_train_pca,y_train_res,cut_off_prob)

**Making prediction on test**

In [None]:
y_test_df= predictChurnUsingProbCutOff(rf_final,X_test_pca,y_test,cut_off_prob)
y_test_df.head()

- Random Forest after selecting optimal cut-off also is resulting in a model with
<br>**Train Recall : 88.40%**  and  **Train Roc_auc_score : 85.17**
<br>**Test Recall : 77.57%**  and  **Test Roc_auc_score : 79.33**

---------------

## Model Selection
The company would like to identify most customers at risk of churning, even if there are many customers that are misclassified as churn. The cost to the company of churning is much higher than having a few false positives. 

| Model                                 | Train  Results   | Test Results  |
|---------------------------------------|------------------|---------------|
| Logistic Regression ( cut-off = 0.45) |  ------------------------------  |
| Roc_auc_score                         | 82.11%           | 81.21%        |
| Sensitivity/Recall                    | 86.48%           | 84.40%        |
| Specificity                           | 77.75%           | 78.02%        |
| precision                             | 79.54%           | 25.04%        |
| DecisionTree ( cut-off = 0.4)         |  ------------------------------  |
| Roc_auc_score                         | 82.41%           | 76.57%        |
| Sensitivity/Recall                    | 89.79%           | 78.13%        |
| Specificity                           | 75.03%           | 75%           |
| precision                             | 78.24%           | 21.38%        |
| Random Forest (cut-off = 0.45)        |   -----------------------------  |
| Roc_auc_score                         | 85.60%           | 96.53%        |
| Sensitivity/Recall                    | 88.70%           | 77.57%        |
| Specificity                           | 82.50%           | 81.73%        |
| precision                             | 83.52%           | 26.97%        |



Overall, the **Logistic Regression** model with probability cut-off = 0.45, performs best. It achieved the **best recall accuracy of 84.4%** for test data. Also the overall accuracy and specificity is consistent for Test and train data, thus avoiding overfitting. The precision is compromised in this effort but the business objective to predict Churn customers is most accuratety captured by it. 

From the Tree Family, the Decision Tree overfitted the data slightly while obtaining 78.13% recall accuracy on test data. 
The Random Forest avoided overfitting but obtained only 77.57% recall accuracy on test data. 



## Identifying relevant churn features. 

We will use an instance of Random Forest classifier to identify the features most relevant to churn. 

### Random Forest for churn driver features 

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'max_depth': [8,10,12],
    'min_samples_leaf': range(100, 400, 200),
    'min_samples_split': range(200, 500, 200),
    'n_estimators': [100,200, 300], 
    'max_features': [12, 15, 20]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = 4,verbose = 1)

In [None]:
# Fit the grid search to the data
grid_search.fit(X_train_res, y_train_res)

In [None]:
# printing the optimal accuracy score and hyperparameters
print("accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
rf = RandomForestClassifier(max_depth=12,
                            max_features=20,
                            min_samples_leaf=100,
                            min_samples_split=200,
                            n_estimators=300,
                            random_state=10)

In [None]:
rf.fit(X_train_res, y_train_res)

In [None]:
plt.figure(figsize=(15,40))
feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
feat_importances.nlargest(len(X.columns)).sort_values().plot(kind='barh', align='center')

Some of the top main predictiors of churn are the monthly features for the action phase (3rd month August).

above plot shows that the top 25 features ranked in order of importance as produced by our RandomForest implementation are the features that belong to month 8 i.e., the action month. Hence, it is clear that what happens in the action phase has a direct impact on the customer churn of high value customers,features :-




1.	**total_ic_mou_8**		-- *Total incoming minutes of usage in month 8*
2.	**loc_ic_mou_8**		-- *local incoming minutes of usage in month 8*
3.	**total_month_rech_8**	-- *Total month recharge amount in month 8*	
4.	**total_roam_mou_8**	-- *Total incoming+outgoing roaming minutes of usage in month 8*
5.	**loc_ic_t2m_mou_8**	-- *local incoming calls to another operator minutes of usage in month 8*
6.	**roam_og_mou_8**		-- *outgoing roaming calls minutes of usage in month 8*
7.	**Total_loc_mou_8**		-- *Total local minutes of usage in month 8*
8.	**roam_ic_mou_8**		-- *incoming roaming calls minutes of usage in month 8*
9.	**total_rech_amt_8**	-- *total recharge amount in month 8*
10.	**loc_ic_t2t_mou_8**	-- *local incoming calls from same operator minutes of usage in month 8*
11.	**max_rech_amt_8**		-- *maximum recharge amount in month 8*
12.	**last_day_rch_amt_8**	-- *last (most recent) recharge amount in month 8*
13.	**arpu_8**				-- *average revenue per user in month 8*
14.	**loc_og_mou_8**		-- *local outgoing calls minutes of usage in month 8*
15.	**loc_og_t2n_mou_8**	-- *local outgoing calls minutes of usage to other operator mobile in month 8*
16.	**av_rech_amt_data_8**	-- *average recharge amount for mobile data in month 8*
17.	**total_rech_data_8**	-- *total data recharge (MB) in month 8*
18.	**total_og_t2t_mou_8**	-- *total outgoing calls from same operator minutes of usage in month 8*
19.	**total_rech_num_8**	-- *total number of recharges done in the month 8*
20.	**total_rech_amt_data_8**	-- *total recharge amount for data in month 8*
21.	**max_rech_data_8**		-- *maximum data recharge (MB) in month 8*
22.	**avg_rech_amt_8**		-- *average recharge amount in month 8*
23.	**fb_user_8**			-- *services of Facebook and similar social networking sites for month 8*
24.	**vol_data_mb_8**		-- *volume of data (MB) consumed for month 8*
25.	**count_rech_2g_8**		-- *Number of 2g data recharge in month 8*
26.	**loc_og_to_ic_mou_8**	-- *local outgoing to incoming mou ratio for month of 8*
27.	**spl_og_mou_7**		-- *Special outgoing call for the month of 7*

Local calls Mou's be it incoming or outgoing have a very important role for churn predictions. 

## Approach to reduce customer churn

It is a fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.
For many incumbent operators, retaining high profitable customers is the number one business goal.

#### Monitoring Drop in usage

Customer churn can be predicted by Usage based Churn, and it gives good accuracy.
telecom company should pay close attention to drop in MoU, ARPU and data usage (2g and 3g) month over month.



######  Outgoing services

In [None]:
# Outgoing Mou
plot_byChurnMou(outgoing_columns,'Outgoing')

-  Initially, churner's outgoing usage was more than that of non-churners. Gradually they dropped there outgoing usage

###### Roaming services

In [None]:
plot_byChurn(highvalue_users,'Total_roam_mou')

Strategy Approach:-
- Churners show higher roaming usage than non-churners.
- The Network operators must further investigate their roaming tariffs, and quality of service.
- Roaming tariffs offered are less competitive than their competitor.
- Discounted roaming rates during particular hours of the day.
- Free monthly roaming mou's depending on the users past roaming mou usage.