# Business Problem Overview

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

 

For many incumbent operators, retaining high profitable customers is the number one business goal.

 

To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.

 

In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.

 
## Understanding and Defining Churn

There are two main models of payment in the telecom industry - postpaid (customers pay a monthly/annual bill after using the services) and prepaid (customers pay/recharge with a certain amount in advance and then use the services).

In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and you directly know that this is an instance of churn.

However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again).

Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully.  Also, prepaid is the most common model in India and southeast Asia, while postpaid is more common in Europe in North America.

This project is based on the Indian and Southeast Asian market.

## Definitions of Churn

There are various ways to define churn, such as:

#### Revenue-based churn: 
Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as _**‘customers who have generated less than INR 4 per month in total/average/median revenue’**_.

The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas.

#### Usage-based churn: 
Customers **who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time**.

    A potential shortcoming of this definition is that when the customer has stopped using the services for a while, it may be too late to take any corrective actions to retain them. For e.g., if you define churn based on a ‘two-months zero usage’ period, predicting churn could be useless since by that time the customer would have already switched to another operator.

In this project, you will use the **usage-based definition to define _churn_ **.

 
### High-value Churn

In the Indian and the southeast Asian market,

    approximately 80% of revenue comes from the top 20% customers (called high-value customers). 
  
Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage.

In this project, you will define high-value customers based on a certain metric (mentioned later below) and predict churn only on high-value customers.

 
### Understanding the Business Objective and the Data

The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively. 


The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful.

 
### Understanding Customer Behaviour During Churn

Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle :

    The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual.

    The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a  competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.)

    The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase.

 

In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Load the data
Load the churn data for the telecom provider

In [13]:
churn_data = pd.read_csv('telecom_churn_data.csv')

In [14]:
total_records = churn_data.shape[0]

In [15]:
missing_value_percentage = round(100 * churn_data.isnull().sum().sort_values(ascending=False) / total_records, 2)


In [16]:
missing_value_percentage[missing_value_percentage > 50.0]

count_rech_2g_6             74.85
date_of_last_rech_data_6    74.85
count_rech_3g_6             74.85
av_rech_amt_data_6          74.85
max_rech_data_6             74.85
total_rech_data_6           74.85
arpu_3g_6                   74.85
arpu_2g_6                   74.85
night_pck_user_6            74.85
fb_user_6                   74.85
arpu_3g_7                   74.43
count_rech_2g_7             74.43
fb_user_7                   74.43
count_rech_3g_7             74.43
arpu_2g_7                   74.43
av_rech_amt_data_7          74.43
max_rech_data_7             74.43
night_pck_user_7            74.43
total_rech_data_7           74.43
date_of_last_rech_data_7    74.43
night_pck_user_9            74.08
date_of_last_rech_data_9    74.08
fb_user_9                   74.08
arpu_2g_9                   74.08
max_rech_data_9             74.08
arpu_3g_9                   74.08
total_rech_data_9           74.08
av_rech_amt_data_9          74.08
count_rech_3g_9             74.08
count_rech_2g_

### Functions

In [17]:
def plot_triangular(corr):
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, annot=True, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5})


In [18]:
def impute_if_field_present(df, driver, toImpute, defaultVal=1, valIfAbsent=0 ):
    
    return None

#### Data cleaning


In [19]:
missing_value_percentage.filter(like='rech')

count_rech_2g_6             74.85
date_of_last_rech_data_6    74.85
count_rech_3g_6             74.85
av_rech_amt_data_6          74.85
max_rech_data_6             74.85
total_rech_data_6           74.85
count_rech_2g_7             74.43
count_rech_3g_7             74.43
av_rech_amt_data_7          74.43
max_rech_data_7             74.43
total_rech_data_7           74.43
date_of_last_rech_data_7    74.43
date_of_last_rech_data_9    74.08
max_rech_data_9             74.08
total_rech_data_9           74.08
av_rech_amt_data_9          74.08
count_rech_3g_9             74.08
count_rech_2g_9             74.08
av_rech_amt_data_8          73.66
count_rech_3g_8             73.66
count_rech_2g_8             73.66
date_of_last_rech_data_8    73.66
total_rech_data_8           73.66
max_rech_data_8             73.66
date_of_last_rech_9          4.76
date_of_last_rech_8          3.62
date_of_last_rech_7          1.77
date_of_last_rech_6          1.61
total_rech_num_6             0.00
total_rech_num

In [20]:
non_kpis_to_be_imputed =['max_rech_data_6', 'max_rech_data_7',  'max_rech_data_8', 'max_rech_data_9']
#This valuses doesnpt drive the KPI will replace by zero as per instruction
churn_data[non_kpis_to_be_imputed] = churn_data[non_kpis_to_be_imputed].fillna(0, axis=1)

#impute the categorical variable NaN with -1
cat_variables_to_be_imputed = ['night_pck_user_6', 'night_pck_user_7',
                               'night_pck_user_8', 'night_pck_user_9',
                               'fb_user_6', 'fb_user_7',
                               'fb_user_8', 'fb_user_9']
churn_data[cat_variables_to_be_imputed] = churn_data[cat_variables_to_be_imputed].fillna(-1, axis=1)

In [21]:
#date dependant missing values
kpis_imputed_wrt_date = dict( date_of_last_rech_data_6 = ['total_rech_data_6', 
                                                          'count_rech_2g_6', 
                                                          'count_rech_3g_6', 'av_rech_amt_data_6' ],
                             date_of_last_rech_data_7 = ['total_rech_data_7', 
                                                          'count_rech_2g_7', 
                                                          'count_rech_3g_7', 'av_rech_amt_data_7' ], 
                             date_of_last_rech_data_8 = ['total_rech_data_8', 
                                                          'count_rech_2g_8', 
                                                          'count_rech_3g_8', 'av_rech_amt_data_8'],
                             date_of_last_rech_data_9 = ['total_rech_data_9', 
                                                          'count_rech_2g_9', 
                                                          'count_rech_3g_9', 'av_rech_amt_data_9'])


In [22]:

churn_data_clone = churn_data.copy()
for date, fields in kpis_imputed_wrt_date.items():
    churn_data_clone.loc[churn_data_clone[date].isnull(), fields] = churn_data_clone.loc[churn_data_clone[date].isnull(), fields].fillna(0)
    churn_data_clone.loc[churn_data_clone[date].notnull(), fields] = churn_data_clone.loc[churn_data_clone[date].notnull(), fields].fillna(1)
        


#### Data visualization
initial distribution 

In [None]:
churn_data_jun = churn_data.filter(regex='(_6$|^jun_)')
plt.figure(figsize=(20,10))
sns.distplot(data=churn_data_jun.filter(like='arpu'))

In [None]:
churn_data['avg_good_phase'] = churn_data[['c', 'total_rech_amt_7']].mean(axis=1)

In [None]:
churn_data_hv=churn_data[churn_data.avg_good_phase >= churn_data['avg_good_phase'].quantile(0.7)]

In [None]:
churn_data_hv.shape

In [None]:
churn_data_jun = churn_data.filter(regex='(_6$|^jun_)')
churn_data_jul = churn_data.filter(regex='(_7$|^jul_)')
churn_data_aug = churn_data.filter(regex='(_8$|^aug_)')
churn_data_sep = churn_data.filter(regex='(_9$|^sep_)')

In [None]:
churn_data_jun.

In [None]:
100*churn_data_jun.isnull().sum().sort_values(ascending=False) / total_records

In [None]:
churn_kpi = ['total_ic_mou_9','total_og_mou_9','vol_2g_mb_9', 'vol_3g_mb_9']
churn_data_hv['churn'] = [1 if churned == 0 else 0 for churned in churn_data_hv[churn_kpi].sum(axis=1)]

In [None]:
churn_data_hv['churn'].shape

In [None]:
churn_data_hv.churn.sum() / churn_data_hv['churn'].shape[0]