# <center>Reducing Customer Churn: Using Machine Learning to Predict Customer Retention at Syriatel Mobile Telecom</center>

## Introduction

Business growth and development remains a central motivator in organizational decision-making and policy making. Although every business leader aspires to achieve growth in revenues, clientele, and profitability, they must try as much as possible to avoid making losses. 

In recent years, such leaders, as well as business experts, have identified customer satisfaction as an important factor to ensuring such growth and development. Without customers, a business would not make any sales, record any cash inflows in terms of revenues, nor make any profits. This underscores the the need for organizations to implement measures that retain existing customers. 

Recent technological advancements have also contributed to an increased business rivalry, especially due to increased startups and entrants. Such competition, coupled with an augmented saturation of markets, means that it has become harder and more expensive for businesses in most sectors to acquire new clients, which means they must shift their focus to cementing relationships with existing customers. 

A 2014 article, called [The Value of Keeping the Right Customers](https://hbr.org/2014/10/the-value-of-keeping-the-right-customers), written by Amy Gallo stresses on the importance of any business investing more to retain existing customers (avoiding customer churning) than acquiring new ones. Gallo maintains that it costs from 5 to 25 times to acquire a new customer than retain an existing one while retaining existing clients by 5% results in profits augmenting by 25% to 95%.

![Retaining Existing Customers vs Acquiring New Ones: Why a Business Should Avoid Customer Churning](images/customer-retention-vs-acquisition.png)
**Source:** [The Value of Keeping the Right Customers](https://www.netscribes.com/customer-retention-strategies/)

Through this project, we are building a prediction model that identifies patterns in customer churning, which can be helpful in developing mitigation strategies. The project is structured as follows: 

1. **Business Understanding**
2. **Data Understanding**
3. **Data Preparation**
4. **Exploratory Data Analysis**
5. **Modelling**
6. **Model Evaluation**
7. **Recommendations and Conclusions**

## Business Understanding 
With an increasing blend of factors such as competition, technological innovations, and globalization, among others in the telecommunication markets, **Syriatel Mobile Telecom** has stressed on the need to improve customer satisfaction and preserve its 8 million clientele. Through its [linkedIn profile](https://sy.linkedin.com/company/syriatel), the Syrian telecommunication giant reiterates on its commitment to maintaining its market position by establishing "*its reputation by focusing on customer satisfaction and social responsibility.*"

Although such efforts have been fruitful over the years, the company needs to increase its commitment to reducing customer charning rates, which might threaten its market position, profitability, and overall growth. Retaining the company's 8 million customers will help the company reduce the costs, avoid losses, and increase sales. Further, such actions would contribute to an increased ROI, reduced marketing costs, augmented customer loyalty, and promote further client acquisition through referrals, as outlined by Amy Gallo.

Hence, this project will help **Syriatel Mobile Telecom** identify customers with highest probabilities of churning, which will be crucial for implementing new policies and business frameworks intended to ensure retention. As defined by Amy Gallo *"Customer churn rate is a metric that measures the percentage of customers who end their relationship with a company in a particular period."* In this scenario, the emphasis is on identifying prospective churners among SyriaTel's customer base and implementing the necessary strategic business decisions intended to ensure such clients are retained.

**Primary stakeholder:** 
+ Syriatel Mobile Telecom

**Other Stakeholders:** 
+ Shareholders
+ Employees
+ Customers


As the principle stakeholder, the company stands to benefits from this model through a reduction in customer charning rates, which has the potential to increase revenues and profits, promote growth, and sustain, or rather, increase its market position. The customers will also benefit through improved telecomunication services, not forgetting better customer service. As the company continues to grow, through revenues, profits, increased customers, and higher market share, the shareholders will also get more returns on their investments (ROI) while employees benefit from better remunerations and bonuses. 


The project aims to provide value to the different stakeholders by identifying predictable patterns related to customer churn, which can help SyriaTel take proactive measures to retain customers and minimize revenue loss.

**Research Objectives**
1. To understand which features would determine if a customer would churn. To establish the significant predictors of Customer churning.
2. To develop a classifier model which would predict weather or not a customer would churn.
3. To establish Cost effective strategies that Syriatel can have in place to retain customers.  

. To establish if Customer Service Calls have an impact on Customer Churn
. Customer with international plans are less likely to churn
. To establish if Customers with high day charges are less likely to churn
. To establish if Customers with high night charges are less likely to churn


**Reaserch Questions**
1. Which features contribute to Customer churn?
2. 
3. 

**Objectives need further review....**

## Data Understanding
The Churn in Telecom’s dataset from Kaggle contains information about customer activity and whether or not they canceled their subscription with Orange Telecom. The goal of this dataset is to develop predictive models that can help the telecom business reduce the amount of money lost due to customers who don’t stick around for very long.

The dataset contains 3333 entries and 21 columns, including information about the state, account length, area code, phone number, international plan, voice mail plan, number of voice mail messages, total day minutes, total day calls, total day charge, total evening minutes, total evening calls, total evening charge, total night minutes, total night calls, total night charge, total international minutes, total international calls, total international charge, customer service calls and churn.

In this phase of the project, we will focus on getting familiar with the data and identifying any potential data quality issues. We will also perform some initial exploratory data analysis to discover first insights into the data.

**More needed**..

## Data Preparation
In this section, we are going to do several actions to prepare our data for exploratory data analysis and modelling. First, we will import all the necessary libraries, load the dataset using pandas library, preview the data (how many features and records, as well as statistical features), and conduct thorough data preprocessing (checking and removing any missing values and transforming data)

Here, we import all the libraries we will use for this project and load the data into a pandas dataframe

In [23]:
# Importing libraries.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from matplotlib import pyplot as plt
%matplotlib inline

#Loading the data into a pandas dataframe
data = pd.read_csv('data/bigml_59c28831336c6604c800002a.csv')

Afterward, we examine the data to determine the number of features, understand whether we have any missing values, identify columns that need transformation for modelling, and get any other insights we may need before proceeding to the next step

In [24]:
#Checking the features/columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

As shown above, we have 3333 data records and 21 columns, with zero null values. However, we will need to review the data further to identify missing values, especially those in the form of placeholder values or unique characters. Four (4) of our columns are of the object type, while eight (8) are of integer type, eight (8) as floats, and one (1) column as bolean.  Our target variable column is churn, which means we will treat the rest of the columns as features. 

We also need preview the top 10 and top bottom data records to get a glimpse of what we are dealing with.

In [25]:
#Checking the top 10 data records
data.head(10)

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False
5,AL,118,510,391-8027,yes,no,0,223.4,98,37.98,...,101,18.75,203.9,118,9.18,6.3,6,1.7,0,False
6,MA,121,510,355-9993,no,yes,24,218.2,88,37.09,...,108,29.62,212.6,118,9.57,7.5,7,2.03,3,False
7,MO,147,415,329-9001,yes,no,0,157.0,79,26.69,...,94,8.76,211.8,96,9.53,7.1,6,1.92,0,False
8,LA,117,408,335-4719,no,no,0,184.5,97,31.37,...,80,29.89,215.8,90,9.71,8.7,4,2.35,1,False
9,WV,141,415,330-8173,yes,yes,37,258.6,84,43.96,...,111,18.87,326.4,97,14.69,11.2,5,3.02,0,False


In [26]:
#Checking the bottom 10 data records
data.tail(10)

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
3323,IN,117,415,362-5899,no,no,0,118.4,126,20.13,...,97,21.19,227.0,56,10.22,13.6,3,3.67,5,True
3324,WV,159,415,377-1164,no,no,0,169.8,114,28.87,...,105,16.8,193.7,82,8.72,11.6,4,3.13,1,False
3325,OH,78,408,368-8555,no,no,0,193.4,99,32.88,...,88,9.94,243.3,109,10.95,9.3,4,2.51,2,False
3326,OH,96,415,347-6812,no,no,0,106.6,128,18.12,...,87,24.21,178.9,92,8.05,14.9,7,4.02,1,False
3327,SC,79,415,348-3830,no,no,0,134.7,98,22.9,...,68,16.12,221.4,128,9.96,11.8,5,3.19,2,False
3328,AZ,192,415,414-4276,no,yes,36,156.2,77,26.55,...,126,18.32,279.1,83,12.56,9.9,6,2.67,2,False
3329,WV,68,415,370-3271,no,no,0,231.1,57,39.29,...,55,13.04,191.3,123,8.61,9.6,4,2.59,3,False
3330,RI,28,510,328-8230,no,no,0,180.8,109,30.74,...,58,24.55,191.9,91,8.64,14.1,6,3.81,2,False
3331,CT,184,510,364-6381,yes,no,0,213.8,105,36.35,...,84,13.57,139.2,137,6.26,5.0,10,1.35,2,False
3332,TN,74,415,400-4344,no,yes,25,234.4,113,39.85,...,82,22.6,241.4,77,10.86,13.7,4,3.7,0,False


In this step, we review our dataset to check for missing values we might have missed before. Remember, we previewed the data columns and records and found that no null values. However, that does not mean that we have no missing values in this data. We need to dive deep into the data to see if we have missing values in terms of placeholder values or unique values.

In [27]:
#Checking whether we have missing values
all_columns = data.columns.tolist()
unique_vals = data[all_columns].apply(lambda x: x.unique())
unique_vals

state                     [KS, OH, NJ, OK, AL, MA, MO, LA, WV, IN, RI, I...
account length            [128, 107, 137, 84, 75, 118, 121, 147, 117, 14...
area code                                                   [415, 408, 510]
phone number              [382-4657, 371-7191, 358-1921, 375-9999, 330-6...
international plan                                                [no, yes]
voice mail plan                                                   [yes, no]
number vmail messages     [25, 26, 0, 24, 37, 27, 33, 39, 30, 41, 28, 34...
total day minutes         [265.1, 161.6, 243.4, 299.4, 166.7, 223.4, 218...
total day calls           [110, 123, 114, 71, 113, 98, 88, 79, 97, 84, 1...
total day charge          [45.07, 27.47, 41.38, 50.9, 28.34, 37.98, 37.0...
total eve minutes         [197.4, 195.5, 121.2, 61.9, 148.3, 220.6, 348....
total eve calls           [99, 103, 110, 88, 122, 101, 108, 94, 80, 111,...
total eve charge          [16.78, 16.62, 10.3, 5.26, 12.61, 18.75, 29.62...
total night 

In [28]:
#Viewing the statistical details such as std, percentile, count, and the mean
data.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0
