### Customer Retention Prediction
• Introduction [10 marks]
o Background [1 paragraph of at least 6 lines]
o Research Problem [1 paragraph of at least 6 lines]
o Objectives [At least 3 SMART points]
o Hypothesis [At least 2 points]
• Methodology, Results and Discussion
o Data description [1 paragraph of at least 10 lines with variables described in a table] [10 marks]
✓ Source of data
✓ Period collected (year and month/day)
✓ How it was collected
✓ Under what conditions was it collected
o Exploratory data analytics [with visualizations and interpretations; new hind insights] [10
marks]
✓ Descriptive analytics
✓ Diagnostics analytics
o Data cleaning/pre-treatment for machine learning purpose. [10 marks]
o Predictive data analytics [with data science and machine learning models; new foresights] [15
marks]
• Conclusion [5 marks]

### Introduction


### Background
The telecommunication industry has seen significant growth over the past few decades, becoming a cornerstone of modern society. With the rise of mobile and internet services, customer retention has become a critical challenge for telecom companies. High customer churn rates not only affect profitability but also reflect customer dissatisfaction. Understanding the factors leading to customer churn and developing strategies to mitigate it are crucial for sustaining growth and competitiveness.

### Research Problem
Despite various efforts by telecom companies to improve customer retention, high churn rates remain a pervasive issue. The challenge lies in identifying the key factors that influence a customer's decision to switch providers and predicting potential churners with high accuracy. This study aims to analyze customer data, identify significant churn predictors, and develop a predictive model to help telecom companies proactively address churn risks.

### Objectives

1. Specific: To identify the key factors that contribute to customer churn in the telecom industry.
2. Measurable: To develop a predictive model with an accuracy of at least 80% in identifying churners.
3. Achievable: To utilize available customer data and machine learning techniques to build and validate the model.
4. Relevant: To provide actionable insights that can be used by telecom companies to reduce churn rates.
5. Time-bound: To complete the analysis and model development within three months.

### Hypothesis
1. Customers with lower satisfaction scores are more likely to churn
2. High usage of support services(e.g. customer service calls) is positively correlated with churn.

In [10]:
import pandas as pd
import numpy as np
import pylab as pl
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns

In [12]:
#Reading the file 
churn = pd. read_csv('telecom_customer_churn.csv')
churn.head()

Unnamed: 0,Customer ID,Gender,Age,Married,Number of Dependents,City,Zip Code,Latitude,Longitude,Number of Referrals,...,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Customer Status,Churn Category,Churn Reason
0,0002-ORFBO,Female,37,Yes,0,Frazier Park,93225,34.827662,-118.999073,2,...,Credit Card,65.6,593.3,0.0,0,381.51,974.81,Stayed,,
1,0003-MKNFE,Male,46,No,0,Glendale,91206,34.162515,-118.203869,0,...,Credit Card,-4.0,542.4,38.33,10,96.21,610.28,Stayed,,
2,0004-TLHLJ,Male,50,No,0,Costa Mesa,92627,33.645672,-117.922613,0,...,Bank Withdrawal,73.9,280.85,0.0,0,134.6,415.45,Churned,Competitor,Competitor had better devices
3,0011-IGKFF,Male,78,Yes,0,Martinez,94553,38.014457,-122.115432,1,...,Bank Withdrawal,98.0,1237.85,0.0,0,361.66,1599.51,Churned,Dissatisfaction,Product dissatisfaction
4,0013-EXCHZ,Female,75,Yes,0,Camarillo,93010,34.227846,-119.079903,3,...,Credit Card,83.9,267.4,0.0,0,22.14,289.54,Churned,Dissatisfaction,Network reliability


In [3]:
churn.info


<bound method DataFrame.info of      Customer ID  Gender  Age Married  Number of Dependents          City  \
0     0002-ORFBO  Female   37     Yes                     0  Frazier Park   
1     0003-MKNFE    Male   46      No                     0      Glendale   
2     0004-TLHLJ    Male   50      No                     0    Costa Mesa   
3     0011-IGKFF    Male   78     Yes                     0      Martinez   
4     0013-EXCHZ  Female   75     Yes                     0     Camarillo   
...          ...     ...  ...     ...                   ...           ...   
7038  9987-LUTYD  Female   20      No                     0       La Mesa   
7039  9992-RRAMN    Male   40     Yes                     0     Riverbank   
7040  9992-UJOEL    Male   22      No                     0           Elk   
7041  9993-LHIEB    Male   21     Yes                     0  Solana Beach   
7042  9995-HOTOH    Male   36     Yes                     0   Sierra City   

      Zip Code   Latitude   Longitude  Numb

In [5]:
churn.describe()

Unnamed: 0,Age,Number of Dependents,Zip Code,Latitude,Longitude,Number of Referrals,Tenure in Months,Avg Monthly Long Distance Charges,Avg Monthly GB Download,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,6361.0,5517.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,46.509726,0.468692,93486.070567,36.197455,-119.756684,1.951867,32.386767,25.420517,26.189958,63.596131,2280.381264,1.962182,6.860713,749.099262,3034.379056
std,16.750352,0.962802,1856.767505,2.468929,2.154425,3.001199,24.542061,14.200374,19.586585,31.204743,2266.220462,7.902614,25.104978,846.660055,2865.204542
min,19.0,0.0,90001.0,32.555828,-124.301372,0.0,1.0,1.01,2.0,-10.0,18.8,0.0,0.0,0.0,21.36
25%,32.0,0.0,92101.0,33.990646,-121.78809,0.0,9.0,13.05,13.0,30.4,400.15,0.0,0.0,70.545,605.61
50%,46.0,0.0,93518.0,36.205465,-119.595293,0.0,29.0,25.69,21.0,70.05,1394.55,0.0,0.0,401.44,2108.64
75%,60.0,0.0,95329.0,38.161321,-117.969795,3.0,55.0,37.68,30.0,89.75,3786.6,0.0,0.0,1191.1,4801.145
max,80.0,9.0,96150.0,41.962127,-114.192901,11.0,72.0,49.99,85.0,118.75,8684.8,49.79,150.0,3564.72,11979.34


In [6]:
churn.shape


(7043, 38)

In [19]:
churn.columns

Index(['Customer ID', 'Gender', 'Age', 'Married', 'Number of Dependents',
       'City', 'Zip Code', 'Latitude', 'Longitude', 'Number of Referrals',
       'Tenure in Months', 'Offer', 'Phone Service',
       'Avg Monthly Long Distance Charges', 'Multiple Lines',
       'Internet Service', 'Internet Type', 'Avg Monthly GB Download',
       'Online Security', 'Online Backup', 'Device Protection Plan',
       'Premium Tech Support', 'Streaming TV', 'Streaming Movies',
       'Streaming Music', 'Unlimited Data', 'Contract', 'Paperless Billing',
       'Payment Method', 'Monthly Charge', 'Total Charges', 'Total Refunds',
       'Total Extra Data Charges', 'Total Long Distance Charges',
       'Total Revenue', 'Customer Status', 'Churn Category', 'Churn Reason',
       'Churn'],
      dtype='object')

### Data pre-processing and selection


In [20]:
churn = churn[['Gender', 'Age',
       'Tenure in Months', 'Offer', 'Monthly Charge', 'Total Charges', 'Customer Status' 'Churn Category', 'Churn Reason']]
churn['Customer Status'] = churn['Customer Status'].astype('str')
churn.head()

KeyError: "['Customer StatusChurn Category'] not in index"

In [16]:
# Convert 'Churn Category' to binary: 1 for Churned, 0 for Stayed
churn['Churn'] = churn['Customer Status'].apply(lambda x: 1 if x == 'Churned' else 0)
#Gender distribution
plt.figure(figsize=(8,5))
sns.countplot(x='Gender')
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show(

ValueError: Could not interpret input 'Gender'

<Figure size 800x500 with 0 Axes>

In [18]:
# Churn by Monthly Charges
plt.figure(figsize=(10, 6))
sns.scatterplot(churn=churn, x='Monthly Charge', y='Total Charges', hue='Churn')
plt.title('Churn by Monthly Charges')
plt.xlabel('Monthly Charges')
plt.ylabel('Total Charges')
plt.show()

ValueError: Could not interpret value `Monthly Charge` for parameter `x`

<Figure size 1000x600 with 0 Axes>