## Business Understanding

Telecommunication services have expanded dramatically due to the growing demand among consumers as many businesses require telecommunication services in order to help their business expand. Which causes telecommunications to become one of the most competitive and fastest-growing industries in the globe. Despite that there are some policies to reduce the rivalry among big corporations, the competition among companies in telecom industry is getting intensified in recent years.
In US, the telecom market continues to witness intense pricing competition. As one of the many telecom companies, SyriaTel is competing with other world top telecom companies such as AT&T, Verizon, Comcast, etc. and they are currently looking for a solution to retain as many customers as they can. However, they don’t know what are some of the reasons that could cause customers to leave. As a solution, we are going to make several predictive models to help SyriaTel to identify the potential churn customers using the recall score, and SyriaTel can take action before they churn. We're focusing on the recall score because we're looking to minimize the false negatives while also identifying churn candidates.

In [91]:
#importing relevant software libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier, DummyRegressor
import numpy as np

## Data Understanding

We are sourcing the data for this project from the SyriaTel dataset, which can be found at [Syria Tel Dataset Link](https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset). In total, we reviewed 3,333 data points from this SyriaTel data set. Our visualizations can also be found seperately in the images folder located in this repository. We will analyze a list of variables from the SyriaTel data such as whether or not the customer has an international plan or voice mail plan, number of voicemail messages, total calling minutes, the state the customer lives in, and charge rates for each time of the day. Limitations of the data include possible overfitting, dealing with an outdated and a narrow data set, and possible outliers.

In [92]:
#importing our data set
df = pd.read_csv('../data/bigml_59c28831336c6604c800002a.csv')
df

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.70,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.70,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.30,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.90,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,AZ,192,415,414-4276,no,yes,36,156.2,77,26.55,...,126,18.32,279.1,83,12.56,9.9,6,2.67,2,False
3329,WV,68,415,370-3271,no,no,0,231.1,57,39.29,...,55,13.04,191.3,123,8.61,9.6,4,2.59,3,False
3330,RI,28,510,328-8230,no,no,0,180.8,109,30.74,...,58,24.55,191.9,91,8.64,14.1,6,3.81,2,False
3331,CT,184,510,364-6381,yes,no,0,213.8,105,36.35,...,84,13.57,139.2,137,6.26,5.0,10,1.35,2,False


We are dropping the phone number and area code column because we believe it lacks mathematical or statistical meaning.

In [93]:
##dropping irrelevant columns
df = df.drop(columns = ['phone number', 'area code'])

## Creating 5 new features, total minutes and average rate per minute by each timeframe 

In [94]:
df['average_night_rate_by_min'] = df['total night charge'] / df['total night minutes']

In [95]:
df['average_day_rate_by_min'] = df['total day charge'] / df['total day minutes'] 

In [96]:
df['average_intl_rate_by_min'] = df['total intl charge'] / df['total intl minutes']

In [97]:
df['average_eve_rate_by_min'] = df['total eve charge'] / df['total eve minutes']

In [98]:
df['total_minutes'] = df['total night minutes'] + df['total day minutes'] + df['total intl minutes'] + df['total eve minutes']

In [99]:
#dropping rows with null values
df.dropna(axis=0, inplace=True)

In [100]:
##defining our X and target variable before train test split
X = df.drop('churn', axis=1)
y = df['churn']

In [101]:
#Running our train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [102]:
X_train

Unnamed: 0,state,account length,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,...,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,average_night_rate_by_min,average_day_rate_by_min,average_intl_rate_by_min,average_eve_rate_by_min,total_minutes
228,VA,104,no,yes,23,280.2,136,47.63,220.5,92,...,6.16,13.3,3,3.59,4,0.044996,0.169986,0.269925,0.084989,650.9
367,MD,45,no,no,0,78.2,127,13.29,253.4,108,...,11.48,18.0,3,4.86,1,0.045020,0.169949,0.270000,0.085004,604.6
872,OK,149,no,yes,43,206.7,79,35.14,174.6,122,...,10.87,10.9,3,2.94,1,0.045010,0.170005,0.269725,0.084994,633.7
1266,IA,42,no,no,0,155.4,127,26.42,164.1,45,...,7.10,9.0,3,2.43,0,0.045022,0.170013,0.270000,0.085009,486.2
277,SD,144,no,yes,48,189.8,96,32.27,123.4,67,...,9.64,6.5,2,1.76,2,0.045005,0.170021,0.270769,0.085008,533.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1103,KS,52,no,no,0,165.5,78,28.14,205.5,89,...,9.61,12.2,6,3.29,0,0.044991,0.170030,0.269672,0.085012,596.8
1138,MA,46,no,no,0,139.4,81,23.70,223.7,113,...,7.79,13.6,6,3.67,1,0.045003,0.170014,0.269853,0.084980,549.8
1302,WA,171,no,no,0,270.5,69,45.99,230.0,112,...,6.12,9.6,5,2.59,1,0.045000,0.170018,0.269792,0.085000,646.1
865,MD,52,no,no,0,209.8,114,35.67,171.3,82,...,6.96,9.9,9,2.67,4,0.045019,0.170019,0.269697,0.084997,545.6


In [103]:
#noticing our data imbalance, will need to use SMOTE in modeling notebook
y_train.value_counts()

False    2116
True      368
Name: churn, dtype: int64

In [104]:
###concatenating our independent variables with our target
training_data = pd.concat([X_train, y_train], axis=1)
testing_data = pd.concat([X_test, y_test], axis=1)

In [105]:
#saved training_data
training_data.to_csv('final_training_data.csv')

In [106]:
#saved  testing_data
testing_data.to_csv('final_testing_data.csv')

## EDA Churn rates for int plan, voice mail plan, customer service calls, total minutes

In [107]:
y_train.value_counts()

False    2116
True      368
Name: churn, dtype: int64

In [108]:
#training data churn rate
dataset_churn = (368/ 2484)
dataset_churn

0.14814814814814814

In [109]:
intplan_yes = training_data.loc[training_data['international plan'] =='yes']


In [110]:
intplan_yes['churn'].value_counts()

False    137
True      99
Name: churn, dtype: int64

In [111]:
#having an international plan churn rate
intplan_yes_churn = (99 /236)
intplan_yes_churn

0.4194915254237288

In [112]:
intplan_no = training_data.loc[training_data['international plan'] =='no']


In [113]:
intplan_no['churn'].value_counts()

False    1979
True      269
Name: churn, dtype: int64

In [114]:
#not having an international plan churn rate
intplan_no_churn = (269 / 2248)
intplan_no_churn

0.11966192170818506

In [115]:
voicemailplan_yes = training_data.loc[training_data['voice mail plan'] =='yes']

In [116]:
voicemailplan_yes['churn'].value_counts()

False    626
True      61
Name: churn, dtype: int64

In [117]:
#having a voicemail plan churn rate
voicemailplan_yes_churn = (61 / 687)
voicemailplan_yes_churn

0.08879184861717612

In [118]:
voicemailplan_no = training_data.loc[training_data['voice mail plan'] =='no']

In [119]:
voicemailplan_no['churn'].value_counts()

False    1490
True      307
Name: churn, dtype: int64

In [120]:
#not having a voicemail plan churn rate
voicemailplan_no_churn = (307/ 1797)
voicemailplan_no_churn

0.17084028937117418

In [121]:
customerserv_ = training_data.loc[training_data['customer service calls'] > 1]

In [122]:
customerserv_['churn'].value_counts()

False    894
True     201
Name: churn, dtype: int64

In [123]:
#having 2 or more customer service calls churn rate
customerserv_yes_churn = (201 / 1095)
customerserv_yes_churn

0.18356164383561643

In [124]:
customerserv_no = training_data.loc[training_data['customer service calls'] <= 1]

In [125]:
customerserv_no['churn'].value_counts()

False    1222
True      167
Name: churn, dtype: int64

In [126]:
#having 0 or 1 customer service calls churn rate
customerserv_no_churn = (167 / 1389)
customerserv_no_churn

0.12023038156947444

In [127]:
training_data['total_minutes'].describe()

count    2484.000000
mean      591.556441
std        90.381246
min       284.300000
25%       530.675000
50%       593.150000
75%       652.100000
max       885.000000
Name: total_minutes, dtype: float64

In [128]:
#used mean total minutes
totalminutes_ = training_data.loc[training_data['total_minutes'] > 591.556441]

In [129]:
totalminutes_['churn'].value_counts()

False    1032
True      230
Name: churn, dtype: int64

In [130]:
#having total minutes above the mean churn rate
totalminutes_yes_churn = (230 / 1262)
totalminutes_yes_churn

0.18225039619651348

In [131]:
totalminutes_no = training_data.loc[training_data['total_minutes'] < 591.556441]

In [132]:
totalminutes_no['churn'].value_counts()

False    1084
True      138
Name: churn, dtype: int64

In [133]:
#having total minutes below the mean churn rate
totalminutes_no_churn = (138 / 1222)
totalminutes_no_churn

0.11292962356792144