# Introduction

Telecommunication industry is getting bigger and having a huge impact on everyone's daily life. The industry is also getting very competitive. In this project we will analyze an extensive consumer data set for a telecommunication company and create a Machine Learning Algorithm by using Logistic Regression. The Business is concerned about many customers leaving the land-line business for other cable competitors. The problem in question that we are trying to solve is "Who are these customers leaving and why?" Business also thinks it is easier and less costly to keep the existing customers rather than acquiring new ones. 

# Contents

1- About the Dataset

2- Data Collection and Understanding

3- Data Wrangling and Exploration

4- Model Selecting and Set Up

5- Model Development

6- Evaluation

7- Conclusion


# About the Dataset

The data provided by the business is the historical dataset with all the customers. Each row represents one customer and we will use this dataset to predict the customer churn.

We can analyze this dataset to predict what behaviors will retain the customers and further develop customer retention focused programs. 

Data set variables that requires explanation are as follow;

* Churn: The Customers who left within the last month

* Phone, Multiple Lines, Internet, Online Security, Online backup, device protection, tech support and streaming TV and movies: These variables are the services that each customer signed up for

* Customer Account information: Shows how long the customer have been a member, contract, payment method, paperless billing, monthly charges, and total charges.

* Demographic information about the customers: Gender, Age, If they have partners or dependents.



# Data Collection and Understanding

In [2]:
# import libraries 
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv('ChurnData.csv')

In [4]:
df.head(5)

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 28 columns):
tenure      200 non-null float64
age         200 non-null float64
address     200 non-null float64
income      200 non-null float64
ed          200 non-null float64
employ      200 non-null float64
equip       200 non-null float64
callcard    200 non-null float64
wireless    200 non-null float64
longmon     200 non-null float64
tollmon     200 non-null float64
equipmon    200 non-null float64
cardmon     200 non-null float64
wiremon     200 non-null float64
longten     200 non-null float64
tollten     200 non-null float64
cardten     200 non-null float64
voice       200 non-null float64
pager       200 non-null float64
internet    200 non-null float64
callwait    200 non-null float64
confer      200 non-null float64
ebill       200 non-null float64
loglong     200 non-null float64
logtoll     200 non-null float64
lninc       200 non-null float64
custcat     200 non-null float64
chur

In [13]:
# check to see if there are any missing data
missing_data=df.isnull()
missing_data.head(5)

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,churn
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


In [14]:
for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print('')

tenure
False    200
Name: tenure, dtype: int64

age
False    200
Name: age, dtype: int64

address
False    200
Name: address, dtype: int64

income
False    200
Name: income, dtype: int64

ed
False    200
Name: ed, dtype: int64

employ
False    200
Name: employ, dtype: int64

equip
False    200
Name: equip, dtype: int64

callcard
False    200
Name: callcard, dtype: int64

wireless
False    200
Name: wireless, dtype: int64

churn
False    200
Name: churn, dtype: int64



There are no missing data at our dataset.

In [7]:
df.shape

(200, 28)

There are 200 rows and 28 columns. We will not need all 28 columns to analyze and create our model. We can look at the correlation to see the relationship between these variables.

# Data Wrangling and Exploration

In [8]:
df.corr()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
tenure,1.0,0.431802,0.456328,0.109383,-0.070503,0.445755,-0.117102,0.42653,-0.07059,0.763134,...,0.018791,-0.164921,-0.009747,0.08065,-0.099128,0.864388,0.310045,0.246353,0.134237,-0.37686
age,0.431802,1.0,0.746566,0.211275,-0.071509,0.622553,-0.071357,0.170404,-0.065527,0.373547,...,0.006803,-0.078395,0.020002,0.030625,-0.048279,0.379413,0.0936,0.313359,0.041055,-0.287697
address,0.456328,0.746566,1.0,0.132807,-0.14555,0.520926,-0.148977,0.209204,-0.146478,0.421782,...,-0.105812,-0.191058,-0.019967,-0.030494,-0.172171,0.409357,0.018386,0.212929,-0.016841,-0.260659
income,0.109383,0.211275,0.132807,1.0,0.141241,0.345161,-0.010741,-0.019969,-0.029635,0.041808,...,0.056977,0.102809,0.081133,-0.031556,-0.041392,0.065595,-0.156498,0.680313,0.030725,-0.09079
ed,-0.070503,-0.071509,-0.14555,0.141241,1.0,-0.213886,0.488041,-0.071178,0.26767,-0.072735,...,0.258698,0.552996,-0.016247,-0.132215,0.427315,-0.054581,-0.007227,0.206718,0.013127,0.216112
employ,0.445755,0.622553,0.520926,0.345161,-0.213886,1.0,-0.17447,0.266612,-0.101187,0.363386,...,0.038381,-0.250044,0.119708,0.173247,-0.151965,0.377186,0.068718,0.540052,0.131292,-0.337969
equip,-0.117102,-0.071357,-0.148977,-0.010741,0.488041,-0.17447,1.0,-0.087051,0.386735,-0.097618,...,0.308633,0.623509,-0.034021,-0.103499,0.603133,-0.113065,-0.027882,0.083494,0.174955,0.275284
callcard,0.42653,0.170404,0.209204,-0.019969,-0.071178,0.266612,-0.087051,1.0,0.220118,0.322514,...,0.251069,-0.067146,0.370878,0.311056,-0.045058,0.35103,0.08006,0.15692,0.407553,-0.311451
wireless,-0.07059,-0.065527,-0.146478,-0.029635,0.26767,-0.101187,0.386735,0.220118,1.0,-0.073043,...,0.667535,0.343631,0.38967,0.382925,0.321433,-0.042637,0.178317,0.033558,0.598156,0.174356
longmon,0.763134,0.373547,0.421782,0.041808,-0.072735,0.363386,-0.097618,0.322514,-0.073043,1.0,...,-0.001372,-0.223929,0.032913,0.060614,-0.124605,0.901631,0.247302,0.12255,0.072519,-0.292026


In [9]:
# selecting the features that we can use for our model development
df = df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',   'callcard', 'wireless','churn']]

# churn type is float. We need to change this to integer for our Logictic Regression
df['churn'] = df['churn'].astype('int')
df.head(5)


Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,1
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,1
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,0


In [10]:
#lets see our column and row number now
df.shape

(200, 10)

# Model Selecting and Set Up

The reason we are picking logical regression as a machine learning method is that, linear regression is more for continues numbers such as predicting future house prices. Logical Regression is better to estimate class of a data point. We are trying to figure out what is the most probable class for a particular data point. 

In [16]:
# preprocessing and definning the X and y (Features and target)
X=np.asarray(df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
X[0:5]

array([[ 11.,  33.,   7., 136.,   5.,   5.,   0.],
       [ 33.,  33.,  12.,  33.,   2.,   0.,   0.],
       [ 23.,  30.,   9.,  30.,   1.,   2.,   0.],
       [ 38.,  35.,   5.,  76.,   2.,  10.,   1.],
       [  7.,  35.,  14.,  80.,   2.,  15.,   0.]])

In [17]:
y=np.array(df['churn'])
y[0:5]

array([1, 1, 0, 0, 0])

In [18]:
# normalize and preprocess the dataset
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-1.13518441, -0.62595491, -0.4588971 ,  0.4751423 ,  1.6961288 ,
        -0.58477841, -0.85972695],
       [-0.11604313, -0.62595491,  0.03454064, -0.32886061, -0.6433592 ,
        -1.14437497, -0.85972695],
       [-0.57928917, -0.85594447, -0.261522  , -0.35227817, -1.42318853,
        -0.92053635, -0.85972695],
       [ 0.11557989, -0.47262854, -0.65627219,  0.00679109, -0.6433592 ,
        -0.02518185,  1.16316   ],
       [-1.32048283, -0.47262854,  0.23191574,  0.03801451, -0.6433592 ,
         0.53441472, -0.85972695]])

In [19]:
# split the data set for train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (160, 7) (160,)
Test set: (40, 7) (40,)


# Model Development

In [20]:
# import required libraries and create the model with Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [21]:
# predict using the test set
yhat = LR.predict(X_test)
yhat

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0])

In [22]:
# predict probability (estimates) for all classes 
# first column will be probability of class 1 and second column is probability of class 0
yhat_prob = LR.predict_proba(X_test)
yhat_prob

array([[0.54132919, 0.45867081],
       [0.60593357, 0.39406643],
       [0.56277713, 0.43722287],
       [0.63432489, 0.36567511],
       [0.56431839, 0.43568161],
       [0.55386646, 0.44613354],
       [0.52237207, 0.47762793],
       [0.60514349, 0.39485651],
       [0.41069572, 0.58930428],
       [0.6333873 , 0.3666127 ],
       [0.58068791, 0.41931209],
       [0.62768628, 0.37231372],
       [0.47559883, 0.52440117],
       [0.4267593 , 0.5732407 ],
       [0.66172417, 0.33827583],
       [0.55092315, 0.44907685],
       [0.51749946, 0.48250054],
       [0.485743  , 0.514257  ],
       [0.49011451, 0.50988549],
       [0.52423349, 0.47576651],
       [0.61619519, 0.38380481],
       [0.52696302, 0.47303698],
       [0.63957168, 0.36042832],
       [0.52205164, 0.47794836],
       [0.50572852, 0.49427148],
       [0.70706202, 0.29293798],
       [0.55266286, 0.44733714],
       [0.52271594, 0.47728406],
       [0.51638863, 0.48361137],
       [0.71331391, 0.28668609],
       [0.

# Evaluation

We are going to use the jaccard index for the accuracy of our model. 

In [23]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)



0.75

# Conclusion

Based on our analysis, tenure, age, education, address, income , employment and equipment are the features that can impact of staying with the services provided or moving to another telecommunication provider. We based on model considering these variables, we split the data set to 20 / 80 , train and test data, trained the data set and used the test data for prediction. Our evaluation based on jaccard index is 75% accurate. 