## Customer churn with Logistic Regression

Creating a model for a telecommunication company, to predict when its customers will leave for a competitor, so that they can take some action to retain the customers.

In [49]:
#Let's first import required libraries
import pandas as pd
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

## Dataset Overview:

This analysis focuses on a telecommunications dataset aimed at predicting customer churn. Each row in the dataset represents a single customer, providing a mix of service usage, account details, and demographic information. Understanding factors that influence customer retention is crucial, as retaining existing customers is often more cost-effective than acquiring new ones. By analyzing this data, we aim to identify key behaviors and characteristics that predict customer loyalty and inform targeted retention strategies.

Key Features:
Churn: Indicates customers who have left the company within the last month.
Services: Details on subscribed services (e.g., phone, internet, online security, streaming TV).
Account Information: Includes tenure, contract type, payment method, billing arrangements, and charges.
Demographics: Gender, age range, and family status (partners and dependents).
The goal is to leverage this information to predict which customers are likely to continue their services and to design effective customer retention programs based on these insights.

In [5]:
churn_df = pd.read_csv("logist_reg_churn_data.csv")
churn_df.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0


## Data pre-processing and selection

In [9]:
X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
X[0:5]

array([[ 11.,  33.,   7., 136.,   5.,   5.,   0.],
       [ 33.,  33.,  12.,  33.,   2.,   0.,   0.],
       [ 23.,  30.,   9.,  30.,   1.,   2.,   0.],
       [ 38.,  35.,   5.,  76.,   2.,  10.,   1.],
       [  7.,  35.,  14.,  80.,   2.,  15.,   0.]])

In [10]:
y = np.asarray(churn_df['churn'])
y[0:5]

array([1., 1., 0., 0., 0.])

In [11]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-1.13518441, -0.62595491, -0.4588971 ,  0.4751423 ,  1.6961288 ,
        -0.58477841, -0.85972695],
       [-0.11604313, -0.62595491,  0.03454064, -0.32886061, -0.6433592 ,
        -1.14437497, -0.85972695],
       [-0.57928917, -0.85594447, -0.261522  , -0.35227817, -1.42318853,
        -0.92053635, -0.85972695],
       [ 0.11557989, -0.47262854, -0.65627219,  0.00679109, -0.6433592 ,
        -0.02518185,  1.16316   ],
       [-1.32048283, -0.47262854,  0.23191574,  0.03801451, -0.6433592 ,
         0.53441472, -0.85972695]])

## Train/Test dataset

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (160, 7) (160,)
Test set: (40, 7) (40,)


In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

In [14]:
# predicting the test dataset
yhat = LR.predict(X_test)
yhat

array([0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 1., 0., 0., 0.])

In [15]:
# predicting probability of each class label
yhat_prob = LR.predict_proba(X_test)
yhat_prob

array([[0.54132919, 0.45867081],
       [0.60593357, 0.39406643],
       [0.56277713, 0.43722287],
       [0.63432489, 0.36567511],
       [0.56431839, 0.43568161],
       [0.55386646, 0.44613354],
       [0.52237207, 0.47762793],
       [0.60514349, 0.39485651],
       [0.41069572, 0.58930428],
       [0.6333873 , 0.3666127 ],
       [0.58068791, 0.41931209],
       [0.62768628, 0.37231372],
       [0.47559883, 0.52440117],
       [0.4267593 , 0.5732407 ],
       [0.66172417, 0.33827583],
       [0.55092315, 0.44907685],
       [0.51749946, 0.48250054],
       [0.485743  , 0.514257  ],
       [0.49011451, 0.50988549],
       [0.52423349, 0.47576651],
       [0.61619519, 0.38380481],
       [0.52696302, 0.47303698],
       [0.63957168, 0.36042832],
       [0.52205164, 0.47794836],
       [0.50572852, 0.49427148],
       [0.70706202, 0.29293798],
       [0.55266286, 0.44733714],
       [0.52271594, 0.47728406],
       [0.51638863, 0.48361137],
       [0.71331391, 0.28668609],
       [0.

In [16]:
cf = confusion_matrix(y_test , yhat)
cf

array([[24,  1],
       [ 9,  6]])

## Analyzing Model Performance with Confusion Matrix
The confusion matrix provides a detailed breakdown of the model's predictions compared to the actual values, particularly useful in binary classification tasks like predicting customer churn.

Churn Predictions (Churn = 1):

Total Customers: Out of 40 customers in the test set, 15 actually churned.
True Positives (TP): The model correctly predicted churn for 6 customers.
False Negatives (FN): The model incorrectly predicted no churn for 9 customers. These errors highlight where the model's predictions were overly optimistic about customer retention.
Retention Predictions (Churn = 0):

Total Customers: The remaining 25 customers did not churn.
True Negatives (TN): The model accurately identified 24 customers as retained.
False Positives (FP): Only 1 customer was mistakenly predicted to churn.
The confusion matrix underscores the model's ability to distinguish between customers who will churn and those who will not. While it performs well in identifying retained customers (high TN rate), there's room for improvement in correctly identifying customers at risk of churning (reducing FN) 

In [50]:
print (classification_report(y_test, yhat))

              precision    recall  f1-score   support

         0.0       0.73      0.96      0.83        25
         1.0       0.86      0.40      0.55        15

    accuracy                           0.75        40
   macro avg       0.79      0.68      0.69        40
weighted avg       0.78      0.75      0.72        40



Finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.72 in our case.