**Telecom Customer Retention Project  
Will Byrd, May 2024  
Intrduction**

In this notebook, I will create classification models to predict wether or not a customer kept their subscription to the phone service.  

**Data**

The data used in this project is from Kaggle's Churn in Telecom's Dataset.  This data is remarkably clean with no missing values and will allow me to focus on the principles of model building.  Each record in this dataset represents a customer in Telecom and has attributes such as state, length of subscription, type of plan, usage, and wether or not the churned.  a customer who has churned has cancelled their subscription, so in this case, we will be targeting customers who have not churned or have a value of false in the churn column.  The churn column is our target column.

**Goals**

Build various models to evaluate the data.


**Explaratory Data Analysis**  

Loading in tools for data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from seaborn import load_dataset
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import RandomOverSampler

Loading in the dataset.  Here, I will call it 'customer_df'

In [2]:
customer_df = pd.read_csv ('Data/telecom.csv')

In [3]:
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

In [4]:
customer_df.drop(columns=['phone number'], inplace=True)

In [5]:
customer_df['churn'] = customer_df['churn'].astype(float)

In [6]:
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   international plan      3333 non-null   object 
 4   voice mail plan         3333 non-null   object 
 5   number vmail messages   3333 non-null   int64  
 6   total day minutes       3333 non-null   float64
 7   total day calls         3333 non-null   int64  
 8   total day charge        3333 non-null   float64
 9   total eve minutes       3333 non-null   float64
 10  total eve calls         3333 non-null   int64  
 11  total eve charge        3333 non-null   float64
 12  total night minutes     3333 non-null   float64
 13  total night calls       3333 non-null   int64  
 14  total night charge      3333 non-null   

How lucky!  We don't have to impute data! We have a clean dataset!  This will make our process much easier going forward.  Now to figure out what all of these columns contain.

In [7]:
customer_df.head()

Unnamed: 0,state,account length,area code,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0.0
1,OH,107,415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0.0
2,NJ,137,415,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0.0
3,OH,84,408,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0.0
4,OK,75,415,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0.0


In [8]:
customer_df.isna().sum()

state                     0
account length            0
area code                 0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

Looks like we have columns of data for various phone accounts.  We can see the state in which the person lives, the phone number, area code, and details of their plan.

Now we need to figure out consistent themes/patterns in accounts that churned versus accounts that renewed.

First thing I'm interested in is just how many accounts churned.  In subscription services, churn refers to the rate at a customer stops using a service.  So we can assume every True value is a customer cancelling their subscription.

We can also assume that the account length is how many months the account has been active.

In [9]:
churn_counts = customer_df['churn'].value_counts()
true_count = churn_counts[1]
false_count = churn_counts[0]

print("Number of times a customer churned:", true_count)
print("Number of times a customer renewed subscription:", false_count)


Number of times a customer churned: 483
Number of times a customer renewed subscription: 2850


Lot's of happy customers!

And we can see that 483 + 2850 = 3333, so we aren't missing any values.

This will likely cause problems are our data is imbalanced.  We will address this later.

For the sake of our model, it will be easier to convert our categorical variables into a numerical format.  To make it simple, 1 will be yes and 0 will be no for columns:

- International Plan
- Voicemail Plan

And for the Churn column, we will leave it alone, as it is our target column.


In [10]:
customer_df.replace({'no': 0, 'yes':1, 'false':0, 'true':1}, inplace=True)
customer_df.head()


Unnamed: 0,state,account length,area code,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0.0
1,OH,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0.0
2,NJ,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0.0
3,OH,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0.0
4,OK,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0.0


So now we can see all of our categorical variables have been changed via one-hot encoding and we can more easily train the data.


**Feature Identification**

We have our customer information in all columns except the churn column.  Most of this makes sense intuitively, but let's go through it.  

- state- is which state the customer lives in
- international plan- if the customer had an international plan or not
- voice mail plan- if the customer had a plan that allowed voicemails
- number vmail messages- how many voicemails did this customer have
- total day minutes- shows how active the customer was during the day
- total day calls- how many calls the customer maade during the day
- total day charge- how much they were charged for their day minutes
- total eve minutes- how active the customer was in the evening
- total eve calls- how many calls were made in the evening
- total eve charge- how much they were charged for their evening minutes
- total night minutes- how active the customer was in the nighttime
- total night calls- how many calls were made in the nighttime
- total night charge- how much they were charged for their nighttime minutes
- total intl minutes- how active the customer was internationally
- total intl calls- how many calls were made internationally
- total intl charge- how much they were charged for their international minutes
- custimer service calls- how often the customer called customer service
- churn - did the customer churn (True) or did they remain a customer (False)

**Creating Training and Testing Sets**

In [11]:
customer_df.isna().sum()

state                     0
account length            0
area code                 0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

In [12]:
X = customer_df.drop('churn', axis=1)
y = customer_df['churn']

In [13]:
# Separate categorical and numerical columns
categorical_columns = ['state']
numerical_columns = [col for col in X.columns if col not in categorical_columns]

# One-hot encode the categorical columns
ohe = OneHotEncoder(drop='first', sparse=False)
X_encoded = ohe.fit_transform(X[categorical_columns])
#X_test_encoded = ohe.fit_transform(X_test[categorical_columns])

# Create DataFrames from the encoded arrays
X_encoded_df = pd.DataFrame(X_encoded, columns=ohe.get_feature_names(input_features=categorical_columns))
#X_test_encoded_df = pd.DataFrame(X_test_encoded, columns=ohe.get_feature_names(input_features=categorical_columns))

In [14]:
X_final = pd.concat([X_encoded_df, X[numerical_columns]], axis=1)

In [15]:
X_final.isna().sum()

state_AL                  0
state_AR                  0
state_AZ                  0
state_CA                  0
state_CO                  0
                         ..
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
Length: 68, dtype: int64

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=615, stratify=y)

In [17]:
#X = customer_df.drop('churn', axis=1)
#y = customer_df['churn']

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=615, stratify=y)

In [18]:
# Separate categorical and numerical columns
#categorical_columns = ['state']
#numerical_columns = [col for col in X_train.columns if col not in categorical_columns]

# One-hot encode the categorical columns
#ohe = OneHotEncoder(drop='first', sparse=False)
#X_train_encoded = ohe.fit_transform(X_train[categorical_columns])
#X_test_encoded = ohe.fit_transform(X_test[categorical_columns])

# Create DataFrames from the encoded arrays
#X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=ohe.get_feature_names(input_features=categorical_columns))
#X_test_encoded_df = pd.DataFrame(X_test_encoded, columns=ohe.get_feature_names(input_features=categorical_columns))



# Now you have X_train_final, X_test_final, y_train, and y_test ready for model training


In [19]:
# Reset the indices of X_train_encoded_df and X_train[numerical_columns]
#X_train_encoded_df = X_train_encoded_df.reset_index()
#X_train_numerical = X_train[numerical_columns].reset_index()

# Check if indices match between encoded and numerical columns
#print("Do indices match between X_train_encoded_df and X_train[numerical_columns]?")
#print(X_train_encoded_df.index.equals(X_train_numerical.index))


# Concatenate the one-hot encoded columns with the original numerical columns
#X_train_final = pd.concat([X_train_numerical, X_train_encoded_df], axis=1)



In [20]:
# Reset the indices of X_train_encoded_df and X_train[numerical_columns]
#X_test_encoded_df = X_test_encoded_df.reset_index()
#X_test_numerical = X_test[numerical_columns].reset_index()

# Check if indices match between encoded and numerical columns
#print("Do indices match between X_train_encoded_df and X_train[numerical_columns]?")
#print(X_test_encoded_df.index.equals(X_test_numerical.index))

# Reindex X_train[numerical_columns] to align with X_train_encoded_df
#X_train[numerical_columns] = X_train[numerical_columns].reindex(X_train_encoded_df.index)

# Concatenate the one-hot encoded columns with the original numerical columns
#X_test_final = pd.concat([X_test_numerical, X_test_encoded_df], axis=1)

# Check again for missing indices
#missing_indices = X_train_encoded_df.index.difference(X_train_final.index)
#print("Missing indices in X_train_final:", missing_indices)

We create the test and train sets before one-hot encoding and feature scaling to reduce data leakage.

**Feature Scaling**

Since we have wide ranges of values for our various features, lets scale them. This will help our ML algorithm will be more accurate and will also save time in training.

In [21]:
standard = StandardScaler()
X_train_final = standard.fit_transform(X_train)

In [22]:
X_test_final = standard.transform(X_test)

In [23]:
X_train_final

array([[-0.15808349, -0.13396186, -0.14378391, ..., -0.2048563 ,
         0.14002274, -1.18300727],
       [-0.15808349, -0.13396186, -0.14378391, ..., -0.99982993,
        -0.50154802,  0.33914398],
       [-0.15808349, -0.13396186, -0.14378391, ..., -0.99982993,
         1.11574494, -0.42193165],
       ...,
       [-0.15808349, -0.13396186, -0.14378391, ..., -0.99982993,
         1.32960186,  1.86129522],
       [-0.15808349, -0.13396186, -0.14378391, ..., -0.2048563 ,
        -1.04955639,  1.1002196 ],
       [-0.15808349, -0.13396186, -0.14378391, ..., -0.60234311,
         0.83505773, -0.42193165]])

In [24]:
my_df1 = pd.DataFrame(X_train_final)
my_df1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,58,59,60,61,62,63,64,65,66,67
0,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-1.649231,0.688623,-1.649875,-0.975824,-0.469630,-0.975427,0.143422,-0.204856,0.140023,-1.183007
1,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-1.289372,0.588022,-1.288632,-2.416241,-0.264597,-2.415613,-0.506180,-0.999830,-0.501548,0.339144
2,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,0.034674,-0.116185,0.034388,-0.205231,0.094211,-0.207035,1.117824,-0.999830,1.115745,-0.421932
3,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,1.828101,-0.367688,1.829095,0.755047,1.016859,0.754552,0.287778,-0.999830,0.287049,1.861295
4,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-0.295849,0.235918,-0.296942,1.944528,2.195799,1.944462,-0.073112,0.192631,-0.073834,0.339144
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2661,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,0.376931,0.638322,0.377223,-0.468023,-0.264597,-0.466093,-0.145290,0.590117,-0.140664,0.339144
2662,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-0.673309,0.487421,-0.674291,-0.159785,1.990766,-0.158736,1.875692,-0.602343,1.877610,-0.421932
2663,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,0.146152,-0.971294,0.147133,-2.256195,-1.187246,-2.257544,1.334358,-0.999830,1.329602,1.861295
2664,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,6.376225,-0.124976,-0.135405,-0.146478,-0.126515,...,-1.232655,-0.870693,-1.233411,0.442858,-2.007378,0.442805,-1.047514,-0.204856,-1.049556,1.100220


In [25]:
my_df2 = pd.DataFrame(X_test_final)
my_df2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,58,59,60,61,62,63,64,65,66,67
0,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,0.580330,-0.015584,0.579703,0.525845,0.453018,0.526230,0.287778,0.192631,0.287049,-1.183007
1,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,7.904188,...,1.118162,-0.417989,1.118115,-0.181520,-0.520888,-0.180690,-0.000934,-0.204856,-0.007004,1.100220
2,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,7.385233,-0.146478,-0.126515,...,-0.812168,-0.417989,-0.812345,1.241114,-0.315855,1.241932,-1.516671,0.192631,-1.517368,-0.421932
3,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-0.645929,0.588022,-0.646680,-0.252652,0.094211,-0.250943,-0.506180,-0.602343,-0.501548,1.100220
4,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-0.006397,0.034716,-0.007028,0.020020,1.273150,0.021287,-0.000934,2.577551,-0.007004,0.339144
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
662,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-1.989532,0.889825,-1.990409,0.197849,1.324409,0.196919,-0.903158,-0.999830,-0.902530,0.339144
663,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-0.829770,-0.920994,-0.830752,1.077116,-0.930954,1.075082,-0.325735,0.192631,-0.327789,-1.183007
664,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,0.779816,-1.927004,0.779882,1.051429,1.170634,1.053127,0.107333,-0.204856,0.113291,-1.183007
665,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,8.001524,-0.135405,-0.146478,-0.126515,...,-0.538362,1.040726,-0.538537,0.037803,0.401760,0.038850,0.504311,0.192631,0.500906,-0.421932


In [26]:
y_train.value_counts()

0.0    2280
1.0     386
Name: churn, dtype: int64

**Class Imbalance Issue**

As mentioned previosly, our dependent variable is imbalanced.  We will us Oversampling to correct this issue to improve all models we will subsequently build.

In [27]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=4)


In [28]:
X_train_final_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
y_train_resampled.value_counts()

1.0    2280
0.0    2280
Name: churn, dtype: int64

Success!! 

We have resampled our training sets so that they are no longer imbalanced.  Now we can start building our models.

**Logistic Regression**

Let's build now use the 'X_Train_numerical' and X_test_numerical' splits and build a baseline model for our logistic regression.

Training model on Numerical columns and then running test in bottom portion of cell.

In [29]:
# Instantiate the logistic regression model
logistic_regression_model = LogisticRegression()

# Fit the model on the training data
logistic_regression_model.fit(X_train_final, y_train)

# Make predictions on the test data
y_pred = logistic_regression_model.predict(X_test_final)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_rep)

Accuracy: 0.856071964017991
Classification Report:
              precision    recall  f1-score   support

         0.0       0.89      0.95      0.92       570
         1.0       0.51      0.30      0.38        97

    accuracy                           0.86       667
   macro avg       0.70      0.62      0.65       667
weighted avg       0.83      0.86      0.84       667



Oversampling hurts our model here below.

In [30]:
# Instantiate the logistic regression model
logistic_regression_model = LogisticRegression()

# Fit the model on the training data
logistic_regression_model.fit(X_train_final_resampled, y_train_resampled)

# Make predictions on the test data
y_pred = logistic_regression_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

#Print the evaluation metrics
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_rep)


Accuracy: 0.6371814092953523
Classification Report:
              precision    recall  f1-score   support

         0.0       0.94      0.62      0.74       570
         1.0       0.25      0.75      0.38        97

    accuracy                           0.64       667
   macro avg       0.59      0.69      0.56       667
weighted avg       0.84      0.64      0.69       667



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [31]:
customer_df['churn'].value_counts()

0.0    2850
1.0     483
Name: churn, dtype: int64

As we can see, the model is very innacurate.  We will need to balance this to improve our model.

In [33]:
# Compute class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)

# Instantiate the logistic regression model with class weights
logistic_regression_model = LogisticRegression(class_weight={0: class_weights[0], 1: class_weights[1]})

# Fit the model on the training data
logistic_regression_model.fit(X_train, y_train)

# Evaluate the model
accuracy = logistic_regression_model.score(X_test, y_test)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.6626686656671664
Classification Report:
              precision    recall  f1-score   support

         0.0       0.94      0.62      0.74       570
         1.0       0.25      0.75      0.38        97

    accuracy                           0.64       667
   macro avg       0.59      0.69      0.56       667
weighted avg       0.84      0.64      0.69       667



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:

# Define class weights
class_weights = {0: 1, 1: 10}  # Adjust the weight for class 1 as needed

# Instantiate the logistic regression model with class weights
logistic_regression_model = LogisticRegression(class_weight=class_weights)

# Fit the model on the training data
logistic_regression_model.fit(X_train_final, y_train)

# Predict on the test data
y_pred = logistic_regression_model.predict(X_test_final)

# Evaluate the model
print('accuracy', accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
