**Telecom Customer Retention Project  
Will Byrd, May 2024  
Intrduction**

In this notebook, I will create classification models to predict wether or not a customer kept their subscription to the phone service.  

**Data**

The data used in this project is from Kaggle's Churn in Telecom's Dataset.  This data is remarkably clean with no missing values and will allow me to focus on the principles of model building.  Each record in this dataset represents a customer in Telecom and has attributes such as state, length of subscription, type of plan, usage, and wether or not the churned.  a customer who has churned has cancelled their subscription, so in this case, we will be targeting customers who have not churned or have a value of false in the churn column.  The churn column is our target column.

**Goals**

Build various models to evaluate the data.


**Explaratory Data Analysis**  

Loading in tools for data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from seaborn import load_dataset
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

Loading in the dataset.  Here, I will call it 'customer_df'

In [2]:
customer_df = pd.read_csv ('Data/telecom.csv')

In [3]:
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

In [4]:
customer_df.drop(columns=['phone number'], inplace=True)

In [5]:
customer_df['churn'] = customer_df['churn'].astype(float)

In [6]:
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   international plan      3333 non-null   object 
 4   voice mail plan         3333 non-null   object 
 5   number vmail messages   3333 non-null   int64  
 6   total day minutes       3333 non-null   float64
 7   total day calls         3333 non-null   int64  
 8   total day charge        3333 non-null   float64
 9   total eve minutes       3333 non-null   float64
 10  total eve calls         3333 non-null   int64  
 11  total eve charge        3333 non-null   float64
 12  total night minutes     3333 non-null   float64
 13  total night calls       3333 non-null   int64  
 14  total night charge      3333 non-null   

How lucky!  We don't have to impute data! We have a clean dataset!  This will make our process much easier going forward.  Now to figure out what all of these columns contain.

In [7]:
customer_df.head()

Unnamed: 0,state,account length,area code,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0.0
1,OH,107,415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0.0
2,NJ,137,415,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0.0
3,OH,84,408,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0.0
4,OK,75,415,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0.0


In [8]:
customer_df.isna().sum()

state                     0
account length            0
area code                 0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

Looks like we have columns of data for various phone accounts.  We can see the state in which the person lives, the phone number, area code, and details of their plan.

Now we need to figure out consistent themes/patterns in accounts that churned versus accounts that renewed.

First thing I'm interested in is just how many accounts churned.  In subscription services, churn refers to the rate at a customer stops using a service.  So we can assume every True value is a customer cancelling their subscription.

We can also assume that the account length is how many months the account has been active.

In [9]:
churn_counts = customer_df['churn'].value_counts()
true_count = churn_counts[1]
false_count = churn_counts[0]

print("Number of times a customer churned:", true_count)
print("Number of times a customer renewed subscription:", false_count)


Number of times a customer churned: 483
Number of times a customer renewed subscription: 2850


Lot's of happy customers!

And we can see that 483 + 2850 = 3333, so we aren't missing any values.

For the sake of our model, it will be easier to convert our categorical variables into a numerical format.  To make it simple, 1 will be yes and 0 will be no for columns:

- International Plan
- Voicemail Plan

And for the Churn column, we will leave it alone, as it is our target column.


In [10]:
customer_df.replace({'no': 0, 'yes':1, 'false':0, 'true':1}, inplace=True)
customer_df.head()


Unnamed: 0,state,account length,area code,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0.0
1,OH,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0.0
2,NJ,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0.0
3,OH,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0.0
4,OK,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0.0


So now we can see all of our categorical variables have been changed via one-hot encoding and we can more easily train the data.


**Feature Identification**

We have our customer information in all columns except the churn column.  Most of this makes sense intuitively, but let's go through it.  

- state- is which state the customer lives in
- international plan- if the customer had an international plan or not
- voice mail plan- if the customer had a plan that allowed voicemails
- number vmail messages- how many voicemails did this customer have
- total day minutes- shows how active the customer was during the day
- total day calls- how many calls the customer maade during the day
- total day charge- how much they were charged for their day minutes
- total eve minutes- how active the customer was in the evening
- total eve calls- how many calls were made in the evening
- total eve charge- how much they were charged for their evening minutes
- total night minutes- how active the customer was in the nighttime
- total night calls- how many calls were made in the nighttime
- total night charge- how much they were charged for their nighttime minutes
- total intl minutes- how active the customer was internationally
- total intl calls- how many calls were made internationally
- total intl charge- how much they were charged for their international minutes
- custimer service calls- how often the customer called customer service
- churn - did the customer churn (True) or did they remain a customer (False)

**Creating Training and Testing Sets**

In [11]:
X = customer_df.drop('churn', axis=1)
y = customer_df['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=615, stratify=y)

In [31]:
# Separate categorical and numerical columns
categorical_columns = ['state']
numerical_columns = [col for col in X_train.columns if col not in categorical_columns]

# One-hot encode the categorical columns
ohe = OneHotEncoder(drop='first', sparse=False)
X_train_encoded = ohe.fit_transform(X_train[categorical_columns])
X_test_encoded = ohe.fit_transform(X_test[categorical_columns])

# Create DataFrames from the encoded arrays
X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=ohe.get_feature_names(input_features=categorical_columns))
X_test_encoded_df = pd.DataFrame(X_test_encoded, columns=ohe.get_feature_names(input_features=categorical_columns))



# Now you have X_train_final, X_test_final, y_train, and y_test ready for model training


In [32]:
# Reset the indices of X_train_encoded_df and X_train[numerical_columns]
X_train_encoded_df.reset_index(drop=True, inplace=True)
X_train[numerical_columns].reset_index(drop=True, inplace=True)

# Check if indices match between encoded and numerical columns
print("Do indices match between X_train_encoded_df and X_train[numerical_columns]?")
print(X_train_encoded_df.index.equals(X_train[numerical_columns].index))

# Reindex X_train[numerical_columns] to align with X_train_encoded_df
X_train[numerical_columns] = X_train[numerical_columns].reindex(X_train_encoded_df.index)

# Concatenate the one-hot encoded columns with the original numerical columns
X_train_final = pd.concat([X_train_encoded_df, X_train[numerical_columns]], axis=1)

# Check again for missing indices
missing_indices = X_train_encoded_df.index.difference(X_train_final.index)
print("Missing indices in X_train_final:", missing_indices)


Do indices match between X_train_encoded_df and X_train[numerical_columns]?
False
Missing indices in X_train_final: Int64Index([], dtype='int64')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [42]:
# Reset the indices of X_train_encoded_df and X_train[numerical_columns]
X_test_encoded_df.reset_index(drop=True, inplace=True)
X_test[numerical_columns].reset_index(drop=True, inplace=True)

# Check if indices match between encoded and numerical columns
print("Do indices match between X_test_encoded_df and X_test[numerical_columns]?")
print(X_test_encoded_df.index.equals(X_test[numerical_columns].index))

# Reindex X_train[numerical_columns] to align with X_train_encoded_df
X_test[numerical_columns] = X_test[numerical_columns].reindex(X_train_encoded_df.index)

# Concatenate the one-hot encoded columns with the original numerical columns
X_test_final = pd.concat([X_test_encoded_df, X_test[numerical_columns]], axis=1)

# Check again for missing indices
missing_indices = X_test_encoded_df.index.difference(X_test_final.index)
print("Missing indices in X_test_final:", missing_indices)

Do indices match between X_test_encoded_df and X_test[numerical_columns]?
False
Missing indices in X_test_final: Int64Index([], dtype='int64')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [43]:
print(X_train_encoded_df.shape)


(2666, 50)


In [44]:
print(X_train_final.shape)

(3194, 68)


In [45]:
# Concatenate the one-hot encoded columns with the original numerical columns
X_train_final = pd.concat([X_train_encoded_df, X_train[numerical_columns]], axis=1)
X_test_final = pd.concat([X_test_encoded_df, X_test[numerical_columns]], axis=1)

In [46]:
X_train_final.isna().sum()

state_AL                   528
state_AR                   528
state_AZ                   528
state_CA                   528
state_CO                   528
                          ... 
total night charge        1056
total intl minutes        1056
total intl calls          1056
total intl charge         1056
customer service calls    1056
Length: 68, dtype: int64

In [47]:
X_test_final.isna().sum()

state_AL                  535
state_AR                  535
state_AZ                  535
state_CA                  535
state_CO                  535
                         ... 
total night charge        674
total intl minutes        674
total intl calls          674
total intl charge         674
customer service calls    674
Length: 68, dtype: int64

We create the test and train sets before one-hot encoding and feature scaling to reduce data leakage.

**Feature Scaling**

Since we have wide ranges of values for our various features, lets scale them. This will help our ML algorithm will be more accurate and will also save time in training.

In [48]:
standard = StandardScaler()
X_train_final = standard.fit_transform(X_train_final)

In [49]:
X_Test_final = standard.transform(X_test_final)

In [50]:
X_train_final

array([[-0.15808349, -0.13396186, -0.14378391, ..., -0.60898329,
        -0.08353808, -0.41808548],
       [-0.15808349, -0.13396186, -0.14378391, ..., -0.60898329,
         1.25995839, -0.41808548],
       [-0.15808349, -0.13396186, -0.14378391, ...,  0.19996749,
         0.70912484, -1.18866027],
       ...,
       [        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       [        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       [        nan,         nan,         nan, ...,         nan,
                nan,         nan]])

In [51]:
my_df1 = pd.DataFrame(X_train_final)
my_df1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,58,59,60,61,62,63,64,65,66,67
0,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-0.063716,-0.050930,-0.063542,0.851293,-0.463776,0.850582,-0.082819,-0.608983,-0.083538,-0.418085
1,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-0.100612,0.150436,-0.100095,1.042820,0.157206,1.043641,1.259415,-0.608983,1.259958,-0.418085
2,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-1.543417,0.502828,-1.543954,-0.769770,0.208954,-0.768476,0.715266,0.199967,0.709125,-1.188660
3,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-2.694942,-0.604689,-2.695385,-0.092517,-0.567273,-0.092772,-1.316222,1.008918,-1.319555,0.352489
4,-0.158083,-0.133962,-0.143784,-0.093286,-0.141042,-0.156833,-0.124976,-0.135405,-0.146478,-0.126515,...,-1.017172,1.106928,-1.016214,-0.289967,1.088678,-0.290218,-0.046542,-0.608983,-0.043233,1.123064
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3189,,,,,,,,,,,...,,,,,,,,,,
3190,,,,,,,,,,,...,,,,,,,,,,
3191,,,,,,,,,,,...,,,,,,,,,,
3192,,,,,,,,,,,...,,,,,,,,,,


In [52]:
my_df2 = pd.DataFrame(X_test_final)
my_df2

Unnamed: 0,state_AL,state_AR,state_AZ,state_CA,state_CO,state_CT,state_DC,state_DE,state_FL,state_GA,...,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,,,,,,,,,,
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,,,,,,,,,,
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3312,,,,,,,,,,,...,,,,,,,,,,
3315,,,,,,,,,,,...,,,,,,,,,,
3318,,,,,,,,,,,...,,,,,,,,,,
3324,,,,,,,,,,,...,,,,,,,,,,


In [None]:
X_train['state'].value_counts().sum()

In [None]:
X_train, X_test, y_train, y_test =\
    train_test_split(customer_df[['state', 'account length', 'area code', 'international plan', 'voice mail plan', 'number vmail messages', 'total day minutes', 'total day calls', 'total day charge', 'total eve minutes', 'total eve calls', 'total eve charge', 'total night minutes', 'total night calls', 'total night charge', 'total intl minutes', 'total intl calls', 'total intl charge', 'customer service calls']],
                    customer_df['churn'],
                    random_state=42)

In [None]:
ohe = OneHotEncoder()
columns_to_encode = ['state']
ohe.fit(X_train[columns_to_encode])

In [None]:
encoded = ohe.transform(X_train[columns_to_encode])
encoded

In [None]:
encoded.todense()

In [None]:
ohe.get_feature_names()

In [None]:
new_state_df = pd.DataFrame(encoded.todense(),
                             columns=ohe.get_feature_names(),
                             index=X_train.index)
new_state_df.head()

In [None]:
new_state_df.info()

In [None]:
df_train_concat = pd.concat([X_train, new_state_df],
                           axis=1).drop('state', axis=1)
df_train_concat.head()

In [None]:
df_train_concat.info()

In [None]:
lr = LinearRegression()
lr.fit(df_train_concat, y_train)

In [None]:
lr.score(df_train_concat, y_train)

In [None]:
test_encoded = ohe.transform(X_test[columns_to_encode])

In [None]:
new_test_df = pd.DataFrame(test_encoded.todense(),
                          columns=ohe.get_feature_names(),
                          index=X_test.index)
new_test_df.head()

In [None]:
df_test_concat = pd.concat([X_test, new_test_df],
                          axis=1).drop('state', axis=1)
df_test_concat.head()

In [None]:
lr.score(df_test_concat, y_test)

In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np

# Calculate the mean of the target variable
mean_target = np.mean(y_train)

# Create predictions based on the mean value
baseline_predictions = np.full_like(y_test, fill_value=mean_target)

# Calculate the RMSE (Root Mean Squared Error) to evaluate the baseline model
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_predictions))
print("Baseline RMSE:", baseline_rmse)


In [None]:
tree_baseline = DecisionTreeClassifier(criterion='entropy', random_state=416)
tree_baseline.fit(df_train_concat, y_train)

In [None]:
fig, ax = plt.subplots(figsize=(15,25))
tree.plot_tree(tree_baseline,
              feature_names = df_train_concat.columns,
              class_names = np.unique(y).astype('str'),
              filled=True,
              ax=ax)

plt.show();

In [None]:
X_train_continuous = X_train[['account length', 'number vmail messages', 'total day minutes', 'total day calls', 'total day charge', 'total eve minutes', 'total eve calls', 'total eve charge', 'total night minutes', 'total night calls', 'total night charge', 'total intl minutes', 'total intl calls', 'total intl charge', 'customer service calls']]

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fig, axes = plt.subplots(nrows=1, ncols=15, figsize=(25,6))
for index, col in enumerate(X_train_continuous):
    axes[index].hist(X_train_continuous[col])
    axes[index].set_title(X_train_continuous.columns[index])
plt.tight_layout();

Looks like we have normal distributions for the almost all of the variables except:

- number vmail messages
- total intl calls
- customer service calls

This can be explained by the fact that:

- number vmail messages-not everyone has access to vmails
- total intl calls-international calls will likely be less as they are expensive and typically infrequent
- customer service calls-most people will not make customer service calls
