# SyriaTel Customer Churn

![Customer service](images/call_center.png)

## Overview

This project explores and analyses telecom data from SyriaTel to generate insights on how international and voicemail plans, customer service calls and call minutes affect customer satisfaction causing churn. SyriaTel can use this analysis to curb churn and calculate money lost when a customer churns.

## Business Problem

In this project, we address the rate of customer churn at SyriaTel, a telecom company. Customer churn leads to significant revenue loss and increased costs for acquiring new customers. The goal is to identify the key factors that contribute to customer churn, such as international and voicemail plans, customer service interactions, and call usage patterns. By understanding these drivers, SyriaTel can develop targeted strategies to improve customer retention, enhance satisfaction, and reduce the financial impact of churn.

![Telephone](images/telephone-3594206_1280.jpg)

## Objectives

* What is the overall churn rate?
* Which package plans have customers at high risk of leaving?
* What factors contribute to high rate of customer churn?


## Data Understanding

The telecom data used in this project is from `kaggle`[here](https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset).


Import the neccessary libraries

In [13]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [14]:
# preprocessing and evaluation metrics
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

ModuleNotFoundError: No module named 'imblearn'

In [None]:
# load dataset

df = pd.read_csv("telecom.csv", sep=",")
df.head()

In [None]:
# check the dataset
df.info()

* The dataset is clean has no missing values
* It has 2 categorical columns (international plan, voicemail plan)

In [None]:
duplicates = df.duplicated()
print(duplicates)

* The dataset has 3333 entries with no duplicates

In [None]:
# view summary statistics
df.describe()

In [None]:
# visualize the distribution of churn and add percentage annotations
plt.figure(figsize=(8, 6))
ax = sns.countplot(x='churn', data=df, palette='inferno')
total = len(df)
for p in ax.patches:
    count = int(p.get_height())
    percentage = 100 * count / total
    ax.annotate(f'{percentage:.2f}%', (p.get_x() + p.get_width() / 2, p.get_height()),
                ha='center', va='bottom', fontsize=12, color='black', fontweight='bold')
plt.title('Distribution of Churn')
plt.xlabel('Churn')
plt.ylabel('Count')
plt.show()

In [None]:
# view churn distribution of values in the dataset
churn_distribution = df['churn'].value_counts()

print(churn_distribution)

The target is binary so the a classification models will perform well.
We will build two models:
- Logistic regression as the base model 
- Decision trees 

In [None]:
# Calculate and display churn percentage
churn_percent = (df['churn'].sum() / len(df)) * 100
print(f"Churn Percentage: {churn_percent:.2f}%")
# check percentage of loyal customers
loyal_customers = (churn_distribution[False] / churn_distribution.sum()) * 100

print(f"Loyal Percentage: {loyal_customers:.2f}%")

* There are 85.51 % of loyal customers and 14.49% churn rate
* The dataset has more records in one class therefore it is unbalanced.

In [None]:
# drop phone number 
df = df.drop(columns=['phone number'],axis=1)

Map binary features international plan and voicemail plan
yes = 1, 
no = 0

In [None]:
# map international and voicemail plans to 1 and 0
df[['international plan', 'voice mail plan']] = df[['international plan', 'voice mail plan']].replace(['yes', 'no'], ['1', '0']).astype(int)

### Multivariate Analysis

This is to check the relationship of features

In [None]:
numeric_df = df.select_dtypes(include=['float64', 'int64'])

correlation_matrix = numeric_df.corr(method='spearman')
#mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))


plt.figure(figsize=(10, 8))

# Draw the heatmap
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='viridis')
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()

*Interpretation*

`Redundacy`
* The heatmap shows perfect correlations between minutes and charge for day, evening and night which indicates multicollinearity


In [None]:
# drop columns to avoid redundancy
df = df.drop(columns=['total day charge', 'total eve charge', 'total night charge', 'total intl charge'])
df.info()

In [None]:
# Calls and minutes aggregate
df['total calls'] = df['total day calls'] + df['total eve calls'] + df['total night calls'] + df['total intl calls'] + df['customer service calls']
df['total minutes'] = df['total day minutes'] + df['total eve minutes'] + df['total night minutes'] + df['total intl minutes']

print(df['total calls'].describe(), df['total minutes'].describe(), sep='\n\n')

Check average calls and minutes by state

In [None]:
# total calls and total minutes per state
calls_by_state = df.groupby('state')['total calls'].sum().sort_values(ascending=False)
minutes_by_state = df.groupby('state')['total minutes'].sum().sort_values(ascending=False)

avg_minutes_by_state = df.groupby('state')['total minutes'].mean().sort_values(ascending=False)
avg_calls_by_state = df.groupby('state')['total calls'].mean().sort_values(ascending=False)
records_by_state = df.groupby('state')['churn'].count().sort_values(ascending=False)

print("Top 5 states on average total minutes:\n", avg_minutes_by_state.head())

print("Top 5 states on average total calls:\n",avg_calls_by_state.head())
      
print("Top 5 states on average total records:\n",records_by_state.head())


* From the top 5 states, indiana(IN) has the highest average total minutes
* Georgia(GA) has the highest average total calls
* West Virginia (WV) has the highest number of churn customers


In [None]:
most_calls = calls_by_state.idxmax()
least_calls = calls_by_state.idxmin()
print('Most calls:', most_calls)
print('Least calls:', least_calls)

most_minutes = minutes_by_state.idxmax()
least_minutes = minutes_by_state.idxmin()
print('Most minutes:', most_minutes)
print('Least minutes:', least_minutes)

*Interpretation*
* West Virginia has the highest number of calls and minutes
* California has the least calls and least minutes

*Insights*
* High number of calls and more minutes might cause the high number of churn.

**Visualization of churn by total calls in a histogram**

In [None]:
# visualize churn by total calls
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='total calls', hue='churn', bins=30, kde=False, multiple='stack')
plt.title('Histogram of Churn by Total Calls')
plt.xlabel('Total Calls')
plt.ylabel('Count')
plt.legend(title='Churn', labels=['Yes', 'No'])
plt.tight_layout()
plt.show()


*Insights*
* Customers with very low number of calls are more likely to churn

### Bivariate Analysis

### Churn rate by voice mail plan and international plan

In [None]:
# print churn rate by voice mail plan and international plan
print(df.groupby(['voice mail plan', 'international plan'])['churn'].mean())#end='\n\n')


Visualization

In [None]:
# Calculate churn percentage
churn_percentage = df.groupby(['voice mail plan', 'international plan'])['churn'].mean() * 100

# Plot the churn percentage
churn_percentage.unstack().plot(kind='bar', figsize=(10, 6), color=['blue', 'maroon'])
plt.title('Churn Percentage by Voice Mail and International Plans')
plt.xlabel('Voice Mail Plan')
plt.ylabel('Churn Percentage')
plt.xticks(rotation=0)
plt.legend(title='International Plan')
plt.show()

*Interpretation*
* Customers without an international plan (blue bars )but have voice mail plan (yes)have low churn percentage
* For customers with an international plan, presence of voice mail plan small impact on churn.


*Insights*
* International plan seems to strongly cause churn.
* For customers without an international plan, offering them voice mail plan may retain them.

A similar visualization on different axes

In [None]:
# Visualize the effect of international plan and voice mail plan on customer satisfaction (churn)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# International plan vs churn
sns.barplot(x='international plan', y='churn', data=df, ax=axes[0], errorbar=None, palette='viridis')
axes[0].set_title('Churn Rate by International Plan')
axes[0].set_xlabel('International Plan (0=No, 1=Yes)')
axes[0].set_ylabel('Churn Rate')

# Voice mail plan vs churn
sns.barplot(x='voice mail plan', y='churn', data=df, ax=axes[1], errorbar=None, palette='plasma')
axes[1].set_title('Churn Rate by Voice Mail Plan')
axes[1].set_xlabel('Voice Mail Plan (0=No, 1=Yes)')
axes[1].set_ylabel('Churn Rate')

plt.tight_layout()
plt.show()

### Churn by Customer service calls

In [None]:
# check churn by customer service calls
df.groupby('churn')['customer service calls'].describe()

In [None]:
# Visualize customer service churn
fig, ax = plt.subplots()
avg_calls = df.groupby('churn')['customer service calls'].mean()
avg_calls.plot(kind='bar', ax=ax, label='')
ax.axhline(y=df['customer service calls'].mean(),c='black', label='dataset mean')
ax.legend()
ax.set_title('mean customer service calls on churn')
ax.set_ylabel('customer service calls (mean)', rotation = 90)

for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    ax.annotate('{:.2f}'.format(height), (p.get_x()+.35*width, p.get_y()+.5*height), color = 'white', weight = 'bold', size = 14)

plt.tight_layout()
plt.show()

*Interpretation*
* Disloyal customers made more calls than loyal customers
* Average number of calls made by disloyal customers is higher (2.23) than the dataset mean approximately (1.6)

*Insights*
* Dissatisfied customers may tend to make more calls to customer service and eventually churn

In [None]:
sns.barplot(y='number vmail messages', x='churn', data=df, errorbar=None, palette='inferno')
plt.title('Relationship Between Number of Voicemail Messages and Churn')
plt.xlabel('Number of Voicemail Messages')
plt.ylabel('Churn Rate')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

*Interpretation*
* Loyal customers(not churned) have high number of voice messages compared to disloyal(churned)
* Disloyal custumers have lower voicemail messages

*Insights*
* Customers actively using voicemail messages are engaged probably satisfied with the service and less likely to churn
* Customers less engaged are likely dissatisfied with the service raising the probability to churn 

## Data Preparation

### Preprocessing
- Preparing the data before splitting into training and test sets to avoid data 
- We use One-Hot Encoding to get dummies for categoricals columns and Standard scaler to convert to same unit variance 

In [None]:
# define categorical columns
categorical_columns = df[df.select_dtypes(exclude=['number']).columns]

# view categorical columns
categorical_columns.info()

Use One-hot Encoding to get dummies

In [None]:
# use one hot encoding on state and convert to integer
state_dummy = pd.get_dummies(df, columns=["state", 'area code'],drop_first=True)

state_dummy = state_dummy.astype(int)
state_dummy.info()

Scale the data to unit variance

In [None]:
#using standard scaler to make values same unit
numeric_features = df.select_dtypes(include=['number'])

# instantiate standard scaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(numeric_features)
scaled_features_df = pd.DataFrame(scaled_features, columns=numeric_features.columns)
print(scaled_features_df.head())

Combine the datasets

In [None]:
# combine state dummies with scaled features df
df_combined = pd.concat([state_dummy, scaled_features_df], axis=1)
df_combined.info()

### Split the data into training and testing

In [None]:
# name the predictor and the target variables
X = df_combined.drop(columns=['churn'], axis=1)
y = df['churn']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
target_names = ['False.', 'True']

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Use Logistic Regression to create base model

In [None]:
# use Logistic regression
lr = LogisticRegression(max_iter=6000)

# fit and train the model
lr.fit(X_train, y_train)

#prediction
y_pred_1 = lr.predict(X_test)

In [None]:
# print the report
print(classification_report(y_test, y_pred_1, target_names=target_names))

In [None]:
# Evaluate the model perfomance
acc_lr = accuracy_score(y_test, y_pred_1)
f1_lr = f1_score(y_test, y_pred_1)
recall_lr = recall_score(y_test, y_pred_1)
precision_lr = precision_score(y_test, y_pred_1)

print(f"F1 score: {f1_lr}, \n Recall: {recall_lr}, \n Precision: {precision_lr}, \n Accuracy: {acc_lr}")


*Interpretation*

* Precision - The model correctly predicts that 54% of the instances a customer churned
* Recall - The model is fails to correctly predict customers who churned.
* Accuracy - The model's accuracy is 85%, meaning it can be improved by balancing classes

* Class imbalance - support of 'False' 566 and 'True' 101

In [None]:
# Create a confusion matrix
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_1)
plt.title("Confusion Matrix - Logistic Regression")
plt.show()



* The model is biased towards predicting the majority class and fails to effectively predict the minority class

### Apply SMOTE to solve class imbalance

In [None]:
# Previous original class distribution
print('Original class distribution: \n')
print(y.value_counts())
smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train) 
# Preview synthetic sample class distribution
print('-----------------------------------------')
print('Synthetic sample class distribution: \n')
print(pd.Series(y_train_resampled).value_counts()) 

The data is balanced

In [None]:
# use the balanced data
logreg = LogisticRegression(max_iter=6000, C=1e12, fit_intercept=False, solver='liblinear')
logreg.fit(X_train_resampled, y_train_resampled)

In [None]:
# check the metrics
# calculate the probability scores of each of the datapoints:
y_score = logreg.fit(X_train_resampled, y_train_resampled).decision_function(X_test)

fpr, tpr, thresholds = roc_curve(y_test, y_score)

In [None]:
# print the Auc
print('AUC: {}'.format(auc(fpr,tpr)))

Plot the ROC curve

In [None]:

# Compute ROC curve and ROC area
fpr, tpr, thresholds = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

The model's Auc score of 0.82 shows 82% chance of being able to distinguish between a positive and a negative instance.

### Decision Trees 

In [None]:
# instantiate the model
tree = DecisionTreeClassifier(random_state=24)

# fit the model on resampled / balanced data
tree.fit(X_train_resampled, y_train_resampled)

# predict
y_pred_tree = tree.predict( X_test)

In [None]:
# Plot confusion matrix for Decision Tree on resampled data

ConfusionMatrixDisplay.from_predictions(y_test, y_pred_tree)
plt.title("Confusion Matrix - Decision Tree")
plt.show()


In [None]:
# evaluate the model
f1_tree = f1_score(y_test, y_pred_tree)
recall_tree = recall_score(y_test, y_pred_tree)
precision_tree = precision_score(y_test, y_pred_tree)
acc_tree = accuracy_score(y_test, y_pred_tree)

print(f"F1 score: {f1_tree:.2f}, \n Recall: {recall_tree:.2f}, \n Precision: {precision_tree:.2f}, \n Accuracy: {acc_tree:.2f}")
print("Decision Tree (tree) Results:")
print(classification_report(y_test, y_pred_tree))

*Interpretation* 
 * The decision tree performed better than the logistic model
 * The f1_score of 95% is good metric 
 * The overall metrics are good, with an accuracy of 91%
 

*Insights* 
- I choose the F1_score of 95% since it's the harmonic mean of recall and precision.
- In real-world predictions, a 95% F1_score implies that prioritization would be almost accurate since customer support team would be focused on many genuine cases.

In [None]:
# Tuning the moddel using GridsearchCV

"""
max_depth - length of tree

min_samples_split - number of samples to split an internal node

criterion - entropy , gini

max_features - number of predictors to use during each split, higher may lead to overfitting, lower generalizes

splitter - best , random - the strategy used choose split

min_samples_leaf - how many you need in a group/leaf mode

"""
tree_1= DecisionTreeClassifier(random_state=42)

param_grid = {
    'max_depth': [None, 3, 5, 20],
    'min_samples_split': [5, 20,100, 1000],
    'min_samples_leaf': [5, 10, 15],
    'criterion':['gini', 'entropy'] ,
    'max_features': ['sqrt','log2']
    }

# instantiate
grid_search = GridSearchCV(estimator=tree_1, param_grid=param_grid, cv=2, scoring = 'accuracy' )

#training/fitting the  gridsearch
grid_search.fit(X_train_resampled, y_train_resampled)


In [None]:
# check the parameters 
params = grid_search.best_params_
score = grid_search.best_score_

print(f'Best Parameters: {params}')
print(f'Best Scores: {score:.2f}')

In [None]:
# Apply the tuning to a grid model
grid_model = grid_search.best_estimator_

#predicting using the best grid search model
y_preds_g = grid_model.predict(X_test)


f1_g = f1_score(y_test, y_preds_g)
acc_g = accuracy_score(y_test, y_preds_g)
precision_g = precision_score(y_test, y_preds_g)
recall_g = recall_score(y_test, y_preds_g)

print(
    f"Grid model has an: \n Accuracy: {acc_g:.2f}, \n Precision: {precision_g:.2f}, \n Recall: {recall_g:.2f}, \n F1 Score: {f1_g:.2f}"
)


In [None]:
# compare the two models performance
print("Decision Tree (tree) Results:")
print(classification_report(y_test, y_pred_tree))

print("\nTuned Decision Tree (grid_model) Results:")
print(classification_report(y_test, y_preds_g))

*Interpretation*
* The original Decision tree performed better than the tuned Decision tree
* Hyperparameter tuning did not improve minority class performance

In [None]:
# Count number of customers predicted to churn (True) by each model
n_churned_lr = np.sum(y_pred_1)
n_churned_tree = np.sum(y_pred_tree)

print(f"Logistic Regression predicted {n_churned_lr} customers to churn.")
print(f"Decision Tree predicted {n_churned_tree} customers to churn.")

### Random Forest Classifier

In [None]:
# instantiate Random Forest
classifier = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10,random_state=42)

# fit the model
classifier = classifier.fit(X_train_resampled, y_train_resampled)
print(classifier)

In [None]:
# predict
y_random_frst=classifier.predict(X_test)

# Evaluate
score = classifier.score(X_test, y_test)
print('Random Forest Classifier : ',score)
print('Accuracy Score',accuracy_score(y_test,y_random_frst))  
print(classification_report(y_test, y_random_frst, target_names=target_names))

In [None]:
# compare the models performance
print("Decision Tree (tree) Results:")
print(classification_report(y_test, y_pred_tree))

print("\nRandom Forest (Random Forest model) Results:")
print(classification_report(y_test, y_random_frst, target_names=target_names))

*Interpretation*
- The Random Forest classifier performs better than all other models
- It generalizes minority class better

*Insights*
- Random Forest Classifier is overall best model since it generalizes the minority class well.

## Analyze factors causing churn

In [None]:
# Analyze feature importances from the decision tree to identify key factors causing churn
importances = classifier.feature_importances_
feature_names = X_train.columns

# Get top features
feature_importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values(by='importance', ascending=False)

print("Top factors causing churn based on Decision Tree importance:")
print(feature_importance_df.head(10))

### Features that contribute to churn

In [None]:
# Visualize top 10 features contributing to churn from the decision tree
plt.figure(figsize=(10, 6))
sns.barplot(
    x='importance',
    y='feature',
    data=feature_importance_df.head(10),
    palette='viridis'
)
plt.title('Top 10 Features Contributing to Churn (Decision Tree)')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

## Conclusion
### *The analysis of this dataset reveals several key insights into the drivers of customer churn.*

* High number of customer service calls show dissatisfaction and a potential churn.
* Customers with an international plan are more likely to churn, and those who make more customer service calls also show a higher tendency to leave, indicating dissatisfaction with the service.
* Customers who actively use voicemail services tend to be more loyal. 
* The models show that accuracy is high due to class imbalance, recall for the minority (churn) class is improved after applying SMOTE. 
* Feature importance analysis confirms that customer service calls, international plan, and usage patterns are significant predictors of churn.

In [None]:
# Visualize the conclusion

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 1. Customer service calls vs churn
sns.barplot(x='churn', y='customer service calls', data=df, ax=axes[0], errorbar=None)
axes[0].set_title('Customer Service Calls by Churn')
axes[0].set_xlabel('Churn')
axes[0].set_ylabel('Avg Customer Service Calls')

# 2. International plan vs churn
sns.barplot(x='international plan', y='churn', data=df, ax=axes[1], errorbar=None)
axes[1].set_title('Churn Rate by International Plan')
axes[1].set_xlabel('International Plan (0=No, 1=Yes)')
axes[1].set_ylabel('Churn Rate')

# 3. Voicemail usage vs churn
sns.barplot(x='churn', y='number vmail messages', data=df, 
ax=axes[2], errorbar=None)
axes[2].set_title('Voicemail Messages by Churn')
axes[2].set_xlabel('Churn')
axes[2].set_ylabel('Avg Voicemail Messages')

plt.tight_layout()
plt.show()



## Recommendations
- Offering incentives or tailored packages to these segments, as well as promoting voicemail plan usage, may help increase satisfaction and retention.
- SyriaTel should focus on improving customer service quality especially customers who call for support or help, this would reduce churn and loss of revenue
- There should be a special consideration to customers with international plans as they have a high chance of leaving.

## Next steps
- Stay update with real-time industry data and new technology to adjust to new ideas.
- Marketing team should focus on customer acquisition and retention methods like content marketing and loyalty rewards.
