# BACKGROUND

A Churn refers to whether a customer has stopped using the service or cancelled their subscription.Customer churn is a critical issue for companies, especially in industries with high competition, such as telecommunications. Retaining existing customers is often more cost-effective than acquiring new ones. As a result, identifying factors that lead to customer churn and predicting which customers are at risk of leaving can help businesses develop strategies to enhance customer satisfaction and loyalty.

 In this project, we used a dataset containing various features related to customer usage and interactions with the service, including account length, usage metrics, service plans, and customer service calls. Our goal was to build predictive models to classify customers into churners and non-churners, helping the company take proactive measures to reduce churn rates.

# PROBLEM STATEMENT

 The telecommunications company aims to reduce its customer churn rate, which directly impacts revenue and profitability. Despite various retention efforts, there is a need for a more systematic and data-driven approach to identify high-risk customers and understand the key drivers behind churn.
 The main challenges include:

Identifying the most significant factors contributing to customer churn.

Developing accurate predictive models to classify customers as churners or non-churners.

Providing actionable insights and recommendations to the company based on the model's predictions.

# OBJECTIVES


# 1.Data Preparation and Exploration:

Clean and preprocess the dataset to handle missing values, encode categorical features, and scale numeric features.

Explore the data to understand distributions, correlations, and potential factors influencing churn.

# 2.Model Development:

Build a baseline model using logistic regression to provide an initial understanding of the predictive capabilities.

Develop an advanced model using a decision tree classifier with hyperparameter tuning to enhance predictive performance.

# 3.Model Evaluation and Comparison:

Evaluate the models using metrics such as accuracy, precision, recall, and F1 score.

Compare the performance of the logistic regression model and the decision tree model to identify the best-performing model.

# 4.Insights and Recommendations:

Analyze feature importance to understand which variables have the most significant impact on churn.

Provide actionable insights to stakeholders based on the model's predictions, including strategies for customer retention and areas for service improvement.

#### By achieving these objectives, the company can leverage data-driven insights to reduce churn rates, enhance customer satisfaction and improve overall business performance.

In [60]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [61]:
#Loading the data set
df=pd.read_csv('Syria_Tel.csv')

In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

### Note:From the columns above, we identify our target variable as Churn and the features as all the other columns.In the context of the data set provided, churn refers to whether a customer has stopped using the service or canceled their subscription.When a customer churns it means they have decided to leave the service.Analyzing and predicting customer churn is crucial for business to understand the factor contributing to customer attrition and to develop strategies to retain customers.

In [63]:
#Separate features and target
x=df.drop(['churn','phone number'],axis=1)
y=df['churn']

### Note:We drop the phone number column as it does not provide meaningful information for predicting churn and should be excluded from the features used for training the model.

In [64]:
#Splitting the data into training and testing sets(50/50 split)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.5,random_state=1)

In [65]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


In [66]:
# Identify categorical features 
categorical_features = ['state', 'international plan', 'voice mail plan']

# Use OneHotEncoder to transform categorical features and pass through other columns unchanged
preprocessor = ColumnTransformer( 
    transformers=[ ('cat', OneHotEncoder(sparse_output=False), categorical_features) ], remainder='passthrough')


In [67]:
#Define a pipeline to scale numeric features and process the data 
pipeline = Pipeline([ 
    ('preprocessor', preprocessor), 
    ('scaler', StandardScaler(with_mean=False)) ])

In [68]:
# Fit and transform the training data, and transform the test data 
X_train_processed = pipeline.fit_transform(x_train)
X_test_processed = pipeline.transform(x_test)
print(X_train_processed.shape) 
print(X_test_processed.shape)


(1666, 71)
(1667, 71)


## Baseline Model: Logistic Regression

In [69]:
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score

# Create and fit the logistic regression model 
logistic_model = LogisticRegression(random_state=1,max_iter=1000)#to increase number of iterations
logistic_model.fit(X_train_processed,y_train)

In [70]:
#Predict and Evaluate
y_pred_logistic = logistic_model.predict(X_test_processed) 
logistic_acc = accuracy_score(y_test, y_pred_logistic)
print(f"Logistic Regression Accuracy: {logistic_acc}")

Logistic Regression Accuracy: 0.853629274145171


## Conclusion: 
An accuracy of 0.8536 means that the logistic model has correctly predicted approximately 85.36% of the instances in the test dataset suggesting that the model performs well in distinguishing between the classes.To understand the model performance better we then proceed with additional metrics: precision, recall and F1 score.

In [74]:
#Calculate Precision,recall and F1 score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

logistic_precision = precision_score(y_test, y_pred_logistic, average='weighted') 
logistic_recall = recall_score(y_test, y_pred_logistic, average='weighted')
logistic_f1 = f1_score(y_test, y_pred_logistic, average='weighted') 
conf_matrix = confusion_matrix(y_test, y_pred_logistic)

print(f"Precision: {logistic_precision}") 
print(f"Recall: {logistic_recall}") 
print(f"F1 Score: {logistic_f1}")
print(f"Confusion Matrix: \n{conf_matrix}")

Precision: 0.8219382350991498
Recall: 0.853629274145171
F1 Score: 0.8296900361785249
Confusion Matrix: 
[[1372   58]
 [ 186   51]]


### Precision measures the accuracy of positive predictions hence a precision of 0.8219 means about 82.19% of the customers predicted to churn were indeed churners.A recall 0f 0.8536 indicates about 85.36% of the actual churners were correctly predicted by the model.An F1 Score of 0.8297 suggests that the model strikes a good balance between the precision and recall making it a reliable metric for assessing the model's performance.

# The confusion matrix shows:
 i)True negative(1372):model identified 1372 instances where customers did not churn
ii)True Positive(51): model identified 51 instances where customers churned
iii)false negative(186):There were 186 instances where the model incorrectly predicted that customers would not churn when they actually did.
 iv)False positive(58);here were58 instances where the model incorrectly predicted that customers would  churn when they actually did not .
Therefore high number of true negatives and true positive indicate the model performs well.

# Tuned Model: Decision Tree Classifier

In [79]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV

# Create a simple decision tree classifier 
tree_model = DecisionTreeClassifier(random_state=1)

# Define hyperparameters to tune 
param_grid = { 
    'max_depth': [None, 10, 20, 30],
      'min_samples_split': [2, 10, 20], 
      'min_samples_leaf': [1, 5, 10]
        }


# Perform grid search g
grid_search = GridSearchCV(tree_model, param_grid, cv=5, scoring='accuracy') 
grid_search.fit(X_train_processed, y_train)



In [84]:
#Get the best model from grid search
best_tree_model = grid_search.best_estimator_ 
# Predict and evaluate 
y_pred_tree = best_tree_model.predict(X_test_processed) 
tree_acc = accuracy_score(y_test, y_pred_tree) 

print(f"Decision Tree Accuracy: {tree_acc}")

Decision Tree Accuracy: 0.937612477504499


In [85]:
#Calculate additional metrics 
tree_precision = precision_score(y_test, y_pred_tree, average='weighted')
tree_recall = recall_score(y_test, y_pred_tree, average='weighted') 
tree_f1 = f1_score(y_test, y_pred_tree, average='weighted') 
conf_matrix_tree = confusion_matrix(y_test, y_pred_tree)

print(f"Decision Tree Accuracy: {tree_acc}") 
print(f"Precision: {tree_precision}")
print(f"Recall: {tree_recall}") 
print(f"F1 Score: {tree_f1}") 
print(f"Confusion Matrix: \n{conf_matrix_tree}")

Decision Tree Accuracy: 0.937612477504499
Precision: 0.9360998080701541
Recall: 0.937612477504499
F1 Score: 0.9366962788212593
Confusion Matrix: 
[[1386   44]
 [  60  177]]


## Accuracy:0.9376
# Accuracy of 0.9376 indicates that approximately 93.76% of the predictions made by your model were correct. This high accuracy suggests that the decision tree model performs well overall.

# Precision:0.9361
## It suggests about 93.61% of the instances predicted as churns were indeed churners. High precision indicates that the model is effective at minimizing false positives.


# F1 Score: 0.9367
## Indicates a good balance between precision and recall showing the model's fitness in handling both aspects of performance.

# Confusion Matrix
## True Negatives (1386): The model correctly identified 1386 instances where customers did not churn.

## False Positives (44): The model incorrectly identified 44 instances where customers were predicted to churn but did not.

## False Negatives (60): The model incorrectly identified 60 instances where customers were predicted not to churn but actually did.

## True Positives (177): The model correctly identified 177 instances where customers churned.

## Thus,these metrics indicate that the decision tree model is highly effective at predicting customer churn, providing reliable predictions with minimal errors.

In [89]:
# Comparing the performance of the two models using the metrics
comparison_df = pd.DataFrame({
     'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'], 
     'Logistic Regression': [logistic_acc, logistic_precision, logistic_recall, logistic_f1], 
     'Decision Tree': [tree_acc, tree_precision, tree_recall, tree_f1] 
     })

print(comparison_df)

      Metric  Logistic Regression  Decision Tree
0   Accuracy             0.853629       0.937612
1  Precision             0.821938       0.936100
2     Recall             0.853629       0.937612
3   F1 Score             0.829690       0.936696


# Conclusion
## From the results above, it is evident that the decision tree performs better than the logistic model accross all key metrics.It has a higher accuracy and balance precision and recall, indicating it effectively identifies both churners and non-churners with fewer errors.

# Actionable Insights for Stakeholders:

# 1.Leverage the Decision Tree Model:

Customer Retention: Use the model to identify at-risk customers early and implement targeted retention strategies such as personalized offers or improved customer support.

Resource Allocation: Focus resources on the segments identified by the model as high-risk for churn, optimizing customer service efforts and retention campaigns.

# 2.Feature Analysis:

Influential Features: Analyze the feature importance scores from the decision tree model to understand which variables (e.g., customer service calls, total day minutes, international plan) have the most significant impact on churn. Use this insight to tailor services and address issues that drive customer dissatisfaction.

# 3.Regular Monitoring and Updates:

Model Monitoring: Continuously monitor the model’s performance and update it with new data to maintain its accuracy. Regularly retrain the model to adapt to changing customer behaviors and trends.

Feedback Loop: Create a feedback loop where insights from the model are used to refine business strategies and customer interactions, and the outcomes are fed back into the model for further improvement.
# 4.Business Strategy:

Personalized Marketing: Utilize model predictions to develop personalized marketing campaigns targeting customers likely to churn with relevant offers and incentives.

Product Development: Use insights from churn patterns to guide product enhancements and feature development, ensuring alignment with customer needs and reducing churn rates.

By implementing these actionable insights, stakeholders can use the decision tree model to enhance customer retention strategies, optimize resource allocation, and ultimately improve business outcomes.