# Akademi Education X Flatiron School - Data Science & AI
---

- **Student name :** Vilmarson JULES
- **Student pace :** self pace
- **Deadline Submission :** July 27, 2025 
- **Instructors' Name :** Wedter Jerome & Geovany Batista Polo Laguerre
- **GitHub Repository:** [Movie-Insights Project](https://github.com/VilmarsonJ/ds-2-movie-insights)  
- **LinkedIn:** [Vilmarson JULES](https://www.linkedin.com/in/jules-vilmarson-2a68a5294/)


Project Overview

This project focuses on predicting customer churn for DIGICEL Haiti, a leading telecommunications provider. The goal is to build a machine learning classification model that identifies customers at high risk of leaving the company’s services. By leveraging historical customer data, including usage patterns, service plans, and billing information, the project aims to uncover patterns that indicate churn.

Problem: DIGICEL has noticed that a significant number of customers are leaving their services for competitors, which leads to decreased revenue and higher customer acquisition costs. However, the company currently does not have a reliable way to identify which customers are at risk of leaving before it happens. This makes it difficult to implement effective retention strategies proactively.

Your Role / Objective: As a Data Analyst / Machine Learning Specialist at DIGICEL, I am tasked with developing a predictive model that can accurately identify customers at high risk of churn. By providing these insights, I will enable DIGICEL’s marketing and retention teams to take targeted actions (such as personalized promotions, loyalty programs, or service improvements) to reduce churn, improve customer satisfaction, and protect the company’s revenue.

## The Data

The dataset for this project is publicly available on Kaggle: [Churn in Telecoms Dataset](https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset)
. It is provided in CSV format (inside a ZIP file) and contains customer account details, usage patterns, and service plans for a telecommunications company.
This dataset will be used throughout Phase 3 to explore, preprocess, and model customer churn, forming the basis for all analyses and predictions in the project.

This project follows a predictive modeling approach focused on classification to predict customer churn. The main steps are:

1. Data Preprocessing

Encode categorical features.

Scale numerical features.

Split the data into training and testing sets.

2. Modeling

Build a baseline classification model (Logistic Regression or single Decision Tree).

Explore additional models (e.g., Decision Tree regression).

Tune hyperparameters for improved performance.

3. Model Evaluation

Evaluate models using classification metrics.

Compare models iteratively to select the best one.

4. Insights and Recommendations

Identify key drivers of churn.

Provide actionable recommendations.

Discuss model limitations and reliability.

Methods

In this project, we follow a predictive modeling approach focused on classification, using historical customer data to predict churn. Unlike Phase 1 and 2, which emphasized exploratory, diagnostic, and descriptive analysis, Phase 3 is fully focused on machine learning and predictive insights.

This project follows a predictive modeling approach focused on classification to predict customer churn. The main steps are:

1. Data Preprocessing

Encode categorical features.

Scale numerical features.

Split the data into training and testing sets.

2. Modeling

Build a baseline classification model (Logistic Regression or single Decision Tree).

Explore additional models (e.g., Decision Tree regression).

Tune hyperparameters for improved performance.

3. Model Evaluation

Evaluate models using classification metrics.

Compare models iteratively to select the best one.

4. Insights and Recommendations

Identify key drivers of churn.

Provide actionable recommendations.

Discuss model limitations and reliability.

Data Understanding

Before building predictive models, it is important to gain a clear understanding of the dataset. This involves exploring the structure, types of features, data quality, and potential relationships that may help predict customer churn. Understanding the data ensures that subsequent preprocessing and modeling steps are well-informed and effective.

In [3]:
import zipfile  # ← make sure to import this
# -----------------------------
# Data Handling
# -----------------------------
import pandas as pd
import numpy as np
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: improve plot aesthetics
sns.set(style="whitegrid")
plt.style.use("seaborn-darkgrid")
# Preprocessing & Feature Engineering
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Model Evaluation Metrics
# -----------------------------
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
# Optional: Hyperparameter Tuning
# -----------------------------
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Misc
import warnings
warnings.filterwarnings("ignore")


In [5]:

# Path to your zip file
zip_path = "archive.zip"

# Extract the csv from the zip
with zipfile.ZipFile(zip_path, 'r') as z:
    z.extractall("data")  # extract contents to a folder named "data"
    print(z.namelist())   # show what files are inside

# Load the csv into pandas
bigml = pd.read_csv("data/bigml_59c28831336c6604c800002a.csv")

# Quick look at the data
bigml.head()
print(bigml.shape)
# print(df.info())

bigml.head()


['bigml_59c28831336c6604c800002a.csv']
(3333, 21)


Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [10]:
bigml.columns

Index(['state', 'account length', 'area code', 'phone number',
       'international plan', 'voice mail plan', 'number vmail messages',
       'total day minutes', 'total day calls', 'total day charge',
       'total eve minutes', 'total eve calls', 'total eve charge',
       'total night minutes', 'total night calls', 'total night charge',
       'total intl minutes', 'total intl calls', 'total intl charge',
       'customer service calls', 'churn'],
      dtype='object')

In [6]:
bigml.shape

(3333, 21)

In [8]:
bigml.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

In [9]:
bigml.isna().sum()

state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

The dataset contains 3,333 customer records from a telecommunications company, with 21 features describing customer account details, usage patterns, and service plans. This data will be used to predict customer churn.

## Implications for analysis:

The dataset is clean and ready for modeling after preprocessing categorical variables and scaling numerical features where needed.

Both customer behavior (usage & support calls) and service plans are likely to be informative for predicting churn.

Early exploration suggests opportunities for feature engineering (e.g., combining call minutes and charges, or encoding plans).

## Data Preprocessing

Before building predictive models, we need to prepare the data. This includes encoding categorical features, scaling numerical features, and splitting the dataset into training and testing sets. Proper preprocessing ensures that models can learn effectively and generalize well to unseen data.

In [12]:
# Import necessary packages for preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd

# Separate features and target
X = bigml.drop(columns=['churn', 'phone number'])  # Drop target and irrelevant column
y = bigml['churn']

# Identify categorical and numerical features
categorical_features = ['state', 'area code', 'international plan', 'voice mail plan']
numerical_features = ['account length', 'number vmail messages', 'total day minutes', 
                      'total day calls', 'total day charge', 'total eve minutes', 
                      'total eve calls', 'total eve charge', 'total night minutes', 
                      'total night calls', 'total night charge', 'total intl minutes', 
                      'total intl calls', 'total intl charge', 'customer service calls']

# Preprocessing for categorical data: One-hot encoding
categorical_transformer = OneHotEncoder(drop='first')  # drop first to avoid dummy variable trap

# Preprocessing for numerical data: Scaling
numerical_transformer = StandardScaler()

# Combine transformations using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Fit the preprocessor on the training data and transform both training and testing sets
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Quick check of shapes
print("Training features shape:", X_train_processed.shape)
print("Testing features shape:", X_test_processed.shape)


Training features shape: (2666, 69)
Testing features shape: (667, 69)


# 2. Modeling

This section focuses on building predictive models to classify which customers are likely to churn. We will start with a baseline Logistic Regression model for interpretability, then explore a Decision Tree model to capture non-linear relationships and feature interactions.

In [13]:
# Import models and evaluation metrics
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report

# ------------------------
# Baseline Logistic Regression
# ------------------------
log_model = LogisticRegression(random_state=42, max_iter=1000)
log_model.fit(X_train_processed, y_train)

# Predictions
y_train_pred_log = log_model.predict(X_train_processed)
y_test_pred_log = log_model.predict(X_test_processed)

# Evaluation
print("Logistic Regression - Training Accuracy:", accuracy_score(y_train, y_train_pred_log))
print("Logistic Regression - Testing Accuracy:", accuracy_score(y_test, y_test_pred_log))
print("\nClassification Report (Test Set):\n", classification_report(y_test, y_test_pred_log))


Logistic Regression - Training Accuracy: 0.8705926481620405
Logistic Regression - Testing Accuracy: 0.8605697151424287

Classification Report (Test Set):
               precision    recall  f1-score   support

       False       0.88      0.96      0.92       570
        True       0.54      0.26      0.35        97

    accuracy                           0.86       667
   macro avg       0.71      0.61      0.64       667
weighted avg       0.83      0.86      0.84       667



Interpretation of Logistic Regression Results

The baseline Logistic Regression model shows:

Training Accuracy: ~87%

Testing Accuracy: ~86%

These values indicate that the model generalizes well from the training data to unseen test data, with no significant overfitting.

Detailed Metrics (Test Set)

False (non-churn) customers:

Precision: 0.88 → when the model predicts a customer will stay, it is correct 88% of the time.

Recall: 0.96 → the model correctly identifies 96% of customers who actually stay.

True (churn) customers:

Precision: 0.54 → when the model predicts churn, it is correct only 54% of the time.

Recall: 0.26 → the model captures only 26% of actual churners.

Business Implications

The model is very good at identifying loyal customers but struggles to correctly predict churners.

For DIGICEL, this means using the model as-is would lead to many missed churn cases, limiting proactive retention efforts.

To improve retention strategy, the company may consider tuning the model, exploring alternative models (e.g., Decision Tree, Random Forest), or engineering additional features to better capture churn patterns.

Scientific Perspective

The model demonstrates high overall accuracy due to the class imbalance (more non-churn than churn).

F1-score for churn (0.35) highlights that the model’s performance on the minority class is weak, which is common in imbalanced classification tasks.

Future steps could include resampling techniques, class weighting, or more complex models to better predict churn without sacrificing interpretability.