* **Institution Name:** Moringa School
* **Course Pursued:**  Data Science
* **Phase/ Level:**   Three (3)
* **Project done by:** Boniface Kimondo Njeri
* **TM Name:** George Kamundia

# Early Detection of Imminent SyriaTel Churn Through Predictive Signal Analysis
----

# Introduction
This report details the end-to-end development of a machine learning classifier aimed at predicting customer churn for SyriaTel. The goal is to analyze usage patterns, service attributes, and customer behaviors to forecast whether a customer is likely to stop using SyriaTel services. Accurate predictions empower the business to intervene proactively—tailoring offers, support, or incentives to retain valuable customers before they exit.

# Background of the Project
In today’s highly competitive telecommunications market, customer retention plays a critical role in sustaining profitability. Studies have shown that acquiring a new customer can cost five to ten times more than retaining an existing one. For companies like SyriaTel, minimizing customer churn—when users discontinue their service—is not just a technical challenge but a strategic imperative.

This project focuses on developing a machine learning-based churn prediction model to help SyriaTel identify customers at high risk of leaving. By flagging potential churners early, the company can implement targeted retention strategies, reduce revenue loss, and enhance customer satisfaction.

# Project Objectives
**General Objective**
To build a predictive system that identifies customers likely to churn, thereby supporting SyriaTel in making timely, data-driven retention decisions.

**Specific Objectives**
1. Develop a Binary Classification Model: Use customer data to classify users as likely to churn (True) or not (False).

2. Support Retention Planning: Leverage model outputs to inform and optimize proactive customer retention strategies.

3. Generate Business Insights: Analyze feature importance and model behavior to draw actionable conclusions that can influence service improvement and marketing approaches.

## Importing Required Libraries

In this section, we import the essential Python libraries required for data analysis, visualization, preprocessing, modeling, and evaluation.

### Core Libraries:
- **pandas**: Used for loading, manipulating, and analyzing structured data (e.g., DataFrames).
- **numpy**: Provides support for numerical operations, especially array-based computations.

### Visualization:
- **matplotlib.pyplot**: The foundational library for plotting charts and visualizations.
- **seaborn**: Built on top of matplotlib, provides more sophisticated and attractive statistical visualizations.

### Machine Learning & Preprocessing (scikit-learn):
- **train_test_split**: Splits the dataset into training and test subsets.
- **LabelEncoder**: Converts categorical labels into numerical values.
- **StandardScaler**: Standardizes features by removing the mean and scaling to unit variance.
- **RandomForestClassifier**: An ensemble learning algorithm based on decision trees for classification tasks.
- **classification_report**, **confusion_matrix**, **accuracy_score**: Tools to evaluate the performance of classification models.


In [13]:
# Importing essential libraries for data handling, visualization, and modeling

# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing and model selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Machine learning algorithm
from sklearn.ensemble import RandomForestClassifier

# Evaluation metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Configure matplotlib for inline plotting in notebooks
%matplotlib inline

# Set Seaborn style for better aesthetics
sns.set(style="whitegrid")

print("All libraries imported successfully.")

All libraries imported successfully.


## Loading the Dataset & Initial Inspection

In this step, we load the dataset into a **Pandas DataFrame** from a CSV file and preview the first few records.

### This aided:
- To **understand the structure** of the dataset (rows, columns).
- To identify **key features**, column names, and possible data types.
- To quickly spot any **obvious data quality issues**, such as missing values or inconsistent formatting.

**File Used**: `bigml_59c28831336c6604c800002a.csv`


In [14]:
# Load the dataset into a DataFrame
df = pd.read_csv("bigml_59c28831336c6604c800002a.csv")

# Display the first 5 rows to examine the structure
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


## Understanding the Dataset Structure & Statistical Summary

To better understand the dataset, I conducted two key inspections:

### Dataset Structure: `df.info()`
- Displays all column names, data types, and the number of non-null (non-missing) values.
- Helped identify:
  - Which features are **categorical vs numerical**.
  - Potential **missing data** that may need cleaning.
  - Overall **shape and memory footprint** of the DataFrame.

### Statistical Summary: `df.describe()`
- Generates descriptive statistics for **numeric columns**, such as:
  - **Mean**, **standard deviation**, **minimum**, **maximum**, and **quartile values**.
- This was Useful for:
  - Detecting **outliers or anomalies**.
  - Understanding the **distribution** and **scale** of features.
  - Informing **scaling** or **normalization** decisions before model training.



In [15]:
# Display dataset structure: column names, data types, and non-null counts
print("Dataset Structure info:\n")
df.info()

# Display statistical summary of numeric columns
print("\n Dataset Structure describe:")
df.describe()

#Display the null values in the dataset
print("\n Null values in each column:")
print(df.isnull().sum())    

Dataset Structure info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night cal

# Output Review: 
The initial output shows we have 3,333 entries and 20 columns. There are no null values, which is excellent. Our target variable, churn, is of type bool (True/False). All feature names are clear, but we should standardize the column formatting for consistency.

## Data Cleaning & Preprocessing

A clean and well-prepared dataset is essential for building an accurate machine learning model. In this phase, I:

### 1. Standardized Column Names
- Converted column names to `snake_case` for consistency and easier access.

### 2. Drop Irrelevant or Non-Predictive Columns
- Features like `phone_number`, `state`, and `area_code` are either unique identifiers or unlikely to contribute meaningfully to the model’s predictive power.

### 3. Convert Binary Categorical Variables
- Converted `"yes"`/`"no"` strings in `international_plan` and `voice_mail_plan` columns into binary integers (`1` for "yes", `0` for "no").

### 4. Encode Target Variable
- The target column `churn` was converted from boolean to integer format for compatibility with machine learning algorithms.

### 5. Visualize Class Balance
- I Used a count plot to visualize whether the dataset is balanced across the target classes (`churned` vs `not churned`).


In [16]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Standardize column names to snake_case
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Drop irrelevant columns
df.drop(['state', 'area_code', 'phone_number'], axis=1, inplace=True)

# Convert binary categorical features to numerical (0/1)
for col in ['international_plan', 'voice_mail_plan']:
    if df[col].dtype == 'object':
        df[col] = df[col].apply(lambda x: 1 if x.lower() == 'yes' else 0)

# Convert target variable to integer
df['churn'] = df['churn'].astype(int)

# Confirm transformations
print("Cleaned column names:", df.columns.tolist())
print("\n Data types after conversion:")
print(df[['international_plan', 'voice_mail_plan', 'churn']].info())


Cleaned column names: ['account_length', 'international_plan', 'voice_mail_plan', 'number_vmail_messages', 'total_day_minutes', 'total_day_calls', 'total_day_charge', 'total_eve_minutes', 'total_eve_calls', 'total_eve_charge', 'total_night_minutes', 'total_night_calls', 'total_night_charge', 'total_intl_minutes', 'total_intl_calls', 'total_intl_charge', 'customer_service_calls', 'churn']

 Data types after conversion:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   international_plan  3333 non-null   int64
 1   voice_mail_plan     3333 non-null   int64
 2   churn               3333 non-null   int32
dtypes: int32(1), int64(2)
memory usage: 65.2 KB
None
