<h1 style="text-align:center;">TELECOM CHURN PREDICTION (SYRIA TEL) </h1>


<h1 style="text-align:center;">BUSINESS UNDERSTANDING  </h1>


Churn is a one of the biggest problem in the telecom industry with Telco companies aiming to attract new customers and at the same time avoid contract terminations to grow their revenue-generating base. Looking at churn, different reasons trigger customers to terminate their contracts, for example, better price offers more interesting packages, bad service experiences, or changes in customers’ personal situations. Prediction models can be utilised to predict churn on an individual customer basis and take countermeasures such as discounts, special offers, or other gratifications to keep their customers.This project aims to develop a predictive model to predict customer churn for SyriaTel.The primary objective is to build a predictive model to identify customers likely to churn and recommend actionable insights to retain them. SyriaTel's stakeholders especially the company's management and marketing team stand to learn and benefit more from the project. Other companies within the telecommunications industry are also in place to learn and benefit.

<h1 style="text-align:center;">DATA UNDERSTANDING</h1>

The dataset originates from SyriaTel Telecommunication company and was obtained from Kaggle. This part aims to give meaning to the data through loading the provided dataset, explore the dataset structure, including data types, missing values, and basic statistics.
Visualize key variables to understand distributions and relationships.We dive into data preparation to describe and understand the data better.

In [6]:
import pandas as pd

#Load the dataset to examine its contents
data = pd.read_csv("Syria_Telcommunication_Customer_Churn_Data.CSV")

# Display general information about the dataset
dataset_info = {
    "shape": data.shape,
    "columns": data.columns.tolist(),
    "missing_values": data.isnull().sum().to_dict(),
    "sample_data": data.head().to_dict(orient='records')
}

dataset_info

{'shape': (3333, 21),
 'columns': ['state',
  'account length',
  'area code',
  'phone number',
  'international plan',
  'voice mail plan',
  'number vmail messages',
  'total day minutes',
  'total day calls',
  'total day charge',
  'total eve minutes',
  'total eve calls',
  'total eve charge',
  'total night minutes',
  'total night calls',
  'total night charge',
  'total intl minutes',
  'total intl calls',
  'total intl charge',
  'customer service calls',
  'churn'],
 'missing_values': {'state': 0,
  'account length': 0,
  'area code': 0,
  'phone number': 0,
  'international plan': 0,
  'voice mail plan': 0,
  'number vmail messages': 0,
  'total day minutes': 0,
  'total day calls': 0,
  'total day charge': 0,
  'total eve minutes': 0,
  'total eve calls': 0,
  'total eve charge': 0,
  'total night minutes': 0,
  'total night calls': 0,
  'total night charge': 0,
  'total intl minutes': 0,
  'total intl calls': 0,
  'total intl charge': 0,
  'customer service calls': 0,
  '

 The data comprises 21 columns and 3333 rows. The columns have various attributes related to customer demographics and churn behavior. The rows correspond to a recorded customer. The dataset encompasses both continuous and categorical variables. The target variable identified is "churn," with the remaining variables serving as predictors. Correlation will be conducted to determine the predictors suitability.

<h1 style="text-align:center;">PREPARATION OF DATA</h1>

Exploratory data analysis (EDA) is performed on the SyriaTel dataset a crucial step to check for patterns and usefull insights for predicting churn. It also promotes describing and understandin the data better. It includes Data Visualization and Correlation Analysis. We also import all necessary libraries.

<h2 style="text-align:center;">Exploratory Data Analysis (EDA)</h2>

In [7]:
# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import xgboost as xgb
from sklearn.metrics import roc_curve, auc

import warnings
warnings.filterwarnings('ignore')

In [14]:
# Display the first few rows of the dataset
data.head()


Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [15]:
# Check for duplicated rows, no duplicated rows to deal with.
data.duplicated().sum()

0

Further Data exploration analysis

In [16]:
# Data Exploration
# Checking data types, unique values in categorical columns, and summary statistics for numeric columns
# Summary statistics for numeric columns
numeric_summary = data.describe()

# Unique values in categorical columns
categorical_columns = data.select_dtypes(include=['object', 'bool']).columns
categorical_summary = {col: data[col].value_counts().to_dict() for col in categorical_columns}

# Correlation matrix for numeric columns (to identify relationships)
correlation_matrix = data.corr()

# Results
{
    "numeric_summary": numeric_summary,
    "categorical_summary": categorical_summary,
    "correlation_matrix": correlation_matrix
}


{'numeric_summary':        account length    area code  number vmail messages  total day minutes  \
 count     3333.000000  3333.000000            3333.000000        3333.000000   
 mean       101.064806   437.182418               8.099010         179.775098   
 std         39.822106    42.371290              13.688365          54.467389   
 min          1.000000   408.000000               0.000000           0.000000   
 25%         74.000000   408.000000               0.000000         143.700000   
 50%        101.000000   415.000000               0.000000         179.400000   
 75%        127.000000   510.000000              20.000000         216.400000   
 max        243.000000   510.000000              51.000000         350.800000   
 
        total day calls  total day charge  total eve minutes  total eve calls  \
 count      3333.000000       3333.000000        3333.000000      3333.000000   
 mean        100.435644         30.562307         200.980348       100.114311   
 std   

Handle missing values either through deletion, imputation or other means

In [11]:
# Counter Check for missing values

data.isnull().sum()


state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

output shows no missing data as previously seen in data undersatnding

To continue with EDA convert the target variable "churn"to integer replacing True with 1 and false = 0. We then check the for the distribution of the target variable. Churn will be used as the dependent variable in this analysis. Churn indicates if a customer has terminated their contract with SyriaTel. True indicates they have terminated and false indicates they have not and have and have an existing account.

In [22]:
# Check the distribution of the target variable 'Churn'
churn_counts = data['churn'].value_counts()

# Print the counts and percentages of churn
samples_total = len(data)
for churn_status, count in churn_counts.items():
    percentage = (count / samples_total) * 100
    print(f"Churn: {churn_status}, Count: {count}, Percentage: {percentage:.2f}%")
#

Churn: False, Count: 2850, Percentage: 85.51%
Churn: True, Count: 483, Percentage: 14.49%


Of the 3,333 customers in the dataset, 483 have terminated their contract with SyriaTel. That is 14.5% of customers lost.
The distribution of the binary classes shows a data imbalance. This needs to be addressed before modeling as an unbalanced feature can cause the model to make false predictions.