
# CS 484 - Introduction to Machine Learning
### Project: **Customer Churn Analytics Framework for Telecommunications**



#### Team Members:
####  Denish Dalsukhbhai Asodariya (A20525465)
####  Prince Jayantbhai Rajodiya (A20536409)


## Enviroment Setup

In [None]:
!pip install tensorflow

from tensorflow.python.client import device_lib

def check_gpu():
  """Checks if a GPU is available."""
  devices = device_lib.list_local_devices()
  for device in devices:
    if 'GPU' in device.name:
      return True
  return False

if check_gpu():
  print('Found GPU!')
else:
  print('No GPU found.')

In [5]:
runtime_type = 'GPU'

## Required Libraries

In [None]:
!pip install ipywidgets
!pip install keplergl
!pip install h3
!pip install h3pandas
!pip install branca
!pip install imbalanced-learn
!pip install seaborn
!pip install plotly
!pip install scikit-learn
!pip install xgboost
!pip install folium
!pip install gdown
!jupyter nbextension install --py --sys-prefix keplergl
!jupyter nbextension enable keplergl --py --sys-prefix

In [8]:
import numpy as np
from IPython.display import Image
import branca.colormap as cm
import pandas as pd
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
import seaborn as sns
import matplotlib.pyplot as pl
from sklearn.preprocessing import LabelEncoder
import imblearn
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, f1_score,recall_score, precision_score
from sklearn.metrics import average_precision_score, roc_auc_score, roc_curve, auc
from xgboost import XGBClassifier
import matplotlib
import imblearn
import folium
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
import os , h3


In [9]:
data_path = "C:/Users/asoda/Downloads/Group_22_Asodariya_Rajodiya/updated_Telco_customer_churn.csv"

In [10]:
data = pd.read_csv(data_path)

<a id="2"></a>
## Data Procesing

In [None]:
data

The "Total Charges" column is currently of object data type, so it needs to be converted to a suitable numerical format.

In [14]:
data['Total Charges'] = pd.to_numeric(data['Total Charges'], errors='coerce')

Examining the dataset for any missing values.

In [None]:
data.isnull().sum()

The columns "Total Charges" and "Churn Reason" contain missing values.
The "Churn Reason" column has a significant number of null entries, as not all customers in the dataset have churned.

In [None]:
data['CustomerID'].nunique()

In [None]:
data.groupby('Churn Label')['CustomerID'].nunique()

In [None]:
data[data['Total Charges'].isna()]

Based on the analysis, customers with missing values in the "Total Charges" column are under a contract, primarily in Two-Year contracts, with a few in One-Year contracts.

<a id="3"></a>
### Imputing Missing Values

1) Computing charges

In [24]:
data['calc_charges'] = data['Monthly Charges'] * data['Tenure Months']

2) Calculating charge discrepancy

In [26]:
data['diff_in_charges'] = data['Total Charges'] - data['calc_charges']

Let’s test our approach!

In [None]:
fig = px.histogram(data, x="diff_in_charges",color = 'Contract',marginal="box")
fig.show()

<div> The graph shows the distribution of the difference between the actual total charges of our customers and the calculated values, which are derived by multiplying the monthly charges by the number of months the customer has used the service.</div>

Quantiles have been plotted to highlight the presence of outliers.

In [None]:
data.groupby('Contract')[['Total Charges','diff_in_charges']].quantile([.50,.80,.90,.95])

In [32]:
data['Total Charges'] = np.where(data['Total Charges'].isna() == True,data['calc_charges'], data['Total Charges'])

In [33]:
data = data.drop(['calc_charges','diff_in_charges'], axis=1)

<a id="4"></a>
# Exploratory Data Analysis

<a id="5"></a>
## 1. Overall Churn Rate

<div> To build a customer churn model, it’s essential to first understand the factors that most significantly influence customer churn.
Churned customers are those who have stopped using the service. </div>
<div> Let's introduce a key metric — churn rate, which represents the percentage of customers who have churned — and analyze this metric based on the customer characteristics in our dataset.</div>

In [None]:
fig = px.pie(data.groupby('Churn Label')['CustomerID'].nunique().reset_index(),
             values='CustomerID',
             names='Churn Label')
fig.show()

In the dataset, **41%** of the customers have churned and stopped using the company's services.

<a id="6"></a>
## 2. Customer Geography

Let's begin by examining the geography of the customers. We have data on their city, postal code, and coordinates.

In [None]:
data.groupby(['Country','State'])['CustomerID'].count()

All of our clients are located in the United States, specifically in California.

In [None]:
fig = px.scatter_mapbox(data.groupby(['Latitude','Longitude'])['CustomerID'].count().reset_index(), lat="Latitude", lon="Longitude", hover_data= ['CustomerID'], zoom=4, height=300)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

The highest concentration of customers is found in the Los Angeles, San Diego, and San Francisco areas.

In [None]:
fig = px.bar(data.groupby(['City'])['CustomerID'].count().reset_index().sort_values('CustomerID',
                                                                                    ascending=False).head(50),
             x='City',
             y='CustomerID',
             color = 'CustomerID',
             text = 'CustomerID')
fig.show()

<a id="7"></a>
### Add Hexagonal Visualization

In [None]:
# Set the resolution level for the hexagons
hex_level = 5

# Assuming `data` is a DataFrame containing 'Latitude', 'Longitude', and 'CustomerID' columns
# Example DataFrame for reference
# data = pd.DataFrame({
#     'Latitude': [37.7749, 34.0522, 40.7128],
#     'Longitude': [-122.4194, -118.2437, -74.0060],
#     'CustomerID': [1, 2, 3]
# })

# Convert geographical coordinates (latitude and longitude) into H3 hex IDs
data['hex_id'] = data.apply(lambda x: h3.latlng_to_cell(x['Latitude'], x['Longitude'], hex_level), axis=1)

# Group the data by hex ID and count the number of customers per hex
hex_counts = data.groupby('hex_id')['CustomerID'].count().reset_index(name='total_clients')

# Calculate the geographic center of each hexagon
hex_counts['center'] = hex_counts['hex_id'].apply(lambda x: h3.cell_to_latlng(x))

# Set the range of values for coloring and create a colormap
color_range = [hex_counts['total_clients'].min(), hex_counts['total_clients'].max()]
colormap = cm.LinearColormap(
    colors=["purple", "red", "orange", "yellow", "green"],
    vmin=min(color_range),
    vmax=max(color_range)
)

# Determine the center of the map based on the average location of hexagon centers
mean_lat = hex_counts['center'].apply(lambda x: x[0]).mean()
mean_lon = hex_counts['center'].apply(lambda x: x[1]).mean()
map_center = [mean_lat, mean_lon]

# Initialize a Folium map with the 'Stamen Terrain' tiles
m = folium.Map(
    location=map_center,
    zoom_start=6,
    tiles='Stamen Terrain',
    attr="Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL."
)

# Add hexagonal regions to the map with color-coding based on customer counts
for _, row in hex_counts.iterrows():
    folium.Polygon(
        locations=h3.cell_to_boundary(row['hex_id']),  # Remove geo_json argument and directly use boundary
        fill=True,
        fill_color=colormap(row['total_clients']),
        fill_opacity=0.7,
        stroke=False,
        tooltip=f"Number of clients: {row['total_clients']}"
    ).add_to(m)

# Add a color legend to the map
colormap.caption = 'Number of clients'
m.add_child(colormap)

# Display the map
m


Calculate churn rate per hexagon

In [49]:
churn = data.assign(churn_clients = np.where(data['Churn Label']=='Yes',data['CustomerID'],None)).groupby(['hex_id']).agg({'churn_clients':'count'}).reset_index()

In [50]:
clients = data.groupby(['hex_id'])['CustomerID'].count().reset_index()

In [51]:
churn_data = clients.join(churn.set_index(['hex_id']), on=['hex_id'])

In [52]:
churn_data['churn_rate'] = churn_data['churn_clients']/churn_data['CustomerID']

In [None]:
churn_data

In [None]:
# Assuming churn_data is a DataFrame containing 'hex_id', 'churn_rate', and 'CustomerID' columns
# Example:
# churn_data = pd.DataFrame({
#     'hex_id': ['8928308280fffff', '8928308283fffff'],
#     'churn_rate': [0.1, 0.4],
#     'CustomerID': [10, 20]
# })

# Compute the geographic center of each hexagon
churn_data['center'] = churn_data['hex_id'].apply(lambda x: h3.cell_to_latlng(x))

# Set the color range based on churn rate and create a colormap
color_range = [churn_data['churn_rate'].min(), churn_data['churn_rate'].max()]
colormap = cm.LinearColormap(
    colors=["green", "orange", "red"],
    vmin=min(color_range),
    vmax=max(color_range)
)

# Determine the central location of the map by averaging the hexagon centers
mean_lat = churn_data['center'].apply(lambda x: x[0]).mean()
mean_lon = churn_data['center'].apply(lambda x: x[1]).mean()
map_center = [mean_lat, mean_lon]

# Create a Folium map centered around the calculated geographic center
m = folium.Map(
    location=map_center,
    zoom_start=6,
    width='100%',
    height='80%',
    tiles='Stamen Terrain',
    attr="Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL."
)

# Add hexagonal regions to the map, color-coded by churn rate
for _, row in churn_data.iterrows():
    folium.Polygon(
        locations=h3.cell_to_boundary(row['hex_id']),  # Use the correct method to get hex boundaries
        fill=True,
        fill_color=colormap(row['churn_rate']),
        fill_opacity=0.7,
        stroke=False,
        tooltip=f"Churn rate: {row['churn_rate']}<br>Number of customers: {row['CustomerID']}"
    ).add_to(m)

# Attach a color legend to the map
colormap.caption = 'Churn rate'
m.add_child(colormap)

# Display the map
m

<a id="8"></a>
## 3) Customer lifetime in service

Before analyzing the services customers have used and their other characteristics, it’s crucial to consider how long customers have been using the service and at what point, in terms of months, they typically start to churn.

Let's analyze the number of months that customers who have churned used the service, and determine if there is a specific point at which the majority of customers stop using the service.

In [None]:
fig = px.histogram(data, x="Tenure Months", color="Churn Label",marginal="box" )
fig.show()

In [None]:
data.groupby('Churn Label')['Tenure Months'].quantile([.50,.75,.90,.95])

In [None]:
data.groupby('Churn Label')['Tenure Months'].mean()

Approximately 50% of the customers who churned did so within the first 10 months. The rate of churn declines sharply after 5 months of service.

The client's lifetime before churn is crucial information. Typically, the first few months of service are the riskiest, as customers may have high expectations, which, if unmet, can lead to churn.

<a id="9"></a>
## What are the factors that lead to customer churn?

A total of 41% of customers have stopped using the service, with 50% of them having used the service for less than 10 months.

Now, we can begin analyzing the customer account data to identify which types of customers are more likely to churn and determine the actions we can take to address this.

Before diving into the analysis, we can also examine the responses in the "Churn Reason" column. While customer survey data can be biased due to its subjective nature, gathering customer feedback is essential for any business aiming to grow and improve. Additionally, we have plenty of other data to cross-check and validate these customer responses.

In [None]:
fig = px.bar(data.groupby(['Churn Reason'])['CustomerID'].count().reset_index().sort_values('CustomerID',
                                                                                    ascending=False),
             x='Churn Reason',
             y='CustomerID',
             color = 'CustomerID',
             text = 'CustomerID')
fig.show()

Among the 41% of customers who churned, the majority left due to issues related to data, speed, and devices, which were better offered by competitors. The remaining churned customers cited poor customer service or dissatisfaction with the attitude of support specialists or the provider as their reasons.

There are also several reasons for churn that are beyond our control, such as customer relocation. Since this data is sparse and irrelevant to understanding how we can retain customers, I will remove it from the dataset.

In [70]:
data = data[data['Churn Reason'] != 'Moved']
data = data[data['Churn Reason'] != 'Deceased']

<a id="10"></a>
## Types of Contracts

Let’s examine the different types of contracts offered by the service and analyze how they impact the churn rate.

In [None]:
fig = px.histogram(data, x="Churn Label", color="Contract", barmode="group",
                   title="Number of customers by contract type")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

In [None]:
fig = px.pie(data.groupby(['Contract','Churn Label'])['CustomerID'].count().reset_index(),
             values='CustomerID',
            names='Contract',
            facet_col = 'Churn Label',
            title = 'Churn rate by contract type')

fig.show()

86% of the customers who churned had a month-to-month contract.

In [None]:
data.groupby(['Contract','Churn Label'])['Tenure Months'].mean()

<a id="11"></a>
## Total and monthly charges for clients

When examining the total and monthly charges of customers, the monthly charges may be more relevant for identifying the causes of churn. However, total charges can also provide valuable insights in understanding customer behavior.

**Total charges**

In [None]:
fig = px.histogram(data, x="Total Charges", color="Churn Label",
                   marginal="box"
                  )
fig.show()

<div> The median charges of customers who have churned are more than twice as low as those of customers who continue using the service. </div>
<div> However, this doesn't necessarily imply that churned customers were less financially capable. As we’ve seen, many customers tend to leave the service within the first 5 months, which could explain the lower charges. </div>

Therefore, we should focus on the monthly charges of customers to gain better insights.

**Monthly Charges**

In [None]:
fig = px.histogram(data, x="Monthly Charges", color="Churn Label",
                   marginal="box"
                  )
fig.show()

In [None]:
data.groupby('Churn Label')['Monthly Charges'].quantile([.50,.75,.95,.99])

The median monthly charges of customers who have churned are higher than those of active customers. Since we did not identify a strong correlation between churn and customer location in previous steps, this difference may be related to specific services or other factors. We will investigate further to uncover the reasons!

<a id="12"></a>
## Services utilized by the Customer

Here is a list of services utilized by each customer:

- Phone Service  
- Internet Service  
- Online Security  
- Online Backup  
- Device Protection  
- Multiple Lines  
- Tech Support  
- Streaming TV  
- Streaming Movies


Given the large list of services, let’s prioritize by analyzing which variables correlate most strongly with the churn variable. This will help us focus on the most impactful factors.

In [90]:
corr_df = data.copy()

In [264]:
corr_df['Churn Label'] = corr_df['Churn Label'].replace({'Yes': 1, 'No': 0}).astype(int)

In [None]:
df_dummies = pd.get_dummies(corr_df[['Churn Label','Phone Service','Multiple Lines','Internet Service','Online Security',
                                 'Online Backup','Device Protection','Tech Support','Streaming TV',
                                 'Streaming Movies']])
df_dummies.head()

In [None]:
pl.figure(figsize=(9, 7))
sns.heatmap(df_dummies.corr(), annot=False, cmap='coolwarm')

pl.show()

<a id="13"></a>
### Internet services

First, let's examine the types of internet services that customers have, and then analyze how these different types affect the churn rate.

In [None]:
fig = px.bar(data.groupby('Internet Service')['CustomerID'].count().reset_index(),
             x='Internet Service',
             y='CustomerID',
             color = 'Internet Service',
             text = 'CustomerID')
fig.show()

The majority of customers are connected to DSL Internet, with a group of customers who do not use internet services (likely using only phone services). Interestingly, the churn rate among these non-internet users appears to be lower, according to the correlation graph, but we will verify this further.

Let's analyze which types of internet services were used by customers who churned. This will help identify if certain types of internet services are more associated with churn.

In [None]:
fig = px.pie(data.groupby(['Internet Service','Churn Label'])['CustomerID'].count().reset_index(),
             values='CustomerID',
             facet_col = 'Churn Label',
             names='Internet Service',
            title = 'What type of internet was associated with the customers who left the service?')
fig.show()

Approximately 45% of customers who churned were using Fiber Optic Internet services before leaving.

<a id="14"></a>
### Tech support and online security

**Tech Support**

In [None]:
fig = px.bar(data.groupby(['Internet Service',
                                                'Tech Support',
                                                'Churn Label'])['CustomerID'].count().reset_index(),
             x="Internet Service",
             y="CustomerID",
             color="Churn Label",
             text = 'CustomerID',
             barmode="group",
             facet_col="Tech Support"
            )
fig.show()

After consulting with Tech Support, the churn rate is lower. We observe that even among customers with Fiber Optic Internet, the churn percentage is lower for those who have the Tech Support service enabled.

In [None]:
fig = px.pie(data.groupby(['Tech Support','Churn Label'])['CustomerID'].count().reset_index(),
             values='CustomerID',
             facet_col = 'Churn Label',
             hole = .5,
             names='Tech Support',
            title = 'Tech support option and churn')
fig.show()

70.1% of the customers who churned did not have the Tech Support option enabled.

<a id="15"></a>
## Customer's Payment Method

Let’s analyze the payment methods used by customers and examine how these methods influence the churn rate.

In [None]:
fig = px.bar(data.groupby(['Payment Method',
                                                'Churn Label'])['CustomerID'].count().reset_index(),
             x="CustomerID",
             y="Payment Method",
             color="Churn Label",
             text = 'CustomerID'
            )
fig.show()

Wow, it appears that for customers who use an electronic check as their payment method, the churn rate is around 50%.

In [None]:
fig = px.pie(data.groupby(['Payment Method','Churn Label'])['CustomerID'].count().reset_index(),
            values='CustomerID',
            names='Churn Label',
            facet_col = 'Payment Method',
            color = 'Churn Label',
            title = 'Churn rate by customer payment method')

fig.show()

We can observe that customers who use automatic payment methods, such as credit cards and bank transfers, generally have a lower churn rate compared to those using electronic or mailed checks.

Among the customers with Fiber Optic Internet, the majority used an electronic check as their payment method.

In [114]:
churn_pm = data.assign(churn_clients = np.where(data['Churn Label']== 'Yes',data['CustomerID'],None))\
   .groupby(['Payment Method','Internet Service']).agg({'churn_clients':'count'}).reset_index()

In [115]:
pm_clients = data.groupby(['Payment Method','Internet Service'])['CustomerID'].count().reset_index()

In [116]:
pm_data = pm_clients.join(churn_pm.set_index(['Payment Method','Internet Service']), on=['Payment Method','Internet Service'])

In [None]:
pm_data

In [118]:
pm_data['churn_rate,%'] = round(((pm_data['churn_clients']/pm_data['CustomerID']) * 100),2)

In [None]:
fig = px.bar(pm_data.sort_values('churn_rate,%'),
             x='churn_rate,%',
             y='Payment Method',
             facet_col = 'Internet Service',
             color = 'churn_rate,%',
             text = 'churn_rate,%')
fig.show()

<a id="16"></a>
## Gender and Age of Clients

**Customer's Gender**

In [None]:
fig = px.pie(data.groupby('Gender')['CustomerID'].count().reset_index(),
            values='CustomerID',
            names='Gender',
            color_discrete_sequence=px.colors.sequential.RdBu,
            title = 'Distribution of the clients by gender')

fig.show()

Let's analyze the churn rate by gender to identify any differences between men and women.

In [None]:
fig = px.bar(data.groupby(['Gender',
                                                'Churn Label'])['CustomerID'].count().reset_index(),
             x="CustomerID",
             y="Gender",
             color="Churn Label",
             text = 'CustomerID'
            )
fig.show()

**Senior Citizen or not**

In [None]:
fig = px.pie(data.groupby(['Senior Citizen','Churn Label'])['CustomerID'].count().reset_index(),
            values='CustomerID',
            names='Churn Label',
            facet_col = 'Senior Citizen',
            color = 'Churn Label',
            title = 'Churn rate by customer age')

fig.show()


The churn rate among senior citizens is nearly twice as high as that of non-senior citizens. However, the number of senior citizens in the dataset is much smaller.

In [None]:
data.groupby('Senior Citizen')['CustomerID'].count()

<a id="17"></a>
## Effect of having a partner or dependents on churn rate.

In [None]:
fig = px.bar(data.groupby(['Senior Citizen','Partner',
                                        'Dependents','Churn Label'])['CustomerID'].count().reset_index(),
             x="Senior Citizen",
             y="CustomerID",
             color="Churn Label",
             #barmode="group",
             facet_row="Partner",
             facet_col = 'Dependents'
            )
fig.show()

We observe that among senior citizens without a partner or dependents, the churn rate is nearly 50%.

Let's examine which services were used by senior citizens, and then proceed to summarize the findings and begin building a churn prediction model.

In [None]:
fig = px.bar(data.groupby(['Senior Citizen','Internet Service','Churn Label'])['CustomerID'].count().reset_index(),
             x="Internet Service",
             y="CustomerID",
             color="Churn Label",
             barmode="group",
             facet_col = 'Senior Citizen'
            )
fig.show()

Among senior citizens, a larger percentage were connected to Fiber Optic Internet, and these customers exhibit the highest churn rate.

Here’s a summary of what we learned from analyzing the data:

- The lowest churn rate is observed among customers without Internet services (though this group is smaller).
- 69.2% of churned customers were using Fiber Optic Internet.
- The lack of technical support and online security options is correlated with higher churn.
- When considering payment methods, customers using electronic checks have the highest churn rate, regardless of the type of Internet service.
- The churn rate for senior citizens is nearly twice as high as that for non-senior citizens.

<a id="18"></a>
# Feature Engineering

<a id="19"></a>
### 1) Removing the columns that are not needed.

In [138]:
data = data.drop(['Country','State','Count','Zip Code','Churn Reason','City','Churn Score','Churn Value','CLTV','CustomerID','Lat Long',
                  'Latitude','Longitude'], axis = 1)

In [None]:
data.info()

<a id="20"></a>
### 2) Converting categorical variables to numeric values.

In [141]:
corr_df['Churn Label'] = corr_df['Churn Label'].replace(to_replace='Yes', value=1)
corr_df['Churn Label'] = corr_df['Churn Label'].replace(to_replace='No', value=0)

In [142]:
def encode_data(dataframe_series):
    if dataframe_series.dtype=='object':
        dataframe_series = LabelEncoder().fit_transform(dataframe_series)
    return dataframe_series

In [None]:
data = data.apply(lambda x: encode_data(x))
data.head()

Now, let's examine the correlation between all the selected features and the churn label.

In [None]:
fig = px.bar(data.corr()['Churn Label'].sort_values(ascending = False),
             color = 'value')
fig.show()

<a id="21"></a>
### 3) Balancing the Dataset

We observe that the data is unbalanced, with a higher number of customers who have not churned compared to those who have.

In [None]:
data.groupby('Churn Label')['Churn Label'].count()

In [149]:
over = SMOTE(sampling_strategy = 1)
x = data.drop("Churn Label", axis = 1).values
y = data['Churn Label'].values

In [150]:
x,y = over.fit_resample(x,y)

<a id="22"></a>
# Modeling using machine learning approaches.

We are comparing three algorithms here:

1. **Logistic Regression**  
2. **Random Forest**  
3. **Extreme Gradient Boosting Classifier**

### Why these three algorithms for Telecommunications?

- **Logistic Regression:**
    - Simple and interpretable model, easy to understand and implement.
    - Effective for binary classification tasks like churn prediction.
    - Outputs probabilities of churn, which is useful for assessing risk.

- **Random Forest:**
    - An ensemble method that merges multiple decision trees for improved performance.
    - Less prone to overfitting, capable of handling complex, non-linear relationships.
    - Provides feature importance scores to identify key factors influencing churn.

- **Extreme Gradient Boosting Classifier (XGBoost):**
    - A gradient boosting method that combines multiple weak learners into a strong one.
    - Known for its high accuracy and efficiency, often outperforming other models in churn prediction.
    - Handles both linear and non-linear relationships, making it ideal for more complex churn datasets.

### Divide the data into training and test sets for the algorithms.

In [155]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state =2, test_size = 0.2)

In [156]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score,recall_score, precision_score
from sklearn.metrics import average_precision_score, roc_auc_score, roc_curve, auc
from xgboost import XGBClassifier
import matplotlib
import folium
import os , h3
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [157]:
def logistic_regression(x, y):
    if len(x) < 10 or len(y) < 10:
        print("Insufficient data.")
        return

    try:
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
        #'liblinear' for binary classification
        model = LogisticRegression(solver='liblinear')
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
        metrics = {
            "Accuracy": accuracy_score(y_test, y_pred),
            "F1 Score": f1_score(y_test, y_pred),
            "Recall": recall_score(y_test, y_pred),
            "Precision": precision_score(y_test, y_pred),
            "AUC": roc_auc_score(y_test, model.predict_proba(x_test)[:, 1])
        }

        for metric, value in metrics.items():
            print(f"{metric}: {value}")
    except Exception as e:
        print(f"An error occurred: {e}")


In [158]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression

def tune_logistic_regression(x, y):
    if len(x) < 10 or len(y) < 10:
        print("Insufficient data.")
        return

    param_distributions = {
        'C': [0.1, 1, 10],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear', 'saga'],
        'max_iter': [1000]
    }

    try:
        random_search = RandomizedSearchCV(LogisticRegression(), param_distributions, n_iter=10, cv=5, scoring='roc_auc', n_jobs=-1)
        random_search.fit(x, y)
        best_params = random_search.best_params_
        print(f"Best parameters: {best_params}")
        logistic_regression(x, y)
    except Exception as e:
        print(f"An error occurred during tuning: {e}")


In [159]:
def random_forest(x, y):
    if len(x) < 10 or len(y) < 10:
        print("Insufficient data.")
        return

    try:
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
        model = RandomForestClassifier()
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
        metrics = {
            "Accuracy": accuracy_score(y_test, y_pred),
            "F1 Score": f1_score(y_test, y_pred),
            "Recall": recall_score(y_test, y_pred),
            "Precision": precision_score(y_test, y_pred),
            "AUC": roc_auc_score(y_test, model.predict_proba(x_test)[:, 1])
        }

        for metric, value in metrics.items():
            print(f"{metric}: {value}")
    except Exception as e:
        print(f"An error occurred: {e}")


In [160]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

def tune_random_forest(x, y):
    if len(x) < 10 or len(y) < 10:
        print("Insufficient data.")
        return

    param_distributions = {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }

    try:
        random_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions, n_iter=10, cv=5, scoring='roc_auc', n_jobs=-1)
        random_search.fit(x, y)
        best_params = random_search.best_params_
        print(f"Best parameters: {best_params}")
        random_forest(x, y)  # Evaluate model with best parameters
    except Exception as e:
        print(f"An error occurred during tuning: {e}")


In [161]:
def xgboost(x, y, params=None):
    # Check for insufficient data
    if len(x) < 10 or len(y) < 10:
        print("Insufficient data.")
        return

    try:
        # Split the data into training and testing sets
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

        # Use the provided parameters or default ones
        if params is None:
            params = {
                'eval_metric': 'logloss',
                'objective': 'binary:logistic',
                'random_state': 42
            }

        # Create the XGBoost classifier with the given parameters
        model = XGBClassifier(**params)

        # Fit the model on the training data
        model.fit(x_train, y_train)

        # Make predictions on the test data
        y_pred = model.predict(x_test)
        y_pred_proba = model.predict_proba(x_test)[:, 1]

        # Compute evaluation metrics
        metrics = {
            "Accuracy": accuracy_score(y_test, y_pred),
            "F1 Score": f1_score(y_test, y_pred),
            "Recall": recall_score(y_test, y_pred),
            "Precision": precision_score(y_test, y_pred),
            "AUC": roc_auc_score(y_test, y_pred_proba)
        }

        # Print the evaluation metrics
        for metric, value in metrics.items():
            print(f"{metric}: {value}")

    except Exception as e:
        print(f"An error occurred: {e}")


In [162]:

def tune_xgboost(x, y):
    if len(x) < 10 or len(y) < 10:
        print("Insufficient data.")
        return

    # Parameter grid for RandomizedSearchCV
    param_distributions = {
        'n_estimators': [50, 100, 200],
        'max_depth': [5, 10, 15],
        'learning_rate': [0.01, 0.1, 0.3],
        'subsample': [0.5, 0.7, 0.9],
        'colsample_bytree': [0.5, 0.7, 0.9]
    }

    try:
        # Randomized search for hyperparameter tuning
        random_search = RandomizedSearchCV(
            XGBClassifier(eval_metric='logloss', objective='binary:logistic', random_state=42),
            param_distributions,
            n_iter=10,
            cv=5,
            scoring='roc_auc',
            n_jobs=-1,
            random_state=42
        )
        random_search.fit(x, y)

        # Best parameters
        best_params = random_search.best_params_
        print(f"Best parameters: {best_params}")

        # Evaluate the model with the best parameters
        xgboost(x, y, params={**best_params, 'eval_metric': 'logloss', 'objective': 'binary:logistic', 'random_state': 42})

    except Exception as e:
        print(f"An error occurred during tuning: {e}")


In [None]:
logistic_regression(x, y)

In [None]:
random_forest(x, y)

In [None]:
xgboost(x, y)


In [None]:
tune_logistic_regression(x, y)

In [None]:
# Random forest Hyper Parameter Tuning
tune_random_forest(x, y)

In [None]:
# XGboost hyperparameter tuning
tune_xgboost(x, y)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Function to plot ROC curves for different models
def plot_roc_curves(models, x_test, y_test):
    plt.figure(figsize=(12, 8))  # Set figure size for the plot
    for model_name, model in models.items():
        # Get predicted probabilities for the positive class
        y_pred_proba = model.predict_proba(x_test)[:, 1]
        # Compute the false positive rate (FPR) and true positive rate (TPR)
        fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
        # Calculate the AUC (Area Under the Curve)
        roc_auc = auc(fpr, tpr)
        # Plot ROC curve for the current model
        plt.plot(fpr, tpr, label=f"{model_name} (AUC = {roc_auc:.2f})")
    plt.plot([0, 1], [0, 1], 'r--')  # Plot diagonal line representing random chance
    plt.xlabel('False Positive Rate')  # Label for X-axis
    plt.ylabel('True Positive Rate')  # Label for Y-axis
    plt.title('ROC Curves for Churn Prediction Models')  # Title for the plot
    plt.legend()  # Show legend with model names and AUC values
    plt.show()  # Display the plot

# Initialize models for Logistic Regression, Random Forest, and XGBoost
models = {
    "Logistic Regression": LogisticRegression(solver='liblinear'),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(eval_metric='logloss')  # Removed use_label_encoder
}

# Train each model using the training data
for model_name, model in models.items():
    model.fit(x_train, y_train)

# Generate and display ROC curves for all models
plot_roc_curves(models, x_test, y_test)


In [None]:

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, roc_auc_score

def model_prediction_visualization(x, y):
    # Split the dataset into training and testing sets
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

    # Initialize models for comparison
    models = {
        'Logistic Regression': LogisticRegression(solver='liblinear'),
        'Random Forest': RandomForestClassifier(),
        'XGBoost': XGBClassifier(eval_metric='logloss')  # Removed use_label_encoder
    }

    # Train the models on the training data
    for model_name, model in models.items():
        model.fit(x_train, y_train)

    # Make predictions and compute probabilities with each model on the test data
    predictions = {}
    probabilities = {}
    for model_name, model in models.items():
        predictions[model_name] = model.predict(x_test)  # Predicted labels
        if hasattr(model, "predict_proba"):  # Check if the model supports predict_proba
            probabilities[model_name] = model.predict_proba(x_test)[:, 1]  # Probabilities for positive class

    # Define evaluation metrics to assess model performance
    metrics = {
        'Accuracy': lambda y_true, y_pred: accuracy_score(y_true, y_pred),
        'F1 Score': lambda y_true, y_pred: f1_score(y_true, y_pred),
        'Recall': lambda y_true, y_pred: recall_score(y_true, y_pred),
        'Precision': lambda y_true, y_pred: precision_score(y_true, y_pred),
        'AUC': lambda y_true, y_proba: roc_auc_score(y_true, y_proba)
    }

    # Evaluate each model using the defined metrics
    results = {}
    for model_name in models.keys():
        results[model_name] = {}
        for metric_name, metric_function in metrics.items():
            if metric_name == 'AUC' and model_name in probabilities:
                results[model_name][metric_name] = metric_function(y_test, probabilities[model_name])
            elif metric_name != 'AUC':
                results[model_name][metric_name] = metric_function(y_test, predictions[model_name])

    # Visualize the results by plotting the metric scores for each model
    fig, ax = plt.subplots(figsize=(12, 8))
    plt.title('Comparison of Model Performance')
    plt.xlabel('Metric')
    plt.ylabel('Score')

    # Plot the scores for each model
    for model_name, metric_scores in results.items():
        plt.plot(list(metric_scores.keys()), list(metric_scores.values()), label=model_name, marker='o')

    plt.legend()
    plt.grid()
    plt.show()

# Call the function to visualize model predictions
model_prediction_visualization(x, y)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

# Function to display confusion matrices for multiple models
def plot_confusion_matrix(models, x_train, y_train, x_test, y_test):
    # Create a subplot for each model to visualize the confusion matrix
    fig, axs = plt.subplots(1, len(models), figsize=(20, 5))

    # For each model, generate and plot its confusion matrix
    for i, (model_name, model) in enumerate(models.items()):
        # Train the model
        model.fit(x_train, y_train)
        # Predict the labels on the test set
        y_pred = model.predict(x_test)
        # Compute the confusion matrix
        cm = confusion_matrix(y_test, y_pred)
        # Visualize the confusion matrix using a heatmap
        sns.heatmap(cm, annot=True, fmt='d', ax=axs[i], cmap='Blues')
        axs[i].set_title(f'{model_name} Confusion Matrix')  # Set the title for the subplot
        axs[i].set_xlabel('Predicted Label')  # Label for X-axis
        axs[i].set_ylabel('True Label')  # Label for Y-axis

    # Adjust layout for better presentation
    plt.tight_layout()
    plt.show()  # Display the confusion matrix plots

# Define the models to evaluate
models = {
    'Logistic Regression': LogisticRegression(solver='liblinear'),
    'Random Forest': RandomForestClassifier(),
    'XGBoost': XGBClassifier(eval_metric='logloss')  # Removed use_label_encoder
}

# Call the function to plot confusion matrices for the models
plot_confusion_matrix(models, x_train, y_train, x_test, y_test)



## Error Analysis  
Error analysis is the process of identifying and understanding the sources of errors in a model or system. It involves examining the model’s predictions and comparing them to the actual outcomes to identify discrepancies.

Here’s a step-by-step approach to error analysis:

1. **Collect Data:** Gather data on the model’s predictions and actual outcomes, including input features, predicted values, and true values.

2. **Analyze Errors:** Calculate error metrics such as mean absolute error, root mean squared error, or classification accuracy to assess the model’s performance.

3. **Identify Patterns:** Look for patterns in the errors. For example, are there specific features or conditions causing higher error rates?

4. **Investigate Causes:** Examine the underlying causes of the errors, such as model assumptions, data quality issues, or limitations in the model’s design or training process.

5. **Develop Mitigation Strategies:** Propose solutions to address the identified errors, such as improving data quality, adjusting model parameters, or testing alternative modeling techniques.



In [None]:
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    recall_score,
    precision_score,
)
import matplotlib.pyplot as plt

# Function to perform error analysis on multiple classification models
def error_analysis(models, x_train, y_train, x_test, y_test):
    # Generate predictions for each model
    predictions = {}
    for model_name, model in models.items():
        model.fit(x_train, y_train)  # Train each model using the training data
        predictions[model_name] = model.predict(x_test)  # Get predictions for the test set

    # Define the evaluation metrics to be calculated
    metrics = {
        "Accuracy": accuracy_score,  # Accuracy of the model
        "F1 Score": f1_score,  # F1 Score
        "Recall": recall_score,  # Recall value
        "Precision": precision_score,  # Precision score
    }

    # Store the results of the error metrics
    results = {}
    for model_name, prediction in predictions.items():
        results[model_name] = {}
        for metric_name, metric_function in metrics.items():
            results[model_name][metric_name] = metric_function(y_test, prediction)

    # Print out the error analysis results
    print("Error Analysis Results:")
    for model_name, metric_scores in results.items():
        print(f"{model_name}:")
        for metric_name, metric_score in metric_scores.items():
            print(f"\t{metric_name}: {metric_score:.4f}")

    # Bar plot for comparison of models' metrics
    metric_names = list(metrics.keys())
    fig, axs = plt.subplots(1, len(metric_names), figsize=(20, 5), sharey=True)
    for i, metric_name in enumerate(metric_names):
        metric_values = [results[model_name][metric_name] for model_name in models.keys()]
        axs[i].bar(models.keys(), metric_values, color="skyblue")
        axs[i].set_title(f"{metric_name} Comparison")
        axs[i].set_xticks(range(len(models)))  # Explicitly set tick positions
        axs[i].set_xticklabels(models.keys(), rotation=45)  # Set tick labels
        axs[i].set_ylim([0, 1])  # Metrics are between 0 and 1
        axs[i].set_ylabel(metric_name)

    plt.tight_layout()
    plt.show()

# Define models for evaluation
models = {
    "Logistic Regression": LogisticRegression(solver="liblinear"),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(eval_metric="logloss"),  # Removed use_label_encoder
}

# Perform error analysis on the models
error_analysis(models, x_train, y_train, x_test, y_test)



## Conclusion  
In this project, we explored Telecommunication Churn Prediction, starting with dataset analysis, followed by feature engineering, and concluding with comparing three machine learning models to identify the best fit for the dataset in terms of accuracy.

The models used were Logistic Regression, Random Forest, and XG-Boost. Among the three, XG-Boost outperformed the others, achieving an accuracy of 85.03%, an F1 score of 84.51%, a precision of 87.54%, and a recall of 81.68%. The ROC curve demonstrated that XG-Boost had the highest AUC value of 0.93, indicating a strong ability to differentiate between churn and non-churn customers.

Error analysis revealed that all models exhibited similar error patterns, suggesting that further tuning or additional features could improve the accuracy, especially for certain customer segments.

In conclusion, the XG-Boost model demonstrated superior performance for churn prediction in the telecommunications industry. With further optimization, it can provide valuable insights for customer retention strategies.