# Subscription Churn Prediction

# Table of Contents :
* [1. Introduction](#section1)
* [2. Importing Required Libraries](#section2)
* [3. Data Exploration](#section3)
  - [Feature Description](#section31)
  - [Basic EDA](#section32)
  - [Profiling Report](#section33)
  - [Dropping Unrelated Columns](#section34)  
  - [Visualization](#section35)  
  - [Correlation](#section36)
* [4. Data Prep](#section4)  
  - [Category Features](#section41)
    - [LabelEncoder](#section411)
    - [OneHotEncoder](#section411)
  - [Scaler](#section42)
* [5. Modelling](#section5)
   - [RandomForestClassifier](#section51)
   - [XGBClassifier](#section52)
   - [LGBMClassifier](#section53)
   - [AdaBoostClassifier](#section54)
   - [LazyClassifier Automation](#section55)   
* [6. Picked Model Validation](#section6)
  

<a id="section1"></a>
# Introduction
Use [Kaggle dataset](https://www.kaggle.com/datasets/safrin03/predictive-analytics-for-customer-churn-dataset), download **data_descriptions.csv, train.csv, test.csv**

<a id="section2"></a>
# Importing Required Libraries

In [None]:
# !pip install pandas
# !pip install missingno
# !pip install plotly
# !brew install lightgbm 
# !pip install lightgbm
# !pip install scikit-learn
# !pip install xgtboot
# !pip install -U lazypredict

In [None]:
#Importing all essential libraries
import numpy as np
import pandas as pd
import plotly.express as px
import missingno as msno

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,LabelEncoder

import lazypredict
from lazypredict.Supervised import LazyClassifier

import warnings
warnings.filterwarnings('ignore')

<a id="section3"></a>
#  Data Exploration :

<a id="section31"></a>
## Feature Description

In [None]:
data_descriptions = pd.read_csv("data_descriptions.csv")
data_descriptions

<a id="section32"></a>
## Input - Basic EDA

In [None]:
# readin train/test dataset
train_df = pd.read_csv("train.csv")

In [None]:
# browse the data
train_df.head()

In [None]:
# check shape
train_df.shape

In [None]:
# convert string yes/No to 1/0
value_map = {'Yes': 1, 'No': 0}
train_df['PaperlessBilling'] = train_df['PaperlessBilling'].replace(value_map)
train_df['MultiDeviceAccess'] = train_df['MultiDeviceAccess'].replace(value_map)
train_df['ParentalControl'] = train_df['ParentalControl'].replace(value_map)
train_df['SubtitlesEnabled'] = train_df['SubtitlesEnabled'].replace(value_map)

In [None]:
# check column types
train_df.dtypes

In [None]:
# descriptive stats
train_df.describe()

In [None]:
# check unique values for each column
train_df.nunique()

<a id="section321"></a>
### Null value check

In [None]:
# check missing values
train_df.isnull().sum()

In [None]:
# OR use missingno to check
msno.matrix(train_df)

We do not have any missing value in the dataset hence now we can analyse the data much better and build accurate models for prediction. If the dataset would contain missing values, check the below given links to help you know the process of Data cleaning.
Customer ID is randomly allocated to a customer and useless for model. We will drop it

<a id="section33"></a>
## Profiling Report

In [None]:
from IPython.display import HTML

HTML(filename="train_csv.html")

<a id="section34"></a>
## Dropping Features

In [None]:
#Dropping Customer ID feature
train_df.drop('CustomerID',axis = 1,inplace = True)

# Dropping MonthlyCharges because TotalCharges=AccountAge*MonthlyCharges
train_df.drop('MonthlyCharges',axis = 1,inplace = True)

<a id="section35"></a>
## Visualization 

In [None]:
# Check outlier

In [None]:
# category features
category_features = train_df.select_dtypes(include='object')
for feature in category_features:
    fig = px.histogram(train_df,x=feature, color = 'Churn',barmode = 'group', text_auto=True)
    fig.update_layout(width=400, height=200)
    fig.show()

In [None]:
numeric_features = train_df.select_dtypes(include='number')
for feature in numeric_features:
    # Calculate mean value of col for each target
    mean_by_target = train_df.groupby('Churn')[feature].mean()
    # Plot a histogram with the mean value of col for each target
    fig = px.histogram(mean_by_target, x=feature, color=mean_by_target.index, barmode='group')
    # Display the figure
    fig.show()

<a id="section36"></a>
## Check outlier 

In [None]:
# For numerical columns, remove boolean columns and Churn columns
outlier_columns = ['AccountAge', 'TotalCharges', 
       'ViewingHoursPerWeek', 'AverageViewingDuration',
       'ContentDownloadsPerMonth', 'UserRating', 'SupportTicketsPerMonth',
       'WatchlistSize']

# Create box plots for numerical columns with outliers
for column in outlier_columns:
    fig = px.box(train_df, y=column)
    fig.update_layout(title=f'Outliers in {column}',
                      xaxis_title='Values',
                      yaxis_title=column)
    fig.update_layout(width=800, height=400)
    fig.show()

<a id="section35"></a>
## Correlation 

In [None]:
# plotting correlation matrix to notice relationships or lack of it between variables
corr = train_df.select_dtypes(include=['number']).corr()

fig = px.imshow(corr, x=corr.columns, y=corr.index, color_continuous_scale='RdBu_r', text_auto=True)
# Make the figure size bigger
fig.update_layout(width=900, height=900)

fig.show()

In [None]:
# remove correlation values

# Find all correlation values greater than 0.9
corr_values = corr.where(corr > 0.8).unstack()

# Get the column names of the correlated features
correlated_features = corr_values.dropna().index.to_list()
print(correlated_features)

# Remove the correlated features from the DataFrame
# data.drop(columns=correlated_features, inplace=True)

In [None]:
# sort correlation values of feature Churn
corr['Churn'].sort_values(ascending=False)

<a id="section4"></a>
# Data Prep

<a id="section41"></a>
## Category Features

<a id="section411"></a>
### LabelEncoder

In [None]:
# Convering Categorical Features into numerical features using LabelEncoder
train_df_copy = train_df.copy() # this is just for OneHotEncoder demo purpose 
# category features
for feature in category_features:
    train_df[feature] = LabelEncoder().fit_transform(train_df[feature])

In [None]:
train_df.head()

<a id="section412"></a>
### OneHotEncoder

**In this example, the distinction between LabelEncoder and OneHotEncoder is negligible after testing. This section merely demonstrates the concept, and it is not utilized further in the process**

In [None]:
train_df_copy = pd.get_dummies(train_df_copy, columns=category_features.columns.tolist())
train_df_copy.head()

<a id="section42"></a>
## StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_df.drop(columns=["Churn"]), train_df["Churn"], test_size=0.25, random_state=42)

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

<a id="section5"></a>
# Modelling

<a id="section51"></a>
## RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, f1_score

rfc = RandomForestClassifier()
print(rfc.get_params())
rfc.fit(X_train, y_train)
y_pred_rfc = rfc.predict(X_test)
print(classification_report(y_test, y_pred_rfc))
print('Accuracy Score : ' + str(round(accuracy_score(y_test,y_pred_rfc),3)))
print('F1 Score : ' + str(round(f1_score(y_test,y_pred_rfc),3)))

<a id="section52"></a>
## XGBClassifier

In [None]:
import xgboost as xgb

# Create an XGBClassifier object
clf = xgb.XGBClassifier()

# Fit the model to the training data
clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(X_test)

# Evaluate the model performance
accuracy = clf.score(X_test, y_test)

print('Accuracy Score : ' + str(round(accuracy_score(y_test,y_pred),3)))
print('F1 Score : ' + str(round(f1_score(y_test,y_pred),3)))

<a id="section53"></a>
## LGBMClassifier

In [None]:
import lightgbm as lgb

# Create an LGBMClassifier object
clf = lgb.LGBMClassifier()

# Fit the model to the training data
clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(X_test)

# Evaluate the model performance
print('Accuracy Score : ' + str(round(accuracy_score(y_test,y_pred),3)))
print('F1 Score : ' + str(round(f1_score(y_test,y_pred),3)))

<a id="section54"></a>
## AdaBoostClassifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Create an AdaBoostClassifier object
abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())

# Train the model
abc.fit(X_train, y_train)

# Make predictions on the test set
y_pred = abc.predict(X_test)

# Evaluate the model performance
print('Accuracy Score : ' + str(round(accuracy_score(y_test,y_pred),3)))
print('F1 Score : ' + str(round(f1_score(y_test,y_pred),3)))

<a id="section55"></a>
## LazyClassifier Automation

This is also for demo purpose

<a id="section551"></a>
### check total number of classifiers

In [None]:
lazypredict.Supervised.CLASSIFIERS

<a id="section552"></a>
### choose neccessary classifiers

In [None]:
classifiers = [
  'XGBClassifier',
  'RandomForestClassifier',
  'LGBMClassifier',
  'DecisionTreeClassifier',
  'SGDClassifier'
]

In [None]:
lazypredict.Supervised.CLASSIFIERS = [tup for tup in lazypredict.Supervised.CLASSIFIERS if tup[0] in classifiers]
lazypredict.Supervised.CLASSIFIERS

In [None]:
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric = None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)

In [None]:
print(models)


<a id="section53"></a>
<div style="font-size:15px; font-family:verdana;"><h4 style="font-family:verdana;">Accuracy Vs Model:</h4>

In [None]:
models.sort_values(by = 'Accuracy',inplace = True,ascending = False)
line = px.line(data_frame= models ,y =["Accuracy"] , markers = True)
line.update_xaxes(title="Model",
              rangeslider_visible = False)
line.update_yaxes(title = "Accuracy")
line.update_traces(line_color="red")
line.update_layout(showlegend = True,
    title = {
        'text': 'Accuracy vs Model'})

line.show()

<a id="section54"></a>
<div style="font-size:15px; font-family:verdana;"><h4 style="font-family:verdana;">Time Taken Vs Model:</h4>

In [None]:
models.sort_values(by = 'Time Taken',inplace = True,ascending = False)
line = px.line(data_frame= models ,y =["Time Taken"] , markers = True)
line.update_xaxes(title="Model")
line.update_yaxes(title = "Time Taken")
line.update_traces(line_color="blue")
line.update_layout(showlegend = True,
    title = {
        'text': 'Time Taken Vs Model'})

line.show()