In [None]:
<a href="https://colab.research.google.com/github/eaedk/Machine-Learning-Tutorials/blob/main/ML_Step_By_Step_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Intro
## General
Machine learning allows the user to feed a computer algorithm an immense amount of data and have the computer analyze and make data-driven recommendations and decisions based on only the input data. 
In most of the situations we want to have a machine learning system to make **predictions**, so we have several categories of machine learning tasks depending on the type of prediction needed: **Classification, Regression, Clustering, Generation**, etc.

**Classification** is the task whose goal is the prediction of the label of the class to which the input belongs (e.g., Classification of images in two classes: cats and dogs).
**Regression** is the task whose goal is the prediction of numerical value(s) related to the input (e.g., House rent prediction, Estimated time of arrival ).
**Generation** is the task whose goal is the creation of something new related to the input (e.g., Text translation, Audio beat generation, Image denoising ). **Clustering** is the task of grouping a set of objects in such a way that objects in the same group (called a **cluster**) are more similar (in some sense) to each other than to those in other **clusters** (e.g., Clients clutering).

In machine learning, there are learning paradigms that relate to one aspect of the dataset: **the presence of the label to be predicted**. **Supervised Learning** is the paradigm of learning that is applied when the dataset has the label variables to be predicted, known as ` y variables`. **Unsupervised Learning** is the paradigm of learning that is applied when the dataset has not the label variables to be predicted. **Self-supervised Learning** is the paradigm of learning that is applied when part of the X dataset is considere as the label to be predicted (e.g., the Dataset is made of texts and the model try to predict the next word of each sentence).

## Notebook overview

This notebook is a guide to start practicing Machine Learning.

# Setup

## Installation
Here is the section to install all the packages/libraries that will be needed to tackle the challlenge.

In [None]:
#%pip install pyodbc
#!pip install plotly

In [2]:
import pyodbc

In [3]:
#pip install python-dotenv

Collecting python-dotenvNote: you may need to restart the kernel to use updated packages.
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0



In [None]:
import python

## Importation
Here is the section to import all the packages/libraries that will be used through this notebook.

In [None]:
# Data handling
import pandas as pd
import numpy as np
import pyodbc
import warnings

warnings.filterwarnings('ignore')

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# EDA (pandas-profiling, hypothesis testing etc. )
import scipy.stats as stats
from scipy.stats import chi2_contingency


# Feature Processing (Scikit-learn processing, etc. )
...

# Machine Learning (Scikit-learn Estimators, Catboost, LightGBM, etc. )
...

# Hyperparameters Fine-tuning (Scikit-learn hp search, cross-validation, etc. )
...

# Other packages
import os, pickle


In [None]:
# Getting data from database
server = "dap-projects-database.database.windows.net"
database = "dapDB"
username = "dataAnalyst_LP2"
password = "A3g@3kR$2y"

connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"

# Connecting to the DB
connection = pyodbc.connect(connection_string)

# Selecting the Table
query = "Select * from dbo.LP2_Telco_churn_first_3000"
data = pd.read_sql(query, connection)

# Preview the data
data.head()

In [None]:
# Getting the number of rows and columns
data.shape

In [None]:
data.to_csv("data1.csv")

# Data Loading
Here is the section to load the datasets (train, eval, test) and the additional files

In [None]:
# Data loading
training_file = "C:/Users/hp/Documents/GitHub/Customer-Churn-ML-Prediction/Datasets/training_data.csv"
training_data = pd.read_csv(training_file)

#Preview Training Data
training_data.head()

# Exploratory Data Analysis: EDA
Here is the section to **inspect** the datasets in depth, **present** it, make **hypotheses** and **think** the *cleaning, processing and features creation*.

## Univariate Analysis
Univariate analysis focuses on examining individual variables in isolation. For customer churn data, univariate analysis will involve exploring Churn Distribution, Customer Demographics and Service Usage

In [None]:
# Churn distribution
# Calculate the churn rate (percentage of customers who churned)
churn_counts = training_data['Churn'].value_counts()
churn_percentages = churn_counts / churn_counts.sum() * 100
print("Churn Distribution:")
print(churn_percentages)

The churn distribution shows that approximately 26.5% of customers have churned, while around 73.5% have not churned. This indicates a class imbalance, with the churned class being the minority.

In [None]:
# Customer demographics - Gender
# Analyzing the gender distributions
sns.set(style="darkgrid")
plt.figure(figsize=(10, 6))
sns.countplot(x='gender', data=training_data)
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

The gender distribution analysis reveals the count of customers by gender. It shows that the gender is balanced with male slightly higher than female.

In [None]:
# Customer demographics - Senior Citizens
# Analyzing the SeniorCitizens distributions
plt.figure(figsize=(10, 6))
training_data['SeniorCitizen'].value_counts().plot(kind='bar')
plt.title('Senior Citizens Distribution')
plt.xlabel('Senior Citizens')
plt.ylabel('Count')
plt.show()

The gender distribution analysis reveals the count of customers by senior citizens. It shows that there are whole lot more of non-senior citizens than senior citizens.

In [None]:
# Service usage - Monthly Charges
# Examining the Average monthly charges with a box plots
plt.figure(figsize=(10, 6))
sns.boxplot(x='MonthlyCharges', data=training_data)
plt.title('Monthly Charges')
plt.xlabel('Charges')
plt.show()

The box plot for monthly charges gives an overview of the distribution of charges among customers. It shows the range, median, quartiles, and any potential outliers in the monthly charges. With this, we can see that the average is around 70 with 20 being the lowest and 120 the highest charge per month. 

## Bivariate & Multivariate Analysis
Explore, analyze, visualize each variable in relation to the others. In the case of customer churn data, bivariate analysis can help uncover potential correlations or dependencies between different variables and churn

In [None]:
# Churn by gender
# Analyzing the Churn Rate by Gender to observe any patterns associated with churn
gender_churn = training_data.groupby('gender')['Churn'].value_counts(normalize=True).unstack().reset_index()
gender_churn.rename(columns={'No': 'No Churn', 'Yes': 'Churn'}, inplace=True)

fig = px.bar(gender_churn, x='gender', y=['No Churn', 'Churn'], barmode='stack', title='Churn by Gender')
fig.show()

The stacked bar chart demonstrates the churn rates categorized by gender. It shows that there is about 80% No Churn rate in both Male and Female and about 20% Churn rate in both too, so there's no pattern in genders.

In [None]:
# Churn by Payment Method
# Analyzing the Churn Rate by Payment Method to observe any patterns associated with churn
payment_churn = training_data.groupby('PaymentMethod')['Churn'].value_counts(normalize=True).unstack().reset_index()
payment_churn.rename(columns={'No': 'No Churn', 'Yes': 'Churn'}, inplace=True)

fig = px.bar(payment_churn, x='PaymentMethod', y=['No Churn', 'Churn'], barmode='stack', title='Churn by Payment Method')
fig.show()

The stacked bar chart demonstrates the churn rates categorized by payment method. It shows that Electronic Checks are definitely not the way to go and should be removed.

In [None]:
# Churn and service usage - Monthly Charges
# Analyzing the Churn Rate by Monthly Charges to observe any patterns associated with churn
fig = px.scatter(training_data, x='MonthlyCharges', y='Churn', color='Churn', title='Churn and Monthly Charges')
fig.show()

The scatter plot visualizes the relationship between churn and monthly charges. It allows us to observe whether there is any noticeable pattern or trend between higher charges and churn. It can help identify if customers with higher monthly charges are more likely to churn. This shows that charges between 70 to 110 have a higher chance to churn and also not churn which means no noticeable patterns

In [None]:
# Churn and contract information - Contract
# Analyzing the Churn Rate by Contract to observe any patterns associated with churn
contract_churn = training_data.groupby('Contract')['Churn'].value_counts(normalize=True).unstack().reset_index()
contract_churn.rename(columns={'No': 'No Churn', 'Yes': 'Churn'}, inplace=True)

fig = px.bar(contract_churn, x='Contract', y=['No Churn', 'Churn'], barmode='stack', title='Churn by Contract Type')
fig.show()

The stacked bar chart showcases the churn rates based on different contract types (Month-to-month, One year and Two years). It shows that a signicantly higher percentage of customers are likely to churn with a contract type of Month-to-month.

In [None]:
# Churn and customer tenure
# Analyzing the Churn Rate by customer tenure to observe any patterns associated with churn
plt.figure(figsize=(10, 6))
sns.histplot(training_data, x='tenure', hue='Churn', multiple='stack', bins=20, palette='Set1')
plt.title('Churn and Customer Tenure')
plt.xlabel('Tenure (Months)')
plt.ylabel('Count')
plt.show()

The histogram displays the distribution of customer tenure for both churned and non-churned customers over their tenure in months. This shows that customer loyalty goes a very long with the customers with the highest months(70) having the highest No Churn count compared to its churned count.

# Feature Processing & Engineering
Here is the section to **clean**, **process** the dataset and **create new features**.

## Drop Duplicates

In [None]:
# Use pandas.DataFrame.drop_duplicates method

## Dataset Splitting

In [None]:
# Use train_test_split with a random_state, and add stratify for Classification

## Impute Missing Values

In [None]:
# Use sklearn.impute.SimpleImputer

## New Features Creation

In [None]:
# Code here

## Features Encoding




In [None]:
# From sklearn.preprocessing use OneHotEncoder to encode the categorical features.

## Features Scaling


In [None]:
# From sklearn.preprocessing use StandardScaler, MinMaxScaler, etc.

## Optional: Train set Balancing (for Classification only)

In [None]:
# Use Over-sampling/Under-sampling methods, more details here: https://imbalanced-learn.org/stable/install.html

# Machine Learning Modeling 
Here is the section to **build**, **train**, **evaluate** and **compare** the models to each others.

## Simple Model #001

Please, keep the following structure to try all the model you want.

### Create the Model

In [None]:
# Code here

### Train the Model

In [None]:
# Use the .fit method

### Evaluate the Model on the Evaluation dataset (Evalset)

In [None]:
# Compute the valid metrics for the use case # Optional: show the classification report 

### Predict on a unknown dataset (Testset)

In [None]:
# Use .predict method # .predict_proba is available just for classification

## Simple Model #002

### Create the Model

In [None]:
# Code here

### Train the Model

In [None]:
# Use the .fit method

### Evaluate the Model on the Evaluation dataset (Evalset)

In [None]:
# Compute the valid metrics for the use case # Optional: show the classification report 

### Predict on a unknown dataset (Testset)

In [None]:
# Use .predict method # .predict_proba is available just for classification

## Models comparison
Create a pandas dataframe that will allow you to compare your models.

Find a sample frame below :

|     | Model_Name     | Metric (metric_name)    | Details  |
|:---:|:--------------:|:--------------:|:-----------------:|
| 0   |  -             |  -             | -                 |
| 1   |  -             |  -             | -                 |


You might use the pandas dataframe method `.sort_values()` to sort the dataframe regarding the metric.

## Hyperparameters tuning 

Fine-tune the Top-k models (3 < k < 5) using a ` GridSearchCV`  (that is in sklearn.model_selection
) to find the best hyperparameters and achieve the maximum performance of each of the Top-k models, then compare them again to select the best one.

In [None]:
# Code here

# Export key components
Here is the section to **export** the important ML objects that will be use to develop an app: *Encoder, Scaler, ColumnTransformer, Model, Pipeline, etc*.

In [None]:
# Use pickle : put all your key components in a python dictionary and save it as a file that will be loaded in an app