<a href="https://colab.research.google.com/github/eaedk/Machine-Learning-Tutorials/blob/main/ML_Step_By_Step_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Customer Churn Prediction Challenge For Azubian**

## **Project Statement of the Problem**:
The focus of this project is a customer churn prediction for an African telecommunications firm that provides customers with airtime and mobile data bundles. The company aims to create a machine learning model capable of effectively forecasting the probability of individual customers "churning," which means they become inactive and not making any transactions for a period of 90 days. Accurate churn prediction plays a pivotal role in the company's success, as it empowers proactive strategies to retain customers and minimize revenue loss.

## **Project Objective**:
The objective of this challenge is to develop a machine learning model to predict the likelihood of each customer “churning,” i.e. becoming inactive and not making any transactions for 90 days. This will help this telecom company to better serve their customers by understanding which customers are at risk of leaving.

## **Hypothesis**

**Null hypothesis(H0)**: There is no significant relationship between the customers' characteristics and the churn rate. In other words, the variables in the dataset have no impact on customer churn.

**Alternative hypothesisH1)**: There is a significant relationship between the customers' characteristics and the churn rate. The variables in the dataset have an impact on customer churn.

## **Business Questions**

#### 1.  What is the overall churn rate of the company?

#### 2. Which region has the highest representation?

#### 3. Which tenure has the highest representation?

#### 4. Does the length of tenure (months) affect the churn rate of customers?

#### 5. Which Region has most clients churning?

#### 6. Do client churn or not churn if active for 90 days (Regularity)?

#### 7. What's the correlation between the various features?


 .

.

# Data Understanding
- Data collated from the ***Customer Relationship Management*** team contains demographic and usage information for each customer as well as whether or not they churned. Below are meanings to each varible provided in the dataset

The churn dataset includes 19 variables including 15 numeric variables and 4 categorical variables.
1. user_id - Unique identifier for each customer
2. REGION - the location of each client
3. TENURE - duration in the network
4. MONTANT - top-up amount
5. FREQUENCE_RECH - number of times the customer refilled
5. REVENUE - monthly income of each client
6. ARPU_SEGMENT - income over 90 days / 3
7. FREQUENCE - number of times the client has made an income
8. DATA_VOLUME - number of connections
9. ON_NET - inter expresso call
10. ORANGE - call to orange
11. TIGO - call to Tigo
12. ZONE1 - call to zones1
13. ZONE2 - call to zones2
14. MRG - a client who is going
15. REGULARITY - number of times the client is active for 90 days
16. TOP_PACK - the most active packs
17. FREQ_TOP_PACK - number of times the client has activated the top pack packages
18. CHURN - variable to predict - Target

# Setup

## Installation
Here is the section to install all the packages/libraries that will be needed to tackle the challlenge.

In [1]:
# Installing relevant libraries
%pip install tabulate
%pip install plotly
%pip install statsmodels
%pip install imblearn
%pip install phik
%pip install xgboost

Note: you may need to restart the kernel to use updated packages.

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Importation
Here is the section to import all the packages/libraries that will be used through this notebook.

In [2]:
# Data handling
import pandas as pd
import numpy as np

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
...

# EDA (pandas-profiling, etc. )
...

# Feature Processing (Scikit-learn processing, etc. )
...

# Machine Learning (Scikit-learn Estimators, Catboost, LightGBM, etc. )
...

# Hyperparameters Fine-tuning (Scikit-learn hp search, cross-validation, etc. )
...

# Other packages
import os, pickle


# Data Loading
Here is the section to load the datasets (train, eval, test) and the additional files

In [3]:
# 
train = pd.read_csv("data/Train.csv")
test = pd.read_csv("data/Test.csv")

# Data Understanding

### Preview Datasets

In [4]:
# previewing a section of the train dataset 
train.head()

Unnamed: 0,user_id,REGION,TENURE,MONTANT,FREQUENCE_RECH,REVENUE,ARPU_SEGMENT,FREQUENCE,DATA_VOLUME,ON_NET,ORANGE,TIGO,ZONE1,ZONE2,MRG,REGULARITY,TOP_PACK,FREQ_TOP_PACK,CHURN
0,7ee9e11e342e27c70455960acc80d3f91c1286d1,DAKAR,K > 24 month,20000.0,47.0,21602.0,7201.0,52.0,8835.0,3391.0,396.0,185.0,,,NO,62,On net 200F=Unlimited _call24H,30.0,0
1,50443f42bdc92b10388fc56e520e4421a5fa655c,,K > 24 month,,,,,,,,,,,,NO,3,,,0
2,da90b5c1a9b204c186079f89969aa01cb03c91b2,,K > 24 month,,,,,,,,,,,,NO,1,,,0
3,364ec1b424cdc64c25441a444a16930289a0051e,SAINT-LOUIS,K > 24 month,7900.0,19.0,7896.0,2632.0,25.0,9385.0,27.0,46.0,20.0,,2.0,NO,61,"Data:490F=1GB,7d",7.0,0
4,d5a5247005bc6d41d3d99f4ef312ebb5f640f2cb,DAKAR,K > 24 month,12350.0,21.0,12351.0,4117.0,29.0,9360.0,66.0,102.0,34.0,,,NO,56,All-net 500F=2000F;5d,11.0,0


In [26]:
# previewing a section of the test dataset
test.head()

Unnamed: 0,user_id,REGION,TENURE,MONTANT,FREQUENCE_RECH,REVENUE,ARPU_SEGMENT,FREQUENCE,DATA_VOLUME,ON_NET,ORANGE,TIGO,ZONE1,ZONE2,MRG,REGULARITY,TOP_PACK,FREQ_TOP_PACK
0,51fe4c3347db1f8571d18ac03f716c41acee30a4,MATAM,I 18-21 month,2500.0,5.0,2500.0,833.0,5.0,0.0,64.0,70.0,,,,NO,35,All-net 500F=2000F;5d,5.0
1,5ad5d67c175bce107cc97b98c4e37dcc38aa7f3e,,K > 24 month,,,,,,,,,,,,NO,2,,
2,5a4db591c953a8d8f373877fad37aaf4268899a1,,K > 24 month,,,,,,0.0,,,,,,NO,22,,
3,8bf9b4d8880aeba1c9a0da48be78f12e629be37c,,K > 24 month,,,,,,,,,,,,NO,6,,
4,c7cdf2af01e9fa95bf498b68c122aa4b9a8d10df,SAINT-LOUIS,K > 24 month,5100.0,7.0,5637.0,1879.0,15.0,7783.0,30.0,24.0,0.0,0.0,,NO,60,"Data:1000F=2GB,30d",4.0


In [27]:
train.shape, test.shape

((1077024, 19), (190063, 18))

In [30]:
train.columns.values, test.columns.values

(array(['user_id', 'REGION', 'TENURE', 'MONTANT', 'FREQUENCE_RECH',
        'REVENUE', 'ARPU_SEGMENT', 'FREQUENCE', 'DATA_VOLUME', 'ON_NET',
        'ORANGE', 'TIGO', 'ZONE1', 'ZONE2', 'MRG', 'REGULARITY',
        'TOP_PACK', 'FREQ_TOP_PACK', 'CHURN'], dtype=object),
 array(['user_id', 'REGION', 'TENURE', 'MONTANT', 'FREQUENCE_RECH',
        'REVENUE', 'ARPU_SEGMENT', 'FREQUENCE', 'DATA_VOLUME', 'ON_NET',
        'ORANGE', 'TIGO', 'ZONE1', 'ZONE2', 'MRG', 'REGULARITY',
        'TOP_PACK', 'FREQ_TOP_PACK'], dtype=object))

# Exploratory Data Analysis: EDA
Here is the section to **inspect** the datasets in depth, **present** it, make **hypotheses** and **think** the *cleaning, processing and features creation*.

## Univariate Analysis
Here is the section to explore, analyze, visualize each variable independently of the others.

In [7]:
# Code here

## Bivariate & Multivariate Analysis
Here is the section to explore, analyze, visualize each variable in relation to the others.

In [8]:
# Code here

# Feature Processing & Engineering
Here is the section to **clean**, **process** the dataset and **create new features**.

## Drop Duplicates

In [9]:
# Use pandas.DataFrame.drop_duplicates method

## Dataset Splitting

In [10]:
# Use train_test_split with a random_state, and add stratify for Classification

## Impute Missing Values

In [11]:
# Use sklearn.impute.SimpleImputer

## New Features Creation

In [12]:
# Code here

## Features Encoding




In [13]:
# From sklearn.preprocessing use OneHotEncoder to encode the categorical features.

## Features Scaling


In [14]:
# From sklearn.preprocessing use StandardScaler, MinMaxScaler, etc.

## Optional: Train set Balancing (for Classification only)

In [15]:
# Use Over-sampling/Under-sampling methods, more details here: https://imbalanced-learn.org/stable/install.html

# Machine Learning Modeling 
Here is the section to **build**, **train**, **evaluate** and **compare** the models to each others.

## Simple Model #001

Please, keep the following structure to try all the model you want.

### Create the Model

In [16]:
# Code here

### Train the Model

In [17]:
# Use the .fit method

### Evaluate the Model on the Evaluation dataset (Evalset)

In [18]:
# Compute the valid metrics for the use case # Optional: show the classification report 

### Predict on a unknown dataset (Testset)

In [19]:
# Use .predict method # .predict_proba is available just for classification

## Simple Model #002

### Create the Model

In [20]:
# Code here

### Train the Model

In [21]:
# Use the .fit method

### Evaluate the Model on the Evaluation dataset (Evalset)

In [22]:
# Compute the valid metrics for the use case # Optional: show the classification report 

### Predict on a unknown dataset (Testset)

In [23]:
# Use .predict method # .predict_proba is available just for classification

## Models comparison
Create a pandas dataframe that will allow you to compare your models.

Find a sample frame below :

|     | Model_Name     | Metric (metric_name)    | Details  |
|:---:|:--------------:|:--------------:|:-----------------:|
| 0   |  -             |  -             | -                 |
| 1   |  -             |  -             | -                 |


You might use the pandas dataframe method `.sort_values()` to sort the dataframe regarding the metric.

## Hyperparameters tuning 

Fine-tune the Top-k models (3 < k < 5) using a ` GridSearchCV`  (that is in sklearn.model_selection
) to find the best hyperparameters and achieve the maximum performance of each of the Top-k models, then compare them again to select the best one.

In [24]:
# Code here

# Export key components
Here is the section to **export** the important ML objects that will be use to develop an app: *Encoder, Scaler, ColumnTransformer, Model, Pipeline, etc*.

In [25]:
# Use pickle : put all your key components in a python dictionary and save it as a file that will be loaded in an app