<center><img src="https://github.com/girishksahu/INSAID2021/blob/SMART_AI_Learning/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

<h1><center>Machine Learning Foundation Project - Customer Classification for Retail Bank</center><h1>

<center><img width=40% src="https://github.com/girishksahu/INSAID2021-ML-Foundation-Customer_Classification/blob/SMART_AI_Learning/bank-logo.png?raw=true"></center>

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Importing Libraries**](#Section3)<br>
  - **3.1** [**Version Check**](#Section31)
  - **3.2** [**Importing Libraries**](#Section32)

**4.** [**Data Acquisition & Description**](#Section4)<br>
  - **4.1** [**Data Description**](#Section41)
  - **4.2** [**Data Information**](#Section42)

**5.** [**Data Pre-processing**](#Section5)<br>
  - **5.1** [**Data Pre-profiling**](#Section51)<br>
  - **5.2** [**Data Pre-Processing**](#Section52)<br>
  - **5.3** [**Data Post-profiling**](#Section53)<br>

**6.** [**Exploratory Data Analysis**](#Section6)<br>
**7.** [**Data Post-Processing**](#Section7)<br>
  - **7.1** [**Data Encoding**](#Section71)<br> 
  - **7.2** [**Data Preparation**](#Section72)<br>
  - **7.3** [**Data Scaling**](#Section73)<br>

**8.** [**Model Development & Evaluation**](#Section8)<br>
**9.** [**Summarization**](#Section9)<br>
  - **9.1** [**Conclusion**](#Section91)<br>

---
<a name = Section1></a>
# **1. Introduction**
---

- AE Corp is retail banking institution.

    - They are going to float a stock trading facility for their existing customers.

    - The idea is to use data to classify whether a customer belongs to a high net worth or low net worth group.

    - They will have to incentivize their customers to adopt their offerings.

    - One way to incentivize is to offer discounts on the commission for trading transactions.


**<h3>Current Scenario:</h3>**

- The company rolled out this service to about 10,000+ of its customers and observed their trading behavior for 6 months and after that, they labeled them into two revenue grids 1 and 2. 

---
<a name = Section2></a>
# **2. Problem Statement**
---

- **The current process suffers from the following problems:**
    - One issue is that only about 10% of the customers do enough trades for earnings after discounts to be profitable.

    - The company wants to figure out, which are those 10% customers so that it can selectively offer them a discount.

    - They will have to incentivize their customers to adopt their offerings.

    - The marketing department has hired you as a data science consultant because they want to supplement their campaigns with a more proactive approach.


<a name = Section21></a>
### **Your Role**

- You are given datasets of past customers and their status (Revenue Grid 1 or 2).

- Your task is to build a classification model using the datasets.

     - Because there was no machine learning model for this problem in the company, you don’t have a quantifiable win condition. 
     - You need to build the best possible model.

<a name = Section21></a>
### **Project Deliverables**
- Deliverable: **Predict whether a customer belongs to a high net worth or low net worth group.**

- Machine Learning Task: **Classification.**

- Target Variable: **Status (high net worth (1) / low net worth (2))**
- Win Condition: **N/A (best possible model)**

<a name = Section21></a>
### **Evaluation Metric**

- The model evaluation will be based on the F1-Score score.

<center><img src="https://github.com/girishksahu/INSAID2021-ML-Foundation-Customer_Classification/blob/SMART_AI_Learning/stocks-trade.jpg?raw=true"></center>

---
<a name = Section3></a>
# **3. Importing Libraries**
---

<a name = Section31></a>
### **3.1 Version Check**

In [None]:
from platform import python_version

# Printing version of Python to ensure correct version is used for this project
print("python version", python_version())
#!pip list
#!pip show


<a name = Section32></a>
### **3.2 Importing Libraries**

In [223]:
#------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
from scipy.stats import randint as sp_randint                       # For initializing random integer values
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
import sklearn.metrics
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler                    # To scaled data with mean 0 and variance 1
from sklearn.model_selection import RandomizedSearchCV              # To find best hyperparamter setting for the algorithm
from sklearn.metrics import classification_report                   # To generate classification report
from sklearn.metrics import plot_confusion_matrix                   # To plot confusion matrix
#import pydotplus                                                    # To generate pydot file
from IPython.display import Image                                   # To generate image using pydot file
#-------------------------------------------------------------------------------------------------------------------------------
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.metrics import accuracy_score                          # For calculating the accuracy for the model
from sklearn.metrics import precision_score                         # For calculating the Precision of the model
from sklearn.metrics import recall_score                            # For calculating the recall of the model
from sklearn.metrics import precision_recall_curve                  # For precision and recall metric estimation
from sklearn.metrics import confusion_matrix                        # For verifying model performance using confusion matrix
from sklearn.metrics import f1_score                                # For Checking the F1-Score of our model  
from sklearn.metrics import roc_curve                               # For Roc-Auc metric estimation
from sklearn.metrics import plot_roc_curve
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split                # To split the data in training and testing part     
from sklearn.linear_model import LogisticRegression                 # To create the Logistic Regression Model
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

#Printing version of few key libraries to ensure correct once are used
print ("pandas version", pd.__version__)
print ("numpy version", np.__version__)
print ("seaborn version", sns.__version__)
print ("sklearn version", sklearn.__version__)

pandas version 1.1.3
numpy version 1.19.2
seaborn version 0.11.0
sklearn version 0.24.2


---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

- The **dataset** consists of the information about **Customers** Along with column **Revenue_Grid** which classifies the customers into high net worth customers (1) and low net worth customers (2).


| Records | Features | Dataset Size |
| :-- | :-- | :-- |
| 8124 | 32 | 1619 KB | 

<br>

| ID | Feature Name | Description of the feature |
| :-- | :--| :--| 
|01| **REF_NO**   | Reference Number of the customer                          |
|02| **children**      | Number of children each customer has                 |
|03| **Age_band**        | Age Group to which the customer belongs            |
|04| **status**          | Marital Status of the customer                     |
|05| **occupation**      | Job or profession of the customer                  |
|06| **occupation_partner**           | Job or profession of the customer's partner                                  |
|07| **home_status**     | Home Status of the customers |
|08| **family_income**     | Income Range of the customer's family|
|09| **self_employed**        | Whether self-employed or not                                         |
|10| **self_employed_partner**          | Whether the partner self-employed or not                                   |
|11| **year_last_moved**         | Moving Year from the last location of the customer  |
|12| **TVarea**     | Television Region of the customer                                   |
|13| **post_code**     | 	Postal Code of the customer                                  |
|14| **post_area**     | Postal Area of the customer                                  |
|15| **Average_Credit_Card_Transaction**     | Average Credit Card Transaction per year by the customer           |
|16| **Balance_Transfer**     | Transfer of the Balance in an account to another account by the customer           |
|17| **Term_Deposit**     | Cash Investment Help at Financial Institute provided to the customer                              |
|18| **Life_Insurance**     | Basic Life Insurance Coverage of the customer                                  |
|19| **Medical_Insurance**     | Medical Insurance Coverage of the customer                                  |
|20| **Average_A/C_Balance**     | Average Balance in the account of the customer                                  |
|21| **Personal_Loan**     | Amount of Personal Loan taken by the customer                                  |
|22| **Investment_in_Mutual_Fund**     | Amount Invested in Mutual Funds by the customer                                  |
|23| **Investment_Tax_Saving_Bond**     | Amount Invested in Tax Saving Bond by the customer                                  |
|24| **Home_Loan**     | Amount of Home Loan taken by the customer                                   |
|25| **Online_Purchase_Amount**     | Amount spent by the customer on online purchases                                   |
|26| **gender**     | Gender of the customer                                   |
|27| **region**     | Region of the customer                                   |
|28| **Investment_in_Commodity**     | Amount Invested in Commodity by the customer                                   |
|29| **Investment_in_Equity**     | Amount Invested in Equity by the customer                                   |
|30| **Investment_in_Derivative**     | Amount Invested in Derivatives by the customer                                   |
|31| **Portfolio_Balance**     | 	Balanced Investment Strategy of the customer                                   |
|32| **Revenue_Grid**     | Grid report of the customers                                   |

- Load AE Corp Retail Bank Customer Data to be used for Training and Validation

In [224]:
# REF_NO is unique ID for customer and can be used as label for index
cust_master_data = pd.read_csv("https://raw.githubusercontent.com/girishksahu/INSAID2021-ML-Foundation-Customer_Classification/SMART_AI_Learning/existing_base_train.csv", index_col='REF_NO')

# Get the dimesions of data
print('Shape of the Training and Validation dataset:', cust_master_data.shape)

# Output first 5 data rows
cust_master_data.head()


Shape of the Training and Validation dataset: (8124, 31)


Unnamed: 0_level_0,children,age_band,status,occupation,occupation_partner,home_status,family_income,self_employed,self_employed_partner,year_last_moved,TVarea,post_code,post_area,Average_Credit_Card_Transaction,Balance_Transfer,Term_Deposit,Life_Insurance,Medical_Insurance,Average_A/C_Balance,Personal_Loan,Investment_in_Mutual_Fund,Investment_Tax_Saving_Bond,Home_Loan,Online_Purchase_Amount,gender,region,Investment_in_Commudity,Investment_in_Equity,Investment_in_Derivative,Portfolio_Balance,Revenue_Grid
REF_NO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
5466,2,31-35,Partner,Professional,Professional,Own Home,">=35,000",No,No,1981,Meridian,M51 0GU,M51,26.98,29.99,312.25,299.79,88.72,108.85,175.43,134.35,8.98,55.44,7.68,Female,North West,151.55,81.79,136.02,360.37,2
9091,Zero,45-50,Partner,Secretarial/Admin,Professional,Own Home,">=35,000",No,No,1997,Meridian,L40 2AG,L40,35.98,74.48,0.0,99.96,10.99,48.45,15.99,0.0,0.0,0.0,18.99,Female,North West,44.28,13.91,29.23,89.22,2
9744,1,36-40,Partner,Manual Worker,Manual Worker,Rent Privately,"<22,500, >=20,000",Yes,Yes,1996,HTV,TA19 9PT,TA19,0.0,24.46,0.0,18.44,0.0,0.0,0.02,10.46,0.0,0.0,0.0,Female,South West,8.58,1.75,4.82,14.5,2
10700,2,31-35,Partner,Manual Worker,Manual Worker,Own Home,"<25,000, >=22,500",No,No,1990,Scottish TV,FK2 9NG,FK2,44.99,0.0,0.0,0.0,29.99,0.0,0.0,0.0,0.0,0.0,0.0,Female,Scotland,15.0,0.0,5.0,68.98,2
1987,Zero,55-60,Partner,Housewife,Professional,Own Home,">=35,000",No,No,1989,Yorkshire,LS23 7DJ,LS23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.98,0.0,0.0,0.0,Female,Unknown,0.0,1.66,1.66,1.88,2


In [None]:
#There is error in spelling of column Investment_in_Commudity, so renaming it for better clarity
cust_master_data=cust_master_data.rename(columns={'Investment_in_Commudity':'Investment_in_Commodity'})

- Load AE Corp Retail Bank Customer Test Data to be used for Prediction

In [225]:
# REF_NO is unique ID for customer and can be used as label for index
cust_test_data = pd.read_csv("https://raw.githubusercontent.com/girishksahu/INSAID2021-ML-Foundation-Customer_Classification/SMART_AI_Learning/existing_base_test.csv", index_col='REF_NO')

# Get the dimesions of data
print('Shape of the Test dataset to be used for Prediction:', cust_test_data.shape)

# Output first 5 data rows
cust_test_data.head()

Shape of the Test dataset to be used for Prediction: (2031, 31)


Unnamed: 0_level_0,children,age_band,status,occupation,occupation_partner,home_status,family_income,self_employed,self_employed_partner,year_last_moved,TVarea,post_code,post_area,Average_Credit_Card_Transaction,Balance_Transfer,Term_Deposit,Life_Insurance,Medical_Insurance,Average_A/C_Balance,Personal_Loan,Investment_in_Mutual_Fund,Investment_Tax_Saving_Bond,Home_Loan,Online_Purchase_Amount,gender,region,Investment_in_Commudity,Investment_in_Equity,Investment_in_Derivative,Portfolio_Balance,Revenue_Grid
REF_NO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
697,Zero,71+,Partner,Retired,Housewife,Own Home,"<12,500, >=10,000",No,No,1973,Meridian,BH21 2JQ,BH21,41.98,55.47,24.99,29.98,49.98,44.47,0.0,45.97,0.0,0.0,0.0,Male,South West,40.48,15.07,28.4,83.05,2
7897,Zero,31-35,Partner,Unknown,Business Manager,Own Home,">=35,000",No,No,1996,Anglia,CM6 3QS,CM6,0.0,0.0,0.0,99.91,35.42,29.49,170.31,133.88,27.45,13.47,57.46,Male,South East,27.07,72.01,82.74,235.29,1
4729,Zero,71+,Partner,Housewife,Retired,Own Home,"<15,000, >=12,500",No,No,1958,HTV,BA12 9JW,BA12,0.0,154.47,0.0,67.47,0.0,87.83,0.0,107.88,0.0,0.0,0.0,Female,South West,44.39,32.62,43.86,98.38,1
6914,1,22-25,Partner,Other,Other,Own Home,"<17,500, >=15,000",No,No,1993,Grampian,AB22 8SP,AB22,18.98,60.98,0.0,17.99,9.99,0.0,102.43,44.96,0.0,0.0,0.0,Female,Scotland,21.59,24.57,29.23,86.32,1
2795,Zero,65-70,Widowed,Retired,Unknown,Own Home,"< 4,000",No,No,1976,Meridian,PO37 6AD,PO37,0.0,144.9,0.0,123.38,0.0,32.98,0.0,33.96,6.99,17.93,5.45,Female,South East,53.66,16.22,32.89,102.99,1


In [None]:
#There is error in spelling of column Investment_in_Commudity, so renaming it for better clarity
cust_test_data=cust_test_data.rename(columns={'Investment_in_Commudity':'Investment_in_Commodity'})

In [None]:
# Check for any missing values
any(cust_master_data.isna().sum())

In [None]:
#check if any duplicate row
any(cust_master_data.duplicated())

<a name = Section41></a>
### **4.1 Data Description**

- In this section we will get **information about the data** and see some observations.

In [None]:
#year_last_moved is not relevant for statistical details but need to check count just to confirm
#Revenue_Grid is Target variable and has two category as 1(High Net Worth) and 2 (Low Net Worth)
cust_master_data.describe(include=[np.int64]).filter(items=['count'], axis=0)

In [None]:
#Basic statistical details for numeric variables
cust_master_data.describe(include=[np.float64])

**Observations:**
- Total count for all numeric variables is same as 8124 which confirms that there is no missing values.
- Most of them have outliers and mean value is higher than median
- Variables like **Personal_Loan**,**Average_Credit_Card_Transaction**, **Term_Deposit** and **Online_Purchase_Amount** etc. appears to be highly skewed.
- Variables like **Investment_in_Commodity**, **Investment_in_Equity**, **Investment_in_Derivative** and **Portfolio_Balance** etc. also appears to be NOT having normal distribution.

In [None]:
#Basic statistical details for categorical variables
cust_master_data.describe(include=[np.object])

**Observations:**
- Total count for all categorical variables is same as 8124 which confirms that there is no missing values. 
- Majority of customers are having **0 children** and falls in **age_band of 45-50** with having **partner as status**
- Majority of customers are under **Professional occupation** and has **Own Home as home_status**
- Majority of customers has category of **family_income >=35,000** and **gender as female**
- Majority of customers are from **Central TVarea** with **PR5 post_area** in **South East region**

In [None]:
# Check any duplicate rows
cust_master_data.duplicated().sum()

In [None]:
# Columns list can be handy and useful for further steps
cust_master_data.columns

<a name = Section42></a>
### **4.2 Data Information**

- In this section we will see the **information about the types of the features**.

In [None]:
# Check column data types and any null values
cust_master_data.info()

**Observations:**

- Total Non-Null count for each column is same as 8124 which means there is no missing values.

- There are **14 Categorical features**, **15 Numerical features** with 3 columns as integer feature as they contain numerical values.

In [None]:
# Get list of categorical variables
s = (cust_master_data.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

In [None]:
# Get list of numerical variables
s = (cust_master_data.dtypes == 'int64')
numeric_cols = list(s[s].index)

print("Numeric variables INT:")
print(numeric_cols)

In [None]:
# Get list of numerical variables
s = (cust_master_data.dtypes == 'float64')
numeric_cols = list(s[s].index)

print("Numeric variables Float:")
print(numeric_cols)

### **Numerical Data Distribution:**

- We shall plot all **numerical features to analyze the distribution** of their past.

In [None]:
# For Random seed values
from random import randint
fig, axes = plt.subplots(nrows = 4, ncols = 4, sharex = False, figsize=(20, 15))
colors = []
for i in range(16):
  colors.append('#%06X' % randint(0, 0xFFFFFF))
  columns=['year_last_moved','Average_Credit_Card_Transaction', 'Balance_Transfer', 'Term_Deposit', 'Life_Insurance', 'Medical_Insurance', 'Average_A/C_Balance', 'Personal_Loan', 'Investment_in_Mutual_Fund', 'Investment_Tax_Saving_Bond', 'Home_Loan', 'Online_Purchase_Amount', 'Investment_in_Commodity', 'Investment_in_Equity', 'Investment_in_Derivative', 'Portfolio_Balance']
for ax, col, color in zip(axes.flat, columns, colors):
  sns.distplot(a = cust_master_data[col], bins = 50, ax = ax, color = color)
  ax.set_title(col)
  plt.setp(axes, yticks=[])
  ax.grid(False)
plt.tight_layout()
plt.show()

**Observation:**

- **Postive Skewed Features: (Mean > Median)**
 - Average_Credit_Card_Transaction,
 - Balance_Transfer, 
 - Term_Deposit, 
 - Life_Insurance, 
 - Medical_Insurance, 
 - Average_A/C_Balance, 
 - Personal_Loan
 - Investment_in_Mutual_Fund
 - Investment_Tax_Saving_Bond
 - Home_Loan
 - Online_Purchase_Amount
 - Investment_in_Commodity
 - Investment_in_Equity
 - Investment_in_Derivative
 - Portfolio_Balance
- **Negative Skewed Features: (Mean < Median)**
 - year_last_moved
- **~ Normally Distributed Features: (Mean = Median = Mode)**
 - None

### **Categorical Data Distribution:**

- We shall plot all **categorical features to analyze the distribution** of the past data.


In [None]:
fig, axes = plt.subplots(nrows = 4, ncols = 4, sharex = False, figsize=(20, 12))

colors = []
for i in range(14):
  colors.append('#%06X' % randint(0, 0xFFFFFF))
  
for ax, col, color in zip(axes.flat, object_cols, colors):
  ax.bar(x = cust_master_data[col].value_counts().index, height = cust_master_data[col].value_counts(), color = color)
  ax.set_title(col)
  ax.set_xlabel(' ')
  ax.set_xticklabels(labels = ' ')
  ax.grid(True)

**Observation:**

- **Normal Distributed Features: (Mean = Median = Mode)**
 - age_band, 
 - post_code, 
 - post_area
- **Postively Skewed Features: (Mean > Median)**
 - children, 
 - status, 
 - occupation, 
 - occupation_partner,
 - home_status
 - family_income
 - self_employed
 - self_employed_partner
 - TVarea
 - gender
 - region
- **Negatively Skewed Features: (Mean < Median)**
 - NA

<a name = Section5></a>

---
# **5. Data Pre-Processing**
---

<a name = Section51></a>
### **5.1 Data Pre-Profiling**

- For **quick analysis** pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column **statistics** are presented in an interactive HTML report.

In [None]:
# profile = ProfileReport(df=cust_master_data)
# profile.to_file(output_file='Customer-classification Pre Profiling Report.html')
# print('Accomplished!')

**Observations:**

- The report shows that there are a **total** of **32 features** out of which **17** are **numerical**, **13** are **categorical** and **2** are showing of type **boolean** as per report but they are categorical too.

- Only **860** customers are High Net Worth out of **8124**

- **Home Ownership**, is higher with **7506** customers having Own Home.

- **Self Employed** number of customers is very low.

- **High Family Income** number of customer is **2014** but **3154** number of customers are **low Family Income** customers.

- There are no missing values

- For detailed information, please check **Customer-classification Pre Profiling Report.html** file.

<a name = Section52></a>
### **5.2 Data Pre-Processing**

- There are no missing values and data appears clean so far but we need to pre process few features before EDA and Model Evaluation

In [None]:
# year_last_moved feature is not required for EDA and Model Evaluation
# post_area and post_code feature can be removed
cust_master_data.drop(columns=['year_last_moved','post_area','post_code'], inplace=True)

# year_last_moved, post_area and post_code feature need to be removed in Test dataset
cust_test_data.drop(columns=['year_last_moved','post_area','post_code'], inplace=True)

In [None]:
cust_master_data.head(10)

In [None]:
cust_master_data.info()

<a name = Section53></a>
### **5.3 Data Post-Profiling**

- We can run the report to get latest information

In [None]:
# post_profile=ProfileReport(df=cust_master_data)
# post_profile.to_file(output_file='Customer-classification Post Profiling Report.html')
# print('Accomplished!')

**Observations:**

- year_lastmoved, post_area and post_code are dropped as they are not going to be used for model evaluation.

<a name = Section6></a>

---
# **6. Exploratory Data Analysis**
---

**NOTE**:  

- Exploratory Data Analysis will explore all the features and their relationship with other features
- Both non-graphical and graphical method will be used as applicable to respective features
- Both univariate and bivariate method be used as applicable to respective features

**Q: What is the breakdown of Customers having children?**

In [None]:
cust_master_data['children'].value_counts()

In [None]:
plt.figure(figsize=(15,12))
sns.countplot(x='children',data=cust_master_data,order=cust_master_data['children'].value_counts().index)

In [None]:
cust_master_data['children'].value_counts().plot(kind='pie', explode=[0.05,0.05,0.05,0.05,0.05], fontsize=14, autopct='%.2f%%', wedgeprops=dict(width=0.15), 

                                       shadow=True, startangle=160, figsize=(10,10), cmap='inferno', legend=True)

**Q: What is the breakdown of Customers across age band?**

In [None]:
cust_master_data['age_band'].value_counts()

In [None]:
plt.figure(figsize=(15,12))
sns.countplot(x='age_band',data=cust_master_data,order=cust_master_data['age_band'].value_counts().index)

In [None]:
cust_master_data['age_band'].value_counts().plot(kind='pie', explode=[0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05], fontsize=14, autopct='%.2f%%', wedgeprops=dict(width=0.15), 

                                       shadow=True, startangle=160, figsize=(15,15), cmap='inferno', legend=True)

**Q: What is the breakdown of Customers across marriage status?**

In [None]:
cust_master_data['status'].value_counts()

In [None]:
plt.figure(figsize=(15,12))
sns.countplot(x='status',data=cust_master_data,order=cust_master_data['status'].value_counts().index)

In [None]:
cust_master_data['status'].value_counts().plot(kind='pie', explode=[0.05,0.05,0.05,0.05,0.05], fontsize=14, autopct='%.2f%%', wedgeprops=dict(width=0.15), 

                                       shadow=True, startangle=160, figsize=(10,10), cmap='inferno', legend=True)

**Q: What is the breakdown of Customers across Occupation?**

In [None]:
cust_master_data['occupation'].value_counts()

In [None]:
plt.figure(figsize=(15,12))
sns.countplot(x='occupation',data=cust_master_data,order=cust_master_data['occupation'].value_counts().index)

In [None]:
cust_master_data['occupation'].value_counts().plot(kind='pie', explode=[0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05], fontsize=14, autopct='%.2f%%', wedgeprops=dict(width=0.15), 

                                       shadow=True, startangle=160, figsize=(10,10), cmap='inferno', legend=True)

**Q: What is the breakdown of Customers across Occupuation Partner?**

In [None]:
cust_master_data['occupation_partner'].value_counts()

In [None]:
plt.figure(figsize=(15,12))
sns.countplot(x='occupation_partner',data=cust_master_data,order=cust_master_data['occupation_partner'].value_counts().index)

**Q: What is the breakdown of Customers across Home Status?**

In [None]:
cust_master_data['home_status'].value_counts()

In [None]:
plt.figure(figsize=(15,12))
sns.countplot(x='home_status',data=cust_master_data,order=cust_master_data['home_status'].value_counts().index)

In [None]:
cust_master_data['home_status'].value_counts().plot(kind='pie', explode=[0.05,0.05,0.05,0.05,0.05], fontsize=14, autopct='%.2f%%', wedgeprops=dict(width=0.15), 

                                       shadow=True, startangle=160, figsize=(13,13), cmap='inferno', legend=True)

**Q: What is the breakdown of Customers across Family Income?**

In [None]:
cust_master_data['family_income'].value_counts()

In [None]:
plt.figure(figsize=(15,12))
sns.countplot(y='family_income',data=cust_master_data,order=cust_master_data['family_income'].value_counts().index)

In [None]:
cust_master_data['family_income'].value_counts().plot(kind='pie', explode=[0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05], fontsize=14, autopct='%.2f%%', wedgeprops=dict(width=0.15), 

                                       shadow=True, startangle=160, figsize=(15,15), cmap='inferno', legend=True)

**Q: What is the breakdown of Customers across Self Employed?**

In [None]:
cust_master_data['self_employed'].value_counts()

In [None]:
plt.figure(figsize=(12,12))
sns.countplot(x='self_employed',data=cust_master_data,order=cust_master_data['self_employed'].value_counts().index)

**Q: What is the breakdown of Customers across Self Employed Partner?**

In [None]:
cust_master_data['self_employed_partner'].value_counts()

In [None]:
plt.figure(figsize=(12,12))
sns.countplot(x='self_employed_partner',data=cust_master_data,order=cust_master_data['self_employed_partner'].value_counts().index)

**Q: What is the breakdown of Customers across TV Area?**

In [None]:
cust_master_data['TVarea'].value_counts()

In [None]:
plt.figure(figsize=(12,12))
sns.countplot(y='TVarea',data=cust_master_data,order=cust_master_data['TVarea'].value_counts().index)

In [None]:
cust_master_data['TVarea'].value_counts().plot(kind='pie', explode=[0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05], fontsize=14, autopct='%.2f%%', wedgeprops=dict(width=0.15), 

                                       shadow=True, startangle=160, figsize=(15,15), cmap='inferno', legend=True)

**Q: What is the breakdown of Customers across Region?**

In [None]:
cust_master_data['region'].value_counts()

In [None]:
plt.figure(figsize=(12,12))
sns.countplot(y='region',data=cust_master_data,order=cust_master_data['region'].value_counts().index)

In [None]:
cust_master_data['region'].value_counts().plot(kind='pie', explode=[0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05], fontsize=14, autopct='%.2f%%', wedgeprops=dict(width=0.15), 

                                       shadow=True, startangle=160, figsize=(15,15), cmap='inferno', legend=True)

**Q: What is the breakdown of Customers across Gender?**

In [None]:
cust_master_data['gender'].value_counts()

In [None]:
plt.figure(figsize=(12,12))
sns.countplot(x='gender',data=cust_master_data,order=cust_master_data['gender'].value_counts().index)

**Q: What is the breakdown of Customers across Revenue Grid?**

In [None]:
cust_master_data['Revenue_Grid'].value_counts()

In [None]:
plt.figure(figsize=(12,12))
sns.countplot(x='Revenue_Grid',data=cust_master_data,order=cust_master_data['Revenue_Grid'].value_counts().index)

**Q: What is the percentage breakdown of Customers across Revenue Grid?**

In [None]:
print('Customers who are High Net Worth:', cust_master_data['Revenue_Grid'].value_counts()[1])
print('Customers who are Low Net Worth:', cust_master_data['Revenue_Grid'].value_counts()[2])

space = np.ones(2)/10
cust_master_data['Revenue_Grid'].value_counts().plot(kind = 'pie', explode = space, fontsize = 14, autopct = '%3.1f%%', wedgeprops = dict(width=0.15), 
                                    shadow = True, startangle = 160, figsize = [13.66, 7.68], legend = True)
plt.legend(['Low Net Worth', 'High Net Worth'])
plt.ylabel('Revenue Grid', size = 14)
plt.title('Proportion of Net Worth customers', size = 16)
plt.show()

**Below are various plots for Numeric feature distribution to get different view point for better data understanding**

In [None]:
plt.figure(figsize=(50,25))
sns.pairplot(cust_master_data[['Portfolio_Balance','Investment_in_Commodity','Investment_in_Equity','Investment_in_Derivative','Investment_in_Mutual_Fund', 'Investment_Tax_Saving_Bond','Revenue_Grid']],palette='rainbow',diag_kind='kde',hue="Revenue_Grid")

In [None]:
plt.figure(figsize=(50,25))
sns.pairplot(cust_master_data[['Term_Deposit','Life_Insurance','Medical_Insurance','Personal_Loan','Home_Loan','Revenue_Grid']],palette='rainbow',diag_kind='kde',hue="Revenue_Grid")

In [None]:
plt.figure(figsize=(50,25))
sns.pairplot(cust_master_data[['Online_Purchase_Amount','Balance_Transfer','Average_A/C_Balance','Average_Credit_Card_Transaction','Revenue_Grid']],palette='rainbow',diag_kind='kde',hue="Revenue_Grid")

**Below are various plots for Numeric feature with Revenue_Grid feature to get different view point for better data understanding**

In [None]:
# sns.relplot(
#     data=cust_master_data, x="Portfolio_Balance", y="Average_A/C_Balance",
#     col="Revenue_Grid", hue="region", style="region",
#     kind="scatter", size="region"
# )

In [None]:
# sns.relplot(
#     data=cust_master_data, x="Portfolio_Balance", y="Average_A/C_Balance",
#     col="Revenue_Grid", hue="TVarea", style="TVarea",
#     kind="scatter", size="TVarea"
# )

In [None]:
# sns.relplot(
#     data=cust_master_data, x="Portfolio_Balance", y="Average_A/C_Balance",
#     col="Revenue_Grid", hue="occupation", style="occupation",
#     kind="scatter", size="occupation"
# )

In [None]:
# sns.relplot(
#     data=cust_master_data, x="Portfolio_Balance", y="Average_A/C_Balance",
#     col="Revenue_Grid", hue="gender", style="gender",
#     kind="scatter", size="gender"
# )

In [None]:
# sns.relplot(
#     data=cust_master_data, x="Portfolio_Balance", y="Average_A/C_Balance",
#     col="Revenue_Grid", hue="status", style="status",
#     kind="scatter", size="status"
# )

In [None]:
# sns.relplot(
#     data=cust_master_data, x="Portfolio_Balance", y="Average_A/C_Balance",
#     col="Revenue_Grid", hue="age_band", style="age_band",
#     kind="scatter", size="age_band"
# )

In [None]:
# sns.relplot(
#     data=cust_master_data, x="Portfolio_Balance", y="Average_A/C_Balance",
#     col="Revenue_Grid", hue="home_status", style="home_status",
#     kind="scatter", size="home_status"
# )

In [None]:
# sns.relplot(
#     data=cust_master_data, x="Portfolio_Balance", y="Average_A/C_Balance",
#     col="Revenue_Grid", hue="family_income", style="family_income",
#     kind="scatter", size="family_income"
# )

**Q: What is breakdown of Customers across Revenue Grid for various Numerical features**

In [None]:
cust_master_data.groupby(['Revenue_Grid'])['Portfolio_Balance'].mean().plot(kind='bar', figsize=(15, 7), color='orange')
cust_master_data.groupby(['Revenue_Grid'])['Balance_Transfer'].mean().plot(kind='bar', figsize=(15, 7), color='red')
cust_master_data.groupby(['Revenue_Grid'])['Average_A/C_Balance'].mean().plot(kind='bar', figsize=(15, 7), color='black')

In [None]:
cust_master_data.groupby(['Revenue_Grid'])[['Average_Credit_Card_Transaction','Online_Purchase_Amount','Home_Loan', 'Personal_Loan']].mean().plot(kind='bar', figsize=(15, 7), color=['orange','blue','red','yellow'])

In [None]:
cust_master_data.groupby(['Revenue_Grid'])[['Investment_in_Commodity','Investment_in_Equity','Investment_in_Derivative', 'Portfolio_Balance']].mean().plot(kind='bar', figsize=(15, 7), color=['orange','yellow','red','blue'])

In [None]:
#Check correlation primarily between Numerical linear features and target variable
#Correlation is a measure used to represent how strongly two random variables are related to each other.
#Correlation ranges between -1 and +1
corr = cust_master_data.corr(method='pearson')
plt.figure(figsize=(15,15))
sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black')
plt.title('Correlation between Numerical features and target variable')

**Observations:**

- Only **Personal_Loan** has positive correlation with **Revenue_Grid**
- There is higher positive correlation for **Personal_Loan** with **Portfolio_Balance** and **Investment_in_Equity**
- There is strong positive correlation for **Portfolio_Balance** with **Investment_in_Derivative**, **Investment_in_Commodity**,**Investment_in_Equity** and **Life_Insurance**
- There is strong positive correlation for **Life_Insurance** with **Investment_in_Derivative** and **Investment_in_Commodity**
- There is higher positive correlation for **Balance_Transfer** with **Portfolio_Balance**, **Investment_in_Commodity** and **Investment_in_Derivative**
- There is higher positive correlation for **Term_Deposit** with **Investment_in_Commodity**

In [None]:
#Covariance is a measure to indicate the extent to which two random variables change in tandem.
#Covariance can vary between -∞ and +∞
cust_master_data.cov()

- Below are few plots to identify correlation between numerical features with respect to target variable

In [None]:
# getting y
y= cust_master_data.Revenue_Grid

In [None]:
# data = cust_master_data[numeric_cols]
# data_n_2 = (data - data.mean()) / (data.std())   
# data2 = pd.concat([y,data_n_2.iloc[:,0:5]],axis=1)
# data3 = pd.melt(data2,id_vars="Revenue_Grid",
#                     var_name="features",
#                     value_name='value')
# plt.figure(figsize=(20,10))
# sns.violinplot(x="features", y="value", hue="Revenue_Grid", data=data3, split=True, inner="quartile", scale="area", pallete="Set2")
# plt.xticks(rotation=90)

In [None]:
# plt.figure(figsize=(20,10))
# sns.boxplot(x="features", y="value", hue="Revenue_Grid", data=data3)
# plt.xticks(rotation=90)

In [None]:
# data = cust_master_data[numeric_cols]
# data_n_2 = (data - data.mean()) / (data.std())   
# data2 = pd.concat([y,data_n_2.iloc[:,5:10]],axis=1)
# data3 = pd.melt(data2,id_vars="Revenue_Grid",
#                     var_name="features",
#                     value_name='value')
# plt.figure(figsize=(20,10))
# sns.violinplot(x="features", y="value", hue="Revenue_Grid", data=data3, split=True, inner="quartile", scale="area", pallete="Set2")
# plt.xticks(rotation=90)

In [None]:
# plt.figure(figsize=(20,10))
# sns.boxplot(x="features", y="value", hue="Revenue_Grid", data=data3)
# plt.xticks(rotation=90)

In [None]:
# data = cust_master_data[numeric_cols]
# data_n_2 = (data - data.mean()) / (data.std())   
# data2 = pd.concat([y,data_n_2.iloc[:,10:15]],axis=1)
# data3 = pd.melt(data2,id_vars="Revenue_Grid",
#                     var_name="features",
#                     value_name='value')
# plt.figure(figsize=(20,10))
# sns.violinplot(x="features", y="value", hue="Revenue_Grid", data=data3, split=True, inner="quartile", scale="area", pallete="Set2")
# plt.xticks(rotation=90)

In [None]:
# plt.figure(figsize=(20,10))
# sns.boxplot(x="features", y="value", hue="Revenue_Grid", data=data3)
# plt.xticks(rotation=90)

<a name = Section7></a>

---
# **7. Data Post-Processing**
---



<a name = Section71></a>
### **7.1 Data Encoding**

- In this section, we will encode our categorical features as necessary and manipulate any column as necessary

In [None]:
# # applying one-hot encoding for Training and Validation Dataset
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
categorical = cust_master_data.loc[:, ['children','age_band', 'status', 'occupation', 'occupation_partner', 'home_status', 'family_income', 'self_employed', 'self_employed_partner', 'TVarea', 'gender','region']]
encoded_data = ohe.fit_transform(categorical)
cols = ohe.get_feature_names(['children','age_band', 'status', 'occupation', 'occupation_partner', 'home_status', 'family_income', 'self_employed', 'self_employed_partner', 'TVarea', 'gender','region'])
encoded_features = pd.DataFrame(encoded_data.todense(), columns=cols)

In [None]:
encoded_features.shape

In [None]:
encoded_features.head(10)

In [None]:
encoded_features.info()

In [None]:
# applying one-hot encoding for Test Dataset
#ohe = OneHotEncoder()
test_categorical = cust_test_data.loc[:, ['children','age_band', 'status', 'occupation', 'occupation_partner', 'home_status', 'family_income', 'self_employed', 'self_employed_partner', 'TVarea', 'gender','region']]
encoded_test_data = ohe.transform(test_categorical)
test_cols = ohe.get_feature_names(['children','age_band', 'status', 'occupation', 'occupation_partner', 'home_status', 'family_income', 'self_employed', 'self_employed_partner', 'TVarea', 'gender','region'])
encoded_test_features = pd.DataFrame(encoded_test_data.todense(), columns=test_cols)

In [None]:
encoded_test_features.shape

In [None]:
encoded_test_features.info()

In [None]:
encoded_test_features.head(10)

In [None]:
numerical_features = cust_master_data[['Average_Credit_Card_Transaction', 'Balance_Transfer', 'Term_Deposit', 'Life_Insurance', 'Medical_Insurance', 'Average_A/C_Balance', 'Personal_Loan', 'Investment_in_Mutual_Fund', 'Investment_Tax_Saving_Bond', 'Home_Loan', 'Online_Purchase_Amount', 'Investment_in_Commodity', 'Investment_in_Equity', 'Investment_in_Derivative', 'Portfolio_Balance']]
numerical_features.head()

In [None]:
numerical_features.shape

In [None]:
numerical_features_reset=numerical_features.reset_index()
numerical_features_reset.head()

In [None]:
test_numerical_features = cust_test_data[['Average_Credit_Card_Transaction', 'Balance_Transfer', 'Term_Deposit', 'Life_Insurance', 'Medical_Insurance', 'Average_A/C_Balance', 'Personal_Loan', 'Investment_in_Mutual_Fund', 'Investment_Tax_Saving_Bond', 'Home_Loan', 'Online_Purchase_Amount', 'Investment_in_Commodity', 'Investment_in_Equity', 'Investment_in_Derivative', 'Portfolio_Balance']]
test_numerical_features.head()

In [None]:
test_numerical_features.shape

In [None]:
test_numerical_features_reset=test_numerical_features.reset_index()
test_numerical_features_reset.head()

In [None]:
finalX = pd.merge(encoded_features,numerical_features_reset,left_index=True,right_index=True, how="inner")
finalX.head()

In [None]:
finalX.shape

In [None]:
finalX.set_index('REF_NO')

In [None]:
test_finalX = pd.merge(encoded_test_features,test_numerical_features_reset,left_index=True,right_index=True, how="inner")
test_finalX.head()

In [None]:
test_finalX.shape

In [None]:
test_finalX.set_index('REF_NO')

<a name = Section72></a>
### **7.2 Data Preparation**

- Now we will **split** our **data** into **dependent** and **independent** variables for further development using holdout validation technique.

In [None]:
# Splitting data into training and testing sets with using Validation Test Data as 25%
X_train, X_test, y_train, y_test = train_test_split(finalX, y, test_size=0.25, random_state=42, stratify=y)

# Display the shape of training and testing data
print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)
X_train.info()
#X_train.columns
X_train.head()

<a name = Section73></a>
### **7.3 Data Scaling**

- Here we can try various scaling options available to arrive at best option
- Here I have decided to not do scaling as of now after trying with different scaling options

In [None]:
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler_rbs = RobustScaler()
X_train_rbs = scaler_rbs.fit_transform(X_train)
X_test_rbs = scaler_rbs.transform(X_test)

<a name = Section8></a>

---
# **8. Model Development & Evaluation**
---

- In this section we will **develop a Logistic Regression model**

- Then we will **analyze the results** obtained and **make our observations**.

- For **evaluation purpose** we will **focus** on **F1 Score** score as required by this project.

<a name = Section81></a>
### **8.1 Baseline Model Development & Evaluation**

- Here we will develop Logistic Regression classification model using default setting.

In [None]:
# Instantiate a Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train_rbs,y_train)

# Predicting training and testing labels
y_train_pred_count = logreg.predict(X_train_rbs)
y_test_pred_count = logreg.predict(X_test_rbs)


In [None]:
#Print confusion matrix for Test Validation Data
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_test_pred_count,labels=logreg.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=logreg.classes_)
fig, ax = plt.subplots(figsize=(8,8))

disp.plot(cmap='cividis', values_format='d', ax=ax)
ax.set_xlabel('Predicted Value');ax.set_ylabel('Actual Value'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['High Net Worth(1)', 'Low Net Worth(2)']); ax.yaxis.set_ticklabels(['High Net Worth(1)', 'Low Net Worth(2)']);

In [None]:
print(' Accuracy score for test validation data is:', accuracy_score(y_test,y_test_pred_count),
      '\n','#########################################################','\n'
   ' Precision score for test validation data is :', precision_score(y_test,y_test_pred_count),'\n',
      'Recall score for test validation data is :', recall_score(y_test,y_test_pred_count),'\n',
      '#########################################################','\n',
      'F1 score for test validation data is :', f1_score(y_test,y_test_pred_count))

In [None]:
train_report = classification_report(y_train, y_train_pred_count)
test_report = classification_report(y_test, y_test_pred_count)
print('                    Training Data Report          ')
print(train_report)
print('                    Test Validation Data Report           ')
print(test_report)

In [None]:
 roc_disp= plot_roc_curve(logreg, X_test, y_test)

- Here we will do prediction on Test Dataset (aka Real World Data) using Logistic Regression classification model.

In [None]:
import sys
np.set_printoptions(threshold=sys.maxsize)
y_test_pred = logreg.predict(test_finalX)
y_test_pred
#print (y_test_pred.shape)
#print (y_test_pred)
lownetworthcount=0
highnetworthcount=0
for pred in y_test_pred:
  if pred ==1:
    lownetworthcount= lownetworthcount+1
  else:
    highnetworthcount= highnetworthcount+1
    
    
    print ("Low Net worth Count", lownetworthcount)
    print ("High Net worth Count", highnetworthcount)

- Here we will write output of prediction to CSV file for submission.

In [None]:
output = pd.DataFrame({'REF_NO': cust_test_data.index,'Revenue_Grid': y_test_pred})
output.to_csv('customer-classification-submission3.csv', index=False, header=False)

<a name = Section9></a>

---
# **9. Summarization**
---

<a name = Section91></a>
### **9.1 Conclusion**

- Logistic Regression model was used for model evaluation and prediction considering this is ML Foundation project.
- F1-Score was used to arrive at model evaluation which is required for this ML Foundation Project.
- One-Hot Encoding using one hot encoder was required as get_dummies was not working for Test Dataset with region column where one of the class is not getting encoded. Later I found same issue with One-Hot encoding for Test Dataset but it was confirmed and one of the class in region feature does not exist and had to drop **region** column altogether as we already have **TVarea** which captures similar information required to run campaign by Marketing Team.
- RobustScaler scaling provided best F1 Score for this dataset as most of numeric feature did not have normal distribution and contains outliers.