# **An Experimental Exploratory Data Analysis for a classification task**

### ***From Visualization to Statistical Analysis***

### ***From Feature Engineering to Feature Selection***

### ***From the Best Model Selection to Interpretability***



To start the exploration set up the environment with libraries, upload the data set (it's stored in a github repository) and split it into target variable and features variables. No more set up is required using Google Colab. Look at the guidelines: https://colab.research.google.com/notebooks/welcome.ipynb

#### **Contents**

Explorative Data Analysis (EDA) is usually a process between data cleaning and data modeling with the goal to understand patterns, detect mistakes, check assumptions and check relationships between variables of a data set with the help of graphical charts and summary statistics.
The goal of this workshop is to extends the classical EDA journey, also provided in automated way by some tools.
The exploration approach is grouped by the classification variables: it goes on from univariate analysis to bivariate analysis, from some feature engineering techniques (one-hot encoding), used to facilitate the machine learning job, to statistics evaluations in order to select the best features able to explain the response variable.
During this process are handled some issues on explanatory variables, such as missing observations and outliers.
A deeply understanding of the data set is completed by an exploration, of several baseline models, splitted into two steps: without handling the unbalanced data set and handling the unbalanced data set.  
Will be used a data set coming from a classification task competition.


# [1. Prepare Workspace](#scrollTo=Hs9lEhRM0Mne)

- [1.1. Upload libraries](#scrollTo=zJlI9RQG-A8w)

- [1.2. Upload data set](#scrollTo=ZLFLLtY2-I06)

- [1.3. Split data set](#scrollTo=Q2_fumrJ-NCl)

# [2. Summarize Data](#scrollTo=cIEjdLM1BGti)

# [3. Formatting Features](#scrollTo=UqhoHvdpPKse)

# [4. Handling Missing Values](#scrollTo=N1_xCHJo02PJ)

# [5. Target Variable Analysis](#scrollTo=1q8NeSmPwljg)

# [6. Categorical Features Analysis](#scrollTo=uqcTsDB4OkMK)

- [6.1. Analysis for categorical features (barplot, univariate analysis, bivariate analysis)](#scrollTo=KkBZluQqTqiO)

- [6.2. Feature Selection](#scrollTo=ICEBY5Jv7tsV)

- [6.3. Feature Engineering on categorical features: one-hot encoding](#scrollTo=aHJVRnRmT7-j)


# [7. Numerical Features Analysis](#scrollTo=jxwJmGWxcLux)

- [7.1. Analysis for numerical features (distribution, univariate analysis, bivariate analysis)](#scrollTo=UYsyCNwRzCtM)

- [7.2. Feature Selection](#scrollTo=AdINGfY5hNlc)

- [7.3. Handling Outliers](#scrollTo=HxOVN66PhX-4)

# [8. Feature Selection on all data set](#scrollTo=mPerrlyVl4l9)

# [9. Modeling Part](#scrollTo=qe16GLXqopQl)

- [9.1. Evaluation Metric and Confusion Matrix](#scrollTo=6S3Jt8sxLF2b)

# [10. Modeling Part I: without handling unbalanced data set](#scrollTo=YQHDO3oQLssf)

- [10.1. Pre-Processing: split data set](#scrollTo=6TRXf-0pLPsc)

- [10.2. Baseline Models](#scrollTo=msCzLqkgLtyx)

- [10.3. Scaled Baseline Models](#scrollTo=qzK7N9OiL237)

- [10.4. Features Importance](#scrollTo=DBpjyQI1L9pF)

# [11. Modeling Part II: handling unbalanced data set](#scrollTo=-eOe6JuTKPBe)

- [11.1. Over-sampling](#scrollTo=5dHoPIJAMvt1)

- [11.2. Pre-Processing: split data set](#scrollTo=toYSVSUQNu6t)

- [11.3. Baseline Models](#scrollTo=4MYenOMcNBcp)

- [11.4. Scaled Baseline Models](#scrollTo=IyUMpIf7NI-Q)

# [12. Conclusions](#scrollTo=hbZnOUhQKtTJ)

# [13. References](#scrollTo=VhXdBXS0Kxj8)

### **Exploratory Data Analysis (EDA) Pipeline**

![](http://www.theleader.info/wp-content/uploads/2017/07/Mortgage-rates.jpg)