## 2️⃣ Exploratory Data Analysis (EDA)
**Designed by:** [datamover.ai](https://www.datamover.ai)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display

# Set random seed
np.random.seed(42)

**1. Load the train set and sample the dataset to a manageable size if necessary**

**2. For supervised learning tasks, identify the target attribute(s)**

**3. Study each attribute and its characteristics:**

**3 a. Name**

**3 b. For tabular data, define the data type of each variable, namely:**

- `Nominal`: Named categories, e.g., `gender : ['Female', 'Male']`
- `Ordinal`: Categories with an implied order, e.g. `quality : [Low, Medium, High]`
- `Discrete`: Only particular numbers, e.g., `age: {1,2,3,...,58,59,60}`
- `Continuous`: Any numerical value, e.g. `weight: {38.9,45.5}`
    
📝 Nominal and ordinal data types are considered qualitative (**categorical**) features, whereas discrete and continuous data types are considered numerical (**quantitative**) features.

**3 c. Percentage of missing values, namely** `np.NaN` 
- [missingno](https://github.com/ResidentMario/missingno) can be a useful tool for visualization;
- Ensure that missing values are not encoded in a specific ways, e.g. `-1`, `"?"`
- Inspect rows with missing values to assess if a specific pattern exists. 


**3 d. Check if there are any duplicates and inspect them;**

**3 e. Noisiness and type of noise e.g. stochastic, rounding errors, etc.** (might require business knowledge)


**3 f. The frequency of each group within categorical variables and the type of distribution for numerical variables** 

Refer to this [link](https://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm) for common types of distributions. It is recommended to visualise each variable by using:
- a [countplot](https://seaborn.pydata.org/generated/seaborn.countplot.html) for categorical variables;
- a [histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn.histplot) for numerical variables.

Visualize `nominal` variables

Visualize `ordinal` variables

Visualize `discrete` and `continuous` variables together

**3. g Examine possible outliers in numerical variables and check whether they make sense (might require business knowledge).** 

For details on identifying outliers, refer to this [link](https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/).

**4. Annotate all information from EDA, such as:**
- the type of data;
- if there are missing values and how to deal with them;
- summary statistics of both numerical and categorical variables;
- the type of distribution; 
- identify the promising transformations you may want to apply (e.g. log-transformation for highly skewed distribution or cluster facets to mitigate group imbalance);
- identify additional data sources that would be useful;
- anything else that is noteworthy for model training.