
---

**Exploratory Data Analysis (EDA), Data Cleaning, Feature Engineering & Preprocessing**

---

# ⭐ **SECTION 1 — Exploratory Data Analysis (EDA)**

### ✅ **Definition:**

**Exploratory Data Analysis (EDA)** is the process of visually and statistically exploring the dataset to understand:

* what the data contains
* hidden patterns
* anomalies or errors
* relationships between features
* what kind of preprocessing/modeling is needed

**Goal:** Build intuition before modeling.

---

## 1️⃣ **Why Do We Do EDA?**

* Understand data structure
* Identify missing values
* Detect outliers
* Find relationships between variables
* Spot data imbalance
* Understand distributions
* Detect possible data leakage
* Decide preprocessing strategies

---

## 2️⃣ **Types of EDA**

### **A. Univariate Analysis (one variable)**

**Definition:** Analyzing a single feature to understand its distribution.

For **numerical** features:

* Histogram (shape of data)
* Boxplot (outliers)
* Summary stats (mean, median, std)

For **categorical** features:

* Count plot
* Value counts

---

### **B. Bivariate Analysis (two variables)**

**Definition:** Studying the relationship between two variables.

Examples:

* Scatterplot (numerical vs numerical)
* Boxplot (numerical vs categorical)
* Bar chart (categorical vs categorical)
* Correlation heatmap

---

### **C. Multivariate Analysis (more than two variables)**

**Definition:** Observing interactions between multiple features.

Examples:

* Pairplots
* Heatmaps
* PCA visualizations

---

## 3️⃣ **Important Checks During EDA**

### ✔ **1. Data Types**

**Definition:** The kind of data stored in a column (int, float, object, category, datetime).
Why important? → Determines which preprocessing method you must use.

---

### ✔ **2. Missing Values**

**Definition:** Values that are empty or not recorded.

Examples: `" "`, `NaN`, `"?"`
Why important? → Models cannot handle missing values directly.

---

### ✔ **3. Outliers**

**Definition:** Data points that are extremely higher or lower than the rest.

Example: A salary value of 5 crore in a dataset of middle class salaries.
Why important? → Can distort mean, regression lines, distances.

---

### ✔ **4. Distribution**

**Definition:** How data spreads across values (normal, skewed, uniform).
Why important? → Helps decide transformations.

---

### ✔ **5. Correlation**

**Definition:** A numeric measure that shows the strength of relationship between two numerical features.
Range: -1 to +1
Why important? → High correlation → multicollinearity → harms linear models.

---

### ✔ **6. Class Imbalance**

**Definition:** When one class has far fewer samples than the other (e.g., 98% no-fraud, 2% fraud).
Why important? → Accuracy becomes misleading.

---



---

# ⭐ **SECTION 2 — Data Cleaning**

### ✅ **Definition:**

Data cleaning is the process of fixing or removing incorrect, corrupted, missing, or badly formatted parts of the data.

Goal: Make the dataset consistent, complete, and usable for modeling.

---

## 1️⃣ **Handling Missing Values**

### **Techniques for Numerical Data**

* **Mean imputation:** Replace missing values with the column mean.
* **Median imputation:** Better when outliers exist.
* **Interpolation:** Estimate values based on nearby points.
* **KNN imputer:** Uses nearest neighbors to fill values.
* **Dropping rows/columns:** If too many missing values.

---

### **Techniques for Categorical Data**

* **Mode imputation:** Replace with most common category.
* **“Unknown/Missing” category:** Helps preserve missing-information patterns.

---

## 2️⃣ **Handling Outliers**

Outliers can be:

* **Valid:** Rare but real values
* **Errors:** Typing mistakes or measurement errors

**Ways to handle:**

* **Remove** (if clearly wrong)
* **Cap values** (winsorization)
* **Log transform** skewed features
* **Use RobustScaler** that ignores outliers

---

## 3️⃣ **Fixing Data Types**

Examples:

* Convert `"2020-01-01"` → datetime
* Convert `"5,00,000"` → integer
* Convert `"Male"` / `"M"` → standard categories

Why important? → Algorithms expect correct formats.

---

## 4️⃣ **Removing Duplicates**

**Definition:** Rows that are repeated exactly.

Why do we remove them?

* They bias the model
* Make training slower
* Distort statistical analysis

---



---

# ⭐ **SECTION 3 — Feature Engineering**

### ✅ **Definition:**

Feature engineering is the process of creating new useful features or transforming existing ones to improve model performance.

This step greatly impacts ML results — sometimes more than choosing a model.

---

## 1️⃣ **Feature Creation**

Examples:

* **Ratios:** (loan_amount / income)
* **Interaction features:** (feature1 × feature2)
* **Aggregations:** Total expenses, total transactions
* **Date-based:** Month, weekday, hour, year
* **Domain-specific:** BMI = weight / height²

Why important? → Good features simplify patterns for the model.

---

## 2️⃣ **Feature Transformation**

### **Log Transform**

Definition: Apply log function to reduce large value influence.
Useful for right-skewed distributions.

### **Binning**

Definition: Converting continuous values into intervals (e.g., age groups).

### **Polynomial Features**

Definition: Add squared, cubic, or interaction terms.
Useful when relationships are non-linear.

---

## 3️⃣ **Categorical Encoding**

### **One-Hot Encoding**

Definition: Convert categorical values into multiple binary columns.
Good for: Trees, linear/logistic regression.

### **Ordinal Encoding**

Definition: Assign integer values based on category order.
Good for: Ordered columns (e.g., low < medium < high).

### **Target Encoding**

Definition: Replace category with mean of target variable.
Useful in high-cardinality categorical features.

---

## 4️⃣ **Feature Selection**

### **Definition:**

Choosing the best subset of features that helps the model perform well.

Methods:

* **Correlation threshold**
* **Tree-based feature importance**
* **SelectKBest**
* **Recursive Feature Elimination (RFE)**

Why important? → Reduces overfitting, improves speed.

---



---

# ⭐ **SECTION 4 — Data Preprocessing**

### ✅ **Definition:**

Preprocessing prepares the data for machine learning by scaling, encoding, imputing, and transforming features.

---

## 1️⃣ **Scaling & Normalization**

### **Why scaling is required?**

Some algorithms depend on distances or gradients.
If features are on different scales (e.g., age 0–80 vs income 0–10,00,000), the larger values dominate.

### **Types of Scaling**

---

### **1. Standardization (Z-score scaling)**

Formula:
[
z = \frac{x - \text{mean}}{\text{std}}
]

* Mean = 0
* Std = 1
  Best for:
* Linear Regression
* Logistic Regression
* Neural Networks

---

### **2. Min-Max Scaling**

Maps values to the range 0–1.

Best for:

* KNN
* SVM
* Neural Networks

---

### **3. Robust Scaling**

Uses median and IQR.
Best for data with **outliers**.

---
