# 📊 Exploratory Data Analysis (EDA) in Machine Learning

**Exploratory Data Analysis (EDA)** is the process of analyzing datasets to summarize their main characteristics, often using visual methods. It helps in understanding the structure, patterns, and relationships within the data before applying machine learning models.

---

## 🎯 Objectives of EDA

- Understand the **distribution** of features
- Identify **missing values**, **outliers**, and **duplicates**
- Analyze **relationships** between variables
- Detect **data imbalances**
- Generate **hypotheses** for further analysis

---

## 📦 1. Load and Inspect the Dataset

```python
import pandas as pd

df = pd.read_csv('your_dataset.csv')
df.head()         # View top rows
df.info()         # Summary: data types and missing values
df.describe()     # Statistical summary
```
---

## 🔍 2. Understand Data Types
Data type helps in choosing the right analysis or visualization.


- Numerical (continuous/discrete)(e.g., age, salary)
- Categorical (nominal/ordinal) (e.g., gender, rating)
- Datetime (timestamp/time series)
```
df.dtypes
df.select_dtypes(include='object').columns  # Categorical
df.select_dtypes(include='number').columns  # Numerical
```
---

## 3️⃣ Univariate Analysis
🔹 Definition:
Study of one variable at a time to understand its distribution and properties.

#### For Numerical Features:

```
import seaborn as sns
sns.histplot(df['age'], kde=True)
```

- Histogram: Frequency distribution

- Boxplot: Spread and outliers

- KDE Plot: Smoothed probability density

#### For Categorical Features:
```
sns.countplot(x='gender', data=df)
```
- Bar charts show frequency of each category.

---


## 4️⃣ Bivariate Analysis
🔹 Definition:
Study of two variables together to understand their relationship.

#### 🔸 Numerical vs Numerical:
```
sns.scatterplot(x='height', y='weight', data=df)
df.corr()
sns.heatmap(df.corr(), annot=True)
```
Correlation coefficient:

Values range from -1 (perfect negative) to 1 (perfect positive).

#### 🔸 Categorical vs Numerical:
```
sns.boxplot(x='gender', y='income', data=df)
```
Helps compare distributions across categories.

#### 🔸 Categorical vs Categorical:
```
pd.crosstab(df['gender'], df['purchase'], normalize='index')
sns.countplot(x='purchase', hue='gender', data=df)
```
---



## 5️⃣ Missing Values Analysis
Definition: Values that are absent from the dataset.

```
df.isnull().sum()
sns.heatmap(df.isnull(), cbar=False)
```
Why Important?

Missing data skews model performance and must be handled properly (imputation/removal).

----

## 6️⃣ Outlier Detection
Definition: An outlier is a data point that lies outside the normal range of values.

Methods:
- Z-score
- IQR (Interquartile Range):
```
sns.boxplot(x='salary', data=df)
```
**Impact:**
Outliers can distort means, variances, and affect model accuracy.

---





## 7️⃣ Skewness & Kurtosis
**Skewness:** Measures symmetry (0 = perfect symmetry)
```
df['income'].skew()
```
- > 0 → right skew

- < 0 → left skew

**Kurtosis:** Measures the "tailedness" of the distribution

```
df['income'].kurtosis()
```
---

## 8️⃣ Correlation Matrix
Definition: A matrix showing correlation coefficients between variables.
```
import matplotlib.pyplot as plt
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
```
Use:

- Identify multicollinearity
- Choose features for linear models

---


## 9️⃣ Pair Plot
```
sns.pairplot(df, hue='target')
```
- Great for multivariate relationships
- Shows scatterplots + histograms
---

## 🔟 Target Variable Analysis (Class Imbalance)
```
df['target'].value_counts(normalize=True)
sns.countplot(x='target', data=df)
```
Why Important?
- Imbalanced classes bias the model.
- Use resampling techniques like SMOTE, oversampling, or adjusting class weights.

---

| Step                 | Purpose                       | Tools                            |
| -------------------- | ----------------------------- | -------------------------------- |
| Data Inspection      | Structure, completeness       | `info()`, `head()`, `describe()` |
| Univariate Analysis  | Distribution of each variable | `histplot()`, `countplot()`      |
| Bivariate Analysis   | Variable relationships        | `scatterplot()`, `boxplot()`     |
| Missing Value Check  | Handle NULLs                  | `isnull()`, `heatmap()`          |
| Outlier Detection    | Remove extreme values         | `boxplot()`, `IQR`, `Z-score`    |
| Correlation Analysis | Detect multicollinearity      | `corr()`, `heatmap()`            |
| Class Balance Check  | Check label distribution      | `value_counts()`, `countplot()`  |


## 📌 Final Notes on EDA
- EDA helps you understand your data before modeling.
- Combine visualizations with statistics for best insights.
- It’s an iterative process: discover → clean → explore → repeat.