
# 📊 Exploratory Data Analysis (EDA)

This notebook provides **code templates and checklists** for performing **EDA on datasets** to understand their structure, relationships, and potential issues.

### 🔹 What’s Covered:
- Dataset overview (shape, types, missing values)
- Summary statistics
- Data visualization (histograms, boxplots, scatter plots, correlation heatmaps)
- Outlier detection & feature distributions


In [None]:

# Ensure required libraries are installed (Uncomment if necessary)
# !pip install pandas numpy matplotlib seaborn



## 🏗️ Dataset Overview

✅ Check the **shape** of the dataset.  
✅ Inspect **data types** and detect incorrect types.  
✅ Identify **missing values** and assess their impact.  


In [None]:

import pandas as pd

# Load dataset (replace with your actual file)
df = pd.read_csv("your_dataset.csv")

# Display basic info
print("Dataset Shape:", df.shape)
print("Column Data Types:")
print(df.dtypes)

# Check for missing values
print("Missing Values:")
print(df.isnull().sum())



## 📈 Summary Statistics

✅ Check **mean, median, min, max, standard deviation**.  
✅ Compare distributions across different features.  
✅ Detect **skewness** and potential anomalies.  


In [None]:

# Summary statistics for numerical columns
print(df.describe())

# Summary statistics for categorical columns
print(df.describe(include=['object']))



## 📊 Univariate Analysis (Single Variable)

✅ Use **histograms** to check data distributions.  
✅ Use **boxplots** to detect outliers.  
✅ Use **count plots** for categorical features.  


In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram for a numerical column
plt.hist(df["numeric_column"], bins=30, edgecolor="black")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of Numeric Column")
plt.show()

# Boxplot for detecting outliers
sns.boxplot(x=df["numeric_column"])
plt.title("Boxplot of Numeric Column")
plt.show()



## 🔍 Multivariate Analysis (Relationships Between Variables)

✅ Use **scatter plots** to check relationships.  
✅ Use **correlation heatmaps** to detect feature correlations.  
✅ Use **pair plots** for a broader perspective.  


In [None]:

# Scatter plot for two numerical variables
plt.scatter(df["column_x"], df["column_y"], alpha=0.5)
plt.xlabel("Column X")
plt.ylabel("Column Y")
plt.title("Scatter Plot of Column X vs Column Y")
plt.show()

# Correlation heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()



## 🚨 Outlier Detection

✅ Use the **interquartile range (IQR) method** to detect outliers.  
✅ Consider **log transformation** for highly skewed data.  
✅ Use **clipping** if outliers are extreme but valid.  


In [None]:

# Detecting outliers using IQR
Q1 = df["numeric_column"].quantile(0.25)
Q3 = df["numeric_column"].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df["numeric_column"] < lower_bound) | (df["numeric_column"] > upper_bound)]
print(outliers)



## ✅ Best Practices & Common Pitfalls

- **Understand the data first**: Don't apply transformations blindly.  
- **Visualize distributions**: Summary statistics alone may miss key patterns.  
- **Beware of data leakage**: Don't use information from the test set in EDA.  
- **Handle missing values carefully**: Different strategies work for different datasets.  
