# Exploratory Data Analysis

This notebook is dedicated to performing exploratory data analysis (EDA) on the dataset chosen for the classification project. The goal of EDA is to understand the data better, identify patterns, and prepare for further modeling.

## Objectives
- Load the dataset
- Visualize the data distributions
- Analyze correlations between features
- Identify any missing values or anomalies
- Document findings and insights


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')


In [None]:
# Load the dataset
data_path = '../data/processed/dataset.csv'  # Update with the actual processed data path
df = pd.read_csv(data_path)

# Display the first few rows of the dataset
df.head()

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

In [None]:
# Visualize the distribution of each feature
df.hist(bins=30, figsize=(15, 10))
plt.tight_layout()
plt.show()

In [None]:
# Analyze correlations between features
plt.figure(figsize=(12, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## Findings

- Document any interesting patterns, correlations, or anomalies observed during the analysis.
- Note any preprocessing steps that may be necessary based on the findings.
