
# Exploratory Data Analysis (EDA)

****Exploratory Data Analysis**** (EDA) is a crucial step in the data science process, allowing us to understand our data before diving into modeling. It involves summarizing the main characteristics of a dataset, often using visual and statistical methods. The goal is to gain insights into the data's structure, detect anomalies, and identify relationships between variables.

## Goals of EDA

-   Understanding the structure and shape of the data
-   Detecting outliers, missing values, and noise
-   Identifying relationships between variables
-   Generating hypotheses or informing modeling decisions

## Typical Techniques

-   Summary statistics: mean, std, min, max, missing values
-   Visualizations:
    -   Histograms, boxplots → distribution of features
    -   Scatter plots → relationships between variables
    -   Correlation heatmaps

It is important to perform Exploratory Data Analysis (EDA) before building machine learning models, since it helps to avoid common pitfalls such as garbage in, garbage out.

## Practical Demonstration

In this section, we will perform EDA on the California Housing dataset using Python libraries such as Pandas, Matplotlib, and Seaborn. We will cover the following steps:

-   Load the dataset using Scikit-learn's `fetch_california_housing`.

In [None]:
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset
data = fetch_california_housing(as_frame=True)

# Convert to DataFrame
df = data.frame
df.info()

-   Show descriptive statistics (mean, std, min, max) with Pandas.

In [None]:
print("Descriptive Statistics:")
print(df.describe())

-   Count missing values (if any).

In [None]:
print("Missing Values:")
print(df.isnull().sum())

-   Explore pairwise relationships with `pairplot()`.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(df, diag_kind='kde', markers='o', corner=True)
plt.suptitle('Pairwise Relationships in California Housing Dataset', y=1.02)
plt.show()

-   Plot histogram of `MedHouseVal` (median house value) and `MedInc` (median income).

In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['MedHouseVal'], bins=30, kde=True)
plt.title('Distribution of Median House Value (MedHouseVal)')
plt.subplot(1, 2, 2)
sns.histplot(df['MedInc'], bins=30, kde=True)
plt.title('Distribution of Median Income (MedInc)')
plt.tight_layout()
plt.show()

-   Plot correlation matrix of all numerical features

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title('Correlation Matrix of California Housing Features')
plt.show()

## Hands-on Exercises

In this exercise, you will apply the concepts learned in the theoretical introduction and practical demonstration to perform EDA on the Ames Housing dataset. Follow the steps below:

-   Load the dataset using Scikit-learn (`fetch_openml`).

In [None]:
from sklearn.datasets import fetch_openml

data = fetch_openml(name='house_prices', as_frame=True)
df = data.frame
df.info()

-   Show descriptive statistics (mean, std, min, max) with Pandas.

In [None]:
import pandas as pd
print(df.describe())

-   Check for missing values.

In [None]:
print("Missing values:\n", df.isnull().sum())

-   Create a correlation matrix heatmap for all numerical features.

In [None]:
plt.figure(figsize=(36, 30))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Matrix for Ames Housing Dataset")
plt.show()

-   Plot a histogram of the `SalePrice` target variable and a boxplot of `OverallQual`.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot histogram of SalePrice
plt.figure(figsize=(10, 6))
sns.histplot(df['SalePrice'], bins=30, kde=True)
plt.title('Distribution of SalePrice')
plt.xlabel('SalePrice')
plt.ylabel('Frequency')
plt.show()

# Plot boxplot of OverallQual
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['OverallQual'])
plt.title('Boxplot of OverallQual')
plt.xlabel('OverallQual')
plt.show()

## Summary

In this section, we covered the importance of Exploratory Data Analysis (EDA) in the machine learning workflow. We demonstrated how to perform EDA using Python libraries such as Pandas, Matplotlib, and Seaborn on the California housing dataset. We also provided hands-on exercises to practice EDA on the Ames Housing dataset. EDA is a crucial step in understanding the data, detecting anomalies, and informing modeling decisions. By performing EDA, we can gain valuable insights into the data and ensure that we are building models on a solid foundation.