# Exploratory Data Analysis (EDA)

This notebook demonstrates essential EDA techniques using the Iris dataset.
We'll cover data loading, summary statistics, visualizations, and correlation analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

# 1. Setup and Data Loading

The Iris dataset, also known as Fisher's Iris dataset, was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The Use of Multiple Measurements in Taxonomic Problems." It has since become one of the most widely used datasets for teaching statistics and machine learning.

Dataset Description:
- Contains 150 samples from three different species of Iris flowers
- Each species has 50 samples
- Four features are measured for each sample:
  1. Sepal length (cm)
  2. Sepal width (cm)
  3. Petal length (cm)
  4. Petal width (cm)
- Three target classes (species):
  1. Iris Setosa
  2. Iris Versicolor
  3. Iris Virginica

![Iris Flower Types](../images/iris_types.png)

In [None]:
# Loading the Iris dataset
iris = load_iris()
df = pd.DataFrame(
    data=np.c_[iris["data"], iris["target"]], columns=iris["feature_names"] + ["target"]
)
df["species"] = pd.Categorical.from_codes(iris.target, iris.target_names)

print("First few rows of the dataset:")
df.head()

# 2. Basic Data Exploration

## 2.1 Dataset Information

In [None]:
# Dataset Information
df.info()

In [None]:
# Basic Statistics
df.describe()

In [None]:
# Missing values
df.isnull().sum()

## 2.2 Data Distribution Analysis

### 2.2.1 Histogram
> A histogram is like a bar chart that shows how often different measurements occur. From the histograms in this dataset, we can learn:

**Sepal Length Distribution (Top Left)**:
- Range: 4.5-8.0 cm
- Setosa (blue) tends to have shorter sepals (clustered around 4.5-5.5 cm)
- Versicolor (orange) and Virginica (green) overlap considerably
- Virginica generally has longer sepals (more measurements in 6.5-8.0 cm range)
- Overall distribution appears somewhat normal (bell-shaped)

**Sepal Width Distribution (Top Right)**:
- Range: 2.0-4.5 cm
- Most measurements cluster between 2.5-3.5 cm
- High overlap between all three species
- Setosa shows some distinction with wider sepals (more measurements in 3.5-4.0 cm)
- This feature alone isn't very good for distinguishing species due to high overlap

**Petal Length Distribution (Bottom Left)**:
- Range: 1-7 cm
- Shows very clear separation between species:
- Setosa (blue) has distinctly short petals (1-2 cm)
- Versicolor (orange) clusters around 3-5 cm
- Virginica (green) has longest petals (5-7 cm)
- This is probably the best feature for distinguishing Setosa from others

**Petal Width Distribution (Bottom Right)**:
- Range: 0-2.5 cm
- Similar pattern to petal length:
- Setosa has very narrow petals (clustered around 0.2-0.5 cm)
- Clear separation between Setosa and others
- Some overlap between Versicolor and Virginica
- Another good feature for identifying Setosa


In [None]:
# Histograms for each feature
plt.figure(figsize=(12, 8))
for i, feature in enumerate(iris.feature_names):
    plt.subplot(2, 2, i + 1)
    sns.histplot(data=df, x=feature, hue="species", multiple="stack")
    plt.title(f"Distribution of {feature}")
plt.tight_layout()
plt.show()

Key Insights:
1. Petal measurements (length and width) are better at distinguishing species than sepal measurements
2. Setosa is the most distinct species, especially in petal measurements
3. Versicolor and Virginica have some overlap in all measurements
4. Sepal width shows the most overlap between species, making it the least useful for classification
5. The distributions help explain why machine learning models might perform well on this dataset, especially for identifying Setosa

### 2.2.2 Box Plots

> Box plots are like "five-number summaries" in picture form. They show:

- The middle line = median (middle number)
- The box = where 50% of the data falls
- The whiskers = the range of typical values
- Dots beyond whiskers = unusual measurements (outliers)

Box plots which show the distribution of measurements for each Iris species:

**Sepal Length (Top Left)**:
- Setosa: Shortest sepals (median ~5cm)
- Versicolor: Medium length (median ~6cm)
- Virginica: Longest sepals (median ~6.5cm)
- Shows clear progression in size from Setosa → Versicolor → Virginica
- Some overlap between species, with one outlier in Virginica
- Boxes (interquartile ranges) show moderate separation

**Sepal Width (Top Right)**:
- Most variable feature with significant overlap
- Setosa: Widest sepals (median ~3.5cm)
- Versicolor and Virginica: Similar widths (median ~3cm)
- Several outliers present (dots)
- Least useful for species differentiation due to high overlap

**Petal Length (Bottom Left)**:
- Shows strongest separation between species
- Setosa: Very short petals (median ~1.5cm)
- Versicolor: Medium length (median ~4.5cm)
- Virginica: Longest petals (median ~5.5cm)
- Few outliers
- Clear separation of Setosa from others
- Some overlap between Versicolor and Virginica

**Petal Width (Bottom Right)**:
- Similar pattern to petal length
- Setosa: Very narrow petals (median ~0.2cm)
- Versicolor: Medium width (median ~1.3cm)
- Virginica: Widest petals (median ~2cm)
- Clear separation between species
- Minimal overlap between groups

In [None]:
# Box plots for each feature
plt.figure(figsize=(12, 6))
for i, feature in enumerate(iris.feature_names):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(data=df, y=feature, x="species")
    plt.title(f"Box Plot of {feature} by Species")
    plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Key Insights:
1. Petal measurements (length and width) are most useful for species classification
2. Setosa is most distinct, especially in petal measurements
3. Sepal width shows most overlap, making it least useful for classification
4. Virginica generally has larger measurements across all features
5. Data is relatively clean with few outliers

## 2.3 Relationship Analysis

### 2.3.1 Pair Plot

This is a comprehensive visualization called a "pairs plot" or "scatter matrix" that shows relationships between all pairs of features in the Iris dataset.

This visualization is particularly useful for:
- Understanding feature relationships
- Identifying the most discriminative features
- Detecting patterns and clusters
- Spotting potential outliers
- Planning machine learning approaches

In [None]:
# Scatter plot matrix
sns.pairplot(df, hue="species", diag_kind="hist")
plt.show()

**Key Insights**:
- Petal measurements are best for species classification
- Setosa is the most distinctly separated species
- Some overlap exists between versicolor and virginica
- Sepal width is least useful for classification
- Multiple features together provide better separation than any single feature

### 2.3.2 Correlation Matrix

The correlation matrix shows how strongly features are related to each other, from -1 (opposite) to +1 (same direction):
- Dark red = strong positive correlation (when one increases, the other does too)
- Dark blue = strong negative correlation (when one increases, the other decreases)
- Light colors = weak correlation

In [None]:
# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df.drop("species", axis=1).corr()
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Matrix")
plt.show()

Key insights:
- Petal length and width are strongly correlated (dark red)
- Sepal width has weak correlations with other features (lighter colors)
- This helps us understand which measurements tend to change together

## 2.4 Statistical Analysis

In [None]:
# Group mean values by species
df.groupby("species").mean()

In [None]:
# Standard deviation by species
df.groupby("species").std()

# 3. Advanced Analysis

## 3.1 Distribution Tests

In [None]:
from scipy import stats

In [None]:
# Perform normality test for each feature
print("\nNormality Tests (Shapiro-Wilk):")
for feature in iris.feature_names:
    stat, p_value = stats.shapiro(df[feature])
    print(f"{feature}:")
    print(f"Statistic: {stat:.4f}, p-value: {p_value:.4f}")

## 3.2 Feature Engineering

`petal_ratio` and `sepal_ratio`. These ratios are calculated by dividing the petal and sepal lengths by their respective widths. These new features may enhance our ability to distinguish between different flower species based on their unique petal and sepal proportions.
  

In [None]:
# Create ratio features
df["petal_ratio"] = df["petal length (cm)"] / df["petal width (cm)"]
df["sepal_ratio"] = df["sepal length (cm)"] / df["sepal width (cm)"]

After creating these features, we visualize them using a scatter plot to observe potential patterns or separations among species. This visualization helps us evaluate if these engineered features can provide additional insights for classification or analysis tasks.

In [None]:
# Visualize new features
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(data=df, x="petal_ratio", y="sepal_ratio", hue="species")
plt.title("Petal Ratio vs Sepal Ratio")
plt.show()

# 4. Key Findings Summary

Key Insights from EDA:
1. Dataset Structure:
   - 150 samples with 4 features and 1 target variable
   - No missing values
   - Balanced classes (50 samples per species)

2. Distribution Patterns:
   - Most features show different distributions for each species
   - Setosa species is the most distinctly separated
   - Versicolor and Virginica show some overlap

3. Correlations:
   - Strong positive correlation between petal length and petal width
   - Moderate correlation between sepal length and petal features
   - Sepal width shows weak correlations with other features

4. Species Characteristics:
   - Setosa: Smallest petals, large sepal width
   - Versicolor: Intermediate in most features
   - Virginica: Largest in most measurements

# 5. Exercises

> **Exercise 1:** Calculate the range (max-min) for each feature by species.

In [None]:
# Solution 1
ranges = df.groupby("species").agg(lambda x: x.max() - x.min())
print("\nRanges by species:")
ranges.head()

Range shows the spread of measurements:
- Calculated as largest value minus smallest value
- Important because it tells us:
  - How much variation exists within each species
  - Which features vary most/least
  - Helps identify unusual measurements

> **Exercise 2:** Perform t-tests between species for each feature.

In [None]:
# Solution 2
for feature in iris.feature_names:
    print(f"\nT-tests for {feature}:")
    setosa = df[df["species"] == "setosa"][feature]
    versicolor = df[df["species"] == "versicolor"][feature]
    virginica = df[df["species"] == "virginica"][feature]

    t_stat, p_val = stats.ttest_ind(setosa, versicolor)
    print(f"Setosa vs Versicolor: t={t_stat:.4f}, p={p_val:.4f}")

    t_stat, p_val = stats.ttest_ind(setosa, virginica)
    print(f"Setosa vs Virginica: t={t_stat:.4f}, p={p_val:.4f}")

    t_stat, p_val = stats.ttest_ind(versicolor, virginica)
    print(f"Versicolor vs Virginica: t={t_stat:.4f}, p={p_val:.4f}")

The t-test results here compare the mean values of each feature (sepal length, sepal width, petal length, and petal width) across different species of the Iris dataset (Setosa, Versicolor, and Virginica). Let’s go through what each result implies:

1. **T-tests for Sepal Length (cm)**
   - **Setosa vs Versicolor**: \( t = -10.5210 \), \( p = 0.0000 \)
   - **Setosa vs Virginica**: \( t = -15.3862 \), \( p = 0.0000 \)
   - **Versicolor vs Virginica**: \( t = -5.6292 \), \( p = 0.0000 \)

   **Interpretation**:
   - The negative t-values indicate that, on average, Setosa’s sepal length is smaller than both Versicolor's and Virginica's, and Versicolor’s sepal length is also smaller than Virginica's.
   - The p-values are all close to 0 (below 0.05), which suggests the differences in sepal length between each pair of species are statistically significant.

2. **T-tests for Sepal Width (cm)**
   - **Setosa vs Versicolor**: \( t = 9.4550 \), \( p = 0.0000 \)
   - **Setosa vs Virginica**: \( t = 6.4503 \), \( p = 0.0000 \)
   - **Versicolor vs Virginica**: \( t = -3.2058 \), \( p = 0.0018 \)

   **Interpretation**:
   - Positive t-values for Setosa vs Versicolor and Setosa vs Virginica indicate that Setosa has a greater sepal width on average than both Versicolor and Virginica.
   - The negative t-value for Versicolor vs Virginica suggests Versicolor has a slightly greater sepal width than Virginica.
   - All p-values are below 0.05, indicating these differences in sepal width are statistically significant.

3. **T-tests for Petal Length (cm)**
   - **Setosa vs Versicolor**: \( t = -39.4927 \), \( p = 0.0000 \)
   - **Setosa vs Virginica**: \( t = -49.9862 \), \( p = 0.0000 \)
   - **Versicolor vs Virginica**: \( t = -12.6038 \), \( p = 0.0000 \)

   **Interpretation**:
   - The highly negative t-values suggest that Setosa has a much smaller petal length compared to both Versicolor and Virginica, and Versicolor has a smaller petal length than Virginica.
   - With p-values close to 0, these differences are highly significant.

4. **T-tests for Petal Width (cm)**
   - **Setosa vs Versicolor**: \( t = -34.0803 \), \( p = 0.0000 \)
   - **Setosa vs Virginica**: \( t = -42.7858 \), \( p = 0.0000 \)
   - **Versicolor vs Virginica**: \( t = -14.6254 \), \( p = 0.0000 \)

   **Interpretation**:
   - The negative t-values suggest that Setosa has a significantly smaller petal width than both Versicolor and Virginica, and Versicolor has a smaller petal width than Virginica.
   - The p-values indicate these differences in petal width are statistically significant.


Across all features, the t-tests show statistically significant differences between species, with each pair showing distinct mean values. This suggests that these features (sepal and petal dimensions) are useful in differentiating between Setosa, Versicolor, and Virginica species in the Iris dataset.