# 🌸 Exploratory Data Analysis with the Iris Dataset

**Author:** André Lopes Marinho  
**Goal:** Use descriptive statistics and Python to understand the structure and distribution of flower measurements in the Iris dataset.

---

## 📌 What You'll Learn

- What are **mean**, **median**, and **mode**?
- How to calculate **variance** and **standard deviation**
- How to summarize a real-world dataset using Python

---

## 📊 Dataset: The Iris Flowers

This famous dataset contains 150 records of **three iris species** (*setosa*, *versicolor*, *virginica*), with:

- Sepal length & width
- Petal length & width

---

## 1. 📁 **Load the data**:

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## 2. 🔍 **Understanding the Data Structure**
## 2. 🔍 **Understanding the Data Structure**

Before analyzing or visualizing data, it's critical to understand what we're working with. Step 2 is all about inspecting the dataset's **structure, completeness, and summary statistics**.

We'll use the following tools from the `pandas` library:

- `.info()` – to check column names, data types, and missing values.
- `.describe()` – to generate summary statistics for numeric columns.

In [2]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## 📐 Step 3 – Descriptive Statistics

Now that we’ve explored the dataset’s structure, let’s dive into **descriptive statistics**. These are numerical values that summarize and describe the main features of a dataset.

In this step, we’ll calculate:

- Measures of **central tendency**: mean, median, mode
- Measures of **spread (dispersion)**: variance and standard deviation

These statistics give insight into the typical values and variability of each feature.

---


### 🧠 Mean (Arithmetic Average)

The mean is the sum of all values divided by the number of values. It's useful to understand the **central location** of the data.

$$
\text{Mean} = \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
$$

In [3]:
df.mean(numeric_only=True)

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

### 🧠 Median

The median is the middle value when all values are sorted. It’s less sensitive to outliers than the mean.

- If \( n \) is odd:
  
$$
\text{Median} = x_{\left(\frac{n+1}{2}\right)}
$$

- If \( n \) is even:

$$
\text{Median} = \frac{x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2}+1\right)}}{2}
$$

In [4]:
df.median(numeric_only=True)

sepal_length    5.80
sepal_width     3.00
petal_length    4.35
petal_width     1.30
dtype: float64

### 🧠 Mode
The mode is the most frequently occurring value in a dataset. There can be multiple modes.

$$
\text{Mode} = \text{value with highest frequency}
$$

In [5]:
df.mode(numeric_only=True).iloc[0]

sepal_length    5.0
sepal_width     3.0
petal_length    1.4
petal_width     0.2
Name: 0, dtype: float64

### 🔁 Variance

Variance measures how far the values are spread out from the mean. It is the average of the squared differences from the mean. A larger variance means more spread in the data.

- **Population variance**:

$$
\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

- **Sample variance** (used by default in `pandas`):

$$
s^2 = \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

In [6]:
df.var(numeric_only=True)

sepal_length    0.685694
sepal_width     0.189979
petal_length    3.116278
petal_width     0.581006
dtype: float64

### 🔁 Standard Deviation

The standard deviation is the square root of the variance and is in the same unit as the original data. It’s one of the most common ways to quantify variability.

- **Population standard deviation**:

$$
\sigma = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 }
$$

- **Sample standard deviation**:

$$
s = \sqrt{ \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2 }
$$


In [7]:
df.std(numeric_only=True)

sepal_length    0.828066
sepal_width     0.435866
petal_length    1.765298
petal_width     0.762238
dtype: float64

> ⚠️ Note: In this notebook, we're using **sample statistics**. 
> That means variance and standard deviation are calculated with **n - 1** in the denominator, 
> which corrects for bias when estimating from a sample.

## 3. ❓ – Interpretation Questions & Answers

Using the descriptive statistics we calculated, let’s reflect on what we’ve learned:

---

### 💡 What is the average petal length across all species?

The average petal length is approximately **3.76 cm**.

---

### 💡 Which feature has the highest variance? What does that imply?

**Petal length** has the highest variance (≈ 3.12). This suggests that petal length varies the most across different observations and is likely a **strong indicator of class differences**.

---

### 💡 Which feature has the lowest standard deviation?

**Sepal width** has the lowest standard deviation (≈ 0.44), indicating **less variation** across samples. This feature might be **less useful for distinguishing species**.

---

### 💡 How might these statistics help us identify features that are useful for classification?

In classification problems, features with **greater variability** often provide **more useful information** to distinguish between classes. For the Iris dataset, **petal length** and **petal width** are likely to be more useful than **sepal width**, because they vary more across species.