<a href="https://colab.research.google.com/github/aaniaahh/DataScience-2025/blob/main/Completed/07-Describing_and_Visualizing_Data/02_Summary_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸ“ˆ Summary Statistics

Summary statistics help us describe a dataset with just a few numbers. Instead of staring at thousands of rows, we can quickly understand **center**, **spread**, and **shape**.

This notebook covers:
* Measures of central tendency (mean, median, mode)
* Measures of spread (range, variance, standard deviation, IQR)
* Skewness and kurtosis (just enough to impress your friends)
* Quick summaries with Pandas

## 1. Central Tendency
These describe the 'middle' of the data:
* **Mean** â€“ average
* **Median** â€“ middle value
* **Mode** â€“ most frequent value

In [None]:
import numpy as np
import pandas as pd
from scipy import stats

data = [5, 7, 8, 5, 10, 12, 7, 7, 6, 9]

print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Mode:", stats.mode(data, keepdims=True)[0][0])

ðŸ‘‰ Question: Which is more robust to outliers, the mean or the median?
ðŸ‘‰ Answer: Median

## 2. Spread
Spread tells us how variable the data are.
* **Range**: max â€“ min
* **Variance**: average squared deviation from the mean
* **Standard Deviation**: square root of variance (in original units)
* **Interquartile Range (IQR)**: middle 50% (Q3 â€“ Q1)

In [None]:
print("Range:", np.max(data) - np.min(data))
print("Variance:", np.var(data, ddof=1))
print("Standard Deviation:", np.std(data, ddof=1))
print("IQR:", stats.iqr(data))

ðŸ‘‰ **Exercise**: Add an extreme outlier (e.g., 100) to the dataset and see how the mean, median, and standard deviation change.

## 3. Shape: Skewness & Kurtosis
* **Skewness**: measures asymmetry (left/right tail)
* **Kurtosis**: measures 'peakedness' or heavy tails

Most real-life datasets are not perfectly normal, so these help describe the difference.

In [None]:
print("Skewness:", stats.skew(data))
print("Kurtosis:", stats.kurtosis(data))

ðŸ‘‰ **Note**: High kurtosis means more extreme outliers; low kurtosis means flat/boring data.

## 4. Quick Summaries with Pandas
Instead of writing 10 functions, Pandas does it for you with `.describe()`.

In [3]:
import numpy as np
import pandas as pd
from scipy import stats

data = [5, 7, 8, 5, 10, 12, 7, 7, 6, 9]

df = pd.DataFrame({"Values": data})
df.describe()

Unnamed: 0,Values
count,10.0
mean,7.6
std,2.221111
min,5.0
25%,6.25
50%,7.0
75%,8.75
max,12.0


ðŸ‘‰ **Task**: Use `.describe()` on another dataset (e.g., `penguins` from Seaborn or your own CSV).

In [4]:
import pandas as pd

# Sample data
data = {'score': [88, 92, 79, 93, 85, 100, 67]}
df = pd.DataFrame(data)

# Get a full summary
summary = df['score'].describe()
print(summary)


count      7.000000
mean      86.285714
std       10.765907
min       67.000000
25%       82.000000
50%       88.000000
75%       92.500000
max      100.000000
Name: score, dtype: float64
