# ðŸ“ˆ Summary Statistics

Summary statistics help us describe a dataset with just a few numbers.
Instead of staring at thousands of rows, we can quickly understand **center, spread, and shape**.

This notebook covers:
- Measures of central tendency (mean, median, mode)
- Measures of spread (range, variance, standard deviation, IQR)
- Skewness and kurtosis (just enough to impress your friends)
- Quick summaries with Pandas


## 1. Central Tendency

These describe the 'middle' of the data:
- **Mean** â€“ average
- **Median** â€“ middle value
- **Mode** â€“ most frequent value

In [None]:
import numpy as np
import pandas as pd
from scipy import stats

data = [5, 7, 8, 5, 10, 12, 7, 7, 6, 9]

print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Mode:", stats.mode(data, keepdims=True)[0][0])

Mean: 7.6
Median: 7.0
Mode: 7


ðŸ‘‰ **Question:** Which is more robust to outliers, the mean or the median? the median

## 2. Spread

Spread tells us how variable the data are.
- **Range**: max â€“ min
- **Variance**: average squared deviation from the mean
- **Standard Deviation**: square root of variance (in original units)
- **Interquartile Range (IQR)**: middle 50% (Q3 â€“ Q1)

In [5]:
print("Range:", np.max(data) - np.min(data))
print("Variance:", np.var(data, ddof=1))
print("Standard Deviation:", np.std(data, ddof=1))
print("IQR:", stats.iqr(data))

Range: 7
Variance: 4.933333333333334
Standard Deviation: 2.2211108331943574
IQR: 2.5


ðŸ‘‰ **Exercise:** Add an extreme outlier (e.g., 100) to the dataset and see how the mean, median, and standard deviation change.

In [8]:
import numpy as np
from scipy import stats

# Example dataset
data = np.array([5, 7, 8, 6, 9, 10, 7])

# Original statistics
print("Original Data:")
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Standard Deviation:", np.std(data, ddof=1))

# Add an extreme outlier
data_with_outlier = np.append(data, 100)

# Statistics with outlier
print("\nData with Outlier (100):")
print("Mean:", np.mean(data_with_outlier))
print("Median:", np.median(data_with_outlier))
print("Standard Deviation:", np.std(data_with_outlier, ddof=1))


Original Data:
Mean: 7.428571428571429
Median: 7.0
Standard Deviation: 1.7182493859684491

Data with Outlier (100):
Mean: 19.0
Median: 7.5
Standard Deviation: 32.76757979641288


## 3. Shape: Skewness & Kurtosis

- **Skewness**: measures asymmetry (left/right tail)
- **Kurtosis**: measures 'peakedness' or heavy tails

Most real-life datasets are *not* perfectly normal, so these help describe the difference.

In [9]:
print("Skewness:", stats.skew(data))
print("Kurtosis:", stats.kurtosis(data))

Skewness: 0.13035868695491676
Kurtosis: -1.0158688865764833


ðŸ‘‰ **Note:** High kurtosis means more extreme outliers; low kurtosis means flat/boring data.

## 4. Quick Summaries with Pandas

Instead of writing 10 functions, Pandas does it for you with `.describe()`.

In [5]:
df = pd.DataFrame({"Values": data})
df.describe()

NameError: name 'data' is not defined

In [6]:

df = pd.DataFrame({"Values": data})
df.describe()

NameError: name 'data' is not defined

In [8]:
import pandas as pd

df = pd.read_csv('/content/penguins.csv')
display(df.head())

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [9]:
display(df.describe())

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


---
âœ… Thatâ€™s it for summary stats! Next up â†’ [Matplotlib Basics](03-Matplotlib_Basics.ipynb)