# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 04 · Notebook 01 – Statistical Fundamentals
**Instructor:** Amir Charkhi  |  **Goal:** Master core statistical concepts for data-driven decisions.

> Format: short theory → quick practice → build understanding → mini-challenges.


---
## Learning Objectives
- Understand measures of central tendency and spread
- Master probability distributions with real data
- Calculate confidence intervals
- Prepare for hypothesis testing

## 1. Central Tendency: Beyond the Average
Let's explore mean, median, and mode with real scenarios.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for nice plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (8, 4)

In [None]:
# Income data with outliers (realistic scenario)
incomes = [45000, 52000, 48000, 51000, 47000, 53000, 49000, 250000]  # CEO salary!

print(f"Mean income: ${np.mean(incomes):,.2f}")
print(f"Median income: ${np.median(incomes):,.2f}")

In [None]:
# Which is more representative?
print(f"Without outlier:")
normal_incomes = incomes[:-1]  # Remove CEO
print(f"Mean: ${np.mean(normal_incomes):,.2f}")
print(f"Median: ${np.median(normal_incomes):,.2f}")

**Exercise 1 – Website Load Times (easy)**  
Calculate mean and median for load times. Which metric should you report to stakeholders?


In [None]:
# Your turn
# load_times = [0.8, 1.2, 0.9, 1.1, 0.7, 0.8, 15.3, 0.9, 1.0]  # seconds


<details>
<summary><b>Solution</b></summary>

```python
load_times = [0.8, 1.2, 0.9, 1.1, 0.7, 0.8, 15.3, 0.9, 1.0]  # seconds
print(f"Mean load time: {np.mean(load_times):.2f}s")
print(f"Median load time: {np.median(load_times):.2f}s")
print("\nRecommendation: Use median for reporting as it's robust to outliers.")
print(f"The 15.3s outlier (timeout?) skews the mean but not the median.")
```
</details>

## 2. Measuring Spread: Variance & Standard Deviation
Understanding how spread out your data is.

In [None]:
# Two teams with same average performance
team_a_scores = [75, 76, 74, 75, 76, 74, 75, 75]  # Consistent
team_b_scores = [60, 90, 65, 85, 70, 80, 75, 75]  # Variable

print(f"Team A - Mean: {np.mean(team_a_scores):.1f}, Std: {np.std(team_a_scores):.1f}")
print(f"Team B - Mean: {np.mean(team_b_scores):.1f}, Std: {np.std(team_b_scores):.1f}")

In [None]:
# Visualize the difference
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

ax1.hist(team_a_scores, bins=10, alpha=0.7, color='blue', edgecolor='black')
ax1.set_title('Team A (Consistent)')
ax1.set_xlabel('Score')

ax2.hist(team_b_scores, bins=10, alpha=0.7, color='red', edgecolor='black')
ax2.set_title('Team B (Variable)')
ax2.set_xlabel('Score')

plt.tight_layout()
plt.show()

**Exercise 2 – Customer Wait Times (medium)**  
Compare variability between two service centers. Which is more predictable?


In [None]:
# Your turn
# center_1 = [5, 6, 5, 7, 6, 5, 6, 7, 5, 6]  # minutes
# center_2 = [2, 12, 3, 11, 4, 10, 3, 9, 4, 8]  # minutes


<details>
<summary><b>Solution</b></summary>

```python
center_1 = [5, 6, 5, 7, 6, 5, 6, 7, 5, 6]  # minutes
center_2 = [2, 12, 3, 11, 4, 10, 3, 9, 4, 8]  # minutes

print("Service Center Analysis:")
print(f"Center 1 - Mean: {np.mean(center_1):.1f} min, Std: {np.std(center_1):.2f} min")
print(f"Center 2 - Mean: {np.mean(center_2):.1f} min, Std: {np.std(center_2):.2f} min")
print(f"\nCoefficient of Variation (CV):")
print(f"Center 1 CV: {np.std(center_1)/np.mean(center_1)*100:.1f}%")
print(f"Center 2 CV: {np.std(center_2)/np.mean(center_2)*100:.1f}%")
print("\nCenter 1 is more predictable (lower variation)")
```
</details>

## 3. Normal Distribution & Z-Scores

In [None]:
# Generate normal distribution data
np.random.seed(42)
test_scores = np.random.normal(75, 10, 1000)  # mean=75, std=10

# Calculate percentiles
print(f"Mean: {test_scores.mean():.1f}")
print(f"Std: {test_scores.std():.1f}")
print(f"\n68% of scores between: {test_scores.mean()-test_scores.std():.1f} and {test_scores.mean()+test_scores.std():.1f}")

In [None]:
# Z-score: How many standard deviations from mean?
student_score = 92
z_score = (student_score - test_scores.mean()) / test_scores.std()

print(f"Student score: {student_score}")
print(f"Z-score: {z_score:.2f}")
print(f"This score is {z_score:.2f} standard deviations above the mean")

In [None]:
# Visualize with the 68-95-99.7 rule
plt.figure(figsize=(10, 5))
plt.hist(test_scores, bins=30, density=True, alpha=0.7, color='skyblue', edgecolor='black')

# Add normal curve
from scipy import stats
x = np.linspace(test_scores.min(), test_scores.max(), 100)
plt.plot(x, stats.norm.pdf(x, test_scores.mean(), test_scores.std()), 'r-', linewidth=2)

# Mark standard deviations
mean = test_scores.mean()
std = test_scores.std()
plt.axvline(mean, color='red', linestyle='--', label='Mean')
plt.axvline(mean-std, color='green', linestyle='--', label='±1 SD (68%)')
plt.axvline(mean+std, color='green', linestyle='--')

plt.xlabel('Test Score')
plt.ylabel('Density')
plt.title('Normal Distribution of Test Scores')
plt.legend()
plt.show()

## 4. Confidence Intervals
How confident are we in our estimates?

In [None]:
# Sample of customer satisfaction scores (1-10)
np.random.seed(42)
sample_size = 30
satisfaction = np.random.normal(7.5, 1.2, sample_size)

# Calculate 95% confidence interval
sample_mean = satisfaction.mean()
sample_std = satisfaction.std(ddof=1)  # ddof=1 for sample std
standard_error = sample_std / np.sqrt(sample_size)

# For 95% CI, use 1.96 (approximation for large samples)
margin_of_error = 1.96 * standard_error
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"Sample mean: {sample_mean:.2f}")
print(f"95% Confidence Interval: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"\nInterpretation: We're 95% confident the true population mean")
print(f"satisfaction score is between {ci_lower:.2f} and {ci_upper:.2f}")

**Exercise 3 – Conversion Rate CI (medium)**  
Calculate confidence interval for website conversion rate.


In [None]:
# Your turn
# visitors = 1000
# conversions = 47
# Calculate 95% CI for conversion rate


<details>
<summary><b>Solution</b></summary>

```python
visitors = 1000
conversions = 47
conversion_rate = conversions / visitors

# For proportions, use different formula
import math
se = math.sqrt((conversion_rate * (1 - conversion_rate)) / visitors)
margin = 1.96 * se

ci_lower = conversion_rate - margin
ci_upper = conversion_rate + margin

print(f"Conversion rate: {conversion_rate:.1%}")
print(f"95% CI: [{ci_lower:.1%}, {ci_upper:.1%}]")
print(f"\nWe can be 95% confident the true conversion rate")
print(f"is between {ci_lower:.1%} and {ci_upper:.1%}")
```
</details>

## 5. Correlation vs Causation

In [None]:
# Create correlated data
np.random.seed(42)
hours_studied = np.random.uniform(1, 10, 50)
exam_scores = 60 + 3 * hours_studied + np.random.normal(0, 5, 50)

# Calculate correlation
correlation = np.corrcoef(hours_studied, exam_scores)[0, 1]
print(f"Correlation coefficient: {correlation:.3f}")

In [None]:
# Visualize relationship
plt.figure(figsize=(8, 5))
plt.scatter(hours_studied, exam_scores, alpha=0.6)
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.title(f'Study Hours vs Exam Scores (r = {correlation:.3f})')

# Add trend line
z = np.polyfit(hours_studied, exam_scores, 1)
p = np.poly1d(z)
plt.plot(hours_studied, p(hours_studied), "r--", alpha=0.8)
plt.show()

In [None]:
# Spurious correlation example
years = np.arange(2010, 2020)
ice_cream_sales = [100, 105, 112, 118, 125, 132, 140, 148, 156, 165]
shark_attacks = [70, 72, 78, 82, 87, 91, 96, 102, 108, 114]

correlation_spurious = np.corrcoef(ice_cream_sales, shark_attacks)[0, 1]
print(f"Ice cream sales vs Shark attacks correlation: {correlation_spurious:.3f}")
print("\n⚠️ High correlation doesn't mean causation!")
print("Both are likely caused by warmer weather/more beach visits.")

**Exercise 4 – Correlation Analysis (hard)**  
Analyze relationship between marketing spend and sales. Is it causal?


In [None]:
# Your turn
# marketing = [5000, 7000, 9000, 11000, 13000, 15000, 17000, 19000]
# sales = [50000, 65000, 72000, 88000, 95000, 103000, 115000, 125000]


<details>
<summary><b>Solution</b></summary>

```python
marketing = np.array([5000, 7000, 9000, 11000, 13000, 15000, 17000, 19000])
sales = np.array([50000, 65000, 72000, 88000, 95000, 103000, 115000, 125000])

# Calculate correlation
corr = np.corrcoef(marketing, sales)[0, 1]
print(f"Correlation: {corr:.3f}")

# Calculate ROI
roi = (sales - marketing) / marketing
print(f"\nAverage ROI: {roi.mean():.1f}x")

# Plot
plt.figure(figsize=(8, 5))
plt.scatter(marketing/1000, sales/1000)
plt.xlabel('Marketing Spend ($1000s)')
plt.ylabel('Sales ($1000s)')
plt.title(f'Marketing vs Sales (r={corr:.3f})')

# Fit line
z = np.polyfit(marketing, sales, 1)
p = np.poly1d(z)
plt.plot(marketing/1000, p(marketing)/1000, "r--")
plt.show()

print("\n📊 Strong positive correlation suggests relationship")
print("But need controlled experiment (A/B test) to prove causation!")
```
</details>

## 6. Mini-Challenges
- **M1 (easy):** Calculate mean, median, mode for daily website visitors over 2 weeks
- **M2 (medium):** Find outliers using IQR method in customer purchase amounts
- **M3 (hard):** Bootstrap confidence intervals for median (not mean)

In [None]:
# Your turn - try the challenges!


<details>
<summary><b>Solutions</b></summary>

```python
# M1 - Website visitors
visitors = [1250, 1180, 1320, 1250, 1400, 1550, 1600, 
            1200, 1250, 1350, 1250, 1480, 1520, 1580]
print(f"Mean: {np.mean(visitors):.0f} visitors/day")
print(f"Median: {np.median(visitors):.0f} visitors/day")
from scipy import stats
print(f"Mode: {stats.mode(visitors).mode} visitors/day")

# M2 - Outlier detection with IQR
purchases = [45, 52, 48, 67, 72, 58, 210, 65, 71, 53, 49, 68]
Q1 = np.percentile(purchases, 25)
Q3 = np.percentile(purchases, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = [x for x in purchases if x < lower_bound or x > upper_bound]
print(f"Outliers: {outliers}")

# M3 - Bootstrap CI for median
data = np.random.normal(100, 15, 50)
n_bootstrap = 1000
bootstrap_medians = []
for _ in range(n_bootstrap):
    sample = np.random.choice(data, size=len(data), replace=True)
    bootstrap_medians.append(np.median(sample))

ci_lower = np.percentile(bootstrap_medians, 2.5)
ci_upper = np.percentile(bootstrap_medians, 97.5)
print(f"Bootstrap 95% CI for median: [{ci_lower:.2f}, {ci_upper:.2f}]")
```
</details>

## Wrap-Up & Next Steps
✅ You understand central tendency and when to use mean vs median  
✅ You can measure and interpret data spread  
✅ You know how to calculate confidence intervals  
✅ You understand correlation ≠ causation  

**Next:** Hypothesis Testing - Making data-driven decisions with statistical significance!
