# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 04 · Notebook 01 — Introduction to Statistics
**Instructor:** Amir Charkhi  |  **Goal:** Build foundation in statistical thinking for AI & data science.

> Format: short theory → quick practice → build understanding → mini-challenges.


---
## Learning Objectives
- Distinguish between descriptive and inferential statistics
- Set up Python environment for statistical analysis
- Apply basic statistical concepts to real data
- Understand statistics' role in AI and machine learning

## 1. What is Statistics?
Statistics is the science of **collecting, organizing, analyzing, and interpreting** data to make informed decisions.

In [None]:
# Essential libraries for statistical analysis
import numpy as np              # Numerical operations
import pandas as pd             # Data manipulation
import matplotlib.pyplot as plt # Basic plotting
import seaborn as sns          # Statistical visualization
from scipy import stats        # Statistical functions

# Set style for nice plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (8, 4)
%matplotlib inline

In [None]:
# Real-world example: Daily app downloads
app_downloads = [1250, 1180, 1320, 1450, 1380, 1100, 980]

print("📱 Daily App Downloads (Week 1):")
print(app_downloads)
print(f"\nQuick insights:")
print(f"Average: {np.mean(app_downloads):.0f} downloads/day")
print(f"Best day: {max(app_downloads)} downloads")
print(f"Worst day: {min(app_downloads)} downloads")

## 2. Two Types of Statistics

### 2.1 Descriptive Statistics
**What happened?** - Summarizes and describes data you already have.

In [None]:
# Customer satisfaction scores (1-10 scale)
satisfaction = [8, 9, 7, 8, 9, 6, 8, 9, 7, 8]

print("📊 Customer Satisfaction Analysis:")
print(f"Average rating: {np.mean(satisfaction):.1f}/10")
mode_result = stats.mode(satisfaction, keepdims=True)  # ensures result is always array
print(f"Most common rating: {mode_result.mode[0]}/10")
print(f"Range: {min(satisfaction)} to {max(satisfaction)}")

### 2.2 Inferential Statistics
**What will happen?** - Makes predictions about larger populations from samples.

In [None]:
# Survey sample: 100 customers out of 10,000 total
sample_satisfaction = np.random.normal(7.8, 1.2, 100)  # Sample data

# Estimate population parameters
sample_mean = np.mean(sample_satisfaction)
margin_error = 1.96 * (np.std(sample_satisfaction) / np.sqrt(100))

print("🔮 Population Prediction (from 100-person sample):")
print(f"Estimated population satisfaction: {sample_mean:.1f}")
print(f"95% confidence: {sample_mean:.1f} ± {margin_error:.1f}")
print(f"Prediction for all 10,000 customers!")

**Exercise 1 — Statistics Type Recognition (easy)**  
Classify each statement as descriptive or inferential:


In [None]:
# Your turn
statements = [
    "75% of our customers rated us 8+ last month",
    "Based on this test, the new algorithm will improve accuracy by 5%", 
    "Average response time was 2.3 seconds yesterday",
    "This sample suggests 60% of voters support the candidate"
]

# Classify each as 'descriptive' or 'inferential'


<details>
<summary><b>Solution</b></summary>

```python
classifications = [
    "Descriptive - summarizes past data",
    "Inferential - predicts future performance", 
    "Descriptive - reports what happened",
    "Inferential - estimates population from sample"
]

for i, (stmt, classification) in enumerate(zip(statements, classifications)):
    print(f"{i+1}. {stmt}")
    print(f"   → {classification}\n")
```
</details>

## 3. Statistics in AI & Data Science

### Why Statistics Matters for AI:
1. **Model Performance** - Accuracy, precision, recall
2. **Data Quality** - Outliers, missing values, distributions
3. **Uncertainty** - Confidence in predictions
4. **A/B Testing** - Which model performs better?
5. **Feature Selection** - Which variables matter?

In [None]:
# AI Model Performance Example
model_accuracies = [0.87, 0.89, 0.85, 0.91, 0.88, 0.90, 0.86]

print("🤖 AI Model Performance Analysis:")
print(f"Average accuracy: {np.mean(model_accuracies):.1%}")
print(f"Standard deviation: {np.std(model_accuracies):.1%}")
print(f"Model consistency: {'High' if np.std(model_accuracies) < 0.02 else 'Low'}")

In [None]:
# Visualize model performance
plt.figure(figsize=(8, 4))
plt.plot(range(1, 8), model_accuracies, 'bo-', linewidth=2, markersize=8)
plt.axhline(np.mean(model_accuracies), color='red', linestyle='--', 
           label=f'Average: {np.mean(model_accuracies):.1%}')
plt.xlabel('Test Run')
plt.ylabel('Accuracy')
plt.title('Model Performance Across Multiple Runs')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 4. First Statistical Analysis

In [None]:
# E-commerce website data
daily_visitors = [1250, 1180, 1320, 1450, 1380, 1100, 980, 
                 1200, 1290, 1400, 1520, 1350, 1150, 1050]

# Calculate key statistics
mean_visitors = np.mean(daily_visitors)
median_visitors = np.median(daily_visitors)
std_visitors = np.std(daily_visitors)

print("🌐 Website Analytics (2 weeks):")
print(f"Average daily visitors: {mean_visitors:.0f}")
print(f"Median daily visitors: {median_visitors:.0f}")
print(f"Variability (std dev): {std_visitors:.0f}")

In [None]:
# Create comprehensive visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Time series plot
ax1.plot(range(1, 15), daily_visitors, 'o-', linewidth=2, markersize=6)
ax1.axhline(mean_visitors, color='red', linestyle='--', label='Mean')
ax1.set_xlabel('Day')
ax1.set_ylabel('Visitors')
ax1.set_title('Daily Visitors Over Time')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Distribution histogram
ax2.hist(daily_visitors, bins=8, alpha=0.7, color='skyblue', edgecolor='black')
ax2.axvline(mean_visitors, color='red', linestyle='--', label=f'Mean: {mean_visitors:.0f}')
ax2.axvline(median_visitors, color='green', linestyle='--', label=f'Median: {median_visitors:.0f}')
ax2.set_xlabel('Daily Visitors')
ax2.set_ylabel('Frequency')
ax2.set_title('Distribution of Daily Visitors')
ax2.legend()

plt.tight_layout()
plt.show()

**Exercise 2 — Website Performance Analysis (medium)**  
Analyze conversion rates and identify trends.


In [None]:
# Your turn
# Daily conversions data
conversions = [48, 52, 45, 61, 58, 42, 38, 51, 49, 67, 72, 55, 46, 43]
visitors = daily_visitors  # Use same visitor data

# Calculate conversion rates and analyze
# conversion_rates = [conversions[i]/visitors[i] for i in range(len(conversions))]


<details>
<summary><b>Solution</b></summary>

```python
# Calculate conversion rates
conversion_rates = [conversions[i]/visitors[i] for i in range(len(conversions))]

print("🎯 Conversion Rate Analysis:")
print(f"Average conversion rate: {np.mean(conversion_rates):.1%}")
print(f"Best day: {max(conversion_rates):.1%}")
print(f"Worst day: {min(conversion_rates):.1%}")
print(f"Rate variability: {np.std(conversion_rates):.3f}")

# Find correlation between visitors and conversion rate
correlation = np.corrcoef(visitors, conversion_rates)[0,1]
print(f"\nVisitors vs Conv. Rate correlation: {correlation:.3f}")
if correlation < -0.3:
    print("⚠️ Higher traffic may hurt conversion rate!")
elif correlation > 0.3:
    print("✅ Higher traffic improves conversion rate!")
else:
    print("➡️ Traffic and conversion rate seem independent")

# Plot conversion rates
plt.figure(figsize=(10, 4))
plt.plot(range(1, 15), [r*100 for r in conversion_rates], 'o-', linewidth=2)
plt.axhline(np.mean(conversion_rates)*100, color='red', linestyle='--', 
           label=f'Average: {np.mean(conversion_rates):.1%}')
plt.xlabel('Day')
plt.ylabel('Conversion Rate (%)')
plt.title('Daily Conversion Rates')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
```
</details>

## 5. Understanding Data Patterns

In [None]:
# Compare different data patterns
np.random.seed(42)

# Three different datasets with same mean
consistent = [100, 101, 99, 100, 101, 99, 100]  # Low variance
variable = [85, 120, 95, 110, 90, 115, 85]      # High variance
outlier_data = [100, 100, 99, 101, 100, 99, 150] # Has outlier

datasets = [consistent, variable, outlier_data]
names = ['Consistent', 'Variable', 'With Outlier']

print("📈 Comparing Data Patterns:")
for name, data in zip(names, datasets):
    print(f"\n{name}:")
    print(f"  Mean: {np.mean(data):.1f}")
    print(f"  Std Dev: {np.std(data):.1f}")
    print(f"  Range: {max(data) - min(data)}")

In [None]:
# Visualize the patterns
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for i, (name, data) in enumerate(zip(names, datasets)):
    axes[i].plot(data, 'o-', linewidth=2, markersize=8)
    axes[i].axhline(np.mean(data), color='red', linestyle='--', alpha=0.7)
    axes[i].set_title(f'{name}\nMean: {np.mean(data):.1f}')
    axes[i].set_ylabel('Value')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

**Exercise 3 — Data Quality Assessment (hard)**  
Detect outliers and assess data quality using statistical methods.


In [None]:
# Your turn
# Response times (in seconds) - some may be outliers
response_times = [0.8, 1.2, 0.9, 1.1, 0.7, 15.3, 0.8, 0.9, 1.0, 1.3, 
                 0.6, 1.4, 0.8, 12.7, 0.9, 1.1, 0.7, 0.8, 1.2, 0.9]

# Use IQR method to detect outliers
# Calculate Q1, Q3, IQR, and identify outliers


<details>
<summary><b>Solution</b></summary>

```python
# Calculate quartiles and IQR
Q1 = np.percentile(response_times, 25)
Q3 = np.percentile(response_times, 75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = [x for x in response_times if x < lower_bound or x > upper_bound]
clean_data = [x for x in response_times if lower_bound <= x <= upper_bound]

print("🔍 Data Quality Assessment:")
print(f"Total data points: {len(response_times)}")
print(f"Q1: {Q1:.2f}s, Q3: {Q3:.2f}s, IQR: {IQR:.2f}s")
print(f"Outlier boundaries: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"Outliers detected: {outliers}")
print(f"Clean data points: {len(clean_data)} ({len(clean_data)/len(response_times):.1%})")

print(f"\n📊 Impact of outliers:")
print(f"Mean with outliers: {np.mean(response_times):.2f}s")
print(f"Mean without outliers: {np.mean(clean_data):.2f}s")
print(f"Median (robust): {np.median(response_times):.2f}s")

# Visualize with outliers highlighted
plt.figure(figsize=(10, 4))
plt.scatter(range(len(response_times)), response_times, 
           c=['red' if x in outliers else 'blue' for x in response_times], 
           s=50, alpha=0.7)
plt.axhline(np.mean(response_times), color='green', linestyle='--', 
           label=f'Mean: {np.mean(response_times):.2f}s')
plt.axhline(np.median(response_times), color='orange', linestyle='--', 
           label=f'Median: {np.median(response_times):.2f}s')
plt.xlabel('Measurement')
plt.ylabel('Response Time (s)')
plt.title('Response Times (Red = Outliers)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
```
</details>

## 6. Mini-Challenges
- **M1 (easy):** Calculate summary statistics for mobile app ratings (1-5 stars)
- **M2 (medium):** Compare performance metrics between two ML models using statistical measures
- **M3 (hard):** Design an A/B test framework with proper statistical foundations

In [None]:
# Your turn - try the challenges!
# M1 data:
app_ratings = [4, 5, 3, 4, 5, 4, 3, 5, 4, 4, 2, 5, 4, 3, 4, 5, 4, 3, 4, 5]

# M2 data:
model_a_accuracy = [0.87, 0.89, 0.85, 0.91, 0.88]
model_b_accuracy = [0.84, 0.86, 0.88, 0.89, 0.87]

# M3: Design framework (conceptual)


<details>
<summary><b>Solutions</b></summary>

```python
# M1 - App ratings analysis
print("⭐ App Ratings Analysis:")
print(f"Average rating: {np.mean(app_ratings):.1f}/5")
print(f"Median rating: {np.median(app_ratings):.1f}/5")
print(f"Most common rating: {stats.mode(app_ratings).mode[0]}/5")
print(f"Rating variability: {np.std(app_ratings):.2f}")

# Rating distribution
unique, counts = np.unique(app_ratings, return_counts=True)
for rating, count in zip(unique, counts):
    percentage = count/len(app_ratings)*100
    print(f"{rating} stars: {count} ratings ({percentage:.1f}%)")

# M2 - Model comparison
print("\n🤖 Model Comparison:")
print(f"Model A: {np.mean(model_a_accuracy):.1%} ± {np.std(model_a_accuracy):.1%}")
print(f"Model B: {np.mean(model_b_accuracy):.1%} ± {np.std(model_b_accuracy):.1%}")

# Statistical significance (simple t-test concept)
diff_means = np.mean(model_a_accuracy) - np.mean(model_b_accuracy)
print(f"Difference: {diff_means:.1%}")
if abs(diff_means) > 0.02:
    print("📈 Potentially significant difference (>2%)")
else:
    print("📊 Difference may not be significant")

# M3 - A/B Test Framework
print("\n🧪 A/B Test Framework:")
print("1. Define hypothesis: Version B increases conversion by 10%")
print("2. Set significance level: α = 0.05 (95% confidence)")
print("3. Calculate sample size needed")
print("4. Random assignment: 50/50 split")
print("5. Collect data for minimum duration")
print("6. Statistical test: Compare conversion rates")
print("7. Decision: Implement if statistically significant")
```
</details>

## Wrap-Up & Next Steps
✅ You understand the difference between descriptive and inferential statistics  
✅ You can perform basic statistical analysis in Python  
✅ You recognize statistics' crucial role in AI and data science  
✅ You can identify data quality issues and outliers  

**Next:** Data Types & Visualization - Choose the right charts and handle different data types!
