# Lab 2: Statistical Analysis - Interactive Notebook

## Learning Objectives

By the end of this lab, you will be able to:
1. Calculate and interpret descriptive statistics
2. Work with probability distributions
3. Perform hypothesis testing
4. Analyze real-world datasets statistically

**Estimated Time:** 3-4 hours

---

## Setup: Import Libraries

Run this cell first to import all necessary libraries.

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical tests
from scipy import stats
from scipy.stats import norm, binom, poisson

# Dataset
from sklearn.datasets import load_iris, load_wine

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')

print('‚úÖ All libraries imported successfully!')
print(f'Pandas version: {pd.__version__}')
print(f'NumPy version: {np.__version__}')

---

# Part 1: Descriptive Statistics

## Exercise 1.1: Load and Explore the Iris Dataset

In [None]:
# Load Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
iris_df['species_name'] = iris_df['species'].map({
    0: 'setosa', 
    1: 'versicolor', 
    2: 'virginica'
})

print('Dataset loaded successfully!')
print(f'Shape: {iris_df.shape}')
print(f'\nFirst 5 rows:')
iris_df.head()

### üìù Task 1: Data Quality Check

Complete the following tasks:

In [None]:
# TODO: Display dataset info
# YOUR CODE HERE


# TODO: Check for missing values
# YOUR CODE HERE


# TODO: Display basic statistics
# YOUR CODE HERE


### ‚úÖ Solution (Hidden - Click to expand)

In [None]:
# Solution
print('Dataset Info:')
print(iris_df.info())

print('\nMissing Values:')
print(iris_df.isnull().sum())

print('\nDescriptive Statistics:')
print(iris_df.describe())

### üìù Task 2: Calculate Measures of Central Tendency

For the 'sepal length (cm)' column, calculate:

In [None]:
sepal_length = iris_df['sepal length (cm)']

# TODO: Calculate mean
mean_val = # YOUR CODE HERE

# TODO: Calculate median
median_val = # YOUR CODE HERE

# TODO: Calculate mode
mode_val = # YOUR CODE HERE

print(f'Mean: {mean_val:.2f}')
print(f'Median: {median_val:.2f}')
print(f'Mode: {mode_val:.2f}')

In [None]:
# Solution
mean_val = sepal_length.mean()
median_val = sepal_length.median()
mode_val = sepal_length.mode()[0]

print(f'Mean: {mean_val:.2f} cm')
print(f'Median: {median_val:.2f} cm')
print(f'Mode: {mode_val:.2f} cm')
print(f'\nDifference (Mean - Median): {abs(mean_val - median_val):.4f}')
print('The distribution is approximately symmetric.')

### üìä Task 3: Visualize Distributions

In [None]:
# TODO: Create a histogram of sepal length with mean and median lines
plt.figure(figsize=(10, 6))

# YOUR CODE HERE: Create histogram

# YOUR CODE HERE: Add vertical lines for mean and median

plt.xlabel('Sepal Length (cm)')
plt.ylabel('Frequency')
plt.title('Distribution of Sepal Length')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

In [None]:
# Solution
plt.figure(figsize=(10, 6))
plt.hist(sepal_length, bins=20, edgecolor='black', alpha=0.7, color='skyblue')
plt.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
plt.axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median: {median_val:.2f}')
plt.xlabel('Sepal Length (cm)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Sepal Length', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.show()

---

# Part 2: Probability Distributions

## Exercise 2.1: Normal Distribution

**Scenario:** Heights of adult males are normally distributed with mean = 175 cm and std = 7 cm.

In [None]:
mu = 175  # mean
sigma = 7  # standard deviation

# TODO: What percentage of males are taller than 180 cm?
prob_taller_180 = # YOUR CODE HERE

# TODO: What percentage are between 170 and 180 cm?
prob_between = # YOUR CODE HERE

# TODO: What height corresponds to the 90th percentile?
height_90th = # YOUR CODE HERE

print(f'P(X > 180): {prob_taller_180:.2%}')
print(f'P(170 < X < 180): {prob_between:.2%}')
print(f'90th percentile: {height_90th:.2f} cm')

In [None]:
# Solution
from scipy.stats import norm

# 1. Probability taller than 180 cm
prob_taller_180 = 1 - norm.cdf(180, mu, sigma)

# 2. Probability between 170 and 180
prob_between = norm.cdf(180, mu, sigma) - norm.cdf(170, mu, sigma)

# 3. 90th percentile
height_90th = norm.ppf(0.90, mu, sigma)

print(f'P(X > 180): {prob_taller_180:.2%}')
print(f'P(170 < X < 180): {prob_between:.2%}')
print(f'90th percentile: {height_90th:.2f} cm')

# Visualize
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
y = norm.pdf(x, mu, sigma)

plt.figure(figsize=(12, 6))
plt.plot(x, y, 'b-', linewidth=2, label='Normal Distribution')
plt.fill_between(x[x > 180], 0, norm.pdf(x[x > 180], mu, sigma), 
                 alpha=0.3, color='red', label=f'P(X > 180) = {prob_taller_180:.2%}')
plt.axvline(mu, color='green', linestyle='--', linewidth=2, label=f'Mean = {mu}')
plt.axvline(height_90th, color='orange', linestyle='--', linewidth=2, 
            label=f'90th percentile = {height_90th:.1f}')
plt.xlabel('Height (cm)', fontsize=12)
plt.ylabel('Probability Density', fontsize=12)
plt.title('Normal Distribution of Male Heights', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.show()

---

# Part 3: Hypothesis Testing

## Exercise 3.1: One-Sample t-test

**Scenario:** A manufacturer claims their light bulbs last 1000 hours on average.

In [None]:
# Simulated data
np.random.seed(42)
bulb_lifetimes = np.random.normal(950, 100, 25)

print(f'Sample mean: {bulb_lifetimes.mean():.2f} hours')
print(f'Sample std: {bulb_lifetimes.std():.2f} hours')

# TODO: Perform one-sample t-test
# H0: Œº = 1000 (manufacturer's claim is correct)
# H1: Œº ‚â† 1000 (manufacturer's claim is incorrect)

t_statistic, p_value = # YOUR CODE HERE

print(f'\nt-statistic: {t_statistic:.4f}')
print(f'p-value: {p_value:.4f}')

alpha = 0.05
if p_value < alpha:
    print(f'\nReject H0: Evidence suggests bulbs don\'t last 1000 hours')
else:
    print(f'\nFail to reject H0: No significant evidence against 1000 hours')

In [None]:
# Solution
from scipy.stats import ttest_1samp

t_statistic, p_value = ttest_1samp(bulb_lifetimes, 1000)

print(f't-statistic: {t_statistic:.4f}')
print(f'p-value: {p_value:.4f}')

alpha = 0.05
if p_value < alpha:
    print(f'\n‚úÖ Reject H0 (p={p_value:.4f} < {alpha})')
    print('Evidence suggests bulbs do NOT last 1000 hours on average')
else:
    print(f'\n‚ùå Fail to reject H0 (p={p_value:.4f} >= {alpha})')
    print('No significant evidence against manufacturer\'s claim')

# 95% Confidence Interval
from scipy.stats import t
n = len(bulb_lifetimes)
mean = bulb_lifetimes.mean()
std_err = bulb_lifetimes.std(ddof=1) / np.sqrt(n)
ci = t.interval(0.95, n-1, loc=mean, scale=std_err)

print(f'\n95% Confidence Interval: ({ci[0]:.2f}, {ci[1]:.2f}) hours')
print(f'The true mean is likely between {ci[0]:.1f} and {ci[1]:.1f} hours')

---

# üéØ Practice Questions

Answer the following questions based on your analysis:

### Question 1
What is the probability of finding a male taller than 190 cm? (Use Œº=175, œÉ=7)

In [None]:
# YOUR ANSWER HERE


### Question 2
Calculate the 95% confidence interval for the mean sepal length of Iris flowers.

In [None]:
# YOUR ANSWER HERE


---

# üèÜ Bonus Challenge

Compare the sepal length between all three Iris species using ANOVA.

- H0: All species have the same mean sepal length
- H1: At least one species has a different mean

In [None]:
# YOUR CODE HERE


In [None]:
# Solution
from scipy.stats import f_oneway

setosa = iris_df[iris_df['species_name'] == 'setosa']['sepal length (cm)']
versicolor = iris_df[iris_df['species_name'] == 'versicolor']['sepal length (cm)']
virginica = iris_df[iris_df['species_name'] == 'virginica']['sepal length (cm)']

f_stat, p_value = f_oneway(setosa, versicolor, virginica)

print(f'F-statistic: {f_stat:.4f}')
print(f'P-value: {p_value:.4e}')

if p_value < 0.05:
    print('\n‚úÖ Reject H0: At least one species has different mean sepal length')
else:
    print('\n‚ùå Fail to reject H0: No significant difference between species')

---

# üìù Summary

## What You Learned

‚úÖ Calculate descriptive statistics (mean, median, mode, variance, std)
‚úÖ Work with probability distributions (Normal, Binomial, Poisson)
‚úÖ Perform hypothesis tests (t-test, ANOVA)
‚úÖ Interpret p-values and make statistical decisions
‚úÖ Visualize distributions and test results

## Next Steps

1. Practice with the Wine Quality dataset
2. Explore Chi-Square tests
3. Learn about correlation analysis
4. Move to Lab 3: Clustering

---

**Great job! üéâ**