# Module 11: Statistics for Data Science

## Topics Covered
1. Descriptive Statistics
2. Measures of Central Tendency
3. Measures of Dispersion
4. Probability Basics
5. Probability Distributions
6. Hypothesis Testing Fundamentals
7. Correlation and Covariance
8. A/B Testing Basics

## Learning Objectives

By the end of this module, you will be able to:
- Calculate and interpret descriptive statistics for datasets
- Understand and apply measures of central tendency and dispersion
- Apply basic probability concepts to data problems
- Work with common probability distributions
- Perform and interpret hypothesis tests
- Calculate and interpret correlation and covariance
- Design and analyze basic A/B tests

---

In [None]:
# Import libraries we'll use throughout this module
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

---
# Section 1: Descriptive Statistics
---

## What are Descriptive Statistics?

Descriptive statistics are numerical values that summarize and describe the main features of a dataset. Unlike inferential statistics (which we'll cover later), descriptive statistics don't try to make conclusions beyond the data at hand - they simply describe what's there.

### Why This Matters in Data Science

Descriptive statistics are your first line of defense when exploring data. Before building models or making predictions, you need to understand:
- What does a "typical" value look like?
- How spread out are the values?
- Are there any unusual patterns or outliers?
- What's the shape of the distribution?

### Types of Descriptive Statistics

1. **Measures of Central Tendency** - Where is the "center" of the data?
2. **Measures of Dispersion** - How spread out is the data?
3. **Measures of Shape** - What does the distribution look like?
4. **Measures of Position** - Where do specific values fall in the distribution?

In [None]:
# Let's create a sample dataset to work with
# Simulating employee salaries at a tech company

np.random.seed(42)

# Generate salary data with different distributions for different departments
engineering_salaries = np.random.normal(95000, 15000, 100)
marketing_salaries = np.random.normal(75000, 12000, 60)
sales_salaries = np.random.normal(65000, 20000, 80)  # Higher variance
executive_salaries = np.random.normal(150000, 30000, 10)  # Small group, high salaries

# Combine into a DataFrame
salaries_df = pd.DataFrame({
    'salary': np.concatenate([engineering_salaries, marketing_salaries, 
                              sales_salaries, executive_salaries]),
    'department': (['Engineering'] * 100 + ['Marketing'] * 60 + 
                   ['Sales'] * 80 + ['Executive'] * 10)
})

# Ensure no negative salaries
salaries_df['salary'] = salaries_df['salary'].clip(lower=30000)

print(f"Dataset shape: {salaries_df.shape}")
print(f"\nDepartment counts:")
print(salaries_df['department'].value_counts())

In [None]:
# Quick overview with pandas describe()
print("Overall Salary Statistics:")
print(salaries_df['salary'].describe())

In [None]:
# Statistics by department
print("Salary Statistics by Department:")
print(salaries_df.groupby('department')['salary'].describe().round(2))

---
# Section 2: Measures of Central Tendency
---

## What is Central Tendency?

Measures of central tendency describe where the "center" or "typical value" of a dataset lies. The three main measures are:

1. **Mean** - The arithmetic average
2. **Median** - The middle value when data is sorted
3. **Mode** - The most frequently occurring value

### When to Use Each Measure

| Measure | Best Used When | Sensitive To |
|---------|---------------|---------------|
| Mean | Data is symmetric, no outliers | Outliers |
| Median | Data is skewed or has outliers | Not affected by outliers |
| Mode | Categorical data or finding most common value | Multiple modes possible |

## Syntax

```python
# Using NumPy
np.mean(array)      # Calculate mean
np.median(array)    # Calculate median

# Using pandas
series.mean()       # Mean of a Series
series.median()     # Median of a Series
series.mode()       # Mode(s) of a Series

# Using scipy.stats
stats.mode(array)   # Mode with count
```

In [None]:
# Example: Calculating measures of central tendency
salaries = salaries_df['salary']

# Mean - sum of all values divided by count
mean_salary = salaries.mean()
print(f"Mean salary: ${mean_salary:,.2f}")

# Median - middle value when sorted
median_salary = salaries.median()
print(f"Median salary: ${median_salary:,.2f}")

# Mode - most frequent value (less useful for continuous data)
# For continuous data, we often look at the mode of binned data
print(f"\nDifference between mean and median: ${mean_salary - median_salary:,.2f}")

In [None]:
# Understanding the impact of outliers
# Let's add an extreme outlier (CEO salary)

salaries_with_outlier = pd.concat([salaries, pd.Series([500000])], ignore_index=True)

print("Without CEO salary:")
print(f"  Mean: ${salaries.mean():,.2f}")
print(f"  Median: ${salaries.median():,.2f}")

print("\nWith CEO salary ($500,000):")
print(f"  Mean: ${salaries_with_outlier.mean():,.2f}")
print(f"  Median: ${salaries_with_outlier.median():,.2f}")

print(f"\nMean increased by: ${salaries_with_outlier.mean() - salaries.mean():,.2f}")
print(f"Median increased by: ${salaries_with_outlier.median() - salaries.median():,.2f}")

In [None]:
# Visualizing mean vs median
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Without outlier
axes[0].hist(salaries, bins=30, edgecolor='black', alpha=0.7)
axes[0].axvline(salaries.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: ${salaries.mean():,.0f}')
axes[0].axvline(salaries.median(), color='green', linestyle='-', linewidth=2, label=f'Median: ${salaries.median():,.0f}')
axes[0].set_xlabel('Salary ($)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Salary Distribution (Without CEO)')
axes[0].legend()

# With outlier
axes[1].hist(salaries_with_outlier, bins=30, edgecolor='black', alpha=0.7)
axes[1].axvline(salaries_with_outlier.mean(), color='red', linestyle='--', linewidth=2, 
                label=f'Mean: ${salaries_with_outlier.mean():,.0f}')
axes[1].axvline(salaries_with_outlier.median(), color='green', linestyle='-', linewidth=2, 
                label=f'Median: ${salaries_with_outlier.median():,.0f}')
axes[1].set_xlabel('Salary ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Salary Distribution (With CEO at $500K)')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Mode - most useful for categorical data
# Let's look at the mode of departments

department_mode = salaries_df['department'].mode()
print(f"Most common department: {department_mode.values[0]}")

# For numerical data, we can bin it first
salary_bins = pd.cut(salaries_df['salary'], bins=10)
print(f"\nMost common salary range: {salary_bins.mode().values[0]}")

In [None]:
# Trimmed mean - a compromise between mean and median
# Removes a percentage of extreme values from both ends

from scipy.stats import trim_mean

# 10% trimmed mean (removes bottom and top 10%)
trimmed = trim_mean(salaries_with_outlier, 0.1)

print("Comparison of central tendency measures (with outlier):")
print(f"  Mean: ${salaries_with_outlier.mean():,.2f}")
print(f"  Trimmed Mean (10%): ${trimmed:,.2f}")
print(f"  Median: ${salaries_with_outlier.median():,.2f}")

---
# Section 3: Measures of Dispersion
---

## What is Dispersion?

Measures of dispersion (or spread) tell us how spread out the values in a dataset are. Two datasets can have the same mean but very different spreads.

### Common Measures of Dispersion

1. **Range** - Difference between max and min values
2. **Variance** - Average of squared deviations from the mean
3. **Standard Deviation** - Square root of variance (same units as data)
4. **Interquartile Range (IQR)** - Range of the middle 50% of data
5. **Coefficient of Variation** - Standard deviation relative to mean

## Syntax

```python
# Range
data_range = data.max() - data.min()

# Variance and Standard Deviation
variance = data.var()          # or np.var(data)
std_dev = data.std()           # or np.std(data)

# Note: pandas uses ddof=1 (sample) by default
# NumPy uses ddof=0 (population) by default
sample_std = data.std(ddof=1)  # Sample standard deviation
pop_std = data.std(ddof=0)     # Population standard deviation

# Interquartile Range
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

# Or using scipy
from scipy.stats import iqr
IQR = iqr(data)
```

In [None]:
# Example: Calculating measures of dispersion

# Range
salary_range = salaries.max() - salaries.min()
print(f"Salary Range: ${salary_range:,.2f}")
print(f"  Min: ${salaries.min():,.2f}")
print(f"  Max: ${salaries.max():,.2f}")

In [None]:
# Variance and Standard Deviation
variance = salaries.var()
std_dev = salaries.std()

print(f"\nVariance: ${variance:,.2f}")
print(f"Standard Deviation: ${std_dev:,.2f}")

# Interpretation: Most salaries fall within 1-2 standard deviations of the mean
mean = salaries.mean()
print(f"\nMean +/- 1 Std Dev: ${mean - std_dev:,.2f} to ${mean + std_dev:,.2f}")
print(f"Mean +/- 2 Std Dev: ${mean - 2*std_dev:,.2f} to ${mean + 2*std_dev:,.2f}")

In [None]:
# Verify: What percentage actually falls within these ranges?
within_1_std = ((salaries >= mean - std_dev) & (salaries <= mean + std_dev)).mean() * 100
within_2_std = ((salaries >= mean - 2*std_dev) & (salaries <= mean + 2*std_dev)).mean() * 100

print(f"Percentage within 1 std dev: {within_1_std:.1f}%")
print(f"Percentage within 2 std dev: {within_2_std:.1f}%")
print("\n(For a normal distribution, these would be ~68% and ~95%)")

In [None]:
# Interquartile Range (IQR)
Q1 = salaries.quantile(0.25)
Q2 = salaries.quantile(0.50)  # Same as median
Q3 = salaries.quantile(0.75)
IQR = Q3 - Q1

print("Quartiles:")
print(f"  Q1 (25th percentile): ${Q1:,.2f}")
print(f"  Q2 (50th percentile/Median): ${Q2:,.2f}")
print(f"  Q3 (75th percentile): ${Q3:,.2f}")
print(f"\nInterquartile Range (IQR): ${IQR:,.2f}")
print(f"\nThe middle 50% of salaries fall between ${Q1:,.2f} and ${Q3:,.2f}")

In [None]:
# Coefficient of Variation (CV) - useful for comparing variability across different scales
cv = (std_dev / mean) * 100
print(f"Coefficient of Variation: {cv:.2f}%")

# Compare CV across departments
print("\nCoefficient of Variation by Department:")
for dept in salaries_df['department'].unique():
    dept_salaries = salaries_df[salaries_df['department'] == dept]['salary']
    dept_cv = (dept_salaries.std() / dept_salaries.mean()) * 100
    print(f"  {dept}: {dept_cv:.2f}%")

In [None]:
# Visualizing dispersion with a box plot
fig, ax = plt.subplots(figsize=(10, 6))

salaries_df.boxplot(column='salary', by='department', ax=ax)
ax.set_ylabel('Salary ($)')
ax.set_title('Salary Distribution by Department')
plt.suptitle('')  # Remove automatic title

plt.tight_layout()
plt.show()

## Practice Exercise 3.1

**Task:** You have test scores from two classes. Calculate the measures of central tendency and dispersion for each class, then determine which class performed more consistently.

```python
class_a = [85, 90, 78, 92, 88, 76, 95, 89, 84, 91]
class_b = [82, 95, 70, 98, 75, 88, 100, 65, 93, 84]
```

Calculate for each class:
1. Mean and Median
2. Standard Deviation
3. Range and IQR
4. Which class was more consistent? (Hint: Lower CV = more consistent)

In [None]:
# Your code here
class_a = [85, 90, 78, 92, 88, 76, 95, 89, 84, 91]
class_b = [82, 95, 70, 98, 75, 88, 100, 65, 93, 84]


In [None]:
# Solution 3.1
class_a = np.array([85, 90, 78, 92, 88, 76, 95, 89, 84, 91])
class_b = np.array([82, 95, 70, 98, 75, 88, 100, 65, 93, 84])

def analyze_scores(scores, class_name):
    print(f"\n{class_name}:")
    print(f"  Mean: {np.mean(scores):.2f}")
    print(f"  Median: {np.median(scores):.2f}")
    print(f"  Standard Deviation: {np.std(scores, ddof=1):.2f}")
    print(f"  Range: {np.max(scores) - np.min(scores)}")
    q1, q3 = np.percentile(scores, [25, 75])
    print(f"  IQR: {q3 - q1:.2f}")
    cv = (np.std(scores, ddof=1) / np.mean(scores)) * 100
    print(f"  Coefficient of Variation: {cv:.2f}%")
    return cv

cv_a = analyze_scores(class_a, "Class A")
cv_b = analyze_scores(class_b, "Class B")

print(f"\nConclusion: {'Class A' if cv_a < cv_b else 'Class B'} performed more consistently.")
print(f"(Lower CV indicates less variability relative to the mean)")

---
# Section 4: Probability Basics
---

## What is Probability?

Probability is a measure of the likelihood that an event will occur. It's expressed as a number between 0 (impossible) and 1 (certain).

### Key Probability Concepts

- **Experiment**: A process that produces an outcome (e.g., rolling a die)
- **Sample Space**: All possible outcomes (e.g., {1, 2, 3, 4, 5, 6})
- **Event**: A subset of the sample space (e.g., rolling an even number)
- **Probability**: P(Event) = Number of favorable outcomes / Total possible outcomes

### Why This Matters in Data Science

Probability forms the foundation of:
- Statistical inference and hypothesis testing
- Machine learning algorithms (especially Bayesian methods)
- Risk assessment and decision making
- Understanding uncertainty in predictions

In [None]:
# Example: Basic probability calculations

# Rolling a fair six-sided die
sample_space = [1, 2, 3, 4, 5, 6]
total_outcomes = len(sample_space)

# P(rolling a 4)
favorable_4 = [x for x in sample_space if x == 4]
p_four = len(favorable_4) / total_outcomes
print(f"P(rolling a 4) = {p_four:.4f} or {p_four:.2%}")

# P(rolling an even number)
favorable_even = [x for x in sample_space if x % 2 == 0]
p_even = len(favorable_even) / total_outcomes
print(f"P(rolling an even number) = {p_even:.4f} or {p_even:.2%}")

# P(rolling greater than 4)
favorable_gt4 = [x for x in sample_space if x > 4]
p_gt4 = len(favorable_gt4) / total_outcomes
print(f"P(rolling > 4) = {p_gt4:.4f} or {p_gt4:.2%}")

In [None]:
# Complement rule: P(not A) = 1 - P(A)
p_not_four = 1 - p_four
print(f"P(NOT rolling a 4) = {p_not_four:.4f}")

# Verify with simulation
np.random.seed(42)
rolls = np.random.randint(1, 7, size=10000)

simulated_p_four = (rolls == 4).mean()
simulated_p_even = (rolls % 2 == 0).mean()

print(f"\nSimulated probabilities (10,000 rolls):")
print(f"  P(4): {simulated_p_four:.4f} (theoretical: {p_four:.4f})")
print(f"  P(even): {simulated_p_even:.4f} (theoretical: {p_even:.4f})")

In [None]:
# Conditional Probability: P(A|B) = P(A and B) / P(B)
# Probability of A given that B has occurred

# Example: Using our salary data
# What's the probability that an employee earns > $100K given they're in Engineering?

engineering_employees = salaries_df[salaries_df['department'] == 'Engineering']
high_earners_in_eng = engineering_employees[engineering_employees['salary'] > 100000]

p_high_given_eng = len(high_earners_in_eng) / len(engineering_employees)
print(f"P(Salary > $100K | Engineering) = {p_high_given_eng:.4f} or {p_high_given_eng:.2%}")

# Compare with overall probability of earning > $100K
p_high_overall = (salaries_df['salary'] > 100000).mean()
print(f"P(Salary > $100K | Any department) = {p_high_overall:.4f} or {p_high_overall:.2%}")

In [None]:
# Independent Events: P(A and B) = P(A) * P(B)
# Events are independent if occurrence of one doesn't affect the other

# Example: Flipping two coins
p_heads = 0.5
p_two_heads = p_heads * p_heads
print(f"P(two heads in a row) = {p_two_heads:.4f}")

# Verify with simulation
np.random.seed(42)
flips = np.random.choice(['H', 'T'], size=(10000, 2))
two_heads = ((flips[:, 0] == 'H') & (flips[:, 1] == 'H')).mean()
print(f"Simulated P(two heads) = {two_heads:.4f}")

In [None]:
# Mutually Exclusive Events: P(A or B) = P(A) + P(B)
# Events that cannot occur at the same time

# Example: Rolling a 1 OR rolling a 6
p_one = 1/6
p_six = 1/6
p_one_or_six = p_one + p_six
print(f"P(rolling 1 OR 6) = {p_one_or_six:.4f}")

# Non-mutually exclusive: P(A or B) = P(A) + P(B) - P(A and B)
# Example: P(even OR greater than 4)
# Even: {2, 4, 6}, Greater than 4: {5, 6}
# Overlap: {6}

p_even = 3/6
p_gt4 = 2/6
p_even_and_gt4 = 1/6  # Just 6
p_even_or_gt4 = p_even + p_gt4 - p_even_and_gt4
print(f"P(even OR > 4) = {p_even_or_gt4:.4f}")
print(f"  (Events {2, 4, 5, 6} = 4 outcomes out of 6)")

---
# Section 5: Probability Distributions
---

## What is a Probability Distribution?

A probability distribution describes how probabilities are distributed over the values of a random variable. There are two main types:

1. **Discrete Distributions** - For countable outcomes (integers)
2. **Continuous Distributions** - For continuous outcomes (any real number)

### Common Distributions in Data Science

| Distribution | Type | Use Case |
|--------------|------|----------|
| Binomial | Discrete | Number of successes in n trials |
| Poisson | Discrete | Count of events in a time period |
| Normal | Continuous | Many natural phenomena |
| Uniform | Both | Equal probability for all values |
| Exponential | Continuous | Time between events |

## Syntax

```python
from scipy import stats

# Normal Distribution
stats.norm.pdf(x, loc=mean, scale=std)  # Probability density function
stats.norm.cdf(x, loc=mean, scale=std)  # Cumulative distribution function
stats.norm.ppf(q, loc=mean, scale=std)  # Percent point function (inverse CDF)
stats.norm.rvs(loc=mean, scale=std, size=n)  # Random samples

# Binomial Distribution
stats.binom.pmf(k, n, p)  # Probability mass function
stats.binom.cdf(k, n, p)  # Cumulative distribution function

# Poisson Distribution
stats.poisson.pmf(k, mu)  # Probability mass function
stats.poisson.cdf(k, mu)  # Cumulative distribution function
```

In [None]:
# Normal (Gaussian) Distribution
# The most important distribution in statistics

# Create a normal distribution with mean=0, std=1 (standard normal)
x = np.linspace(-4, 4, 100)
standard_normal = stats.norm.pdf(x, loc=0, scale=1)

# Different means and standard deviations
normal_1 = stats.norm.pdf(x, loc=0, scale=1)
normal_2 = stats.norm.pdf(x, loc=0, scale=2)  # Wider spread
normal_3 = stats.norm.pdf(x, loc=2, scale=1)  # Shifted mean

plt.figure(figsize=(10, 6))
plt.plot(x, normal_1, label='mean=0, std=1', linewidth=2)
plt.plot(x, normal_2, label='mean=0, std=2', linewidth=2)
plt.plot(x, normal_3, label='mean=2, std=1', linewidth=2)
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Normal Distributions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Using the normal distribution for probability calculations
# Example: IQ scores are normally distributed with mean=100, std=15

mean_iq = 100
std_iq = 15

# What percentage of people have IQ above 130?
p_above_130 = 1 - stats.norm.cdf(130, loc=mean_iq, scale=std_iq)
print(f"P(IQ > 130) = {p_above_130:.4f} or {p_above_130:.2%}")

# What percentage of people have IQ between 85 and 115?
p_between = stats.norm.cdf(115, loc=mean_iq, scale=std_iq) - stats.norm.cdf(85, loc=mean_iq, scale=std_iq)
print(f"P(85 < IQ < 115) = {p_between:.4f} or {p_between:.2%}")

# What IQ score is at the 95th percentile?
iq_95th = stats.norm.ppf(0.95, loc=mean_iq, scale=std_iq)
print(f"95th percentile IQ = {iq_95th:.1f}")

In [None]:
# Binomial Distribution
# Models the number of successes in n independent trials with probability p

# Example: A website has a 5% conversion rate. 
# If 100 people visit, what's the probability of exactly 7 conversions?

n_visitors = 100
p_convert = 0.05

# P(exactly 7 conversions)
p_exactly_7 = stats.binom.pmf(7, n_visitors, p_convert)
print(f"P(exactly 7 conversions) = {p_exactly_7:.4f}")

# P(at most 3 conversions)
p_at_most_3 = stats.binom.cdf(3, n_visitors, p_convert)
print(f"P(at most 3 conversions) = {p_at_most_3:.4f}")

# P(at least 10 conversions)
p_at_least_10 = 1 - stats.binom.cdf(9, n_visitors, p_convert)
print(f"P(at least 10 conversions) = {p_at_least_10:.4f}")

# Expected value (mean) and standard deviation
expected = n_visitors * p_convert
std = np.sqrt(n_visitors * p_convert * (1 - p_convert))
print(f"\nExpected conversions: {expected:.2f}")
print(f"Standard deviation: {std:.2f}")

In [None]:
# Visualizing the binomial distribution
k_values = np.arange(0, 20)
probabilities = stats.binom.pmf(k_values, n_visitors, p_convert)

plt.figure(figsize=(10, 6))
plt.bar(k_values, probabilities, edgecolor='black')
plt.axvline(expected, color='red', linestyle='--', linewidth=2, label=f'Expected: {expected}')
plt.xlabel('Number of Conversions')
plt.ylabel('Probability')
plt.title(f'Binomial Distribution (n={n_visitors}, p={p_convert})')
plt.legend()
plt.show()

In [None]:
# Poisson Distribution
# Models the number of events occurring in a fixed interval when events happen at a constant rate

# Example: A call center receives an average of 4 calls per minute
# What's the probability of receiving exactly 6 calls in a minute?

lambda_rate = 4  # average calls per minute

p_exactly_6 = stats.poisson.pmf(6, lambda_rate)
print(f"P(exactly 6 calls) = {p_exactly_6:.4f}")

# P(no calls in a minute)
p_zero = stats.poisson.pmf(0, lambda_rate)
print(f"P(0 calls) = {p_zero:.4f}")

# P(more than 7 calls)
p_more_than_7 = 1 - stats.poisson.cdf(7, lambda_rate)
print(f"P(more than 7 calls) = {p_more_than_7:.4f}")

In [None]:
# Comparing distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Normal
x = np.linspace(-4, 4, 100)
axes[0].plot(x, stats.norm.pdf(x), linewidth=2)
axes[0].fill_between(x, stats.norm.pdf(x), alpha=0.3)
axes[0].set_title('Normal Distribution\n(Continuous)')
axes[0].set_xlabel('x')
axes[0].set_ylabel('Density')

# Binomial
k = np.arange(0, 21)
axes[1].bar(k, stats.binom.pmf(k, 20, 0.5), edgecolor='black')
axes[1].set_title('Binomial Distribution\n(n=20, p=0.5)')
axes[1].set_xlabel('k (successes)')
axes[1].set_ylabel('Probability')

# Poisson
k = np.arange(0, 15)
axes[2].bar(k, stats.poisson.pmf(k, 5), edgecolor='black')
axes[2].set_title('Poisson Distribution\n(lambda=5)')
axes[2].set_xlabel('k (events)')
axes[2].set_ylabel('Probability')

plt.tight_layout()
plt.show()

---
# Section 6: Hypothesis Testing Fundamentals
---

## What is Hypothesis Testing?

Hypothesis testing is a statistical method for making decisions based on data. It helps us determine whether observed differences or effects are statistically significant or just due to random chance.

### Key Concepts

1. **Null Hypothesis (H0)**: The default assumption (usually "no effect" or "no difference")
2. **Alternative Hypothesis (H1 or Ha)**: What we're trying to prove
3. **Test Statistic**: A value calculated from the data
4. **P-value**: Probability of observing results as extreme as ours, assuming H0 is true
5. **Significance Level (alpha)**: Threshold for rejecting H0 (commonly 0.05)

### The Hypothesis Testing Process

1. State the hypotheses (H0 and H1)
2. Choose significance level (alpha)
3. Collect data and calculate test statistic
4. Calculate p-value
5. Make a decision: Reject H0 if p-value < alpha

## Syntax

```python
from scipy import stats

# One-sample t-test (compare sample mean to known value)
t_stat, p_value = stats.ttest_1samp(sample_data, population_mean)

# Two-sample t-test (compare means of two groups)
t_stat, p_value = stats.ttest_ind(group1, group2)

# Paired t-test (compare before/after measurements)
t_stat, p_value = stats.ttest_rel(before, after)

# Chi-square test (for categorical data)
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Mann-Whitney U test (non-parametric alternative to t-test)
stat, p_value = stats.mannwhitneyu(group1, group2)
```

In [None]:
# Example 1: One-sample t-test
# A company claims their batteries last 500 hours on average.
# We test 30 batteries and want to know if this claim is accurate.

np.random.seed(42)
# Simulate battery life data (true mean is 485 hours)
battery_life = np.random.normal(485, 30, 30)

print("Battery Life Test")
print(f"Sample size: {len(battery_life)}")
print(f"Sample mean: {battery_life.mean():.2f} hours")
print(f"Sample std: {battery_life.std():.2f} hours")

# Hypothesis test
# H0: mu = 500 (company's claim is true)
# H1: mu != 500 (company's claim is false)

claimed_mean = 500
t_stat, p_value = stats.ttest_1samp(battery_life, claimed_mean)

print(f"\nHypothesis Test Results:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nConclusion: Reject H0 (p < {alpha})")
    print("The evidence suggests the batteries do NOT last 500 hours on average.")
else:
    print(f"\nConclusion: Fail to reject H0 (p >= {alpha})")
    print("There is not enough evidence to reject the company's claim.")

In [None]:
# Example 2: Two-sample t-test
# Compare salaries between Engineering and Marketing departments

engineering = salaries_df[salaries_df['department'] == 'Engineering']['salary']
marketing = salaries_df[salaries_df['department'] == 'Marketing']['salary']

print("Engineering vs Marketing Salary Comparison")
print(f"Engineering: n={len(engineering)}, mean=${engineering.mean():,.2f}")
print(f"Marketing: n={len(marketing)}, mean=${marketing.mean():,.2f}")

# Hypothesis test
# H0: There is no difference in mean salaries between departments
# H1: There is a difference in mean salaries

t_stat, p_value = stats.ttest_ind(engineering, marketing)

print(f"\nTwo-sample t-test Results:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nConclusion: Reject H0 (p < {alpha})")
    print("There is a statistically significant difference in salaries.")
else:
    print(f"\nConclusion: Fail to reject H0 (p >= {alpha})")
    print("No significant difference detected.")

In [None]:
# Visualize the comparison
fig, ax = plt.subplots(figsize=(10, 6))

# Create side-by-side histograms
ax.hist(engineering, bins=20, alpha=0.5, label=f'Engineering (mean=${engineering.mean():,.0f})')
ax.hist(marketing, bins=20, alpha=0.5, label=f'Marketing (mean=${marketing.mean():,.0f})')

ax.axvline(engineering.mean(), color='blue', linestyle='--', linewidth=2)
ax.axvline(marketing.mean(), color='orange', linestyle='--', linewidth=2)

ax.set_xlabel('Salary ($)')
ax.set_ylabel('Frequency')
ax.set_title(f'Salary Distribution Comparison (p-value = {p_value:.6f})')
ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Example 3: Paired t-test
# Compare employee productivity before and after a training program

np.random.seed(42)
n_employees = 25

# Productivity scores (tasks completed per day)
before_training = np.random.normal(20, 5, n_employees)
# After training, there's a small improvement (+ noise)
after_training = before_training + np.random.normal(2, 3, n_employees)

print("Productivity Before vs After Training")
print(f"Before: mean = {before_training.mean():.2f} tasks/day")
print(f"After: mean = {after_training.mean():.2f} tasks/day")
print(f"Mean improvement: {(after_training - before_training).mean():.2f} tasks/day")

# Paired t-test (same subjects measured twice)
# H0: No difference before and after training
# H1: There is a difference

t_stat, p_value = stats.ttest_rel(before_training, after_training)

print(f"\nPaired t-test Results:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nConclusion: The training had a statistically significant effect.")
else:
    print("\nConclusion: No significant effect detected.")

In [None]:
# Understanding Type I and Type II Errors

print("Hypothesis Testing Errors:")
print("-" * 60)
print("                    | H0 True        | H0 False       ")
print("-" * 60)
print("Reject H0           | Type I Error   | Correct!       ")
print("                    | (False Pos)    | (True Pos)     ")
print("-" * 60)
print("Fail to Reject H0   | Correct!       | Type II Error  ")
print("                    | (True Neg)     | (False Neg)    ")
print("-" * 60)
print("\nType I Error (alpha): Probability of rejecting H0 when it's true")
print("Type II Error (beta): Probability of failing to reject H0 when it's false")
print("Power = 1 - beta: Probability of correctly rejecting a false H0")

## Practice Exercise 6.1

**Task:** A coffee shop claims their medium coffee contains 12 oz on average. You suspect they might be under-filling cups. You measure 20 cups and get the following data:

```python
coffee_amounts = [11.8, 12.1, 11.5, 11.9, 12.0, 11.7, 11.6, 12.2, 11.4, 11.8,
                  11.9, 12.0, 11.5, 11.7, 11.8, 12.1, 11.6, 11.9, 11.7, 11.8]
```

1. State the null and alternative hypotheses
2. Perform a one-sample t-test
3. At alpha = 0.05, what is your conclusion?
4. Calculate the 95% confidence interval for the true mean

In [None]:
# Your code here
coffee_amounts = [11.8, 12.1, 11.5, 11.9, 12.0, 11.7, 11.6, 12.2, 11.4, 11.8,
                  11.9, 12.0, 11.5, 11.7, 11.8, 12.1, 11.6, 11.9, 11.7, 11.8]


In [None]:
# Solution 6.1
coffee_amounts = np.array([11.8, 12.1, 11.5, 11.9, 12.0, 11.7, 11.6, 12.2, 11.4, 11.8,
                           11.9, 12.0, 11.5, 11.7, 11.8, 12.1, 11.6, 11.9, 11.7, 11.8])

print("Coffee Shop Under-filling Test")
print("=" * 50)

# 1. State hypotheses
print("\n1. Hypotheses:")
print("   H0: mu = 12 oz (coffee shop fills correctly)")
print("   H1: mu < 12 oz (coffee shop under-fills)")

# Summary statistics
print(f"\nSample statistics:")
print(f"   n = {len(coffee_amounts)}")
print(f"   Sample mean = {coffee_amounts.mean():.3f} oz")
print(f"   Sample std = {coffee_amounts.std(ddof=1):.3f} oz")

# 2. Perform t-test
claimed_mean = 12
t_stat, p_value_two_sided = stats.ttest_1samp(coffee_amounts, claimed_mean)

# For one-sided test (H1: mu < 12), divide p-value by 2
# Only if t-statistic is negative (sample mean < claimed mean)
p_value_one_sided = p_value_two_sided / 2 if t_stat < 0 else 1 - p_value_two_sided / 2

print(f"\n2. Test Results:")
print(f"   t-statistic: {t_stat:.4f}")
print(f"   p-value (one-sided): {p_value_one_sided:.4f}")

# 3. Conclusion
alpha = 0.05
print(f"\n3. Conclusion (alpha = {alpha}):")
if p_value_one_sided < alpha:
    print(f"   Reject H0 (p = {p_value_one_sided:.4f} < {alpha})")
    print("   Evidence suggests the coffee shop is under-filling cups.")
else:
    print(f"   Fail to reject H0 (p = {p_value_one_sided:.4f} >= {alpha})")
    print("   Not enough evidence to conclude under-filling.")

# 4. Confidence interval
confidence_level = 0.95
ci = stats.t.interval(confidence_level, 
                      df=len(coffee_amounts)-1,
                      loc=coffee_amounts.mean(),
                      scale=stats.sem(coffee_amounts))

print(f"\n4. 95% Confidence Interval:")
print(f"   ({ci[0]:.3f}, {ci[1]:.3f}) oz")
print(f"   Note: 12 oz {'is NOT' if ci[1] < 12 else 'IS'} contained in the interval.")

---
# Section 7: Correlation and Covariance
---

## What are Correlation and Covariance?

Both measure the relationship between two variables, but they differ in interpretation:

**Covariance**: Measures how two variables change together
- Positive: Variables move in the same direction
- Negative: Variables move in opposite directions
- Problem: Scale-dependent, hard to interpret

**Correlation (Pearson's r)**: Standardized covariance
- Range: -1 to +1
- +1: Perfect positive linear relationship
- -1: Perfect negative linear relationship
- 0: No linear relationship

### Interpretation Guidelines

| Correlation | Strength |
|-------------|----------|
| 0.0 - 0.2 | Very weak |
| 0.2 - 0.4 | Weak |
| 0.4 - 0.6 | Moderate |
| 0.6 - 0.8 | Strong |
| 0.8 - 1.0 | Very strong |

## Syntax

```python
# Covariance
cov_matrix = np.cov(x, y)           # Returns 2x2 covariance matrix
cov_value = np.cov(x, y)[0, 1]      # Extract covariance

# Correlation
corr_matrix = np.corrcoef(x, y)     # Returns 2x2 correlation matrix
corr_value = np.corrcoef(x, y)[0, 1]  # Extract correlation

# Using pandas
df['col1'].cov(df['col2'])          # Covariance between two columns
df['col1'].corr(df['col2'])         # Correlation between two columns
df.corr()                           # Correlation matrix for all columns

# Statistical test for correlation (with p-value)
from scipy.stats import pearsonr, spearmanr
corr, p_value = pearsonr(x, y)      # Pearson correlation
corr, p_value = spearmanr(x, y)     # Spearman correlation (rank-based)
```

In [None]:
# Create sample data to explore correlation
np.random.seed(42)
n = 100

# Create variables with different relationships
x = np.random.normal(50, 10, n)

# Strong positive correlation
y_positive = 2 * x + np.random.normal(0, 5, n)

# Strong negative correlation
y_negative = -1.5 * x + 150 + np.random.normal(0, 5, n)

# No correlation
y_none = np.random.normal(50, 10, n)

# Non-linear relationship (correlation may miss this!)
y_nonlinear = (x - 50) ** 2 + np.random.normal(0, 30, n)

# Calculate correlations
print("Pearson Correlations:")
print(f"  Positive relationship: r = {np.corrcoef(x, y_positive)[0,1]:.3f}")
print(f"  Negative relationship: r = {np.corrcoef(x, y_negative)[0,1]:.3f}")
print(f"  No relationship: r = {np.corrcoef(x, y_none)[0,1]:.3f}")
print(f"  Non-linear relationship: r = {np.corrcoef(x, y_nonlinear)[0,1]:.3f}")

In [None]:
# Visualize different correlation patterns
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

datasets = [
    (x, y_positive, 'Strong Positive'),
    (x, y_negative, 'Strong Negative'),
    (x, y_none, 'No Correlation'),
    (x, y_nonlinear, 'Non-linear')
]

for ax, (data_x, data_y, title) in zip(axes.flat, datasets):
    ax.scatter(data_x, data_y, alpha=0.5)
    corr = np.corrcoef(data_x, data_y)[0, 1]
    ax.set_title(f'{title}\nr = {corr:.3f}')
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    
    # Add trend line for linear relationships
    if title != 'Non-linear':
        z = np.polyfit(data_x, data_y, 1)
        p = np.poly1d(z)
        ax.plot(sorted(data_x), p(sorted(data_x)), "r--", alpha=0.8)

plt.tight_layout()
plt.show()

In [None]:
# Correlation with significance testing
from scipy.stats import pearsonr, spearmanr

print("Statistical Tests for Correlation:")
print("=" * 50)

# Test positive relationship
r, p_value = pearsonr(x, y_positive)
print(f"\nPositive relationship:")
print(f"  Pearson r = {r:.4f}")
print(f"  p-value = {p_value:.2e}")
print(f"  Significant: {'Yes' if p_value < 0.05 else 'No'}")

# Test no relationship
r, p_value = pearsonr(x, y_none)
print(f"\nNo relationship:")
print(f"  Pearson r = {r:.4f}")
print(f"  p-value = {p_value:.4f}")
print(f"  Significant: {'Yes' if p_value < 0.05 else 'No'}")

In [None]:
# Spearman correlation - for non-linear monotonic relationships
# Based on ranks, not actual values

# Create a monotonic but non-linear relationship
x_mono = np.arange(1, 101)
y_mono = np.log(x_mono) + np.random.normal(0, 0.2, 100)  # Logarithmic relationship

pearson_r, _ = pearsonr(x_mono, y_mono)
spearman_r, _ = spearmanr(x_mono, y_mono)

print(f"Logarithmic relationship (y = log(x) + noise):")
print(f"  Pearson r = {pearson_r:.4f}")
print(f"  Spearman r = {spearman_r:.4f}")
print(f"\nSpearman is higher because it captures monotonic relationships")
print("even when they're not perfectly linear.")

In [None]:
# Real-world example: Correlation matrix with salary data
# Let's create a more complete employee dataset

np.random.seed(42)
n_emp = 200

# Generate correlated features
years_experience = np.random.uniform(0, 20, n_emp)
education_years = np.random.uniform(12, 22, n_emp)  # 12 = high school, 22 = PhD

# Salary correlated with experience and education
base_salary = 40000
salary = (base_salary + 
          years_experience * 3000 + 
          education_years * 2000 + 
          np.random.normal(0, 8000, n_emp))

# Performance somewhat correlated with experience
performance_score = 3 + years_experience * 0.1 + np.random.normal(0, 1, n_emp)
performance_score = np.clip(performance_score, 1, 5)

# Age correlated with experience (logically)
age = 22 + years_experience + np.random.normal(0, 3, n_emp)

employee_data = pd.DataFrame({
    'age': age,
    'years_experience': years_experience,
    'education_years': education_years,
    'salary': salary,
    'performance_score': performance_score
})

# Calculate correlation matrix
corr_matrix = employee_data.corr()
print("Correlation Matrix:")
print(corr_matrix.round(3))

In [None]:
# Visualize correlation matrix as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            vmin=-1, vmax=1, fmt='.2f', square=True)
plt.title('Employee Data Correlation Matrix')
plt.tight_layout()
plt.show()

In [None]:
# Important: Correlation does not imply causation!
# Example of spurious correlation

np.random.seed(42)
years = np.arange(2000, 2020)

# Two completely unrelated trends that both increase over time
ice_cream_sales = 1000 + 50 * (years - 2000) + np.random.normal(0, 30, 20)
drowning_incidents = 100 + 5 * (years - 2000) + np.random.normal(0, 10, 20)

corr = np.corrcoef(ice_cream_sales, drowning_incidents)[0, 1]
print(f"Correlation between ice cream sales and drowning incidents: r = {corr:.3f}")
print("\nThis is a SPURIOUS correlation! Both are actually caused by a third")
print("variable (summer/hot weather) that affects both independently.")

---
# Section 8: A/B Testing Basics
---

## What is A/B Testing?

A/B testing (also called split testing) is a method of comparing two versions of something to determine which performs better. It's widely used in:

- Web design (button colors, layouts)
- Marketing (email subject lines, ads)
- Product development (features, pricing)

### The A/B Testing Process

1. **Define the goal**: What metric are you trying to improve?
2. **Create variants**: Control (A) and Treatment (B)
3. **Determine sample size**: How many observations needed?
4. **Randomly assign users**: Ensure groups are comparable
5. **Run the experiment**: Collect data
6. **Analyze results**: Is the difference statistically significant?
7. **Make a decision**: Implement winner or iterate

In [None]:
# Example: A/B Test for Website Conversion Rate
# A company wants to test if a new button color (green) increases conversions
# compared to the original (blue)

np.random.seed(42)

# Simulate data
n_control = 1000  # Users who saw blue button
n_treatment = 1000  # Users who saw green button

# True conversion rates (unknown in practice)
true_rate_control = 0.10  # 10% conversion for blue
true_rate_treatment = 0.12  # 12% conversion for green

# Simulate conversions (1 = converted, 0 = didn't convert)
control_conversions = np.random.binomial(1, true_rate_control, n_control)
treatment_conversions = np.random.binomial(1, true_rate_treatment, n_treatment)

# Calculate observed conversion rates
control_rate = control_conversions.mean()
treatment_rate = treatment_conversions.mean()

print("A/B Test: Button Color Experiment")
print("=" * 50)
print(f"\nControl (Blue Button):")
print(f"  Sample size: {n_control}")
print(f"  Conversions: {control_conversions.sum()}")
print(f"  Conversion rate: {control_rate:.2%}")

print(f"\nTreatment (Green Button):")
print(f"  Sample size: {n_treatment}")
print(f"  Conversions: {treatment_conversions.sum()}")
print(f"  Conversion rate: {treatment_rate:.2%}")

print(f"\nAbsolute Difference: {treatment_rate - control_rate:.2%}")
print(f"Relative Lift: {((treatment_rate - control_rate) / control_rate) * 100:.1f}%")

In [None]:
# Statistical test for A/B test (proportions)
from scipy.stats import chi2_contingency, norm

# Create contingency table
contingency_table = np.array([
    [control_conversions.sum(), n_control - control_conversions.sum()],
    [treatment_conversions.sum(), n_treatment - treatment_conversions.sum()]
])

print("Contingency Table:")
print(pd.DataFrame(contingency_table, 
                   index=['Control', 'Treatment'],
                   columns=['Converted', 'Not Converted']))

# Chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-square Test Results:")
print(f"  Chi-square statistic: {chi2:.4f}")
print(f"  Degrees of freedom: {dof}")
print(f"  p-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nConclusion: The difference IS statistically significant (p < {alpha})")
    print("The green button appears to increase conversions.")
else:
    print(f"\nConclusion: The difference is NOT statistically significant (p >= {alpha})")
    print("We cannot conclude that the button color affects conversions.")

In [None]:
# Z-test for proportions (alternative approach)
# This is often used for A/B tests with binary outcomes

def z_test_proportions(successes_a, n_a, successes_b, n_b):
    """Two-proportion z-test."""
    p_a = successes_a / n_a
    p_b = successes_b / n_b
    
    # Pooled proportion
    p_pooled = (successes_a + successes_b) / (n_a + n_b)
    
    # Standard error
    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
    
    # Z-statistic
    z = (p_b - p_a) / se
    
    # P-value (two-tailed)
    p_value = 2 * (1 - norm.cdf(abs(z)))
    
    return z, p_value, p_a, p_b

z_stat, p_val, p_ctrl, p_treat = z_test_proportions(
    control_conversions.sum(), n_control,
    treatment_conversions.sum(), n_treatment
)

print("Z-Test for Proportions:")
print(f"  Control proportion: {p_ctrl:.4f}")
print(f"  Treatment proportion: {p_treat:.4f}")
print(f"  Z-statistic: {z_stat:.4f}")
print(f"  p-value: {p_val:.4f}")

In [None]:
# Sample Size Calculation
# How many samples do we need to detect a given effect?

def calculate_sample_size(baseline_rate, minimum_detectable_effect, alpha=0.05, power=0.80):
    """
    Calculate required sample size for A/B test.
    
    Parameters:
    - baseline_rate: Current conversion rate (e.g., 0.10 for 10%)
    - minimum_detectable_effect: Relative change to detect (e.g., 0.20 for 20% lift)
    - alpha: Significance level (Type I error rate)
    - power: Statistical power (1 - Type II error rate)
    """
    # New rate after effect
    new_rate = baseline_rate * (1 + minimum_detectable_effect)
    
    # Z-scores
    z_alpha = norm.ppf(1 - alpha/2)
    z_beta = norm.ppf(power)
    
    # Pooled variance estimate
    p_avg = (baseline_rate + new_rate) / 2
    
    # Sample size formula
    numerator = (z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) + 
                 z_beta * np.sqrt(baseline_rate * (1 - baseline_rate) + 
                                  new_rate * (1 - new_rate))) ** 2
    denominator = (new_rate - baseline_rate) ** 2
    
    n_per_group = numerator / denominator
    
    return int(np.ceil(n_per_group))

# Example: Current conversion rate is 10%, want to detect 20% relative lift
baseline = 0.10
mde = 0.20  # 20% relative increase (from 10% to 12%)

required_n = calculate_sample_size(baseline, mde)

print("Sample Size Calculation")
print("=" * 50)
print(f"Baseline conversion rate: {baseline:.1%}")
print(f"Minimum detectable effect: {mde:.0%} relative lift")
print(f"Target conversion rate: {baseline * (1 + mde):.1%}")
print(f"Significance level (alpha): 0.05")
print(f"Statistical power: 80%")
print(f"\nRequired sample size: {required_n:,} per group")
print(f"Total required: {2 * required_n:,} observations")

In [None]:
# Visualize A/B test results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of conversion rates
groups = ['Control\n(Blue)', 'Treatment\n(Green)']
rates = [control_rate, treatment_rate]
colors = ['steelblue', 'forestgreen']

bars = axes[0].bar(groups, rates, color=colors, edgecolor='black')
axes[0].set_ylabel('Conversion Rate')
axes[0].set_title('Conversion Rate by Group')
axes[0].set_ylim(0, max(rates) * 1.3)

# Add value labels on bars
for bar, rate in zip(bars, rates):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
                 f'{rate:.2%}', ha='center', va='bottom', fontsize=12, fontweight='bold')

# Add confidence intervals
def confidence_interval(successes, n, confidence=0.95):
    p = successes / n
    z = norm.ppf((1 + confidence) / 2)
    margin = z * np.sqrt(p * (1 - p) / n)
    return p - margin, p + margin

ci_control = confidence_interval(control_conversions.sum(), n_control)
ci_treatment = confidence_interval(treatment_conversions.sum(), n_treatment)

# Error bars
errors = [[rates[0] - ci_control[0], rates[1] - ci_treatment[0]],
          [ci_control[1] - rates[0], ci_treatment[1] - rates[1]]]
axes[0].errorbar([0, 1], rates, yerr=errors, fmt='none', color='black', capsize=5)

# Distribution of possible outcomes (simulation)
simulated_control = np.random.binomial(n_control, control_rate, 10000) / n_control
simulated_treatment = np.random.binomial(n_treatment, treatment_rate, 10000) / n_treatment

axes[1].hist(simulated_control, bins=50, alpha=0.5, label='Control', color='steelblue')
axes[1].hist(simulated_treatment, bins=50, alpha=0.5, label='Treatment', color='forestgreen')
axes[1].axvline(control_rate, color='steelblue', linestyle='--', linewidth=2)
axes[1].axvline(treatment_rate, color='forestgreen', linestyle='--', linewidth=2)
axes[1].set_xlabel('Conversion Rate')
axes[1].set_ylabel('Frequency (Simulations)')
axes[1].set_title('Distribution of Possible Outcomes')
axes[1].legend()

plt.tight_layout()
plt.show()

## Practice Exercise 8.1

**Task:** An e-commerce company ran an A/B test on their checkout page. They tested a new "simplified checkout" (Treatment) against the original (Control). Here are the results:

- **Control**: 5,000 visitors, 350 completed purchases
- **Treatment**: 5,200 visitors, 390 completed purchases

1. Calculate the conversion rate for each group
2. Calculate the relative lift (% improvement)
3. Perform a statistical test to determine if the difference is significant (alpha = 0.05)
4. What is your recommendation to the company?

In [None]:
# Your code here
control_visitors = 5000
control_purchases = 350
treatment_visitors = 5200
treatment_purchases = 390


In [None]:
# Solution 8.1
control_visitors = 5000
control_purchases = 350
treatment_visitors = 5200
treatment_purchases = 390

print("E-commerce A/B Test Analysis")
print("=" * 50)

# 1. Conversion rates
control_rate = control_purchases / control_visitors
treatment_rate = treatment_purchases / treatment_visitors

print(f"\n1. Conversion Rates:")
print(f"   Control: {control_rate:.4f} ({control_rate:.2%})")
print(f"   Treatment: {treatment_rate:.4f} ({treatment_rate:.2%})")

# 2. Relative lift
absolute_diff = treatment_rate - control_rate
relative_lift = (treatment_rate - control_rate) / control_rate * 100

print(f"\n2. Lift:")
print(f"   Absolute difference: {absolute_diff:.4f} ({absolute_diff:.2%})")
print(f"   Relative lift: {relative_lift:.2f}%")

# 3. Statistical test
contingency_table = np.array([
    [control_purchases, control_visitors - control_purchases],
    [treatment_purchases, treatment_visitors - treatment_purchases]
])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\n3. Statistical Test (Chi-square):")
print(f"   Chi-square statistic: {chi2:.4f}")
print(f"   p-value: {p_value:.4f}")
print(f"   Significant at alpha=0.05: {'Yes' if p_value < 0.05 else 'No'}")

# 4. Recommendation
print(f"\n4. Recommendation:")
if p_value < 0.05:
    print(f"   The simplified checkout shows a statistically significant")
    print(f"   improvement of {relative_lift:.1f}% in conversion rate.")
    print(f"   RECOMMEND: Implement the new checkout for all users.")
else:
    print(f"   While there appears to be a {relative_lift:.1f}% improvement,")
    print(f"   this difference is NOT statistically significant (p={p_value:.3f}).")
    print(f"   RECOMMEND: Continue testing with more data, or consider other factors.")

---
# Module Summary

## Key Takeaways

### Descriptive Statistics
- Descriptive statistics summarize data through measures of central tendency, dispersion, and shape
- Use pandas `.describe()` for a quick overview of numerical data

### Central Tendency
- **Mean**: Arithmetic average, sensitive to outliers
- **Median**: Middle value, robust to outliers
- **Mode**: Most frequent value, useful for categorical data

### Dispersion
- **Range**: Max - Min (simplest measure)
- **Variance/Standard Deviation**: Measure spread around the mean
- **IQR**: Range of middle 50%, robust to outliers
- **Coefficient of Variation**: Relative variability (std/mean)

### Probability
- P(A) is between 0 and 1
- P(not A) = 1 - P(A)
- Independent events: P(A and B) = P(A) x P(B)
- Conditional probability: P(A|B) = P(A and B) / P(B)

### Probability Distributions
- **Normal**: Bell curve, described by mean and std
- **Binomial**: Number of successes in n trials
- **Poisson**: Count of events in a time period

### Hypothesis Testing
- State null (H0) and alternative (H1) hypotheses
- Calculate test statistic and p-value
- Reject H0 if p-value < alpha (typically 0.05)
- Be aware of Type I and Type II errors

### Correlation
- Pearson's r measures linear relationship (-1 to +1)
- Spearman's rho measures monotonic relationship
- Correlation does NOT imply causation!

### A/B Testing
- Randomly assign users to control and treatment groups
- Calculate conversion rates and statistical significance
- Consider sample size requirements for detecting effects

## Next Module
In the next module, we'll explore Introduction to Machine Learning, where you'll learn how to build predictive models using the statistical foundation we've established.

## Additional Practice
For extra practice, try these challenges:
1. Analyze the distribution of a real dataset (e.g., Kaggle) and report all descriptive statistics
2. Design an A/B test for a feature in an app you use frequently
3. Find examples of spurious correlations and explain the confounding variables