# Point Estimation of Population Parameters

Author & Instructor: Diana NURBAKOVA, PhD.

In [None]:
%%html
<link rel="stylesheet" type="text/css" href="../styles/styles.css">

## Learning Objectives

By the end of this lesson, you will be able to:
- Define and distinguish point estimators
- Evaluate estimator properties (bias, variance, MSE)
- Derive MLE for standard distributions 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.filterwarnings('ignore')

# Set style for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_style("whitegrid")
#sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
from ipywidgets import interact, IntSlider, FloatSlider, Dropdown
import ipywidgets as widgets

In [None]:
plt.rcParams['font.family'] = ['DejaVu Sans', 'Segoe UI Emoji']

In [None]:
import sys
from pathlib import Path

# Add the "resources" directory to the path
project_root = Path().resolve().parent
resources_path = project_root / 'resources'
sys.path.insert(0, str(resources_path))

In [None]:
from estimators import (generate_hook_data, explore_sampling_distribution, demonstrate_estimator_concept, visualize_bias_variance_tradeoff, plot_heads_tails, 
                        demonstrate_likelihood_concept, plot_binomial, demonstrate_prior_importance, compare_mle_map)

<div class="alert alert-info">
<h4>üéØ Today's Challenge: The Neural Network Initialization Mystery</h4>

You're training a neural network for image classification. A critical hyperparameter is the weight initialization scale $\sigma$.

- Too small ‚Üí vanishing gradients, slow learning
- Too large ‚Üí exploding gradients, instability
- Just right ‚Üí optimal convergence

You run 50 training experiments and record the final validation accuracy for different $\sigma$ values. Your best result shows $\hat{\sigma} = 0.15$ gave 94.2% accuracy.

**Questions**:

1. Is $\sigma = 0.15$ the "true" optimal value?
2. If you had run 500 experiments instead of 50, would you get the same estimate?
3. How do you quantify how "wrong" your estimate might be?
4. Your colleague claims $\sigma = 0.12$ is better. Who's right?

By the end of today: You'll have a mathematical framework to answer all these questions with confidence.

</div>

In [None]:
# visualisation
generate_hook_data()

> If we run 50 MORE experiments, will we get exactly $\hat{\sigma} = 0.15$ again? (Yes / No / Maybe)

## What is a Statistic?

<div class="alert alert-success">
<h4>Definition: Statistic</h4>

A **statistic** is any function of the sample data that doesn't depend on unknown parameters.

If $X_1, X_2, ..., X_n$ is a random sample, then $T(X_1, X_2, ..., X_n)$ is a statistic.

*Examples of Statistics:*
- Sample mean: $\bar{X} = (1/n)\sum_i^n X_i$
- Sample variance: $s^2 = (1/(n-1))\sum_i^n(X_i - \bar{X})^2$
- Sample median: middle value when data is sorted
- Sample maximum: $max(X_1, ..., X_n)$
- Sample range: $max(X_i) - min(X_i)$

*Not statistics (depend on unknown parameters)*:
- $(\hat{X} - \mu)/\sigma$ where $\mu, \sigma$ are unknown 
- $P(X > \theta)$ where $\theta$ is unknown 
</div>

### Two Types of Statistics for Estimation

| | Point Statistic (Point Estimator) | Interval (Range) Statistic (Interval Estimator) |
|--|----|----|
|**Definition**| A **single number** that estimates the parameter| A **range of plausible values** for the parameter|
|**Notation**| $\hat{\theta}$ (theta-hat) represents a point estimate of $\theta$| interval containing the point estimate and the margin of error, e.g. $\mu ¬± E$ where $E$ is a margin of error| 
|**Characteristics:**|- Simple, easy to interpret</br>- No information about uncertainty</br>- "Best guess" at the true value|- Provides range of uncertainty</br>- Comes with confidence level (e.g., 95%)</br>- More informative than point estimate</br>- Accounts for sampling variability|
|**Interpretation**||*"We are 95% confident that the true parameter lies in this interval"*| 
|**Examples**|- Estimating population mean $\mu$: use $\bar{X} = 94.3$</br>- Estimating population variance $\sigma^2$: use $s^2 = 15.7$</br>- Estimating success probability $p$: use $\hat{p} = 0.65$|- Estimating $\mu$: [92.1, 96.5] (95% confidence interval)</br>- Estimating $\sigma^2$: [12.3, 21.4]</br>- Estimating $p$: [0.58, 0.72]|
| **Use in ML**|- Model parameters (weights, biases)</br>- Performance metrics (accuracy = 0.87)</br>- Quick decision-making|- Model comparison: "Is model A really better than B?"</br>- Performance ranges: accuracy $\in$ [0.84, 0.90]</br>- Risk assessment</br>- Communicating uncertainty to stakeholders|


## What is an Estimate?

<div class="alert alert-success">
<h4>Definition: Estimator</h4>

An **estimator** is a rule (function) that takes data and produces an estimate of an unknown parameter.

**Notation:**
- $\theta$ = true (unknown) parameter
- $X_1, X_2, ..., X_n$ = observed data (random sample)
- $\hat{\theta} = \hat{\theta}(X_1, ..., X_n)$ = estimator (a function of the data)

**Key insight:** An estimator is itself a random variable (because it depends on random data)
</div>

Let's consider the darts analogy for a better understanding estimators.

Imagine throwing darts at a target. The **bullseye** represents the **true parameter value** $\theta$, and each **dart** represents an **estimate** $\hat{\theta}$ from a different sample.

We simulate the same estimation process many times (like throwing many darts), each time with a different random sample. This shows us the **sampling distribution** of our estimator‚Äîhow the estimates vary from sample to sample.

What we see:
- Red star (‚òÖ): The true parameter value (bullseye)
- Blue dots: Individual estimates from different samples (dart throws)
- Spread pattern: How the estimator behaves across many samples


In [None]:
# darts analogy
demonstrate_estimator_concept()

<div class="alert alert-primary">
<h4>ü§ñ ML Application Spotlight: Where Do We Use Estimators?</h4>

Everywhere in machine learning:

- Neural Networks: Estimating optimal weights from training data
- Clustering: Estimating cluster centers (k-means uses sample means)
- Gaussian Mixture Models: Estimating $\mu$ and $\sigma$ for each component
- Reinforcement Learning: Estimating value functions from rewards
- Generative Models: Estimating distribution parameters (GANs, VAEs)

The fundamental question: Given data, what's our best guess for the model parameters?
</div>

Let's consider the following setup.

We repeatedly:
1. Draw a random sample of size $n$ from a population
2. Compute an estimate $\hat{\theta}$ from that sample
3. Plot the estimate as a point
4. Build up a histogram of all estimates

For the sake of comparison, we will use the true values of the distribution parameters (e.g. true $\mu = 5$ and $\sigma = 2$).

As we increase the number of samples:

1. Distribution Shape Emerges
   - Initially: just a few scattered points
   - Eventually: a clear bell-shaped curve (often normal by CLT)
   - The shape tells us about the estimator's behavior

2. Center (Bias)
   - Where is the histogram centered?
   - If centered on the green line (true $\theta$): **unbiased** 
   - If shifted left or right: **biased** 

3. Spread (Variance)
   - How wide is the histogram?
   - Narrow spread: **low variance** (precise) 
   - Wide spread: **high variance** (imprecise) 

4. Convergence
   - With more samples, the histogram stabilizes
   - This demonstrates the Law of Large Numbers
   - The empirical distribution ‚Üí theoretical distribution


Parameters to explore:

1. Sample Size (n):
- Small n ‚Üí wider sampling distribution (more uncertainty)
- Large n ‚Üí narrower sampling distribution (less uncertainty)
- This is why "more data is better"

2. Number of Samples:
- More samples ‚Üí smoother histogram
- Shows the sampling distribution more clearly
- In practice, we only get ONE sample, but this helps us understand uncertainty

3. Different Estimators:
- Compare sample mean vs. sample median
- See which has lower variance
- Understand when each is preferable

In [None]:
# interactive exploration of sampling distribution
interact(explore_sampling_distribution,
         true_mean=FloatSlider(min=0, max=10, step=0.5, value=5, description='True Œº:'),
         true_std=FloatSlider(min=0.5, max=5, step=0.5, value=2, description='True œÉ:'),
         sample_size=IntSlider(min=10, max=200, step=10, value=30, description='Sample size:'),
         n_experiments=IntSlider(min=100, max=2000, step=100, value=1000, description='# Samples:'))

Key Insights from This Exploration

1. Every estimate has uncertainty
   - A single sample gives one estimate
   - Different samples give different estimates
   - The sampling distribution quantifies this variability

2. Sample size matters enormously
   - Larger n ‚Üí estimates closer to truth
   - This is the foundation of statistical inference

3. Not all estimators are equal
   - Some are centered better (less bias)
   - Some are more consistent (less variance)
   - We need both properties!

4. Statistics is about distributions, not just numbers
   - Don't just report $\hat{\theta} = 5.3$
   - Think about: "How would this estimate change with different data?"
   - The sampling distribution answers this question

<div class="alert alert-warning">
<h4>üí° Key Insight: Estimators are Random Variables</h4>

Because estimators depend on random data, they have their own probability distributions (called sampling distributions).

This means:

- Different samples ‚Üí different estimates
- We need to understand the distribution of our estimator
- Properties like bias and variance characterize estimator quality

</div>


## Properties of Estimators

<div class="alert alert-success">
<h4>Definition: Three Critical Properties of Estimators</h4>

Given an estimator $\hat{\theta}$ for parameter $\theta$:

1. **Bias**:

$$Bias(\hat{\theta}, \theta) = E[\hat{\theta}] - \theta$$

- Measures systematic error

- $\hat{\theta}$ is **unbiased** if $E[\hat{\theta}] = \theta$, i.e. $Bias(\hat{\theta}, \theta) = 0$

2. **Variance**:

$$Var(Bias(\hat{\theta}, \theta)) = E[(Bias(\hat{\theta}, \theta) - E[Bias(\hat{\theta}, \theta)])^2]$$

- Measures spread/uncertainty

- Lower variance ‚Üí more consistent estimates

3. **Mean Squared Error (MSE)**:

$$MSE(\hat{\theta}, \theta) = E[(\hat{\theta} - Œ∏)^2]$$

Combines both: $$MSE(\hat{\theta}, \theta) = Bias^2(\hat{\theta}, \theta) + Var(\hat{\theta})$$

- Measures the dispersion of the results around the true value

- Overall measure of estimator quality

- If $MSE(\hat{\theta}, \theta) \xrightarrow[n\rightarrow \infty]{} 0$, then the estimator $\hat{\theta}$ converges to $\theta$ 

</div>

<div class="alert alert-exercise">
<h4>Calculated Example: Bias of and Estimator of Poisson Distribution Parameter ùúÜ</h4>

1. Calculate the bias of the estimator $\hat{m}_n = \bar{X}$ of the parameter $\lambda$ of Poisson distribution $\mathcal{P}(\lambda)$, i.e. $X_i \sim \mathcal{P}(\lambda)$.
2. Calculate MSE of the estimator $\hat{m}_n = \bar{X}$

*Reminder*: $P(X = k) = e^{-k}\frac{\lambda^k}{k!}$, $\mathbb{E}X = \lambda$, $Var(X) = \lambda$
</div>

<details>
<summary>Reveal solution</summary>

1. Finding Bias

$$Bias(\hat{\theta}, \theta) = E[\hat{\theta}] - \theta$$

In our case: $$\left\{\begin{array}{ll}\theta = \lambda\\ \hat{\theta} = \bar{X}\end{array}\right.$$

Hence:

$$Bias(\bar{X}, \lambda) = E[\bar{X}] - \lambda = \bigg[\text{by def. } \bar{X} = \frac{1}{n}\sum_i^nX_i\bigg] = E\bigg[\frac{1}{n}\sum_i^nX_i\bigg] - \lambda =$$

$$= \bigg[\text{by propr. of E } E[aX + b] = aE[X] + b\bigg] = \frac{1}{n}E\bigg[\sum_i^nX_i\bigg] - \lambda = \bigg[\text{by propr. of E } E[X + Y] = E[X] + E[Y]\bigg] = \frac{1}{n}\sum_i^n E[X_i] - \lambda =$$

$$= \bigg[\text{as } X_i \sim \mathcal{P}(\lambda) \text{ and } EX = \lambda \bigg] = \frac{1}{n}\sum_i^n \lambda - \lambda = \frac{1}{n}n \lambda - \lambda = \mathbf{0}$$

As $Bias = 0$, this is *unbiased estimator*.

2. Finding MSE

$$MSE(\hat{\theta}, \theta) = E[(\hat{\theta} - Œ∏)^2] = Bias^2(\hat{\theta}, \theta) + Var(\hat{\theta})$$

In our case, $Bias(\hat{\theta}, \theta) = 0$

Hence:

$$MSE(\bar{X}, \lambda) = 0 + Var(\bar{X}) = \bigg[\text{by def. } \bar{X} = \frac{1}{n}\sum_i^nX_i\bigg] = Var\bigg(\frac{1}{n}\sum_i^nX_i\bigg) = $$

$$= \bigg[\text{by propr. of Var } Var(aX + b) = a^2Var(X)\bigg] =  \bigg(\frac{1}{n}\bigg)^2 Var\bigg(\sum_i^nX_i\bigg) =$$

$$= \bigg[\text{by propr. of Var for indep. r.v.} Var(X + Y) = Var(X) + Var(Y)\bigg] =  \frac{1}{n^2} \sum_i^n Var(X_i) =$$

$$= \bigg[\text{as } X_i \sim \mathcal{P}(\lambda) \text{ and } Var(X) = \lambda \bigg] = \frac{1}{n^2} n\lambda = \frac{\lambda}{n} \xrightarrow[n\rightarrow \infty]{} 0$$

</details>

Thus, every estimator can be characterized along two independent dimensions:

|| **BIAS (Systematic Error)**| **VARIANCE (Random Error)**|
|---|----|---|
|**Question**| "On average, does the estimator hit the target?"| "How much do estimates vary from sample to sample?"
**Formula**| $Bias = E[\hat{\theta}] - \theta$ | $Variance = E[(\hat{\theta} - E[\hat{\theta}])^2]$|
**Visual cue**| Where is the cluster of estimates *centered*?|How *spread out* is the cluster of estimates?|
|**Low value**| Estimates centered on true value (target)|Tight cluster of estimates (consistent)|
|**High value**| Estimates consistently shifted away from true value|Wide scatter of estimates (inconsistent)|
|**Reduces with mode data?**| NO | YES |

Let's get back to our darts analogy and consider four scenarios:

1. **Low Bias, Low Variance** (Excellent)
   - Darts cluster tightly around bullseye
   - Consistently accurate (low bias) and consistently precise (low variance)
   - *Example*: Using $\bar{X}$ to estimate $\mu$ for normal data

2. **Low Bias, High Variance** (Good)
   - Darts scattered but centered on bullseye
   - Unbiased on average
   - But individual throws are unreliable
   - Need more data to reduce 
   - Note: better than biased estimators
   - *Example*: Median with small sample size

3. **High Bias, Low Variance** (Problematic)
   - Darts cluster tightly but miss the target
   - Consistently wrong in the same direction
   - Precision without accuracy
   - More data won't help (bias doesn't decrease with $n$)
   - Note: sometimes accepted if variance reduction is dramatic
   - *Example*: Biased coin estimator that always adds 0.1

4. **High Bias, High Variance** (Worst)
   - Darts scattered AND off-target
   - Neither accurate nor precise
   - *Example*: Using a terrible estimator like "first observation + 5"

In [None]:
# bias variance 
visualize_bias_variance_tradeoff()

<div class="alert alert-warning">
<h4>üí° Key Insights: Bias and Variance</h4>

**Bias = Systematic error**
- Where is your dart cluster *centered*?
- Low bias: centered on target (bullseye)
- High bias: consistently off to one side

**Variance = Random error**
- How *spread out* are your darts?
- Low variance: tight cluster
- High variance: wide scatter

**The Goal**: Low bias AND low variance
- Hit the bullseye consistently
- This means: $E[\hat{\theta}] = \theta$ (unbiased) AND $Var(\hat{\theta})$ is small

**The Trade-off**: Sometimes we accept a little bias to get much lower variance
- Like a slightly off-center but very tight dart cluster
- This is the bias-variance trade-off in machine learning
- *Example*: Ridge regression accepts small bias for lower variance

</div>

<div class="alert alert-primary">
<h4>ü§ñ ML Application Spotlight: Bias-Variance in Model Selection</h4>

This is THE fundamental tradeoff in machine learning.

**Underfitting (High Bias)**:

- Model too simple
- Systematically misses patterns
- Low training AND test accuracy

**Overfitting (High Variance):**

- Model too complex
- Learns noise in training data
- High training accuracy, poor test accuracy

*Goal*: Find the sweet spot that minimizes MSE = Bias¬≤ + Variance

*Example*: Polynomial regression degree selection

- Degree 1: High bias (too simple)
- Degree 20: High variance (too complex)
- Degree 3-5: Just right for most problems

</div>

<div class="alert alert-exercise">
<h4>Question: Mini-Exercise on Estimator Properties</h4>

Given data from $N(\mu=10, \sigma^2=3^2)$, analyze the three estimators below: sample mean, sample mean of first half of the data + 5, and sample median.

**Tasks:**

1. Run 1000 experiments with n=50 samples each
2. For each estimator, calculate:

- Empirical bias
- Empirical variance
- Empirical MSE

3. Which estimator is best? Why?

</div>

In [None]:
# Define estimators
def estimator_A(data):
    """Sample mean"""
    return np.mean(data)
    
def estimator_B(data):
    """Mean of first half + 5"""
    return np.mean(data[:len(data)//2]) + 5
    
def estimator_C(data):
    """Median"""
    return np.median(data)

In [None]:
# ANSWER
def solution_mini_exercise_estimator_properties(true_mean = 10, true_std = 3, n = 50, n_trials = 1000):
    """
    Solution for mini-exercise: Comparing Three Estimators
    """
    
    # Setup
    np.random.seed(42)
    
    # Run simulation
    estimates_A, estimates_B, estimates_C = [], [], []
    
    pass

In [None]:
# ANSWER


COMMENTS: <span style="color:red">YOUR COMMENTS HERE</span>

| When to use Median | When to use Mean |
|------|------|
| Data has outliers (robust to extreme values) | No significant outliers |
| Skewed distributions (median = 'typical' value)</br> Heavy-tailed distributions </br> When a few extreme values shouldn't influence result | Data is (approximately) normally distributed | 
| | Want most efficient estimator (lowest variance) | 
| | Need to use established statistical tests (t-test, etc.) |

Let's explore the case with outliers.

In [None]:
true_mean = 10
true_std = 3
n = 50 
n_trials = 1000

# Generate data with outliers
n_outlier_experiments = 1000
sample_size_outlier = 50
contamination_rate = 0.1  # 10% outliers
    
estimates_mean_outlier = []
estimates_median_outlier = []
    
for _ in range(n_outlier_experiments):
    # Generate mostly normal data
    sample = np.random.normal(true_mean, true_std, sample_size_outlier)
        
    # Add some outliers
    n_outliers = int(contamination_rate * sample_size_outlier)
    outlier_indices = np.random.choice(sample_size_outlier, n_outliers, replace=False)
    sample[outlier_indices] = np.random.uniform(30, 50, n_outliers)  # Extreme values
        
    estimates_mean_outlier.append(np.mean(sample))
    estimates_median_outlier.append(np.median(sample))
    
estimates_mean_outlier = np.array(estimates_mean_outlier)
estimates_median_outlier = np.array(estimates_median_outlier)
    
# Calculate MSE with outliers
mse_mean_outlier = np.mean((estimates_mean_outlier - true_mean)**2)
mse_median_outlier = np.mean((estimates_median_outlier - true_mean)**2)
    
print(f"With {contamination_rate:.0%} outliers:")
print(f"  MSE (Mean):   {mse_mean_outlier:.4f}")
print(f"  MSE (Median): {mse_median_outlier:.4f}")

if mse_median_outlier < mse_mean_outlier:
    print("Median WINS! It's robust to outliers.")
    print(f"Median reduces MSE by {(1 - mse_median_outlier/mse_mean_outlier)*100:.1f}%")
    

In [None]:
# visualisation 
fig, axes = plt.subplots(ncols=3, figsize=(16, 5))

# Plot 1: Distributions of all three estimators
ax = axes[0]

ax.hist(estimates_A, bins=40, alpha=0.6, color='steelblue', 
       edgecolor='black', density=True, label='A: Mean')
ax.hist(estimates_B, bins=40, alpha=0.6, color='red', 
       edgecolor='black', density=True, label='B: Half+5')
ax.hist(estimates_C, bins=40, alpha=0.6, color='green', 
       edgecolor='black', density=True, label='C: Median')

ax.axvline(true_mean, color='gold', linewidth=4, linestyle='--',
          label=f'True Œº = {true_mean}', zorder=10)

ax.set_xlabel('Estimate', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Sampling Distributions of Three Estimators\n(No Outliers)', 
            fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Plot 2: Bias-Variance decomposition
ax = axes[1]

names = list(results.keys())
biases_sq = [results[name]['bias']**2 for name in names]
variances = [results[name]['variance'] for name in names]

x = np.arange(len(names))
width = 0.35

bars1 = ax.bar(x, biases_sq, width, label='Bias¬≤', color='red', alpha=0.7)
bars2 = ax.bar(x, variances, width, bottom=biases_sq, label='Variance', 
               color='blue', alpha=0.7)

# Add MSE values on top
for i, name in enumerate(names):
    mse = results[name]['mse']
    ax.text(i, mse + 0.1, f'MSE={mse:.2f}', ha='center', fontsize=10, fontweight='bold')

ax.set_ylabel('Value', fontsize=12)
ax.set_title('Bias-Variance Decomposition: MSE = Bias¬≤ + Variance', 
            fontsize=13, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(['Mean', 'Half+5', 'Median'], fontsize=11)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3, axis='y')

# Plot 3: With outliers comparison
ax = axes[2]

ax.hist(estimates_mean_outlier, bins=40, alpha=0.6, color='steelblue',
       edgecolor='black', density=True, label='Mean (affected by outliers)')
ax.hist(estimates_median_outlier, bins=40, alpha=0.6, color='green',
       edgecolor='black', density=True, label='Median (robust)')

ax.axvline(true_mean, color='gold', linewidth=4, linestyle='--',
          label=f'True Œº = {true_mean}', zorder=10)

ax.set_xlabel('Estimate', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Effect of Outliers: Median vs Mean\n(10% contamination)', 
            fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Is My Variance High?

> How to evaluate if variance is high? </br>

Note that variance bu itself is just a number (e.g. $Var(\hat{\theta} = 0.1)$). To be able to assess if its value is high or low, we need more information, e.g.:
- If $\theta \approx 1000$, then variance of 0.1 is tiny (very precise)
- If $\theta \approx 0.01$, then variance of 0.1 is huge (very imprecise)

1. **Method 1: Coefficient of Variation**

The most common approach is to compare variance to the magnitude of what is estimated.

<div class="alert alert-success">
<h4>Definition: Coefficient of Variation</h4>

**Coefficient of Variation** (CV) is a standardized measure of dispersion of a probability distribution or frequency distribution. It is defined as the ratio of the standard deviation 
$\sigma$ to the mean $\mu$ (or its absolute value, $|\mu |$). It shows the extent of variability in relation to the mean of the population.

$$CV(X) = \frac{\sigma}{\mu} = \frac{\sqrt{Var(X)}}{|E[X]|}$$

**Interpretation**:

- CV < 0.1 (10%): Low variance, very precise
- CV ‚âà 0.1 - 0.3 (10-30%): Moderate variance, acceptable
- CV > 0.3 (30%): High variance, imprecise 
- CV > 1 (100%): Very high variance, unreliable

</div>

In [None]:
# coefficient of variation
estimates = np.array([98, 102, 95, 103, 97, 101])
mean_estimate = np.mean(estimates)  # 99.33
std_estimate = np.std(estimates, ddof=1)  # 3.08

CV = std_estimate / mean_estimate
print(f"CV = {CV:.3f} = {CV*100:.1f}%")

2. **Method 2: Relative to Bias (MSE Context)**

**Question**: Is variance large compared to bias?

**Rule**: 
- If Variance >> Bias¬≤: High variance problem (need more data)
- If Bias¬≤ >> Variance: High bias problem (need better model)
- If Bias¬≤ ‚âà Variance: Balanced

In [None]:
# example
true_param = 10
estimates = np.array([15.1, 15.3, 14.9, 15.2, 15.0])

mean_est = np.mean(estimates)  # 15.1
bias = mean_est - true_param   # 5.1
variance = np.var(estimates, ddof=1)  # 0.02

print(f"Bias¬≤ = {bias**2:.3f}")      # 26.01
print(f"Variance = {variance:.3f}")  # 0.02

if variance < bias**2:
    print("Bias¬≤ >> Variance ‚Üí HIGH BIAS problem")
    print("Variance is actually LOW relative to bias")
elif variance > bias**2:
    print("Bias¬≤ << Variance ‚Üí HIGH VARIANCE problem (need more data)")

3. **Method 3: Relative to Sample Size**

**Expected behavior**: Var(Œ∏ÃÇ) should decrease as n increases

For most estimators:
$$Var(\hat{\theta}) \propto 1/n$$

**If variance doesn't decrease with n**: Something is wrong!

In [None]:
# example
# Simulate variance at different sample sizes
sample_sizes = [10, 50, 100, 500, 1000]
variances = []

for n in sample_sizes:
    estimates = []
    for _ in range(1000):
        sample = np.random.normal(10, 3, n)
        estimates.append(np.mean(sample))
    variances.append(np.var(estimates))

# Check if variance decreases as 1/n
for n, var in zip(sample_sizes, variances):
    print(f"n={n:4d}: Var={var:.4f}, n√óVar={n*var:.2f}")

4. **Method 4: Relative to Other Estimators (Efficiency)**

**Relative Efficiency**: Compare variance of two estimators for same parameter

$$\text{Efficiency of } \hat{\theta}_1 \text{ relative to } \hat{\theta}_2 = Var(\hat{\theta}_2) / Var(\hat{\theta}_1)$$

In [None]:
# example
# Compare mean vs median for normal data
n_trials = 1000
n = 50

estimates_mean = []
estimates_median = []

for _ in range(n_trials):
    sample = np.random.normal(10, 3, n)
    estimates_mean.append(np.mean(sample))
    estimates_median.append(np.median(sample))

var_mean = np.var(estimates_mean, ddof=1)
var_median = np.var(estimates_median, ddof=1)

efficiency = var_median / var_mean
print(f"Var(mean) = {var_mean:.4f}")
print(f"Var(median) = {var_median:.4f}")
print(f"Efficiency = {efficiency:.2f}")

In the example above, median ($\hat{\theta}_2$) has 57% MORE variance than mean. Hence, mean ($\hat{\theta}_1$) has "lower" variance (more efficient).

5. **Method 5: Confidence Interval Width**

**Practical perspective**: Is the uncertainty acceptable for your application?

**95% Confidence Interval**: $\hat{\theta} \pm 1.96 \times \sqrt{Var(\hat{\theta})}$

$$Width = 2 \times 1.96 \times \sqrt{Var(\hat{\theta})} \approx 4 \times SD(\hat{\theta})$$

Note that depending on the application domain, the same value may be acceptable or not.

In [None]:
# example
# Estimating model accuracy
mean_accuracy = 0.85
std_accuracy = 0.05

# 95% CI
ci_lower = mean_accuracy - 1.96 * std_accuracy
ci_upper = mean_accuracy + 1.96 * std_accuracy
ci_width = ci_upper - ci_lower

print(f"Accuracy: {mean_accuracy:.2f} ¬± {1.96*std_accuracy:.2f}")
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"CI width: {ci_width:.2f}")

In the example above, the width is 20% which is NOT acceptable for critical medical application (too wide) but can be acceptable for exploratory analysis.

**Rules of thumb for CI width**:
- Width < 0.1: Low variance (precise) 
- Width 0.1 - 0.3: Moderate variance (acceptable)
- Width > 0.3: High variance (imprecise)

6. **Method 6: Domain Knowledge**

**Context matters**: What's acceptable depends on your problem.

||| Scenario 1 | Scenario 2|
|--|--|--|---|
|**ML Model Accuracy** | **Domain** | **Medical diagnosis** | **Movie recommendation**|
||**Accuracy**| 0.90 ¬± 0.10 | 0.70 ¬± 0.10 | 
||**CI**|[0.80, 1.00]|[0.60, 0.80]|
||**Assessment**|HIGH VARIANCE (unacceptable)|Moderate variance (acceptable)|
||**Why?**|80% vs 90% is a huge difference in medicine| Small accuracy variations are OK here|
|**Parameter Estimation**|**Domain**| **Bridge engineering** | **Marketing campaign reach** |
||**Parameter**|Estimated load capacity|Estimated viewers|
||**Value**|1000 tons ¬± 200 tons|1M ¬± 200K|
||**CV**|20%|20%|
||**Assessment**|HIGH VARIANCE (unacceptable)|Low variance (acceptable)|
||**Why?**|Safety critical, need precision|Rough estimates are sufficient|



<div class="alert alert-warning">
<h4>üí° Key Insights: Variance Assessment: Decision Framework</h4>

**Step 1**: Calculate $CV = SD(\hat{\theta}) / |E[\hat{\theta}]|$
- CV < 10%: Low variance 
- CV 10-30%: Moderate variance
- CV > 30%: High variance 

**Step 2**: Check MSE decomposition
- If Variance >> Bias¬≤: Variance is the problem
- If Bias¬≤ >> Variance: Bias is the problem

**Step 3**: Compare to other estimators
- Is there a lower-variance alternative?
- What's the efficiency loss?

**Step 4**: Consider practical implications
- Is the CI width acceptable?
- Does it meet your application requirements?

**Step 5**: Use domain knowledge
- What precision does your problem actually need?
- What are the consequences of uncertainty?

</div>

## Parameter Estimation Task

<div class="alert alert-primary">
<h4>Problem Statement: Parameter Estimation Task</h4>

Our observed data: $X_1, X_2, ..., X_n$

We believe they come from a distribution with parameter $\theta$ (e.g., $N(Œ∏, 1)$)

**Question**: Which value of $\theta$ makes our observed data "most likely"?

</div>

## Method of Moments (MoM)

<div class="alert alert-success">
<h4>Definition: Methods of Moments (MoM)</h4>

**Match sample moments to population moments**, then solve for parameters.

**Population moments**: $E[X]$, $E[X^2]$, $E[X^3]$, ... (depend on unknown parameters $\theta$)

**Sample moments**: $\bar{X}$, $(1/n)\sum_{i=1}^n X_i^2$, $(1/n)\sum_i{i=1}^n X_i^3$, ... (computed from data)

**Method**: Set sample moments equal to population moments, solve for $\theta$.

<h5> Algorithm</h5>

For a distribution with $k$ parameters $\theta = (\theta_1, ..., \theta_k)$:

1. Write first $k$ population moments in terms of $\theta$:
   - $\mu_1 = E[X] = g_1(\theta)$
   - $\mu_2 = E[X^2] = g_2(\theta)$
   - ...
   - $\mu_k = E[X^k] = g_k(\theta)$

2. Compute corresponding sample moments:
   - $m_1 = \bar{X} = (1/n)\sum_i^n X_i$
   - $m_2 = (1/n)\sum_{i=1}^n X_i^2$
   - ...
   - $m_k = (1/n)\sum_{i=1}^n X_i^k$

3. Solve the system of equations:
   - $m_1 = g_1(\hat{\theta})$
   - $m_2 = g_2(\hat{\theta})$
   - ...
   - $m_k = g_k(\hat{\theta})$

</div>

<div class="alert alert-example">
<h4>Calculated Example: MoM for Normal Distribution, N(Œº, œÉ¬≤)</h3>

Normal distribution $\mathcal{N}(\mu, \sigma^2)$ has two parameters $\mu, \sigma^2$ ‚Üí need two moments

**Population moments:**
- $E[X] = \mu$
- $E[X¬≤] = \sigma^2 + \mu^2$

**Sample moments:**
- $m_1 = \bar{X}$
- $m_2 = (1/n)\sum_{i=1}^n X_i^2$

**Method of Moments equations:**
- $\hat{\mu} = \bar{X}$
- $\hat{\sigma}^2 + \hat{\mu}^2 = (1/n)\sum_{i=1}^n X_i^2$

**Solving:**
- $\hat{\mu}_{MoM} = \bar{X}$
- $\hat{\sigma}^2_{MoM} = (1/n)\sum_{i=1}^n X_i^2 - \bar{X}^2 = (1/n)\sum_{i=1}^n(X_i - \bar{X})^2$

**Note**: This gives the *biased* variance estimator (divides by $n$, not $n-1$)


</div>

<div class="alert alert-exercise">
<h4>Calculated Example: MoM for Exponential Distribution, Exp(Œª)</h4>

One parameter ($\lambda$) ‚Üí need one moment

**Population moment:**
- $E[X] = 1/\lambda$

**Sample moment:**
- $m_1 = \bar{X}$

**Method of Moments equation:**
- $1/\hat{\lambda} = \bar{X}$

**Solving:**
- $\hat{\lambda}_{MoM} = 1/\bar{X}$

</div>

When to Use Method of Moments

**Advantages:**
- Simple to compute (just solve equations)
- No optimization needed
- Good starting values for numerical MLE
- Works when MLE is intractable

**Disadvantages:**
- Less efficient than MLE
- May give biased estimates
- Ignores likelihood structure
- Can give invalid estimates (e.g., negative variance)

**Practical use:**
- Quick initial estimates
- Starting point for iterative MLE
- When MLE is computationally expensive
- Sanity check for MLE results


Let $X_1, X_2, ..., X_n$ be independent realisations of r.v. $X$ with $\mathbb{E}X = \mu$ and $Var(X) = \sigma^2$. 

We define the moment estimators as follows: 
$$\begin{array}{l}\hat{\mu} = \bar{X}\\\hat{\sigma}^2 = s^2\end{array}$$

> Is there quality satisfactory?</br>

**Part I: Quality of the estimator** $\hat{\mu} = \bar{X}$

1. Let's calculate bias of the estimator $\hat{\mu} = \bar{X}$

$$Bias(\hat{\theta},\theta) = \mathbb{E}[\hat{\theta}] ‚àí \theta = Bias(\hat{\mu}, \mu) = \mathbb{E}[\hat{\mu}] ‚àí \mu = $$

$$= \mathbb{E}[\bar{X}] ‚àí \mu = \mathbb{E}\bigg[1/n \sum_{ùëñ=1}^n X_i\bigg] ‚àí \mu = [\text{by propr. of  E,} E[aX + b] = ùëéE[X] + b] = 1/n \mathbb{E}[\sum_{i=1}^n X_i] ‚àí \mu $$

$$= [\text{by propr. of E,} E[X + Y] = E[X] + E[Y]] = 1/n \sum_{i=1}^n E[X_i] ‚àí \mu = 1/n n \mu ‚àí \mu = \mathbf{0}$$

So, the estimator $\hat{\mu} = \bar{X}$ is **unbiased**.

2. Let's calculate the variance of the estimator $\hat{\mu} = \bar{X}$

$$Var(\hat{\mu}) = Var(\bar{X}) = Var\bigg(1/n \sum_{i=1}^n X_i\bigg) = [\text{by propr. of Var, } Var(aX + b) = a^2Var(X)] = \frac{1}{n^2}Var\bigg(\sum_{i=1}^n X_i\bigg) = $$
$$= [\text{by propr. of Var of indep. X and Y, } Var(X + Y) = Var(X) + Var(Y)] = \frac{1}{n^2}\sum_{i=1}^n Var(X_i) = \frac{1}{n^2}n\sigma^2 = \frac{\sigma^2}{n} \xrightarrow[n\rightarrow \infty]{} 0$$

3. Calculate MSE of the estimator $\hat{\mu} = \bar{X}$

$$MSE(\hat{\mu}, \mu) = Bias(\hat{\mu}, \mu)^2 + Var(\hat{\mu}) = \frac{\sigma^2}{n} \xrightarrow[n\rightarrow \infty]{} 0$$

So, it is a convergent estimator in the mean square sense.

**Part II: Quality of the estimator** $\hat{\sigma}^2 = s^2$

1. Let's calculate bias of the estimator $\hat{\sigma}^2 = s^2$

$$Bias(\hat{\theta},\theta) = \mathbb{E}[\hat{\theta}] ‚àí \theta = Bias(\hat{\sigma}^2, \sigma^2) = \mathbb{E}[\hat{\sigma}^2] ‚àí \sigma^2 = \mathbb{E}[s^2] ‚àí \sigma^2 =$$
$$= \mathbb{E}\bigg[\frac{1}{n}\sum_{i=1}^n(X_i^2 - \bar{X})\bigg] - \sigma^2 = \bigg[\text{by propr. of E, } E[X + Y] = EX + EY\bigg] = \mathbb{E}\bigg[\frac{1}{n}\sum_{i=1}^nX_i^2\bigg] - \mathbb{E}[\bar{X}^2] - \sigma^2 = $$

$$\bigg[\text{by propr. of E, } E[aX + b] = aEX + b\bigg] = \frac{1}{n}\mathbb{E}\bigg[\sum_{i=1}^nX_i^2\bigg] - \mathbb{E}[\bar{X}^2] - \sigma^2 =$$

$$\bigg[\text{by propr. of E, } E[X + Y] = EX + EY\bigg] = \frac{1}{n}\sum_{i=1}^n\mathbb{E}[X_i^2] - \mathbb{E}[\bar{X}^2] - \sigma^2 = \frac{1}{n}n\mathbb{E}[X^2] - \mathbb{\hat{\mu}^2} - \sigma^2 =$$

$$= \mathbb{E}[X^2] - \mathbb{\hat{\mu}^2} - \sigma^2 = \bigg[\text{by def. } Var(X) = E[X^2] - (EX)^2 \Rightarrow E[X^2] = Var(X) + (EX)^2\bigg] = $$

$$= (Var(X) + (EX)^2) - (Var(\hat{\mu}) + (\mathbb{E}[\hat{\mu}])^2) - \sigma^2 = (\sigma^2 + \mu^2) - \bigg(\frac{\sigma^2}{n} + \mu^2\bigg) - \sigma^2 =$$

$$= \bigg(1 - \frac{1}{n}\bigg)\sigma^2 - \sigma^2 = -\frac{\sigma^2}{n} \mathbf{\neq 0}$$

So, the estimator $\hat{\sigma}^2 = s^2$ is **biased**.

To make it *unbiased*, we can apply so called **Bessel's correction**:

$$\hat{\sigma}'^2 = s'^2 = \frac{n}{n-1}s^2$$




<div class="alert alert-success">
<h4>Definition: Bessel's Correction</h4>

Unbiased estimator of variance is given by:

$$\hat{\sigma}'^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})^2$$

*Note:* we divide by $(n-1)$ instead of $n$ here.
</div>

## Maximum Likelihood Estimator (MLE)

Imagine you flip a coin 10 times and get HTHHHTHHHH:

<center>
<img src="img/coins.svg" width="800px" alt="10 coins: HTHHHTHHHH">
</center>

In [None]:
# visualisation
plot_heads_tails()

Our goal is to find the optimal way to fit a distribution to the data in order to facilitate the work and generalise the observations. 

Here, we are dealing with a sequence of Bernoulli trials with parameter $p$ of having a head in a single coin flip. As we observed HTHHHTHHHH, we can calculate the probability to get this exact sequence with a given parameter $p$ as the joint probability:

$$P(HTHHHTHHHH|p) = p\times (1-p)\times p\times p\times p\times p\times (1-p) \times p\times p\times p\times p = p^7\times (1-p)^3$$

In a more general way, we think the data follows Binomial distribution with parameter $p$ of having a head in a single coin flip. The probability mass function is given by $P(X=k | n, p) = \begin{pmatrix}n \\ k\end{pmatrix}p^k q^{n-k} = \frac{n!}{k!(n-k)!}p^k q^{n-k}$ and for $n=10$ it looks something like that:


In [None]:
# visualise binomial distribution mass function for n=10
plot_binomial()

For instance, the probability that we get 8 heads out of 10 flips of a fair coin is:
$$P(X=8 | n=10, p=0.5) = \frac{10!}{8!(10-8)!}0.5^8 0.5^{10-8} = \frac{10!}{8!2!}0.5^8 0.5^{2} = \frac{9\times 10}{2}\frac{1}{2^{10}} = 9 \times 5 \times \frac{1}{2^{10}} \approx 0.044$$


So our question is then:

> Which value of $p$ makes our observed data most probable?

If we want to calculate the likelihood of $p = 0.5$, then we need to rearrange our equation by **modifying only the left side**:

$$L(p=0.5 | n=10, X=8) = \frac{10!}{8!(10-8)!}0.5^8 0.5^{10-8} = \frac{10!}{8!2!}0.5^8 0.5^{2} = \frac{9\times 10}{2}\frac{1}{2^{10}} = 9 \times 5 \times \frac{1}{2^{10}} \approx 0.044$$

The left side of the equation reads "*the likelihood of $p$ (the probability to get a head), given $n$, the number of flips we make, and $X$, the number of heads*".

**Note** that we can modify the values of $p$ in this equation, but the observed data ($n=10$ and $X=8$) remains fixed:

$$L(p=0.3 | n=10, X=8) = \frac{10!}{8!(10-8)!}0.3^8 0.7^{10-8} = \frac{10!}{8!2!}0.3^8 0.7^{2} = \frac{9\times 10}{2}\times 0.00007\times 0.49 = 9 \times 5 \times 0.00007\times 0.49 \approx 0.00145$$

$$L(p=0.8 | n=10, X=8) = \frac{10!}{8!(10-8)!}0.8^8 0.2^{10-8} = \frac{10!}{8!2!}0.8^8 0.2^{2} \approx 0.302$$

In [None]:
demonstrate_likelihood_concept()

<div class="alert alert-primary">
<h4>Reminder: Joint Probability of Independent R.V.</h4>

> How likely is all the data together?</br>

We need to use **joint probability**.

Let $X_1, X_2, ..., X_n$ be i.i.d.

Recall that two events $A$ and $B$ are *independent* if:

$$P(A\cap B) = P(A)\times P(B)$$

For our data, independence means:

- Observing $X_1$‚Äã doesn't affect $X_2$
- Each data point is drawn separately from the same distribution
- Knowing one observation tells us nothing about another

For *discrete variables*, the joint probability of independent r.v. is expressed as a product of individual probabilities:
$$P(X_1‚Äã=x_1‚Äã, X_2‚Äã=x_2‚Äã, ‚Ä¶, X_n‚Äã=x_n‚Äã‚à£\theta) = \prod_{i=1}^n ‚ÄãP(X_i‚Äã=x_i‚Äã‚à£\theta)$$

For *continuous variables*, the joint probability of independent r.v. is expressed as a product of probability density functions (PDFs):
$$f(x_1, x_2, \ldots, x_n | \theta) = \prod_{i=1}^{n} f(x_i | \theta)$$

*Example*: Suppose you flip a fair coin 3 times and get: H, H, T with $P(H)=0.5$ for each flip. Flips are independent (one doesn't affect another).

Joint probability:
$P(H,¬†H,¬†T) = P(H)\times P(H)\times P(T) = 0.5\times 0.5 \times 0.5 = 0.125$

</div>

**What if Data Aren't Independent?**
If observations are not independent (e.g., time series, spatial data, grouped data):

- We cannot write likelihood as a simple product
- Need more complex joint distributions
- This is why independence assumption is so important!

*Example*: Time series

- Today's stock price depends on yesterday's
- Need to model $P(X_t | X_{t-1}, X_{t-2}, \ldots)$
- Likelihood is more complex

<div class="alert alert-success">
<h4>Definition: Maximum Likelihood Estimation</h4>

Setup:

- Data: $X_1, X_2, ..., X_n \sim f(x | \theta)$ (i.i.d. from distribution with parameter $\theta$)
- Goal: Estimate $\theta$

**Likelihood Function**, "*probability of observing the data, as a function of $\theta$*": 
$$L(\theta | X) = \prod_{i=1}^{n} f(X_i | \theta)$$

**Log-Likelihood:**
$$\ell(\theta | X) = \log L(\theta | X) = \sum_{i=1}^{n} \log f(X_i | \theta)$$
(Taking log: easier to work with, doesn't change location of maximum)

**Maximum Likelihood Estimator:**

$$\hat{\theta}_{MLE} = \arg\max_{\theta} L(\theta | X) = \arg\max_{\theta} \ell(\theta | X)$$

How to find it:

1. Write down likelihood (or log-likelihood)
2. Take derivative with respect to $\theta$
3. Set equal to zero and solve
4. Verify it's a maximum (second derivative test)

</div>

### MLE for Bernoulli Distribution

<div class="alert alert-example">
<h4>Worked Example: MLE for Bernoulli Distribution</h4>

**Scenario: Email Spam Rate Estimation**

You're building a spam classifier. From your training data, you observe:

- $n = 1000$ emails
- $k = 230$ are spam

**Question**: Estimate $p$ = probability that a random email is spam using MLE

Model: Each email is spam with probability $p$ (Bernoulli distribution)
</div>

In [None]:
# Observed data
np.random.seed(42)
n = 1000
k = 230
observed_data = np.array([1]*k + [0]*(n-k))  # 1=spam, 0=ham
np.random.shuffle(observed_data)

print(f"Data (10 first observations): {observed_data[:10]}")

1. **Step 1: Write the Likelihood Function**

Each email $X_i \sim Bernoulli(p)$. Therefore, $\begin{array}{ll} P(X_i = 1) = p   & \text{(spam)} \\ P(X_i = 0) = 1-p & \text{(not spam)}\end{array}$

$$L(p|X) = \prod_i^n P(X_i| p) = \prod_i^n p^{X_i} \times (1-p)^{(1-X_i)} = [k = \text{\# of successes}, n = \text{\# trials}] = p^k\times (1-p)^{(n-k)}$$

In [None]:
# for our data
print("L(p) = ‚àè·µ¢ P(X·µ¢ | p)")
print("     = ‚àè·µ¢ p^X·µ¢ (1-p)^(1-X·µ¢)")
print(f"     = p^{k} (1-p)^{n-k}")

2. **Step 2: Write the Log-Likelihood Function**

$$l(p|X) = \log L(p|X) = \log (p^k\times (1-p)^{n-k}) = \log (p^k) + \log (1-p)^{n-k} = k \log (p) + (n-k) \log(1-p)$$

In [None]:
# for our data
print("‚Ñì(p) = log L(p)")
print(f"     = {k} log(p) + {n-k} log(1-p)")

3. **Step 3: Take the Derivative**

$$\frac{dl}{dp} = \frac{d}{dp}(k\log(p) + (n-k)\log(1-p)) = k\times\frac{1}{p} - (n-k)\frac{1}{1-p} = \frac{k}{p} - \frac{n-k}{1-p}$$

In [None]:
# for our data
print(f"d‚Ñì/dp = {k}/p - {n-k}/(1-p)")

4. **Step 4: Set Derivative to 0 and Solve**

$$\frac{dl}{dp} = 0$$

$$\frac{k}{p} - \frac{n-k}{1-p} = 0$$

$$\frac{k}{p} = \frac{n-k}{1-p}$$

$$k\times (1-p) = (n-k)p$$

$$k - kp = np - kp$$

$$k = np + kp - kp$$

$$k = np$$

$$p = \frac{k}{n}$$

**Step 5: Verify It's a Maximum**

Check second derivative:

$$\frac{d^2\mathcal{l}}{dp^2} = \frac{d}{dp}\bigg(\frac{k}{p} - \frac{n-k}{1-p}\bigg) = -\frac{k}{p^2} - \frac{n-k}{(1-p)^2}$$

At $\hat{p} = \frac{k}{n}$, $$\frac{d^2\mathcal{l}}{dp^2}\bigg|_{(p = k/n)} = - \frac{k}{(k/n)^2} - \frac{n-k}{(1-k/n)^2} =$$
$$= - \frac{kn^2}{k^2} - \frac{(n-k)n^2}{(n-k)^2} = - \frac{n^2}{k} - \frac{n^2}{(n-k)} = - n^2\bigg(\frac{1}{k} + \frac{1}{n-k}\bigg) < 0$$

Second derivative is NEGATIVE ‚Üí this is a MAXIMUM.


In [None]:
p_mle = k / n
# Calculate each term
term1 = -k / (p_mle**2)
term2 = -(n-k) / ((1-p_mle)**2)
second_deriv = term1 + term2
print(f"d¬≤‚Ñì/dp¬≤|_(p={p_mle}) = {term1:.4f} + {term2:.4f} = {second_deriv:.4f}")
alt_form = -n**2 * (1/k + 1/(n-k))
print(f"")
print(f"d¬≤‚Ñì/dp¬≤|_(p={k}/{n}) = -{n}¬≤ √ó [1/{k} + 1/{(n-k)}]")
print(f"  = -{n**2} √ó [{1/k:.4f} + {1/(n-k):.4f}]")
print(f"  = {alt_form:.4f}")

6. **Step 6: Compute the MLE**

Intuitive result: MLE = observed proportion


In [None]:
p_mle = k / n
print(f"MLE = {p_mle:.2f}")

In [None]:
n_tails = n - k
n_heads = k
# Visualization
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 3, hspace=0.4, wspace=0.3)

p_range = np.linspace(0.001, 0.999, 500)

# Plot 1: Likelihood
ax1 = fig.add_subplot(gs[0, 0])
likelihood = p_range**n_heads * (1-p_range)**n_tails
ax1.plot(p_range, likelihood, linewidth=3, color='steelblue')
ax1.axvline(p_mle, color='red', linewidth=2, linestyle='--',
           label=f'MLE = {p_mle:.2f}')
ax1.scatter(p_mle, p_mle**n_heads * (1-p_mle)**n_tails, s=300,
           color='red', marker='*', edgecolors='darkred', linewidths=2, zorder=5)
ax1.set_xlabel('p', fontsize=12)
ax1.set_ylabel('L(p)', fontsize=12)
ax1.set_title('Likelihood Function\n(concave down ‚Üí maximum)', 
             fontsize=12, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot 2: Log-likelihood
ax2 = fig.add_subplot(gs[0, 1])
log_likelihood = n_heads * np.log(p_range) + n_tails * np.log(1-p_range)
ax2.plot(p_range, log_likelihood, linewidth=3, color='green')
ax2.axvline(p_mle, color='red', linewidth=2, linestyle='--',
           label=f'MLE = {p_mle:.2f}')
ax2.scatter(p_mle, n_heads * np.log(p_mle) + n_tails * np.log(1-p_mle),
           s=300, color='red', marker='*', edgecolors='darkred', linewidths=2, zorder=5)
ax2.set_xlabel('p', fontsize=12)
ax2.set_ylabel('‚Ñì(p)', fontsize=12)
ax2.set_title('Log-Likelihood\n(easier to work with)', 
             fontsize=12, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

# Plot 3: First derivative (score)
ax3 = fig.add_subplot(gs[0, 2])
first_deriv = n_heads / p_range - n_tails / (1 - p_range)
ax3.plot(p_range, first_deriv, linewidth=3, color='purple')
ax3.axhline(0, color='black', linewidth=1, linestyle='-', alpha=0.5)
ax3.axvline(p_mle, color='red', linewidth=2, linestyle='--',
           label=f'Zero at p={p_mle:.2f}')
ax3.scatter(p_mle, 0, s=300, color='red', marker='o',
           edgecolors='darkred', linewidths=2, zorder=5)
ax3.set_xlabel('p', fontsize=12)
ax3.set_ylabel("d‚Ñì/dp (score)", fontsize=12)
ax3.set_title('First Derivative\n(zero at maximum)', 
             fontsize=12, fontweight='bold')
ax3.legend(fontsize=10)
ax3.grid(True, alpha=0.3)
ax3.set_ylim(-50, 50)


plt.tight_layout()
plt.show()

<div class="alert alert-primary">
<h4>ü§ñ ML Application: Logistic Regression Loss Function</h4>

**Connection:** The MLE for Bernoulli is exactly what logistic regression does.

Logistic regression loss (binary cross-entropy):
$$\mathcal{L} = -\sum_{i=1}^{n} [y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i)]$$

This is the **negative log-likelihood** for Bernoulli data.

**Training a logistic regression = Finding MLE of the parameters**
</div>

### MLE for Exponential Distribution

<div class="alert alert-example">
<h4>Worked Example: MLE for Exponential Distribution</h4>

**Scenario: Server Response Times**

You're analyzing server response times (in seconds). Theory suggests response times follow an Exponential distribution.

Recall, $Exponential(\lambda)$ has:
- PDF: $f(x|\lambda) = \lambda e^{-\lambda x}$ for $x \geq 0$
- Mean: $1/\lambda$
- Interpretation: $\lambda$ = rate parameter (events per unit time)

Observed data: response times for 20 requests

**Question:** find $\hat{\lambda}$ using MLE.

</div>

In [None]:
# Generate example data
np.random.seed(42)
true_lambda = 0.5  # True rate (unknown in practice)
n = 20
data = np.random.exponential(1/true_lambda, n)

print(f"Data (first 10 out of {n} observations): {data[:10]}")
# Summary statistics
print(f"Sample mean: {np.mean(data):.3f} seconds")
print(f"Sample min:  {np.min(data):.3f} seconds")
print(f"Sample max:  {np.max(data):.3f} seconds")

1. **Step 1: Write the Likelihood Function**

For i.i.d. data $X_1, X_2, ..., X_n \sim Exponential(\lambda)$:

$$L(\lambda | X) = \prod_{i=1}^n f(X_i | \lambda) = \prod_{i=1}^n \lambda e^{-\lambda X_i} = \lambda^n \prod_{i=1}^n e^{-\lambda X_i} = [\text{by prop. of exp. func., } e^a\times e^b = e^{a+b}] = \lambda^n e^{-\sum_{i=1}^n \lambda X_i} = \lambda^n e^{- \lambda \sum_{i=1}^n X_i}$$

In [None]:
# likelihood for our data
sum_X_i = np.sum(data)
print(f"‚àëX·µ¢ = {sum_X_i:.3f}")
print(f"L(Œª) = Œª^{n} √ó exp(-Œª √ó {np.sum(data):.3f})")

2. **Step 2: Take the Log-Likelihood**

$$\mathcal{l}(\lambda|X) = \log L(\lambda|X) = \log (\lambda^n e^{- \lambda \sum_{i=1}^n X_i}) = [\text{by propr. of log}] = \log(\lambda^n) + \log(e^{- \lambda \sum_{i=1}^n X_i}) =$$

$$= n\log(\lambda) + (- \lambda \sum_{i=1}^n X_i) = n\log(\lambda) - \lambda \sum_{i=1}^n X_i$$


In [None]:
# for our data
print(f"‚Ñì(Œª) = {n} log(Œª) - Œª √ó {np.sum(data):.3f}")

3. **Step 3: Take the Derivative**

$$\frac{d\mathcal{l}}{d\lambda} = \frac{d}{d\lambda}\bigg(n\log(\lambda) - \lambda \sum_{i=1}^n X_i\bigg) = n\frac{1}{\lambda} - \sum_{i=1}^n X_i = \frac{n}{\lambda} - \sum_{i=1}^n X_i$$

In [None]:
# for our data
print(f"d‚Ñì/dŒª = {n}/Œª - {np.sum(data):.3f}")

4. **Step 4: Set Derivative to 0 and Solve**

$$\frac{d\mathcal{l}}{d\lambda} = 0$$

$$\frac{n}{\lambda} - \sum_{i=1}^n X_i = 0$$

$$\frac{n}{\lambda} = \sum_{i=1}^n X_i$$

$$\lambda = \frac{n}{\sum_{i=1}^n X_i}$$

Note that $\bar{X} = 1/n\sum_{i=1}^n X_i$, so:

$$\hat{\lambda}_{MLE} = \frac{1}{\bar{X}}$$

So, MLE for Exponential rate = 1 / sample mean.

5. **Step 5: Verify It's a Maximum**

Check second derivative:

$$\frac{d^2\mathcal{l}}{d\lambda^2} = \frac{d}{d\lambda}\bigg(\frac{n}{\lambda} - \sum_{i=1}^n X_i\bigg) = -\frac{n}{\lambda^2}$$

Since $n > 0$ and $\lambda^2 > 0$: $\frac{d^2\mathcal{l}}{d\lambda^2} = -\frac{n}{\lambda^2} < 0$

Second derivative is NEGATIVE ‚Üí this is a MAXIMUM.

6. **Step 6: Compute the MLE**

In [None]:
# compute MLE for our data
sample_mean = np.mean(data)
mle_lambda = 1 / sample_mean
print(f"Sample mean: XÃÑ = {sample_mean:.4f}")
print(f"ŒªÃÇ_MLE = 1 / XÃÑ = 1 / {sample_mean:.4f} = {mle_lambda:.4f}")

print()
print(f"True Œª (unknown in practice): {true_lambda:.4f}")
print(f"Error: {abs(mle_lambda - true_lambda):.4f}")

In [None]:
# Visualization
fig, axes = plt.subplots(ncols=3, figsize=(16, 5))
    
# Plot 1: Data histogram with fitted distribution
ax = axes[0]
    
ax.hist(data, bins=15, density=True, alpha=0.7, color='steelblue',
           edgecolor='black', label='Observed data')
    
x_range = np.linspace(0, np.max(data)*1.2, 200)
    
# True distribution
ax.plot(x_range, true_lambda * np.exp(-true_lambda * x_range),
           'g-', linewidth=3, label=f'True: Exp(Œª={true_lambda})')
    
# MLE distribution
ax.plot(x_range, mle_lambda * np.exp(-mle_lambda * x_range),
           'r--', linewidth=3, label=f'MLE: Exp(ŒªÃÇ={mle_lambda:.3f})')
    
ax.set_xlabel('Response Time (seconds)', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Data vs Fitted Exponential Distribution', 
                fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
    
# Plot 2: Log-likelihood function
ax = axes[1]
    
lambda_range = np.linspace(0.1, 1.5, 200)
log_likelihoods = []
    
for lam in lambda_range:
    ll = n * np.log(lam) - lam * np.sum(data)
    log_likelihoods.append(ll)
    
log_likelihoods = np.array(log_likelihoods)
    
ax.plot(lambda_range, log_likelihoods, linewidth=3, color='steelblue')
ax.axvline(mle_lambda, color='red', linewidth=3, linestyle='--',
              label=f'MLE: ŒªÃÇ={mle_lambda:.3f}')
ax.scatter(mle_lambda, n * np.log(mle_lambda) - mle_lambda * np.sum(data),
              s=400, color='red', marker='*', edgecolors='darkred',
              linewidths=2, zorder=5)
    
ax.axvline(true_lambda, color='gold', linewidth=2, linestyle=':',
              label=f'True: Œª={true_lambda}', alpha=0.7)
    
ax.set_xlabel('Œª (rate parameter)', fontsize=12)
ax.set_ylabel('Log-Likelihood ‚Ñì(Œª)', fontsize=12)
ax.set_title('Log-Likelihood Function', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
    
# Plot 3: Derivative (score function)
ax = axes[2]
    
derivatives = n / lambda_range - np.sum(data)
    
ax.plot(lambda_range, derivatives, linewidth=3, color='green',
           label='Score function: d‚Ñì/dŒª')
ax.axhline(0, color='black', linewidth=1, linestyle='-', alpha=0.5)
ax.axvline(mle_lambda, color='red', linewidth=3, linestyle='--',
              label=f'MLE (d‚Ñì/dŒª=0): ŒªÃÇ={mle_lambda:.3f}')
ax.scatter(mle_lambda, 0, s=400, color='red', marker='o',
              edgecolors='darkred', linewidths=2, zorder=5)
    
ax.set_xlabel('Œª (rate parameter)', fontsize=12)
ax.set_ylabel('d‚Ñì/dŒª', fontsize=12)
ax.set_title('First Derivative (Score Function)\n= 0 at MLE', 
                fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim(-20, 20)

plt.tight_layout()
plt.show()

<div class="alert alert-warning">
<h4> üí° Key Insights: MLE for Exponential Distribution</h4>
    
1. For Exponential: MLE = 1 / sample mean 
    
2. $\lambda$ is RATE (events per time) and $1/\lambda$ is MEAN TIME between events
    
3. If $\bar{X} = 2$ seconds ‚Üí $\hat{\lambda} = 0.5$ per second (one event every 2 seconds on average)

<div class="alert alert-primary">
<h4> ü§ñ ML APPLICATION: Time-to-Event Modeling </h4>
    
Exponential distribution is used in ML for:
1. Waiting times:
- Time between user clicks
- Server request intervals
- Time to next purchase

2. Survival Analysis:
- Customer churn modeling
- Time to equipment failure
- Session duration

3. Deep Learning:
- Dropout regularization (exponential draws)
- Exponential learning rate decay

*Example*: If $\hat{\lambda} = 0.5$ requests/second:
- Average wait: 1/0.5 = 2 seconds
- $P(wait > 5 sec) = exp(-0.5 √ó 5) \approx 8.2\%$
- Useful for capacity planning

</div>

### MLE for Normal Distribution

<div class="alert alert-exercise">
<h4>Worked Example: MLE for Normal Distribution</h4>

Find the MLE for both parameters of Normal distribution $N(\mu, \sigma^2)$.
</div>

In [None]:
# Generate sample data
np.random.seed(42)
true_mu = 5
true_sigma = 2
n = 100
data = np.random.normal(true_mu, true_sigma, n)

print(f"Data (10 first observations): {data[:10]}")

1. **Step 1: Write the Likelihood Function**

$$L(\mu, \sigma^2|X) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} exp(-(X_i-\mu)^2/(2\sigma^2)) = (\frac{1}{\sqrt{2\pi}})^n\prod_{i=1}^n \frac{1}{\sqrt{\sigma^2}} exp(-(X_i-\mu)^2/(2\sigma^2)) =$$
$$= (2\pi)^{-n/2}\prod_{i=1}^n \frac{1}{\sqrt{\sigma^2}} exp(-(X_i-\mu)^2/(2\sigma^2)) = (2\pi)^{-n/2} (\frac{1}{\sqrt{\sigma^2}})^n\prod_{i=1}^n  exp(-(X_i-\mu)^2/(2\sigma^2)) ==$$
$$=(2\pi)^{-n/2} (\sigma^2)^{-n/2}\prod_{i=1}^n  exp(-(X_i-\mu)^2/(2\sigma^2))$$

2. **Step 2: Write the Log-Likelihood**

$$\ell(\mu, \sigma^2|X) = \log L(\mu, \sigma^2|X) = \log \bigg((2\pi)^{-n/2} (\sigma^2)^{-n/2}\prod_{i=1}^n  exp(-(X_i-\mu)^2/(2\sigma^2))\bigg) =$$
$$= \log((2\pi)^{-n/2}) + \log((\sigma^2)^{-n/2}) + \log\bigg(\prod_{i=1}^n  exp(-(X_i-\mu)^2/(2\sigma^2))\bigg) = -n/2 \log(2\pi) - n/2 \log(\sigma^2) + \log(exp(-\sum_{i=1}^n((X_i-\mu)^2/(2\sigma^2)))) =$$
$$= -n/2 \log(2\pi) - n/2 \log(\sigma^2) + \log(exp(-1/(2\sigma^2)\sum_{i=1}^n(X_i-\mu)^2)) =$$
$$= -n/2 \log(2\pi) - n/2 \log(\sigma^2) - 1/(2\sigma^2) \sum_{i=1}^n(X_i-\mu)^2$$

3. **Step 3: Take Derivatives**

$$\frac{\partial l}{\partial\mu} = \frac{\partial }{\partial\mu} \bigg(-n/2 \log(2\pi) - n/2 \log(\sigma^2) - 1/(2\sigma^2) \sum_{i=1}^n(X_i-\mu)^2\bigg) = \frac{\partial }{\partial\mu} \bigg(- 1/(2\sigma^2) \sum_{i=1}^n(X_i-\mu)^2\bigg) =$$
$$=- 1/(2\sigma^2)\times(-1)\times 2\times \sum_{i=1}^n(X_i-\mu) = 1/\sigma^2 \sum_{i=1}^n(X_i-\mu)$$

$$\frac{\partial l}{\partial\sigma^2} = \frac{\partial }{\partial\sigma^2} \bigg(-n/2 \log(2\pi) - n/2 \log(\sigma^2) - 1/(2\sigma^2) \sum_{i=1}^n(X_i-\mu)^2\bigg) = -n/2\times 1/\sigma^2 + 1/(2\sigma^4) \sum_{i=1}^n(X_i-\mu)^2 = -n/(2\sigma^2) + 1/(2\sigma^4) \sum_{i=1}^n(X_i-\mu)^2$$

4. **Step 4: Set Derivatives to 0 and Solve**

1. for $\mu$
$$\frac{\partial l}{\partial\mu} =  1/\sigma^2 \sum_{i=1}^n(X_i-\mu) = 0$$
$$\sum_{i=1}^n(X_i-\mu) = 0$$
$$\sum_{i=1}^nX_i - n\mu = 0$$
$$\mu = 1/n\sum_{i=1}^nX_i$$

Note that we get *sample mean*.

2. for $\sigma^2$
$$\frac{\partial l}{\partial\sigma^2} =  -n/(2\sigma^2) + 1/(2\sigma^4) \sum_{i=1}^n(X_i-\mu)^2 = 0$$
$$-n/(2\sigma^2) + 1/(2\sigma^4) \sum_{i=1}^n(X_i-\mu)^2 = 0$$
$$-n + 1/(\sigma^2)\sum_{i=1}^n(X_i-\mu)^2 = 0$$
$$1/(\sigma^2)\sum_{i=1}^n(X_i-\mu)^2 = n$$
$$\sum_{i=1}^n(X_i-\mu)^2 = n\sigma^2$$
$$1/n\sum_{i=1}^n(X_i-\mu)^2 = \sigma^2$$

Note that we get *sample variance* (biased).

5. **Step 5: Verify These Are Maximum**

Since we have two parameters, we need to compute the Hessian matrix:
$$H = \begin{pmatrix}
\frac{\partial^2 \ell}{\partial \mu^2} & \frac{\partial^2 \ell}{\partial \mu \partial \sigma^2} \
\frac{\partial^2 \ell}{\partial \sigma^2 \partial \mu} & \frac{\partial^2 \ell}{\partial (\sigma^2)^2}
\end{pmatrix}$$

For a maximum, we need the Hessian to be negative definite at the MLE.

Computing Each Second Derivative

1. Second derivative with respect to $\mu$
$$\frac{\partial^2 \ell}{\partial \mu^2} = \frac{\partial}{\partial \mu}\left[\frac{n}{\sigma^2}(\bar{X} - \mu)\right]$$
$$= \frac{n}{\sigma^2} \times (-1) = -\frac{n}{\sigma^2}‚Äã$$

*Note*: This doesn't depend on $\mu$.  It's constant.


2. Second derivative with respect to $\sigma^2$
$$\frac{\partial^2 \ell}{\partial (\sigma^2)^2} = \frac{\partial}{\partial \sigma^2}\left[-\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^{n}(X_i - \mu)^2\right]$$

For the first term:

$$\frac{\partial}{\partial \sigma^2}\left[-\frac{n}{2\sigma^2}\right] = -\frac{n}{2} \times \frac{\partial}{\partial \sigma^2}[(\sigma^2)^{-1}] = -\frac{n}{2} \times (-1)(\sigma^2)^{-2} = \frac{n}{2\sigma^4}$$

For the second term:

$$\frac{\partial}{\partial \sigma^2}\left[\frac{1}{2\sigma^4}\sum_{i=1}^{n}(X_i - \mu)^2\right] = \frac{1}{2}\sum_{i=1}^{n}(X_i - \mu)^2 \times (-2)(\sigma^2)^{-3} = -\frac{\sum_{i=1}^{n}(X_i - \mu)^2}{\sigma^6}$$

Combining:

$$\frac{\partial^2 \ell}{\partial (\sigma^2)^2} = \frac{n}{2\sigma^4} - \frac{\sum_{i=1}^{n}(X_i - \mu)^2}{\sigma^6}$$

3. Cross partial derivative
$$\frac{\partial^2 \ell}{\partial \mu \partial \sigma^2} = \frac{\partial}{\partial \sigma^2}\left[\frac{n}{\sigma^2}(\bar{X} - \mu)\right]$$
$$= n(\bar{X} - \mu) \times \frac{\partial}{\partial \sigma^2}[(\sigma^2)^{-1}]=$$
$$= n(\bar{X} - \mu) \times (-1)(\sigma^2)^{-2} = -\frac{n(\bar{X} - \mu)}{\sigma^4}‚Äã$$

Evaluating at the MLE

At the MLE: $\hat{\mu} = \bar{X}$ and $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2$

1. At $\hat{\mu}, \hat{\sigma}^2$:
$$\frac{\partial^2 \ell}{\partial \mu^2}\bigg|_{\hat{\mu}, \hat{\sigma}^2} = -\frac{n}{\hat{\sigma}^2} < 0 \quad ‚úì$$

*Interpretation*: Always negative ‚Üí concave down in $\mu$ direction

2. At $\hat{\mu}, \hat{\sigma}^2$:
$$\frac{\partial^2 \ell}{\partial (\sigma^2)^2}\bigg|_{\hat{\mu}, \hat{\sigma}^2} = \frac{n}{2\hat{\sigma}^4} - \frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{\hat{\sigma}^6}$$

Substitute $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2$:

$$= \frac{n}{2\hat{\sigma}^4} - \frac{n\hat{\sigma}^2}{\hat{\sigma}^6} = \frac{n}{2\hat{\sigma}^4} - \frac{n}{\hat{\sigma}^4} = -\frac{n}{2\hat{\sigma}^4} < 0 \quad ‚úì$$

*Interpretation*: Negative at the MLE ‚Üí concave down in $\sigma^2$ direction

3. Cross partial at $\hat{\mu}, \hat{\sigma}^2$:
$$\frac{\partial^2 \ell}{\partial \mu \partial \sigma^2}\bigg|_{\hat{\mu}, \hat{\sigma}^2} = -\frac{n(\bar{X} - \hat{\mu})}{\hat{\sigma}^4}$$

Since $\hat{\mu} = \bar{X}$:

$$= -\frac{n \times 0}{\hat{\sigma}^4} = 0$$

*Interpretation*: No interaction between $\mu$ and $\sigma^2¬≤$ at the MLE (parameters are orthogonal)

The Hessian Matrix at MLE
$$H\bigg|_{\hat{\mu}, \hat{\sigma}^2} = \begin{pmatrix}
-\frac{n}{\hat{\sigma}^2} & 0 \
0 & -\frac{n}{2\hat{\sigma}^4}
\end{pmatrix}$$

This is a diagonal matrix with both diagonal entries negative!

*Verifying Negative Definiteness*

For a matrix to be negative definite, we need:

- All eigenvalues negative, OR
- Leading principal minors alternate in sign (starting negative)

*Method 1: Eigenvalues*: 

Since $H$ is diagonal, eigenvalues are just the diagonal entries:
- $\lambda_1 = -\frac{n}{\hat{\sigma}^2} < 0 ‚úì$
- $\lambda_2 = -\frac{n}{2\hat{\sigma}^4} < 0 ‚úì$

Both negative ‚Üí negative definite ‚úì

*Method 2: Principal Minors* 

First leading principal minor:

$$M_1 = -\frac{n}{\hat{\sigma}^2} < 0 \quad ‚úì$$

Second leading principal minor (determinant):

$$M_2 = \det(H) = \left(-\frac{n}{\hat{\sigma}^2}\right) \times \left(-\frac{n}{2\hat{\sigma}^4}\right) - 0^2 = \frac{n^2}{2\hat{\sigma}^6} > 0 \quad ‚úì$$

Signs alternate: (‚àí, +) ‚Üí negative definite ‚úì

Conclusion
1. Both diagonal entries of Hessian are negative
2. Off-diagonal entries are zero (parameters uncorrelated at MLE)
3. Hessian is negative definite
4. Therefore: $(\hat{\mu}, \hat{\sigma}^2) = (\bar{X}, \frac{1}{n}\sum(X_i - \bar{X})^2)$ is a MAXIMUM

6. **Step 6: Compute MLE**

In [None]:
# Analytical MLEs
mle_mu = np.mean(data)
mle_sigma = np.std(data, ddof=0)  # ddof=0 for MLE (biased estimator)

print(f"ŒºÃÇ_MLE = {mle_mu:.4f}  (true: {true_mu})")
print(f"œÉÃÇ_MLE = {mle_sigma:.4f}  (true: {true_sigma})")

In [None]:
# Visualization: 3D likelihood surface
fig = plt.figure(figsize=(16, 6))

# Create grid for parameters
mu_range = np.linspace(true_mu - 2, true_mu + 2, 50)
sigma_range = np.linspace(0.5, 4, 50)
MU, SIGMA = np.meshgrid(mu_range, sigma_range)

# Compute log-likelihood for each combination
log_likelihood = np.zeros_like(MU)
for i in range(len(mu_range)):
    for j in range(len(sigma_range)):
        mu = MU[j, i]
        sigma = SIGMA[j, i]
        ll = -n/2 * np.log(2*np.pi*sigma**2) - np.sum((data - mu)**2) / (2*sigma**2)
        log_likelihood[j, i] = ll

# Plot 1: 3D surface
ax1 = fig.add_subplot(131, projection='3d')
surf = ax1.plot_surface(MU, SIGMA, log_likelihood, cmap='viridis', alpha=0.8)
ax1.scatter([mle_mu], [mle_sigma], [np.max(log_likelihood)], 
           color='red', s=300, marker='*', edgecolors='darkred', linewidths=2)
ax1.set_xlabel('Œº', fontsize=11)
ax1.set_ylabel('œÉ', fontsize=11)
ax1.set_zlabel('Log-Likelihood', fontsize=11)
ax1.set_title('3D Log-Likelihood Surface', fontsize=13, fontweight='bold')

# Plot 2: Contour plot
ax2 = fig.add_subplot(132)
contour = ax2.contour(MU, SIGMA, log_likelihood, levels=20, cmap='viridis')
ax2.clabel(contour, inline=True, fontsize=8)
ax2.scatter(mle_mu, mle_sigma, s=400, color='red', marker='*', 
           edgecolors='darkred', linewidths=2, label='MLE', zorder=5)
ax2.scatter(true_mu, true_sigma, s=300, color='gold', marker='o', 
           edgecolors='darkgoldenrod', linewidths=2, label='True', zorder=5)
ax2.set_xlabel('Œº', fontsize=11)
ax2.set_ylabel('œÉ', fontsize=11)
ax2.set_title('Contour Plot', fontsize=13, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Plot 3: Fitted distribution
ax3 = fig.add_subplot(133)
ax3.hist(data, bins=30, density=True, alpha=0.7, color='lightblue', 
        edgecolor='black', label='Data')

# True distribution
x_range = np.linspace(data.min(), data.max(), 200)
ax3.plot(x_range, stats.norm.pdf(x_range, true_mu, true_sigma), 
        'g-', linewidth=3, label=f'True: N({true_mu}, {true_sigma}¬≤)')

# MLE distribution
ax3.plot(x_range, stats.norm.pdf(x_range, mle_mu, mle_sigma), 
        'r--', linewidth=3, label=f'MLE: N({mle_mu:.2f}, {mle_sigma:.2f}¬≤)')

ax3.set_xlabel('x', fontsize=11)
ax3.set_ylabel('Density', fontsize=11)
ax3.set_title('Fitted Distribution', fontsize=13, fontweight='bold')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<div class="alert alert-warning">
<h4>üí° Key Insight: MLE vs. Unbiased Estimator</h4>

**Important note:** For normal distribution variance:
- MLE: $\sigma^2_MLE = (1/n) \sum_{i=1}^n(X_i - \hat{\mu})^2$  ‚Üí **Biased** (underestimates on average)
- Unbiased: $s^2 = 1/(n-1) \sum_{i=1}^n(X_i - \hat{\mu})^2$  ‚Üí **Unbiased**

**Why the difference?**
- MLE uses $\hat{\mu$ (estimated mean), not true $\mu$
- This introduces dependency, causing bias
- Factor $(n-1)$ corrects for this (Bessel's correction)

**In practice:** For large $n$, the difference is negligible!
</div>

### MLE Properties and Computation

<div class="alert alert-success">
<h4>Properties of MLE</h4>

**Asymptotic Properties** (as n ‚Üí ‚àû):

1. **Consistency:** Œ∏ÃÇ_MLE ‚Üí Œ∏ (converges to true value)

2. **Asymptotic Normality:** 
   $$\sqrt{n}(\hat{\theta}_{MLE} - \theta) \xrightarrow{d} N(0, I(\theta)^{-1})$$
   where $I(\theta)$ is the Fisher Information

3. **Efficiency:** Among all consistent estimators, MLE has minimum asymptotic variance

4. **Invariance:** If $\hat{\theta}_{MLE}$ is MLE for $\theta$, then $g(\hat{\theta}_{MLE})$ is MLE for $g(\theta)$

**Why MLE is popular:**
- Strong theoretical properties
- Often has closed-form solution
- Intuitive interpretation
- Works well in practice
</div>

<div class="alert alert-primary">
<h4>ü§ñ ML Application: MLE via Optimization</h4>

When no closed-form solution exists, we use numerical optimization:

**Algorithm:** Gradient Ascent on Log-Likelihood

```
1. Initialize Œ∏‚ÇÄ
2. Repeat:
   Œ∏‚Çú‚Çä‚ÇÅ = Œ∏‚Çú + Œ± ‚àá‚Ñì(Œ∏‚Çú)
   where ‚àá‚Ñì(Œ∏) = gradient of log-likelihood
3. Until convergence
```

This is exactly how we train ML models

- Neural networks: gradient descent on negative log-likelihood
- Logistic regression: same thing
- Many other models: MLE via optimization

</div>


## MoM vs MLE: Comparison

| Aspect | Method of Moments | Maximum Likelihood |
|--------|------------------|-------------------|
| **Idea** | Match moments | Maximize probability of data |
| **Complexity** | Usually simpler | Can be complex |
| **Efficiency** | Less efficient | Most efficient (asymptotically) |
| **Existence** | Always exists | May not have closed form |
| **Optimality** | Not optimal | Optimal (under regularity) |
| **Bias** | Often biased | Asymptotically unbiased |

## Maximum A Posteriori (MAP) Estimation

<h4>The Limitation of MLE</h4>

MLE says: "Which $\theta$ makes the data most likely?" But what if we have prior knowledge about $\theta$?

*Example:*

- You're estimating spam rate from 10 emails. You observe 9 spam, 1 ham.
- MLE: $\hat{p} = 0.9$ (90% spam rate)
- But: You know from experience that typical spam rate is ~20-30%

**Question**: Shouldn't we incorporate this knowledge?

**Answer**: Yes. Use MAP estimation

*Background*: Based on millions of emails, we know spam rate ‚âà 25%

Let's consider two scenarios: 

1. *Scenario 1: Small Sample*

- Observed: 9 spam in 10 emails
- MLE: 90% spam rate
- Problem: Seems too high. Small sample might be misleading.
- Better idea: Combine data with prior knowledge...

2. *Scenario 2: Large Sample*

- Observed: 250 spam in 1,000 emails
- MLE: 25% spam rate
- Prior: ~25% expected
- Assessment: Data is strong evidence, prior less important.

In [None]:
# visualisation
demonstrate_prior_importance()

<div class="alert alert-success">
<h4>Definition: Maximum A Posteriori (MAP) Estimation</h4>

Bayes' Theorem:
$$P(\theta | X) = \frac{P(X | \theta) P(\theta)}{P(X)} \propto P(X | \theta) P(\theta)$$

Where:

- $P(\theta|X)$ = posterior: probability of $\theta$ given data
- $P(X|\theta)$ = likelihood: probability of data given $\theta$
- $P(\theta)$ = prior: our belief about $\theta$ before seeing data
- $P(X)$ = evidence: normalizing constant (doesn't depend on $\theta$)

**MAP Estimator:**

$$\hat{\theta}_{MAP} = \arg\max_{\theta} P(\theta | X) = \arg\max_{\theta} [P(X | \theta) P(\theta)]$$

Or equivalently (taking logs):

$$\hat{\theta}_{MAP} = \arg\max_{\theta} [\log P(X | \theta) + \log P(\theta)] = \arg\max_{\theta} [\ell(\theta) + \log P(\theta)]$$

*Interpretation:*

- MLE: Maximize likelihood only
- MAP: Maximize likelihood + prior
- MAP incorporates prior knowledge

</div>

In [None]:
# MLE vs MAP
compare_mle_map()

**Key Observation:** 
As $n \rightarrow \infty$, $MAP \rightarrow MLE$ (Data dominates prior with large samples)

|MLE|MAP|
|---|---|
| ‚úì Uses only data </br>‚úó Ignores prior knowledge | ‚úì Incorporates prior knowledge |
| ‚úì No assumptions beyond model | ‚úì Regularizes estimates | 
| ‚úó Can overfit with small samples | ‚úì Better with small samples| 
|  | ‚úó Requires choosing prior |
|  | ‚úó Can be biased if prior is wrong 



<div class="alert alert-warning">
<h4>üí° Key Insight: When to Use MAP vs MLE?</h4>

Use MLE when:

- You have lots of data
- No strong prior knowledge
- Want purely data-driven estimates
- Interpretability is critical

Use MAP when:

- Limited data (prior helps regularize)
- Strong prior knowledge exists
- Want to incorporate domain expertise
- Overfitting is a concern

In practice: Many ML methods are actually MAP
</div>

<div class="alert alert-primary">
<h4>ü§ñ ML Connection: MAP = Regularization</h4>

**Regularization in ML is just MAP estimation with specific priors**

1. Ridge Regression (L2 regularization):
$$\min_w ||y - Xw||^2 + \lambda||w||^2$$

This is equivalent to MAP with Gaussian prior on weights:
$$\sim N(0, \sigma^2 I)$$
where $\lambda = 1/(2\sigma^2)$

2. Lasso Regression (L1 regularization):
$$\min_w ||y - Xw||^2 + \lambda||w||_1$$
This is equivalent to MAP with Laplace prior on weights:

$$w \sim \text{Laplace}(0, b)$$

What this means:

- Regularization = imposing prior belief that weights should be small
- $\lambda$ = strength of prior belief
- Different regularizations = different prior distributions

</div>

<div class="alert alert-danger">
<h4>‚ö†Ô∏è Common Mistake: Forgetting to Normalize</h4>

**Problem:** Prior strength depends on data scale!

**Example:**
- Features in meters: Œª=1 might be good
- Same features in millimeters: Œª=1 is now way too weak!

**Solution:** Always normalize/standardize features before regularization:

```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

This ensures regularization strength is interpretable and consistent
</div>

## Return to Opening Challenge

Recall: We had 50 experiments with $\hat{\sigma} = 0.15$.

1. Is $\sigma = 0.15$ the 'true' optimal value?
   - No. It's our MLE, but there's uncertainty.

2. If we ran 500 experiments, would we get the same estimate?
   - Probably not exactly, but it would be close
   - With more data, our estimate becomes more reliable

3. How do we quantify how 'wrong' our estimate might be?
   - Use confidence intervals! (Next class)
   - Or: compute standard error of estimator

4. Your colleague claims $\sigma = 0.12$ is better. Who's right?
   - Use hypothesis testing (will be seen soon)
   - Or: compare likelihoods

## Common Mistakes

<div class="alert alert-danger">
<h4>‚ö†Ô∏è Common Pitfalls</h4>

- Confusing estimate with true parameter
- Ignoring bias-variance tradeoff
- Forgetting to normalize before regularization
- Choosing wrong prior in MAP
- Not checking if solution is actually a maximum

</div>

## ML Applications

<div class="alert alert-primary">
<h4>ü§ñ ML Applications</h4>

- Model Training: All supervised learning is parameter estimation
- Regularization: L2/L1 penalties = Gaussian/Laplace priors
- Loss Functions: Cross-entropy = negative log-likelihood
- Optimization: Gradient descent = finding MLE/MAP

</div>

## Key Takeaways

<div class="alert alert-summary">
<h4>üéì Key Takeaways</h4>

1. Point Estimation:

- Estimator = function that produces estimate from data
- Key properties: Bias, Variance, MSE
- MSE = Bias¬≤ + Variance (fundamental tradeoff)

2. Maximum Likelihood Estimation:

- Principle: Choose Œ∏ that makes data most likely
- Method: Maximize L(Œ∏|X) or ‚Ñì(Œ∏|X)
- Properties: Consistent, efficient, asymptotically normal
- Computation: Closed-form or gradient ascent

3. Maximum A Posteriori:

- Principle: Maximize posterior = likelihood √ó prior
- Incorporates prior knowledge
- Connection: MAP with Gaussian prior = Ridge regression
- Becomes MLE as n ‚Üí ‚àû

</div>

## Useful Links

1. [Maximum Likelihood, Clearly Explained!!! by StatQuest](https://www.youtube.com/watch?v=XepXtl9YKwc)
2. [In Statistics, Probability is not Likelihood by StatQuest](https://www.youtube.com/watch?v=pYxNSUDSFH4)
3. [Maximum Likelihood For the Normal Distribution, step-by-step!!! by StatQuest](https://www.youtube.com/watch?v=Dn6b9fCIUpM)
4. [Maximum Likelihood for the Exponential Distribution, Clearly Explained!! by StatQuest](https://www.youtube.com/watch?v=p3T-_LMrvBc)
5. [Maximum Likelihood for the Binomial Distribution, Clearly Explained!!! by StatQuest!!!](https://www.youtube.com/watch?v=4KKV9yZCoM4)
2. [What are degrees of freedom? by James Gilbert](https://www.youtube.com/watch?v=rATNoxKg1yA)