# Sampling, Monte Carlo Methods and Bootstrapping

Author & Instructor: Diana NURBAKOVA, PhD.

In [None]:
%%html
<link rel="stylesheet" type="text/css" href="../styles/styles.css">

## Learning Objectives

By the end of this lesson, you will be able to:
1. Define what sampling from a distribution means and distinguish between population and sample
2. Explain the difference between population parameters ($\mu$, $\sigma$) and sample statistics ($\bar{x}$, $s$)
3. Understand and apply Monte Carlo methods: using randomness to solve deterministic problems
4. Generate bootstrap samples by resampling with replacement

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.filterwarnings('ignore')

# Set style for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')
#sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
from ipywidgets import interact, IntSlider, Dropdown
import ipywidgets as widgets

In [None]:
plt.rcParams['font.family'] = ['DejaVu Sans', 'Segoe UI Emoji']

In [None]:
from matplotlib.patches import Circle, Rectangle
import matplotlib.patches as mpatches

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
import sys
from pathlib import Path

# Add the "resources" directory to the path
project_root = Path().resolve().parent
resources_path = project_root / 'resources'
sys.path.insert(0, str(resources_path))

In [None]:
from montecarlo import(estimate_pi_visual, interactive_population_sample_demo, demonstrate_sampling_concept, sampling_process_diagram, motivating_examples_visualization, 
                       interactive_sampling_demo, ml_sampling_example, bootstrap_demo, visualize_uncertainty_concept, bootstrap_uncertainty_explanation)

<div class="alert alert-info">
<h4>üéØ Today's Challenge: Can We Trust This Model?</h4>
<p><strong>Scenario:</strong> You've trained a deep learning model that predicts customer churn with 87% accuracy on your test set. Your CEO wants to invest $50M based on this model's predictions.</p>

<p><strong>The Questions:</strong></p>
<ul>
<li>How confident are you that the <em>real</em> accuracy is above 85%?</li>
<li>How much would this estimate vary with different test data? What if you only have 500 samples? What if you have 500,000?</li>
<li>Can you quantify the uncertainty in this 87% figure?</li>
</ul>

<p><strong>Make your guess:</strong> Green light the $50M investment?</p>

</div>

## Population vs Sample: We Can't Measure Everything

<div class="alert alert-example">
<h4>The Real-World Dilemma</h4>

<p><strong>Imagine you're hired by Netflix to answer:</strong></p>
<p style="font-size: 1.1em; font-style: italic;">"What percentage of our users will enjoy our new recommendation algorithm?"</p>

<p><strong>The Impossible Approach:</strong></p>
<ul>
<li>Netflix has 250 million subscribers worldwide</li>
<li>To be 100% certain, test with ALL 250 million users</li>
<li>Cost: $50 million+ in infrastructure</li>
<li>Time: 6 months</li>
<li>Problem: By then, competitors have already launched their features!</li>
</ul>

<p><strong>The Practical Approach:</strong></p>
<ul>
<li>Test with 10,000 randomly selected users</li>
<li>Cost: $100,000</li>
<li>Time: 2 weeks</li>
<li>Magic: Get answers that are 95% as reliable as testing everyone!</li>
</ul>

<p style="margin-top: 15px; padding: 10px; background-color: rgba(255, 193, 7, 0.1); border-left: 4px solid #ffc107;">
<strong>The Question:</strong> How can testing 10,000 users tell us about 250 million? This is the power of sampling!
</p>
</div>


In [None]:
# demo
motivating_examples_visualization()

<div class="alert alert-success">
<h4>Definitions: Population and Sample</h4>

**Population** is the complete set of ALL individuals/items/observations we want to study. 

*Symbol:* Often denoted as $N$ (population size)

*Key property:* Contains the TRUE parameters we want to know ($\mu$, $\sigma$, etc.) 

*Problem:* Usually impossible or impractical to measure entirely

</br>

**Sample** is a subset of the population that we actually observe/measure. 

*Symbol:* Often denoted as $n$ (sample size), where $n << N$

*Key property:* Provides ESTIMATES of population parameters ($\bar{x}$, $s$, etc.)

*Goal:* Choose sample so it "represents" the population

</br>

**Random Sampling** is a process of selecting a subset of the population where each member of population has equal probability of being selected.

*Why random?* Eliminates bias, allows mathematical theory to work

*Result:* Sample statistics approximate population parameters

</div>

In [None]:
# visualisation sampling process
sampling_process_diagram()

<div class="alert alert-success">
<h4>Definition: Sample Mean</h4>

Given a sample of $n$ observations: $X_1, X_2, ..., X_n$. The **sample mean** is defined as: 
$$\bar{x} = \frac{1}{n}\sum_i^n X_i$$

**Key properties:**

1. Unbiased estimator $E[\bar{x}] = \mu$ where $\mu$ is a population mean (usually unknown). 

*The expected value of the sample mean equals the population mean. "On average, xÃÑ gives us the right answer"*

2. Consistency $\bar{x} \rightarrow \mu$ as $n \rightarrow \infty$

*As sample size increases, sample mean converges to population mean. This is the Law of Large Numbers*
</div>

<div class="alert alert-success">
<h4>Definition: Sample Variance and Sample Standard Deviation</h4>

Given a sample of $n$ observations: $X_1, X_2, ..., X_n$ with sample mean $\bar{x}$. The **sample variance** is defined as: 
$$s^2 = \frac{1}{n-1} \sum_i^n (X_i - \bar{x})^2 = \frac{1}{n-1} \bigg(\sum_i^n X_i^2 - \frac{(\sum_i^nX_i)^2}{n}\bigg)$$

The **sample standard deviation** is:
$$s = \sqrt{s^2}$$

</div>

<div class="alert alert-example">
<h4>Example: Sample Mean, Sample Variance, and Sample Standard Deviation</h4>

We observe the following data: 12, 15, 18, 21, 19, 22, 16, 14, 17, 20.

</div>

The sample size is $n=10$.

1. Sample mean

$$\bar{x} = 1/n \sum_i^n X_i = 1/10 \times (12 + 15 + 18 + 21 + 21 + 19 + 22 + 16 + 14 + 17 + 20) = 174 / 10 = 17.4$$


In [None]:
# with python
data = np.array([12, 15, 18, 21, 19, 22, 16, 14, 17, 20])
n = len(data)
print(data.sum())
sample_mean = np.mean(data)
    
print(f"\nSample data: {data}")
print(f"Sample size: n = {n}")
print(f"Sample mean: xÃÑ = {sample_mean}")
    

2. Sample Variance: Method 1

$$s^2 = \frac{1}{n-1} \sum_i^n (X_i - \bar{x})^2$$

| $i$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Sum |
| :--: |:--: |:--: |:--: |:--: |:--: |:--: |:--: |:--: |:--: |:--: |:--: |
| $X_i$ | $12$ | $15$ | $18$ | $21$ | $19$ | $22$ | $16$ | $14$ | $17$ | $20$ | $174$ |
| $X_i - \bar{x}$ | $12 - 17.4 = -5.4$ | $15 - 17.4 = -2.4$ | $18 - 17.4 = 0.6$ | $21 - 17.4 = 3.6$ | $19 - 17.4 = 1.6$ | $22 - 17.4 = 4.6$ | $16 - 17.4 = -1.4$ | $14 - 17.4 = -3.4$ | $17 - 17.4 = -0.4$ | $20 - 17.4 = 2.6$ | $0$ |
| $(X_i - \bar{x})^2$ | $(-5.4)^2 = 29.16$ | $(-2.4)^2 = 5.76$ | $0.6^2 = 0.36$ | $3.6^2 = 12.96$ | $1.6^2 = 2.56$ | $4.6^2 = 21.16$ | $(-1.4)^2 = 1.96$ | $(-3.4)^2 = 11.56$ | $(-0.4)^2 = 0.16$ | $2.6^2 = 6.76$ | $92.40$ |

Hence,

$$s^2 = \frac{1}{n-1} \sum_i^n (X_i - \bar{x})^2 = 1 / 9 \times 92.40 = 10.267$$

3. Standard Deviation

$$s = \sqrt{s^2} = \sqrt{10.267} = 3.204$$

In [None]:
print('Sample Variance')
print("METHOD 1: DEFINITION FORMULA")

print("\nStep 1: Compute deviations from mean (X·µ¢ - xÃÑ)")
print(f"{'i':<5} {'X·µ¢':<8} {'xÃÑ':<8} {'(X·µ¢ - xÃÑ)':<15} {'(X·µ¢ - xÃÑ)¬≤':<15}")
    
deviations = []
squared_deviations = []
    
for i, x in enumerate(data, 1):
    dev = x - sample_mean
    sq_dev = dev ** 2
    deviations.append(dev)
    squared_deviations.append(sq_dev)
        
    print(f"{i:<5} {x:<8} {sample_mean:<8} {dev:<15.2f} {sq_dev:<15.2f}")
    
print(f"{'Sum:':<22} {sum(deviations):<15.2f} {sum(squared_deviations):<15.2f}")
    
print("\n‚úì Note: Sum of deviations = 0 (always!)")
print("  This is because deviations cancel out by definition of mean")
    
print("\nStep 2: Sum the squared deviations")
sum_sq_dev = sum(squared_deviations)
print(f"  Œ£(X·µ¢ - xÃÑ)¬≤ = {sum_sq_dev}")
    
print("\nStep 3: Divide by (n-1)")
print(f"  s¬≤ = Œ£(X·µ¢ - xÃÑ)¬≤ / (n-1)")
print(f"     = {sum_sq_dev} / {n-1}")
    
sample_variance = sum_sq_dev / (n - 1)
print(f"     = {sample_variance:.4f}")
    
print("\nStep 4: Take square root for standard deviation")
sample_std = np.sqrt(sample_variance)
print(f"  s = ‚àös¬≤")
print(f"    = ‚àö{sample_variance:.4f}")
print(f"    = {sample_std:.4f}")

print(f"with numpy: {data.var(ddof=1)}")

3. Sample Variance: Method 2

$$s^2 = \frac{1}{n-1} \bigg(\sum_i^n X_i^2 - \frac{(\sum_i^nX_i)^2}{n}\bigg)$$

| $i$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Sum |
| :--: |:--: |:--: |:--: |:--: |:--: |:--: |:--: |:--: |:--: |:--: |:--: |
| $X_i$ | $12$ | $15$ | $18$ | $21$ | $19$ | $22$ | $16$ | $14$ | $17$ | $20$ | $174$ |
| $X_i^2$ | $12^2 = 144$ | $15^2 = 225$ | $18^2 = 324$ | $21^2 = 441$ | $19^2 = 361$ | $22^2 = 484$ | $16^2 = 256$ | $14^2 = 196$ | $17^2 = 289$ | $20^2 = 400$ | $3120$ |

Hence:
$$s^2 = \frac{1}{n-1} \bigg(\sum_i^n X_i^2 - \frac{(\sum_i^nX_i)^2}{n}\bigg) = \frac{1}{10-1} \bigg(3120 - \frac{174^2}{10}\bigg) = \frac{1}{9} \bigg(3120 - \frac{30276}{10}\bigg) = \frac{1}{9} \bigg(3120 - 3027.6\bigg) = 92.4/9 = 10.2667$$



In [None]:
n = data.shape[0]
print("\nMETHOD 2: COMPUTATIONAL FORMULA")
print("\nStep 1: Compute Œ£ X·µ¢ and Œ£ X·µ¢¬≤")
x_squared = data ** 2
print(f"X·µ¢¬≤: {x_squared}")
sum_x = np.sum(data)
sum_x_squared = np.sum(x_squared)
    
print(f"  Œ£ X·µ¢ = {sum_x}")
print(f"  Œ£ X·µ¢¬≤ = {sum_x_squared}")
    
print("\nStep 2: Apply formula")
print(f"  s¬≤ = [Œ£ X·µ¢¬≤ - (Œ£ X·µ¢)¬≤/n] / (n-1)")
print(f"     = [{sum_x_squared} - ({sum_x})¬≤/{n}] / {n-1}")
print(f"     = [{sum_x_squared} - {sum_x**2/n:.2f}] / {n-1}")
print(f"     = {sum_x_squared - sum_x**2/n:.2f} / {n-1}")
    
variance_alt = (sum_x_squared - sum_x**2/n) / (n-1)
print(f"     = {variance_alt:.4f}")

Let's also calculate biased sample variance: 
$$s^2 = \frac{1}{\mathbf{n}} \sum_i^n (X_i - \bar{x})^2 = \frac{1}{10} 92.4 = 9.24$$

In [None]:
# Biased version (using n)
biased_var = sum_sq_dev / n
    
print(f"\nUsing n (BIASED):")
print(f"  Biased variance = Œ£(X·µ¢ - xÃÑ)¬≤ / n")
print(f"                  = {sum_sq_dev} / {n}")
print(f"                  = {biased_var:.4f}")

print(f"with numpy: {data.var(ddof=0)}")

In [None]:
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
# Plot 1: Visualize deviations
ax1 = axes[0, 0]
x_pos = range(1, n+1)
ax1.bar(x_pos, data, alpha=0.6, edgecolor='black', color='skyblue')
ax1.axhline(sample_mean, color='red', linewidth=3, linestyle='--',
               label=f'Mean xÃÑ = {sample_mean}')
    
# Show deviations as arrows
for i, (pos, val) in enumerate(zip(x_pos, data)):
    if val > sample_mean:
        ax1.annotate('', xy=(pos, sample_mean), xytext=(pos, val),
                        arrowprops=dict(arrowstyle='<->', color='green', lw=2))
        ax1.text(pos + 0.2, (val + sample_mean)/2, f'{val-sample_mean:.1f}',
                    fontsize=8, color='green')
    else:
        ax1.annotate('', xy=(pos, val), xytext=(pos, sample_mean),
                        arrowprops=dict(arrowstyle='<->', color='orange', lw=2))
        ax1.text(pos + 0.2, (val + sample_mean)/2, f'{val-sample_mean:.1f}',
                    fontsize=8, color='orange')
    
ax1.set_xlabel('Observation Index', fontsize=12)
ax1.set_ylabel('Value', fontsize=12)
ax1.set_title('Deviations from Mean', fontsize=13, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3, axis='y')
ax1.set_xticks(x_pos)
    
# Plot 2: Distribution with std dev bands
ax2 = axes[0, 1]
ax2.hist(data, bins=7, alpha=0.7, edgecolor='black', color='skyblue', density=False)
ax2.axvline(sample_mean, color='red', linewidth=3, linestyle='-',
               label=f'Mean = {sample_mean:.1f}')
ax2.axvline(sample_mean - sample_std, color='orange', linewidth=2, linestyle='--',
               label=f'Mean ¬± 1 SD')
ax2.axvline(sample_mean + sample_std, color='orange', linewidth=2, linestyle='--')
ax2.axvspan(sample_mean - sample_std, sample_mean + sample_std, 
               alpha=0.2, color='orange')
    
ax2.set_xlabel('Value', fontsize=12)
ax2.set_ylabel('Frequency', fontsize=12)
ax2.set_title(f'Distribution with Standard Deviation\ns = {sample_std:.2f}', 
                 fontsize=13, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3, axis='y')
    
# Plot 3: Effect of n vs (n-1)
ax3 = axes[1, 0]
    
sample_sizes = range(2, 51)
bias_factors = [n/(n-1) for n in sample_sizes]
    
ax3.plot(sample_sizes, bias_factors, linewidth=3, color='purple')
ax3.axhline(1, color='red', linewidth=2, linestyle='--', alpha=0.7,
               label='No correction (bias)')
ax3.axhline(1.05, color='orange', linewidth=1, linestyle=':', alpha=0.7)
ax3.axhline(1.10, color='orange', linewidth=1, linestyle=':', alpha=0.7)
    
ax3.fill_between(sample_sizes, 1, bias_factors, alpha=0.3, color='purple')
    
ax3.set_xlabel('Sample Size (n)', fontsize=12)
ax3.set_ylabel('Correction Factor: n/(n-1)', fontsize=12)
ax3.set_title('Impact of Bessel\'s Correction', fontsize=13, fontweight='bold')
ax3.legend(fontsize=11)
ax3.grid(True, alpha=0.3)
ax3.set_ylim(0.98, 1.6)
    
# Annotate key points
for n in [5, 10, 20, 50]:
    if n <= max(sample_sizes):
        idx = n - 2
        factor = bias_factors[idx]
        ax3.plot(n, factor, 'ro', markersize=8)
        ax3.text(n, factor + 0.03, f'n={n}\n{factor:.3f}√ó',
                    ha='center', fontsize=8, fontweight='bold')
    
# Plot 4: Explanation box
ax4 = axes[1, 1]
ax4.axis('off')
    
explanation = f"""
    UNDERSTANDING SAMPLE VARIANCE
    {'='*50}
    
    Our Sample:
    {'‚îÄ'*50}
    Data: {data}
    n = {n}
    xÃÑ = {sample_mean}
    s¬≤ = {sample_variance:.4f}
    s = {sample_std:.4f}
    
    What does s¬≤ = {sample_variance:.2f} mean?
    {'‚îÄ'*50}
    ‚Ä¢ Average squared deviation from mean
    ‚Ä¢ Measures spread/dispersion
    ‚Ä¢ Units: (original units)¬≤
    ‚Ä¢ s = {sample_std:.2f} is in original units
    
    Why (n-1) instead of n?
    {'‚îÄ'*50}
    ‚Ä¢ We estimated xÃÑ from the data
    ‚Ä¢ This uses up 1 degree of freedom
    ‚Ä¢ Makes s¬≤ an UNBIASED estimator of œÉ¬≤
    ‚Ä¢ Critical for small samples!
    
    Impact for our data:
    {'‚îÄ'*50}
    ‚Ä¢ With n:   {biased_var:.4f} (underestimates!)
    ‚Ä¢ With n-1: {sample_variance:.4f} (unbiased)
    ‚Ä¢ Difference: {(sample_variance/biased_var - 1)*100:.1f}% larger
    
    In Monte Carlo Integration:
    {'‚îÄ'*50}
    ‚Ä¢ Standard Error = (b-a) √ó s/‚àön
    ‚Ä¢ We MUST use s with (n-1)!
    ‚Ä¢ Otherwise SE is underestimated
    ‚Ä¢ Confidence intervals would be too narrow
    
    Rule of Thumb:
    {'‚îÄ'*50}
    ‚Ä¢ Small samples (n < 30): Use n-1
    ‚Ä¢ Large samples (n > 100): Barely matters
    ‚Ä¢ ALWAYS use n-1 to be safe!
    """
    
ax4.text(0.05, 0.95, explanation, transform=ax4.transAxes,
            fontsize=8.5, verticalalignment='top', family='monospace',
            bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
    
plt.tight_layout()
plt.show()

<div class="alert alert-success">
<h4>Definition: Standard Error (SE)</h4>

The **standard error** of the mean is the standard deviation of the sampling distribution of the sample mean $\bar{x}$.
    
POPULATION STANDARD ERROR (if $\sigma$ is known):
    
$$SE(\bar{x}) = \frac{\sigma}{\sqrt{n}}$$
    
ESTIMATED STANDARD ERROR (using sample):
    
$$SE(\bar{x}) = \frac{s}{\sqrt{n}}$$
    
where:
- $\sigma$ = population standard deviation (usually unknown)
- $s$ = sample standard deviation
- $n$ = sample size
    

    
**Critical distinction:**

    
1. Standard Deviation ($s$):
- Measures variability of INDIVIDUAL observations
- Answers: "How spread out is my data?"
- Does NOT decrease with n
- Units: same as data
    
2. Standard Error ($SE$):
- Measures variability of the SAMPLE MEAN
- Answers: "How precisely do I know the mean?"
- Decreases as $1/\sqrt{n}$ (more data = more precise)
- Units: same as data
    

    
**Key properties**:
  
1. Decreases with sample size: $SE \propto 1/\sqrt{n}$
       
    - Double $n$ ‚Üí SE decreases by factor of $\sqrt{2} \approx 1.41$
    - 4√ó more data ‚Üí SE halves
    - 100√ó more data ‚Üí SE decreases by 10√ó
    
2. Measures precision:
    - Small SE ‚Üí precise estimate of mean
    - Large SE ‚Üí imprecise estimate
</div>

Let's calculate SE for our example with the following observed data: 12, 15, 18, 21, 19, 22, 16, 14, 17, 20.

Recall: $s = 3.2042$ and $n = 10$.

Hence: $$SE(\bar{x}) = \frac{s}{\sqrt{n}} = \frac{3.2041}{\sqrt{10}} = 1.0132$$

In [None]:
data = np.array([12, 15, 18, 21, 19, 22, 16, 14, 17, 20])
n = len(data)
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)
se = sample_std / np.sqrt(n)
    
print(f"\nSample data: {data}")
print(f"Sample size: n = {n}")
print(f"Sample mean: xÃÑ = {sample_mean}")
print(f"Sample std dev: s = {sample_std:.4f}")
print(f"SE = s/‚àön")
print(f"     = {sample_std:.4f}/‚àö{n}")
print(f"     = {sample_std:.4f}/{np.sqrt(n):.4f}")
print(f"     = {se:.4f}")
    
print("\nINTERPRETATION:")
print(f"  The sample mean xÃÑ = {sample_mean:.2f} has standard error {se:.4f}")
print(f"  ")
print(f"  This means:")
print(f"  ‚Ä¢ If we repeated this experiment many times,")
print(f"  ‚Ä¢ Each time taking a sample of size {n},")
print(f"  ‚Ä¢ The sample means would vary with SD ‚âà {se:.4f}")
print(f"  ‚Ä¢ About 68% would fall within {sample_mean:.2f} ¬± {se:.4f}")
print(f"  ‚Ä¢ About 95% would fall within {sample_mean:.2f} ¬± {1.96*se:.4f}")
print()

# variation with sample size
print(f"{'Sample Size (n)':<20} {'SE = s/‚àön':<15} {'95% CI Width':<20}")
print("-" * 60)
for n_demo in [5, 10, 20, 50, 100, 500, 1000]:
    se_demo = sample_std / np.sqrt(n_demo)
    ci_width = 2 * 1.96 * se_demo
    print(f"{n_demo:<20} {se_demo:<15.4f} ¬±{ci_width:<19.4f}")
    
print("\nOBSERVATIONS HOW SE CHANGES WITH SAMPLE SIZE:")
print("  ‚úì SE decreases as n increases")
print("  ‚úì To halve SE, need 4√ó more data")
print("  ‚úì Diminishing returns: going from 100‚Üí1000 helps less than 10‚Üí100")

## Understanding Sampling from Distributions

<div class="alert alert-success">
<h4>Definition: Sampling from a Distribution</h4>

**Sampling** means generating random values that follow a specific probability distribution.</p>

Intuition:

- A distribution describes the "recipe" for how values should occur
- Sampling is like following that recipe to create actual instances
- Each sample is random, but collectively they follow the distribution's pattern 


**Mathematical notation**: $X \sim p(x)$ means "$X$ is sampled from distribution $p$"

Sampling from a Distribution means:

1. Start with: A probability distribution $p(x)$ that describes the population
2. Process: Generate random values $X_1, X_2, ..., X_n$ where each $X_i$ follows distribution $p(x)$
3. Result: A sample of $n$ observations from that distribution

</div>

Let's consider several examples:

1. Uniform Distribution

Imagine a perfectly fair spinner that can land anywhere from 0 to 10 with equal probability.
Let's spin it 5 times:
  - Spin 1: 3.75
  - Spin 2: 9.51
  - Spin 3: 7.32
  - Spin 4: 5.99
  - Spin 5: 1.56 

> What happens with many samples?

With 1000 spins, the histogram matches the uniform shape.
- Sample mean: 4.91 (expected: 5.00)
- Sample std:  2.92 (expected: 2.89)

2. Normal (Gaussian) Distribution

Imagine measuring the heights of students (mean=170cm, std=10cm). Most students cluster around 170cm, extreme heights are rare.

Let's measure heights of 5 random students:
  - Student 1: 175.0 cm
  - Student 2: 168.6 cm
  - Student 3: 176.5 cm
  - Student 4: 185.2 cm
  - Student 5: 167.7 cm

> What happens with many samples?

With 1000 measurements, the histogram matches the bell curve!
- Sample mean: 170.19cm (expected: 170cm)
- Sample std:  9.79cm (expected: 10cm)

In [None]:
# demo
demonstrate_sampling_concept()

<div class="alert alert-success">

**Two Perspectives on Sampling:**

<table style="width:100%; border-collapse: collapse; margin-top: 10px;">
<tr style="background-color: #f0f0f0;">
<th style="border: 1px solid #ddd; padding: 12px; text-align: left;">Real-World Perspective</th>
<th style="border: 1px solid #ddd; padding: 12px; text-align: left;">Mathematical Perspective</th>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 10px; vertical-align: top;">
<strong>Sample from Population</strong><br>
- Population has some distribution<br>
- Randomly select individuals<br>
- Measure their values<br>
- Each measurement is a "sample"
</td>
<td style="border: 1px solid #ddd; padding: 10px; vertical-align: top;">
<strong>Sample from Distribution</strong><br>
- Distribution p(x) describes process<br>
- Generate random values<br>
- Each follows probability rules<br>
- Each value is a "sample"
</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 10px; vertical-align: top;">
<em>Example:</em> Measure heights of 100 random students ‚Üí sample of size n=100
</td>
<td style="border: 1px solid #ddd; padding: 10px; vertical-align: top;">
<em>Example:</em> Generate 100 values from Normal(170, 10) ‚Üí sample of size n=100
</td>
</tr>
</table>

<p style="margin-top: 15px; padding: 10px; background-color: rgba(33, 150, 243, 0.1); border-left: 4px solid #2196f3;">
<strong>Key Connection:</strong> When we "sample from a distribution" computationally, we're simulating the process of randomly sampling from a population!
</p>
</div>

In [None]:
# interactive demo 
try:
    print("üéÆ INTERACTIVE DEMO: Try different sample sizes and distributions!\n")
    interact(interactive_sampling_demo,
             n_samples=IntSlider(min=10, max=10000, step=10, value=100, 
                                description='Samples:', continuous_update=False),
             distribution=Dropdown(options=['Normal', 'Uniform', 'Exponential', 'Beta'],
                                  value='Normal', description='Distribution:'))
    
except ImportError:
    print("Note: ipywidgets not available. Install with: pip install ipywidgets")

<div class="alert alert-warning">
<h4>üí° Key Insights About Sampling</h4>

<p><strong>Observation 1: Randomness with Structure</strong></p>
<ul>
<li>Each individual sample is unpredictable (random)</li>
<li>But the <em>pattern</em> of many samples is predictable (follows the distribution)</li>
<li>It's like rolling a die: one roll is random, but after 1000 rolls you'll get ~167 sixes</li>
</ul>

<p><strong>Observation 2: More Samples = Better Picture</strong></p>
<ul>
<li>5 samples: Hard to see the distribution's shape</li>
<li>100 samples: Pattern starts to emerge</li>
<li>1000 samples: Clear picture of the distribution</li>
<li>This is the foundation of Monte Carlo methods!</li>
</ul>

<p><strong>Observation 3: Sample Statistics ‚Üí Population Parameters</strong></p>
<ul>
<li>Sample mean approximates true mean</li>
<li>Sample standard deviation approximates true standard deviation</li>
<li>Larger samples ‚Üí better approximations</li>
</ul>
</div>

<div class="alert alert-primary">
<h4>ü§ñ ML Connection: Why Sampling Powers Modern AI</h4>

<p><strong>Sampling enables three critical capabilities:</strong></p>

<ol>
<li><strong>Efficiency</strong>: Can't use all data at once? Sample a batch!
   <ul>
   <li>Mini-batch SGD: Sample 32 examples instead of using all 1M</li>
   <li>Reduces memory usage by 31,250√ó</li>
   </ul>
</li>

<li><strong>Exploration</strong>: Want to see different possibilities? Sample variations!
   <ul>
   <li>Dropout: Sample different sub-networks each iteration</li>
   <li>Data augmentation: Sample different transformations</li>
   <li>Prevents overfitting by showing the model diverse examples</li>
   </ul>
</li>

<li><strong>Approximation</strong>: Can't compute exactly? Sample to approximate!
   <ul>
   <li>Monte Carlo methods (today's topic!)</li>
   <li>Estimate complex integrals using random samples</li>
   <li>Works even when math is impossible</li>
   </ul>
</li>
</ol>

<p><strong>Bottom line:</strong> Understanding sampling is understanding how ML actually works.</p>
</div>

In [None]:
# some ML sampling examples
ml_sampling_example()

<div class="alert alert-exercise">
<h4>‚úèÔ∏è Quick Check: Do You Understand Sampling?</h4>

<p><strong>Question 1:</strong> You sample 10 values from Normal(Œº=5, œÉ=2). Their mean is 5.3. Is something wrong?</p>
<details>
<summary>Click to reveal answer</summary>
<p><strong>Answer:</strong> No! This is completely normal. The sample mean will vary around the true mean. With only 10 samples, seeing 5.3 instead of 5.0 is expected. The difference will decrease as you collect more samples.</p>
</details>

<p><strong>Question 2:</strong> What's the difference between sampling 1000 times from Normal(0,1) versus computing the normal distribution at 1000 points?</p>
<details>
<summary>Click to reveal answer</summary>
<p><strong>Answer:</strong></p>
<ul>
<li><strong>Sampling:</strong> Get 1000 <em>random</em> values following the distribution (different each time)</li>
<li><strong>Computing:</strong> Evaluate the probability density at 1000 <em>fixed</em> points (same each time)</li>
<li>Sampling gives you <em>data</em>, computing gives you the <em>formula</em></li>
</ul>
</details>

<p><strong>Question 3:</strong> In mini-batch training with batch_size=32, how many times do we sample per epoch (if dataset has 3200 samples)?</p>
<details>
<summary>Click to reveal answer</summary>
<p><strong>Answer:</strong> 100 times (3200 / 32 = 100 batches per epoch)</p>
</details>
</div>

To generate a sample of size $n$ from a given distribution using python, you can use the method `rvs` from the corresponding `scipy.stats` distribution, e.g.:

In [None]:
# EXAMPLE 1: Standard Normal distribution
X_norm = stats.norm(loc=0, scale=1)
# generate 10 samples
samples_norm = X_norm.rvs(size=10, random_state=42)
print(f"10 random draws from Standard Normal distribution: {samples_norm}")

# EXAMPLE 2: Bernoulli with p=0.3
X_bern = stats.bernoulli(p=0.3)
# generate 10 samples
samples_bernoulli = X_bern.rvs(size=10, random_state=42)
print(f"10 random draws from Bernoulli distribution with p=0.3: {samples_bernoulli}")


For generating sample from uniform distribution, it is also common to use [`numpy.random.uniform`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html):

In [None]:
# EXAMPLE 3: Uniform distribution
n_samples = 10
samples_uniform = np.random.uniform(low=0, high=1, size=n_samples)
print(f"10 random draws from Uniform distribution on [0,1]: {samples_uniform}")

## Monte Carlo Methods

How do you calculate the area of a lake with an irregular shape? Or the probability that your ML model fails in exactly 3 out of 10 edge cases?

<div class="alert alert-example">
<h4>The Problem: Can We Estimate œÄ Without Using œÄ?</h4>

What we know:

- $\pi ‚âà 3.14159...$ (but let's pretend we don't know this)
- $\pi$ is the ratio of a circle's circumference to its diameter
- Calculating $\pi$ analytically is complex (infinite series, etc.)


What we want:

- Estimate $\pi$ using only random numbers and simple geometry
- No calculus, no infinite series, no complex math
- Just: throw darts randomly and count!


<p style="margin-top: 15px; padding: 10px; background-color: rgba(255, 193, 7, 0.1); border-left: 4px solid #ffc107;">
<strong>The Monte Carlo Idea:</strong> Use randomness to approximate a deterministic constant!
</p>
</div>

Let's consider the following setup:

In [None]:
# graphical setup for pi calculation
fig, ax = plt.subplots(1, 1, figsize=(16, 7))
    
# Draw square
square = Rectangle((-1, -1), 2, 2, fill=False, edgecolor='tab:blue', linewidth=3)
ax.add_patch(square)
    
# Draw circle
circle = Circle((0, 0), 1, fill=False, edgecolor='#EF9A9A', linewidth=3)
ax.add_patch(circle)
    
# Add labels
ax.plot([-1, 1], [0, 0], 'b--', alpha=0.5, linewidth=1)
ax.plot([0, 0], [-1, 1], 'b--', alpha=0.5, linewidth=1)
    
# Annotations
ax.annotate('', xy=(1, 0), xytext=(0, 0),
                arrowprops=dict(arrowstyle='<->', color='#EF9A9A', lw=2))
ax.text(0.5, -0.15, 'r = 1', fontsize=12, ha='center', color='#EF9A9A', fontweight='bold')
    
ax.annotate('', xy=(1, 1), xytext=(1, -1),
                arrowprops=dict(arrowstyle='<->', color='tab:blue', lw=2))
ax.text(1.2, 0, 'side = 2', fontsize=12, ha='left', color='tab:blue', fontweight='bold')
   
# Formulas
formula_text = """
    Circle (red):
    ‚Ä¢ Radius: r = 1
    ‚Ä¢ Area: œÄr¬≤ = œÄ(1)¬≤ = œÄ
    
    Square (blue):
    ‚Ä¢ Side length: 2
    ‚Ä¢ Area: 2¬≤ = 4
    
    Ratio:
    Area(circle)   œÄ
    ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ = ‚îÄ‚îÄ‚îÄ
    Area(square)   4
    
    Solve for œÄ:
        Area(circle)  
    4 * ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ 
        Area(square)
    
    """
    
ax.text(-3.5, 0, formula_text, fontsize=11, verticalalignment='center',
            family='monospace',
            bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
    
ax.set_xlim(-4, 2)
ax.set_ylim(-1.5, 1.5)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.set_title('Geometric Setup', fontsize=14, fontweight='bold')
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('y', fontsize=12)


**The main (Monte Carlo) idea:**

1. Throw darts randomly (uniform in square)
    
2. Count how many land inside the circle, i.e. $x^2 + y^2 \leq 1$

3. Count results: number of points inside circle, outside circle

4. Ratio of counts ‚âà Ratio of areas
    
5. Solve for $\pi$

$$\pi = 4 \times \frac{\text{Area circle}}{\text{Area square}} = 4 \times \frac{\text{\# darts inside the circle}}{\text{\# darts total}}$$

In [None]:
results = []
sample_sizes = [10, 100, 1000, 5000, 10000, 50000, 100000]
for n in sample_sizes:
    pi_est, fig = estimate_pi_visual(n, True)
    print(f"\n{n} darts: {pi_est}")
    # calculate error
    error = abs(pi_est - np.pi)
    percent_error = error / np.pi * 100
    results.append({
        'n': n, 
        'pi_estimate': pi_est,
        'error': error,
        'percent_error': percent_error
    })

<div class="alert alert-success">
<h4>Definition: Monte Carlo Method</h4>

A **Monte Carlo method** is any computational technique that relies on repeated random sampling to obtain numerical results for problems that are:

- Analytically intractable (no closed-form solution)
- Too complex for traditional numerical methods
- High-dimensional

**Core Principle:** Use randomness to solve deterministic problems.

**Mathematical Foundation:** Law of Large Numbers - as $n \rightarrow \infty$, sample mean ‚Üí true mean

**Formal Framework:**

To estimate an expectation $E[f(X)]$ where $X \sim p(x)$:

1. Sample: Draw $X_1, X_2, ..., X_n \sim p(x)$
2. Evaluate: Compute $f(X_1), f(X_2), ..., f(X_n)$
3. Average: $Estimate = (1/n) \sum_{i=1}^n f(X_i)$

Convergence Rate: Error decreases as $O(1/\sqrt{n})$ regardless of dimensionality.

</div>

Let's express it formally. 

Let $X_1, X_2, ... , X_n$ and $Y_1, Y_2, ..., Y_n$ be independent variables with distribution $\mathcal{U}[-1; 1]$.
Let's denote $R^2_i = X^2_i + Y^2_i$. 

Let $I$ be an indicator function: $I(x, y) = \left\{\begin{array}{ll}1 & \text{ if } x^2 + y^2 \leq 1 \\ 0 & \text{ otherwise} \end{array}\right.$ 

Recall our graphical setup. The probability $\mathbb{P}(R^2_i \leq 1)$ corresponds to the ratio of the area of the circle with radius 1 which is $pi$ to the area of the square (all points) which is 4. Thus, 

$$E[I(X, Y)] = \mathbb{P}(R^2_i \leq 1) = \frac{\text{Area circle}}{\text{Area square}} = \frac{\pi}{4}$$

By Monte Carlo estimation:

$$E[I(X, Y)] \approx 1/n \sum_{i=1}^n I(X_i, Y_i) = \frac{\text{count inside}}{n}$$

So, solving this for $\pi$, we obtain:

$$\hat{\pi} = 4\times \frac{\text{count inside}}{n}$$

Let's explore the convergence of this method:

In [None]:
# demo convergence
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
# Plot 1: Estimates vs n
ax1 = axes[0, 0]
ns = [r['n'] for r in results]
estimates = [r['pi_estimate'] for r in results]
    
ax1.semilogx(ns, estimates, 'bo-', linewidth=2, markersize=8, label='MC Estimate')
ax1.axhline(np.pi, color='red', linewidth=2, linestyle='--', label=f'True œÄ = {np.pi:.6f}')
ax1.fill_between(ns, np.pi - 0.1, np.pi + 0.1, alpha=0.2, color='red', 
                     label='¬±0.1 error band')
ax1.set_xlabel('Number of Samples (n)', fontsize=12)
ax1.set_ylabel('Estimate of œÄ', fontsize=12)
ax1.set_title('Convergence of œÄ Estimate', fontsize=13, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3, which='both')
    
# Plot 2: Error vs n (log-log)
ax2 = axes[0, 1]
errors = [r['error'] for r in results]
    
ax2.loglog(ns, errors, 'ro-', linewidth=2, markersize=8, label='Actual Error')
    
# Theoretical O(1/‚àön) line
theoretical = errors[0] * np.sqrt(ns[0]) / np.sqrt(np.array(ns))
ax2.loglog(ns, theoretical, 'b--', linewidth=2, label='O(1/‚àön) theoretical')
    
ax2.set_xlabel('Number of Samples (n)', fontsize=12)
ax2.set_ylabel('Absolute Error', fontsize=12)
ax2.set_title('Error Decay Rate', fontsize=13, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3, which='both')
    
# Plot 3: Visualize for n=1000
ax3 = axes[1, 0]
n_viz = 1000
x_viz = np.random.uniform(-1, 1, n_viz)
y_viz = np.random.uniform(-1, 1, n_viz)
inside_viz = (x_viz**2 + y_viz**2) <= 1
    
# Draw circle and square
circle = Circle((0, 0), 1, fill=False, edgecolor='#EF9A9A', linewidth=2, alpha=0.3)
ax3.add_patch(circle)
square = Rectangle((-1, -1), 2, 2, fill=False, edgecolor='tab:blue', linewidth=2)
ax3.add_patch(square)
    
# Plot points
ax3.scatter(x_viz[inside_viz], y_viz[inside_viz], c='red', s=2, alpha=0.5, label='Inside')
ax3.scatter(x_viz[~inside_viz], y_viz[~inside_viz], c='tab:blue', s=2, alpha=0.5, label='Outside')
    
inside_count_viz = np.sum(inside_viz)
pi_viz = 4 * inside_count_viz / n_viz
    
ax3.set_xlim(-1.2, 1.2)
ax3.set_ylim(-1.2, 1.2)
ax3.set_aspect('equal')
ax3.set_title(f'Visualization: n=1,000 points\nœÄ ‚âà {pi_viz:.6f}', 
                  fontsize=13, fontweight='bold')
ax3.legend(fontsize=10)
ax3.grid(True, alpha=0.3)
    
# Plot 4: Distribution of estimates (run Monte Carlo multiple times)
ax4 = axes[1, 1]
    
n_trials = 1000
n_samples = 1000
trial_estimates = []
    
for _ in range(n_trials):
    x_trial = np.random.uniform(-1, 1, n_samples)
    y_trial = np.random.uniform(-1, 1, n_samples)
    inside_trial = np.sum((x_trial**2 + y_trial**2) <= 1)
    pi_trial = 4 * inside_trial / n_samples
    trial_estimates.append(pi_trial)
    
trial_estimates = np.array(trial_estimates)
    
ax4.hist(trial_estimates, bins=40, alpha=0.7, edgecolor='black', density=True, color='skyblue')
ax4.axvline(np.pi, color='red', linewidth=2.5, linestyle='--', label=f'True œÄ = {np.pi:.6f}')
ax4.axvline(np.mean(trial_estimates), color='green', linewidth=2, 
                label=f'Mean estimate = {np.mean(trial_estimates):.6f}')
    
# Show standard error
se = np.std(trial_estimates)
ax4.axvspan(np.pi - se, np.pi + se, alpha=0.2, color='orange', label=f'¬±1 SE = {se:.6f}')
    
ax4.set_xlabel('Estimate of œÄ', fontsize=12)
ax4.set_ylabel('Density', fontsize=12)
ax4.set_title(f'Distribution of Estimates\n{n_trials} trials with n={n_samples} each', 
                  fontsize=13, fontweight='bold')
ax4.legend(fontsize=10)
ax4.grid(True, alpha=0.3)
    
plt.tight_layout()
plt.show()

<div class="alert alert-warning">
<h4>üí° Key Insight: The Curse of Dimensionality... Broken!</h4>

Traditional numerical integration (like Simpson's rule) requires n^d evaluations for d dimensions. Monte Carlo needs only n samples regardless of d!

This is why Monte Carlo dominates in ML where we routinely work in 1000+ dimensional spaces.
</div>

<div class="alert alert-primary">
<h4>ü§ñ ML Application: Dropout Training</h4>
<p>Dropout in neural networks is a Monte Carlo method!</p>
<ul>
<li><strong>Each forward pass</strong> samples a different sub-network (random dropout mask)</li>
<li><strong>Training</strong> approximates the expectation over all possible sub-networks</li>
<li><strong>Prediction</strong> uses Monte Carlo dropout for uncertainty estimation</li>
</ul>
</div>

<div class="alert alert-primary">
<h4>ü§ñ Why This Foundation Matters for Machine Learning</h4>

<p><strong>Every ML workflow follows this exact pattern:</strong></p>

<ol>
<li><strong>Population:</strong> All possible data points your model might see
   <ul>
   <li>Example: All possible images of cats</li>
   <li>Size: Infinite!</li>
   </ul>
</li>

<li><strong>Sample:</strong> Your training/test dataset
   <ul>
   <li>Example: ImageNet's 1M labeled images</li>
   <li>Size: Finite, manageable</li>
   </ul>
</li>

<li><strong>Distribution:</strong> The learned model
   <ul>
   <li>Example: Neural network learns P(cat|image)</li>
   <li>This is your estimate of the population distribution</li>
   </ul>
</li>

<li><strong>Sampling from Distribution:</strong> Generate synthetic data or make predictions
   <ul>
   <li>Example: GANs generate new images by sampling from learned distribution</li>
   <li>Example: Language models sample next words from P(word|context)</li>
   </ul>
</li>

<li><strong>Monte Carlo:</strong> Estimate model performance, uncertainty
   <ul>
   <li>Example: Dropout during inference (sample different sub-networks)</li>
   <li>Example: Bootstrap to estimate confidence intervals on accuracy</li>
   </ul>
</li>
</ol>

<p style="margin-top: 15px; padding: 12px; background-color: rgba(103, 58, 183, 0.1); border-left: 4px solid #673ab7;">
<strong>Bottom Line:</strong> Understanding sampling is understanding how ML learns from finite data and makes predictions about infinite possibilities!
</p>
</div>

## Numerical Integration using Monte Carlo

Consider the following **challenge**:

$$\int_a^b f(x)dx$$

becomes impossible when:
- No closed form exists
- High dimensional ($\int\int\int...\int$)
- Complex boundaries

> How to calculate it? 

We can use **Monte Carlo simulation** for that.

Let $X\sim U([a;b])$. Therefore, $X$ has the following probability density: $f_X(x) = \left\{\begin{array}{ll}\frac{1}{b-a} & \text{ if } a\leq x \leq b\\ 0 & \text{ otherwise} \end{array}\right.$

1. We can transform our integral of $h(x)$ to expectation:
$$\int_a^b h(x)dx = \int_a^b h(x) \cdot 1\ dx = \int_a^b h(x) \cdot \frac{b-a}{b-a} dx = (b-a)\int_a^b h(x) \cdot \frac{1}{b-a} =\bigg[\text{since }f_X(x) = \frac{1}{b-a}\bigg]= (b-a)\int_a^b h(x) \cdot f_X(x) =[\text{def. of expectation}]= (b-a) E[h(x)]$$

**KEY INSIGHT**: The integral of $h(X)$ equals $(b-a)$ times the expected value of $h(X)$.

2. Using Monte Carlo method, we estimate $E[h(x)]$ by sampling:

$$E[h(x)] \approx (1/n) \sum_{i}^n h(X_i) \text{ where } X_i \sim Uniform([a;b])$$

Hence:
$$\int_a^b h(x)dx = (b-a) \cdot (1/n)\sum_i^n h(X_i)$$

**Intuition**: $E[h(X)]$ tells us the AVERAGE HEIGHT of the function, $(b-a)$ is the WIDTH of the interval. Hence, $\text{Integral} = \text{WIDTH} \times \text{AVERAGE HEIGHT} = (b-a) E[h(X)]$

3. Variance and Standard Error

Let $Y = h(X)$, so we are estimating $E[Y]$. The sample mean $\bar{Y} = \frac{1}{n}\sum_i^nY_i$ has variance $Var(\bar{Y}) = Var(Y)/n = \frac{\sigma_Y^2}{n}$ where $\sigma_Y^2 = Var(h(X))$. 

Our estimator for the integral is $\hat{I} = (b-a)\bar{Y}$. 

Therefore: 

$$Var(\hat{I}) = Var((b-a)\bar{Y}) = [\text{variance of scaled r.v.}] = (b-a)^2 Var(\bar{Y}) = (b-a)^2 \frac{\sigma_Y^2}{n}$$

The standard error is:

$$SE(\hat{I}) = \sqrt{Var(\hat{I})} = \sqrt{(b-a)^2 \frac{\sigma_Y^2}{n}} = (b-a) \frac{\sigma_Y}{\sqrt{n}}$$

In practice, we estimate $\sigma_Y$ with the sample standard deviation $s$: $SE(\hat{I}) = (b-a) \cdot s/\sqrt{n}$ where $s = \sqrt{(1/n)\sum_i^n(f(X_i) - \bar{f})^2}$

**Intuition**: if you double the interval width, the uncertainty also doubles (more area to estimate). Hence, SE scales with $(b-a)$.

<div class="alert-exercise">
<h5> QUESTION:</h5> 

Write a function that estimates integral of f using Monte Carlo method. Test it on $f(x) = x^2$ from 0 to 1. Compare the result with the same function from 0 to 2.

```
def combinations_with_repetition_count(n: int, k: int) -> int:
    """Calculates and returns the number of combinations with repetition: C(n+k-1, k) 

    Args:
        n (int): number of types of elements
        k (int): number of elements taken

    Returns:
        int: number of combinations with repetitions
    """
```
</div>


In [None]:
# ANSWER
def monte_carlo_integrate(f, a:float=0, b:float=1, n_samples:int=10000) -> tuple[float, float]:
    """
    Estimate integral of f from a to b using Monte Carlo
    
    Theory: ‚à´f(x)dx ‚âà (b-a) * E[f(X)] where X ~ Uniform(a,b)
    

    Args:
        f (function): function to be integrated
        a (float): min boundary of integration. Defaults to 0.
        b (float): max boundary of integration. Defaults to 1.
        n_samples (int, optional): Number of samples to consider. Defaults to 10000.

    Returns:
        integral estimate and standard error.
    """
    pass

In [None]:
# ANSWER


In [None]:
# ANSWER


<div class="alert example">
<h4>Calculated Example</h4>

Compute $I = \int_{0}^1 e^{-x^2} dx$ using Monte-Carlo method.

Note that this integral has NO closed-form solution. True value: $I \approx 0.746824132812427$ (computed numerically to high precision).

Consider the following 10 samples: 0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897, 0.42310646, 0.9807642,  0.68482974, 0.4809319,  0.39211752 (obtained using `numpy.random.uniform` with random seed 123). Perform manual calculation of the value of the integral.

Perform calculation using python.

</div>

<details>
<summary>Reveal solution</summary>

1. Transform integral to expectation:

$$I = \int_{a}^b f(x) dx = (b-a) E_X[f(X)] \text{ where } X\sim Uniform([a,b])$$

In our case: $a = 0, b = 1, f(x) = e^{-x^2}$, $X\sim Uniform([0,1])$
$$I = \int_{0}^1 e^{-x^2} dx = (1-0) E_X[e^{-X^2}] \text{ where } X\sim Uniform([0,1]) = E_X[e^{-X^2}]$$

2. Monte Carlo Estimator:
    
$$\hat{I_n} = (1/n) \sum_i^n f(X_i) \text{  where } X_i \sim Uniform([0,1]) = (1/n) \sum_i^n e^{-X_i^2}$$
    
3. Use samples

| $i$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Sum | 
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| $X_i$ | 0.69646919 | 0.28613933 | 0.22685145 | 0.55131477 | 0.71946897 | 0.42310646 | 0.9807642 |  0.68482974 | 0.4809319|  0.39211752 | 5.44199352975335 |
| $e^{-X_i^2}$ | 0.61565 | 0.92139 | 0.94984 | 0.73790 | 0.59593 | 0.83609 | 0.38217 | 0.62563 | 0.79350 | 0.85748 | 7.3155836853 |

Hence: $\hat{I_n} = (1/n) \sum_i^n f(X_i) \text{  where } X_i \sim Uniform([0,1]) = (1/n) \sum_i^n e^{-X_i^2} = 1/10 \times 7.31558 = 0.731558$

</details>

In [None]:
# define function
def f(x):
    return np.exp(-x**2)

In [None]:
# True value (computed with high precision)
from scipy.integrate import quad
true_value, _ = quad(f, 0, 1)
    
print(f"\nTrue value (high precision): I = {true_value:.15f}\n")

In [None]:
# Generate 10 random samples
np.random.seed(123)
n = 10
samples = np.random.uniform(0, 1, n)
    
print("Generated samples X·µ¢ ~ Uniform(0,1):")
print(samples)
print(samples.sum())

In [None]:
# Detailed calculation for each sample
results = []
sum_f = 0
sum_f_squared = 0
    
print(f"{'i':<5} {'X·µ¢':<12} {'f(X·µ¢)=e^(-X·µ¢¬≤)':<18} {'Running Sum':<15} {'Running Avg':<15}")

    
for i, x in enumerate(samples, 1):
    f_x = f(x)
    sum_f += f_x
    sum_f_squared += f_x**2
    running_avg = sum_f / i
        
    results.append({
            'i': i,
            'x': x,
            'f_x': f_x,
            'sum': sum_f,
            'avg': running_avg
        })
        
    print(f"{i:<5} {x:<12.8f} {f_x:<18.10f} {sum_f:<15.10f} {running_avg:<15.10f}")
    

In [None]:
# compute sample mean and the estimate
mean_f = sum_f / n
    
print(f"\nSample mean:")
print(f"  fÃÑ = (1/n) Œ£·µ¢ f(X·µ¢)")
print(f"    = (1/{n}) √ó {sum_f:.10f}")
print(f"    = {mean_f:.10f}")

integral_estimate = mean_f  # Since (b-a) = 1

print(f"\nMonte Carlo estimate of integral:")
print(f"  √é = (b-a) √ó fÃÑ")
print(f"    = (1-0) √ó {integral_estimate:.10f}")
print(f"    = {integral_estimate:.10f}")

In [None]:
error = abs(integral_estimate - true_value)
relative_error = error / true_value * 100
    
print(f"\nTrue value:        I = {true_value:.15f}")
print(f"Estimated value:   √é = {integral_estimate:.15f}")
print(f"Absolute error:    |√é - I| = {error:.15f}")
print(f"Relative error:    {relative_error:.10f}%")
    

In [None]:
# Compute variance
variance = (sum_f_squared / n) - mean_f**2
std_dev = np.sqrt(variance)
se = std_dev / np.sqrt(n)
    
print(f"\nSample variance:")
print(f"  s¬≤ = (1/n)Œ£·µ¢ f(X·µ¢)¬≤ - fÃÑ¬≤")
print(f"     = (1/{n}) √ó {sum_f_squared:.10f} - ({mean_f:.10f})¬≤")
print(f"     = {sum_f_squared/n:.10f} - {mean_f**2:.10f}")
print(f"     = {variance:.10f}")
    
print(f"\nSample standard deviation:")
print(f"  s = ‚àös¬≤")
print(f"    = ‚àö{variance:.10f}")
print(f"    = {std_dev:.10f}")
    
print(f"\nStandard error:")
print(f"  SE = (b-a) √ó s/‚àön")
print(f"     = (1-0) √ó {std_dev:.10f}/‚àö{n}")
print(f"     = {std_dev:.10f}/{np.sqrt(n):.10f}")
print(f"     = {se:.10f}")

In [None]:
# 95% CI using normal approximation
z_95 = 1.96
ci_lower = integral_estimate - z_95 * se
ci_upper = integral_estimate + z_95 * se
# Check if true value is in CI
in_ci = ci_lower <= true_value <= ci_upper
print(f"\nIs true value in 95% CI? {in_ci}")
if in_ci:
    print(f"  ‚úì YES: {ci_lower:.6f} ‚â§ {true_value:.6f} ‚â§ {ci_upper:.6f}")
else:
    print(f"  ‚úó NO: True value outside [{ci_lower:.6f}, {ci_upper:.6f}]")

In [None]:
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
# Plot 1: Function and sample points
ax1 = axes[0, 0]
x_plot = np.linspace(0, 1, 1000)
y_plot = f(x_plot)
    
ax1.fill_between(x_plot, 0, y_plot, alpha=0.3, color='blue', label='Area = integral')
ax1.plot(x_plot, y_plot, 'b-', linewidth=2, label='f(x) = e^(-x¬≤)')
    
# Plot sample points
for i, r in enumerate(results):
    ax1.plot([r['x'], r['x']], [0, r['f_x']], 'r--', alpha=0.5, linewidth=1)
    ax1.scatter([r['x']], [r['f_x']], c='red', s=100, zorder=5, edgecolor='black', linewidth=1.5)
    ax1.text(r['x'], r['f_x'] + 0.05, f"{i+1}", ha='center', fontsize=9, fontweight='bold')
    
# Show mean height
ax1.axhline(mean_f, color='green', linewidth=2, linestyle='--',
                label=f'Mean height: {mean_f:.4f}')
ax1.fill_between([0, 1], 0, mean_f, alpha=0.15, color='green')
    
ax1.set_xlabel('x', fontsize=12)
ax1.set_ylabel('f(x)', fontsize=12)
ax1.set_title(f'Function and Sample Points\n'
                  f'Integral ‚âà 1 √ó {mean_f:.4f} = {integral_estimate:.4f}',
                  fontsize=12, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-0.05, 1.05)
ax1.set_ylim(0, 1.1)
    
# Plot 2: Convergence of estimate
ax2 = axes[0, 1]
running_avgs = [r['avg'] for r in results]
ax2.plot(range(1, n+1), running_avgs, 'bo-', linewidth=2, markersize=8)
ax2.axhline(true_value, color='red', linewidth=2, linestyle='--',
                label=f'True value: {true_value:.6f}')
ax2.axhline(integral_estimate, color='green', linewidth=2, linestyle=':',
                label=f'Final estimate: {integral_estimate:.6f}')
    
ax2.set_xlabel('Number of samples (i)', fontsize=12)
ax2.set_ylabel('Running average', fontsize=12)
ax2.set_title('Convergence of Estimate', fontsize=12, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
    
# Plot 3: Distribution of f(Xi) values
ax3 = axes[1, 0]
f_vals = [r['f_x'] for r in results]
ax3.hist(f_vals, bins=7, alpha=0.7, edgecolor='black', color='skyblue')
ax3.axvline(mean_f, color='red', linewidth=2.5, linestyle='--',
                label=f'Mean: {mean_f:.4f}')
ax3.axvline(mean_f - std_dev, color='orange', linewidth=1.5, linestyle=':')
ax3.axvline(mean_f + std_dev, color='orange', linewidth=1.5, linestyle=':',
                label=f'¬±1 SD: {std_dev:.4f}')
    
ax3.set_xlabel('f(X·µ¢) values', fontsize=12)
ax3.set_ylabel('Frequency', fontsize=12)
ax3.set_title('Distribution of Sample Values', fontsize=12, fontweight='bold')
ax3.legend(fontsize=10)
ax3.grid(True, alpha=0.3, axis='y')
    
# Plot 4: Summary box
ax4 = axes[1, 1]
ax4.axis('off')
    
summary_text = f"""
    SUMMARY OF MONTE CARLO INTEGRATION
    {'='*50}
    
    Problem: ‚à´‚ÇÄ¬π e^(-x¬≤) dx
    
    Method: Monte Carlo with n = {n} samples
    
    Results:
    ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    Sample mean (fÃÑ):        {mean_f:.10f}
    Sample std dev (s):      {std_dev:.10f}
    Standard error (SE):     {se:.10f}
    
    Integral estimate (√é):   {integral_estimate:.10f}
    95% Confidence Interval: [{ci_lower:.6f}, {ci_upper:.6f}]
    
    True value (I):          {true_value:.10f}
    Absolute error:          {error:.10f}
    Relative error:          {relative_error:.4f}%
    
    True value in CI?        {'‚úì YES' if in_ci else '‚úó NO'}
    
    {'='*50}
    
    Interpretation:
    ‚Ä¢ With just {n} samples, we estimated the integral
      to within {relative_error:.2f}% of the true value
    ‚Ä¢ Standard error of {se:.4f} tells us the
      typical error we expect
    ‚Ä¢ 95% CI means: if we repeated this {n}-sample
      experiment many times, ~95% of intervals
      would contain the true value
    ‚Ä¢ More samples would reduce SE by O(1/‚àön)
    """
    
ax4.text(0.05, 0.95, summary_text, transform=ax4.transAxes,
            fontsize=9, verticalalignment='top', family='monospace',
            bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
    
plt.tight_layout()
plt.show()
    

<div class="alert alert-summary">

<h4>Monte Carlo vs Traditional Numerical Integration Methods</h4>

<table style="width:100%; border-collapse: collapse; margin: 10px 0; font-size: 0.9em;">
<tr style="background-color: #f0f0f0;">
<th style="border: 1px solid #ddd; padding: 6px;">Method</th>
<th style="border: 1px solid #ddd; padding: 6px;">Convergence</th>
<th style="border: 1px solid #ddd; padding: 6px;">1D</th>
<th style="border: 1px solid #ddd; padding: 6px;">High-D</th>
<th style="border: 1px solid #ddd; padding: 6px;">Deterministic?</th>
<th style="border: 1px solid #ddd; padding: 6px;">Uncertainty Estimate?</th>
<th style="border: 1px solid #ddd; padding: 6px;">Best For</th>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 6px;">Rectangle</td>
<td style="border: 1px solid #ddd; padding: 6px;">O(1/n)</td>
<td style="border: 1px solid #ddd; padding: 6px;">Poor</td>
<td style="border: 1px solid #ddd; padding: 6px;">Infeasible</td>
<td style="border: 1px solid #ddd; padding: 8px;">‚úÖ Yes</td>
<td style="border: 1px solid #ddd; padding: 8px;">‚ùå No</td>
<td style="border: 1px solid #ddd; padding: 6px;">Teaching</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 6px;">Trapezoidal</td>
<td style="border: 1px solid #ddd; padding: 6px;">O(1/n¬≤)</td>
<td style="border: 1px solid #ddd; padding: 6px;">Good</td>
<td style="border: 1px solid #ddd; padding: 6px;">Infeasible</td>
<td style="border: 1px solid #ddd; padding: 8px;">‚úÖ Yes</td>
<td style="border: 1px solid #ddd; padding: 8px;">‚ùå No</td>
<td style="border: 1px solid #ddd; padding: 6px;">1D smooth</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 6px;">Simpson</td>
<td style="border: 1px solid #ddd; padding: 6px;">O(1/n‚Å¥)</td>
<td style="border: 1px solid #ddd; padding: 6px;">Excellent</td>
<td style="border: 1px solid #ddd; padding: 6px;">Infeasible</td>
<td style="border: 1px solid #ddd; padding: 8px;">‚úÖ Yes</td>
<td style="border: 1px solid #ddd; padding: 8px;">‚ùå No</td>
<td style="border: 1px solid #ddd; padding: 6px;">1D very smooth</td>
</tr>
<tr style="background-color: #e8f5e9;">
<td style="border: 1px solid #ddd; padding: 6px;"><strong>Monte Carlo</strong></td>
<td style="border: 1px solid #ddd; padding: 6px;"><strong>O(1/‚àön)</strong></td>
<td style="border: 1px solid #ddd; padding: 6px;"><strong>Moderate</strong></td>
<td style="border: 1px solid #ddd; padding: 6px;"><strong>‚úÖ Feasible</strong></td>
<td style="border: 1px solid #ddd; padding: 8px;">‚ùå No (random)</td>
<td style="border: 1px solid #ddd; padding: 8px;">‚úÖ Yes (SE)</td>
<td style="border: 1px solid #ddd; padding: 6px;"><strong>High-D / ML</strong></td>
</tr>
</table>

**Monte Carlo and the Curse of Dimensionality:**

- Dimension 1-3:</br>
    ‚úì Traditional methods (Simpson) are faster and more accurate</br>
    ‚úì Use them if you can!
    
- Dimension 4-5:</br>
    ~ Traditional methods still feasible but getting expensive</br>
    ~ Monte Carlo becomes competitive
    
- Dimension 10+:</br>
    ‚úó Traditional methods INFEASIBLE</br>
    ‚úì Monte Carlo is the ONLY practical option
    
- Dimension 100 (typical in ML):</br>
    ‚Ä¢ Simpson would need 10^100 evaluations</br>
    ‚Ä¢ Monte Carlo still needs only 10,000 evaluations

<p><strong>Key Insight:</strong></p>
<ul>
<li>Traditional methods: O(10^d) evaluations in d dimensions ‚Üí exponential curse</li>
<li>Monte Carlo: O(n) evaluations regardless of d ‚Üí dimension-free!</li>
<li>In ML (d=100+), Monte Carlo is literally the only option</li>
</ul>

<p style="margin-top: 15px; padding: 12px; background-color: rgba(33, 150, 243, 0.1); border-left: 4px solid #2196f3; font-weight: bold;">
This is why every modern ML algorithm fundamentally relies on Monte Carlo methods!
</p>
</div>

## Bootstrapping or Bootstrap Sampling

You trained a model on 1000 samples. How do you estimate how much your model's accuracy would vary if you collected the data again?

This is what we mean by **uncertainty**.

- ‚úì Low uncertainty ‚Üí Different samples give similar estimates (stable)
- ‚úó High uncertainty ‚Üí Different samples give very different estimates (unstable)

**The Dilemma:**

- Collecting new data: $$$, slow, sometimes impossible
- Mathematical formulas: Often don't exist for complex statistics
- **Bootstrap**: Treat your sample as the population, i.e. simulate "new samples" by resampling from what we have

In [None]:
# demo uncertainty
std_means = visualize_uncertainty_concept()

<div class="alert alert-success">
<h4>Definition: Bootstrap Sampling</h4>

**Bootstrap** is a resampling method that estimates the sampling distribution of a statistic by:


- Sampling <em>with replacement</em> from your original data
- Computing the statistic on each resample
- Using the distribution of these statistics to estimate uncertainty


**Key Idea:** The relationship between sample and population is similar to the relationship between bootstrap samples and original sample.

Invented by: [Bradley Efron](https://en.wikipedia.org/wiki/Bradley_Efron) (1979)

**Procedure:**

Step 1: Start with our ONE sample
    
Step 2: Create a "new sample" by:
- Randomly selecting n values FROM our sample
- Sampling WITH REPLACEMENT
- This gives us a "bootstrap sample"
    
Step 3: Compute the statistic on this bootstrap sample (e.g., mean, median, whatever we're estimating)
    
Step 4: Repeat steps 2-3 many times (e.g., 1000 times)
    
Step 5: Look at the DISTRIBUTION of bootstrap estimates
- The spread of this distribution = UNCERTAINTY
- Shows how much estimate varies across "fake samples"

*Interpretation*: If we COULD collect new samples of size $n$, our estimates would typically vary by about $\pm$ STD of bootstrap distribution around the value of bootstrap estimate (mean). 

</div>

In [None]:
# demo
np.random.seed(42)
sample_data = np.random.exponential(scale=2, size=100)
bootstrap_stats, ci = bootstrap_demo(sample_data, np.mean, n_bootstrap=2000)

In [None]:
# demo explaining uncertainty
bootstrap_means, bootstrap_std = bootstrap_uncertainty_explanation()

<div class="alert alert-warning">

Bootstrap gives us a DISTRIBUTION of estimates.

The SPREAD of this distribution tells us: "How different would my estimate be with a different sample?"

<h4>üí° Key Insight: Why Sampling WITH Replacement?</h4>

<p>Without replacement ‚Üí you get the same sample every time (useless!)</p>
<p>With replacement ‚Üí each bootstrap sample is different, mimicking the variability of collecting new data</p>
<p>On average, each bootstrap sample contains ~63.2% unique observations from the original sample (some appear multiple times, some don't appear at all).</p>
</div>    


<div class="alert alert-primary">
<h4>ü§ñ ML Application: Model Performance Uncertainty</h4>
<p><strong>Scenario:</strong> You have 1000 test samples with 87% accuracy.</p>
<p><strong>Bootstrap Approach:</strong></p>
<ol>
<li>Resample your test set 1000 times (with replacement)</li>
<li>Compute accuracy on each bootstrap sample</li>
<li>Get distribution of accuracy ‚Üí confidence interval</li>
</ol>
<p><strong>Used by:</strong> scikit-learn's cross-validation, Kaggle competitions, production ML monitoring</p>
</div>

<div class="alert alert-summary">
<h4>üèÜ Production Best Practices</h4>
<ol>
<li><strong>Always report confidence intervals</strong> alongside point estimates</li>
<li><strong>Use bootstrap for model comparison</strong> before deployment</li>
<li><strong>Monitor CI width</strong> as a model health metric (widening = degradation)</li>
<li><strong>Combine methods:</strong> Bootstrap for data uncertainty + MC Dropout for model uncertainty</li>
<li><strong>Cache bootstrap samples</strong> for faster re-evaluation</li>
</ol>
</div>

## Return to the Opening Challenge

The question we are about to ask is:

> Can we trust the 87% accuracy for a $50M decision?
</br>

Let's generate the data to simulate the scenario:

In [None]:
def generate_data(n_test=1000, observed_accuracy=0.87):
    # Simulate the scenario
    np.random.seed(42)
    n_correct = int(observed_accuracy * n_test)
    # Calculate how many predictions should be incorrect
    n_incorrect = n_test - n_correct

    
    # Create synthetic data matching the scenario
    y_true = np.random.randint(0, 2, size=n_test)
    y_pred = y_true.copy()  # Perfect on this split
    # Randomly select indices to flip
    incorrect_indices = np.random.choice(n_test, size=n_incorrect, replace=False)
    # Adjust to get exactly 87% 
    # Flip the predictions at those indices
    y_pred[incorrect_indices] = 1 - y_pred[incorrect_indices]
    # calculate accuracy
    actual_acc = accuracy_score(y_true, y_pred)
        
    print(f"""
    Test set size: {n_test} samples
    Correct predictions: {n_correct}
    Observed accuracy: {actual_acc:.1%}
    
    But this is just ONE test set!
    Question: Would we get similar accuracy with DIFFERENT test data?
    """)
    
    return y_true, y_pred

In [None]:
# Simulate the scenario
n_test = 1000
observed_accuracy=0.87
n_correct =int(observed_accuracy * n_test)
y_true, y_pred = generate_data(n_test=n_test, observed_accuracy=observed_accuracy)


METHOD 1: Standard Error (Quick Assessment)

- For accuracy, we can think of each prediction as $Bernoulli(p)$. 
- For binary classification (correct/incorrect), we can use the formula of Standard error of a proportion: $SE = \sqrt{p(1-p)/n}$ where $p$ = observed accuracy and $n$ = test set size.

$$SE = \sqrt{p(1-p)/n} = \sqrt{0.87 (1 - 0.87) / 1000} = \sqrt{0.87 \times 0.13 / 1000} = \sqrt{0.1131 / 1000} \approx 0.01063$$

*Interpretation*:
- Our estimate: 87% accuracy
- Typical variation: $\pm SE = \pm 1.06\% \approx \pm 1.1\%$
- Range of typical accuracy: $[\text{Observed accuracy} - SE, \text{Observed accuracy} + SE] = [87 - 1.1, 87 + 1.1] = [85.9\%, 88.1\%]$
- Compared to 85% accuracy threshold from the problem statement, even our lower end 85.9% is above it $\Rightarrow$ green light


In [None]:
se_formula = np.sqrt(observed_accuracy * (1 - observed_accuracy) / n_test)
    
print(f"""
    For binary classification (correct/incorrect), we can use the formula:
    
        SE = ‚àö[p(1-p)/n]
        
    where p = observed accuracy = {observed_accuracy}
          n = test set size = {n_test}
    
    Calculation:
        SE = ‚àö[{observed_accuracy} √ó {1-observed_accuracy} / {n_test}]
           = ‚àö[{observed_accuracy * (1-observed_accuracy)} / {n_test}]
           = ‚àö{observed_accuracy * (1-observed_accuracy) / n_test:.6f}
           = {se_formula:.4f}
    
    INTERPRETATION:
    {'‚îÄ'*70}
    ‚Ä¢ Our estimate: 87% accuracy
    ‚Ä¢ Typical variation: ¬±{se_formula:.1%} (that's ¬±{se_formula*100:.1f} percentage points)
    
    ‚Ä¢ If you tested on DIFFERENT data samples:
      - We'd typically get accuracy between:
        {observed_accuracy - se_formula:.1%} and {observed_accuracy + se_formula:.1%}
      - That's roughly {(observed_accuracy - se_formula)*100:.1f}% to {(observed_accuracy + se_formula)*100:.1f}%
    
    First concern: Even the lower end ({(observed_accuracy - se_formula)*100:.1f}%) 
                   is above our 85% threshold! ‚úì
    """)

METHOD 2: Bootstrap (Detailed Assessment)

Let's use bootstrap to really understand the variability:
    
1. We have 1,000 test predictions (correct/incorrect for each)
2. Resample these 1,000 predictions WITH replacement
3. Compute accuracy on each bootstrap sample
4. Repeat 10,000 times to see the full 
5. Analyse the distribution

In [None]:
# Bootstrap analysis
n_bootstrap = 10000
bootstrap_accuracies = []
    
for _ in range(n_bootstrap):
    # Resample indices with replacement
    indices = np.random.choice(n_test, size=n_test, replace=True)
    # calculate accuracy score
    boot_acc = accuracy_score(y_true[indices], y_pred[indices])
    bootstrap_accuracies.append(boot_acc)
    
bootstrap_accuracies = np.array(bootstrap_accuracies)

# Analyze results
bootstrap_mean = np.mean(bootstrap_accuracies)
bootstrap_std = np.std(bootstrap_accuracies)
bootstrap_min = np.min(bootstrap_accuracies)
bootstrap_max = np.max(bootstrap_accuracies)
    
# Key percentiles
p05 = np.percentile(bootstrap_accuracies, 5)
p25 = np.percentile(bootstrap_accuracies, 25)
p75 = np.percentile(bootstrap_accuracies, 75)
p95 = np.percentile(bootstrap_accuracies, 95)

# How many bootstrap samples are above 85%?
above_85 = np.mean(bootstrap_accuracies >= 0.85) * 100
    
print(f"""
    Distribution of Bootstrap Accuracies:
    {'‚îÄ'*70}
    
    Original accuracy:           {observed_accuracy:.1%}
    
    Bootstrap statistics:
    ‚Ä¢ Average:                   {bootstrap_mean:.1%}
    ‚Ä¢ Standard deviation:        {bootstrap_std:.2%} (this is the uncertainty!)
    ‚Ä¢ Minimum seen:              {bootstrap_min:.1%}
    ‚Ä¢ Maximum seen:              {bootstrap_max:.1%}
    
    Percentiles (how values are spread):
    ‚Ä¢ Bottom 5%:                 {p05:.1%}
    ‚Ä¢ Bottom 25%:                {p25:.1%}
    ‚Ä¢ Top 25%:                   {p75:.1%}
    ‚Ä¢ Top 95%:                   {p95:.1%}
    
    Middle 50% of estimates:     [{p25:.1%}, {p75:.1%}]
    Middle 90% of estimates:     [{p05:.1%}, {p95:.1%}]
    
    KEY FINDING:
    {'‚îÄ'*70}
    ‚Ä¢ {above_85:.1f}% of bootstrap samples had accuracy ‚â• 85%
    ‚Ä¢ Only {100-above_85:.1f}% were below 85%
    """)

In [None]:
# Create comprehensive visualization
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 2, hspace=0.3, wspace=0.3)
    
# Plot 1: Bootstrap distribution
ax1 = fig.add_subplot(gs[0, :])
    
ax1.hist(bootstrap_accuracies, bins=50, alpha=0.7, edgecolor='black', 
             color='skyblue', density=False)
    
ax1.axvline(observed_accuracy, color='red', linewidth=3, linestyle='-',
               label=f'Our accuracy: {observed_accuracy:.1%}', zorder=5)
ax1.axvline(0.85, color='orange', linewidth=3, linestyle='--',
               label=f'Threshold: 85%', zorder=5)
    
# Shade regions
ax1.axvspan(bootstrap_min, 0.85, alpha=0.15, color='red', 
               label=f'Below 85%: {100-above_85:.1f}%')
ax1.axvspan(0.85, bootstrap_max, alpha=0.15, color='green',
               label=f'Above 85%: {above_85:.1f}%')
    
# Mark percentiles
ax1.axvline(p05, color='purple', linewidth=1.5, linestyle=':', alpha=0.7)
ax1.axvline(p95, color='purple', linewidth=1.5, linestyle=':', alpha=0.7)
ax1.text(p05, ax1.get_ylim()[1]*0.9, f'5th: {p05:.1%}', 
            ha='right', fontsize=9, rotation=90)
ax1.text(p95, ax1.get_ylim()[1]*0.9, f'95th: {p95:.1%}', 
            ha='left', fontsize=9, rotation=90)
    
ax1.set_xlabel('Accuracy', fontsize=12)
ax1.set_ylabel('Frequency (out of 10,000 bootstrap samples)', fontsize=12)
ax1.set_title(f'Bootstrap Distribution: "How much does accuracy vary?"\n'
                 f'Standard deviation: {bootstrap_std:.2%} (the uncertainty measure)',
                 fontsize=13, fontweight='bold')
ax1.legend(fontsize=11, loc='upper left')
ax1.grid(True, alpha=0.3, axis='y')
    
# Plot 2: Cumulative distribution
ax2 = fig.add_subplot(gs[1, 0])
    
sorted_accs = np.sort(bootstrap_accuracies)
cumulative = np.arange(1, len(sorted_accs) + 1) / len(sorted_accs) * 100
    
ax2.plot(sorted_accs, cumulative, linewidth=2.5, color='blue')
ax2.axvline(0.85, color='orange', linewidth=3, linestyle='--', 
               label=f'85% threshold')
ax2.axhline(50, color='gray', linewidth=1, linestyle=':', alpha=0.5)
    
# Mark the 85% line
pct_below_85 = 100 - above_85
ax2.plot([0.85, 0.85], [0, pct_below_85], 'r--', linewidth=2, alpha=0.5)
ax2.plot([0.82, 0.85], [pct_below_85, pct_below_85], 'r--', linewidth=2, alpha=0.5)
ax2.text(0.848, pct_below_85/2, f'{pct_below_85:.1f}%\nbelow', 
            fontsize=10, ha='right', color='red', fontweight='bold')
    
ax2.set_xlabel('Accuracy', fontsize=12)
ax2.set_ylabel('Cumulative Percentage', fontsize=12)
ax2.set_title(f'Cumulative Distribution\n'
                 f'{above_85:.1f}% of estimates are ‚â• 85%',
                 fontsize=12, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)
ax2.set_xlim(0.82, 0.92)
    
# Plot 3: Comparison of methods
ax3 = fig.add_subplot(gs[1, 1])
ax3.axis('off')
    
comparison_text = f"""
    COMPARING THE TWO METHODS
    {'='*50}
    
    Standard Error Method:
    {'‚îÄ'*50}
    ‚Ä¢ SE = {se_formula:.4f} ({se_formula*100:.2f} percentage points)
    ‚Ä¢ Typical range: {observed_accuracy - se_formula:.1%} to {observed_accuracy + se_formula:.1%}
    ‚Ä¢ Quick, formula-based
    
    Bootstrap Method:
    {'‚îÄ'*50}
    ‚Ä¢ SD = {bootstrap_std:.4f} ({bootstrap_std*100:.2f} percentage points)
    ‚Ä¢ Typical range: {p05:.1%} to {p95:.1%}
    ‚Ä¢ Full distribution, no assumptions
    
    Agreement:
    {'‚îÄ'*50}
    SE ‚âà {se_formula*100:.2f} pp vs Bootstrap SD ‚âà {bootstrap_std*100:.2f} pp
    
    Difference: {abs(se_formula - bootstrap_std)*100:.2f} pp
    
    ‚úì Very similar! Both methods agree on 
      the uncertainty level.
    
    {'='*50}
    
    What This Tells Us:
    {'‚îÄ'*50}
    
    1. Our 87% estimate is STABLE
       (standard deviation ~{bootstrap_std*100:.1f} pp)
    
    2. With different test data, we'd 
       typically get accuracy between
       {observed_accuracy - bootstrap_std:.1%} and {observed_accuracy + bootstrap_std:.1%}
    
    3. {above_85:.1f}% of possible outcomes are 
       above our 85% threshold
    
    4. The risk of being below 85% is 
       only {100-above_85:.1f}%
    """
    
ax3.text(0.05, 0.95, comparison_text, transform=ax3.transAxes,
            fontsize=9, verticalalignment='top', family='monospace',
            bbox=dict(boxstyle='round', facecolor='lightcyan', alpha=0.8))
    
# Plot 4: Decision framework
ax4 = fig.add_subplot(gs[2, :])
ax4.axis('off')
    
# Decision
decision_color = 'lightgreen' if above_85 > 95 else 'lightyellow' if above_85 > 90 else 'lightcoral'
    
if above_85 > 95:
    decision = "GREEN LIGHT - PROCEED"
    reasoning = f"Very high confidence ({above_85:.1f}% of estimates ‚â• 85%)"
    risk = "Low"
    recommendation = "Safe to deploy with $50M investment"
elif above_85 > 90:
    decision = "YELLOW LIGHT - PROCEED WITH CAUTION"
    reasoning = f"Good confidence ({above_85:.1f}% of estimates ‚â• 85%)"
    risk = "Moderate"
    recommendation = "Proceed but with monitoring and contingency plan"
else:
    decision = "RED LIGHT - DO NOT PROCEED"
    reasoning = f"Insufficient confidence (only {above_85:.1f}% of estimates ‚â• 85%)"
    risk = "High"
    recommendation = f"Need ~{int((0.85/observed_accuracy)**2 * n_test - n_test)} more test samples"
    
decision_text = f"""
    
    EXECUTIVE DECISION FRAMEWORK
    {'‚ïê'*80}
    
    INVESTMENT: $50 Million deployment
    REQUIREMENT: Accuracy must be ‚â• 85% to justify investment
    
    {'‚ïê'*80}
    YOUR DATA:
    {'‚îÄ'*80}
    ‚Ä¢ Test set size: {n_test:,} samples
    ‚Ä¢ Observed accuracy: {observed_accuracy:.1%}
    ‚Ä¢ Uncertainty (SD): ¬±{bootstrap_std:.2%}
    ‚Ä¢ Estimates ‚â• 85%: {above_85:.1f}%
    
    {'‚ïê'*80}
    UNCERTAINTY ANALYSIS:
    {'‚îÄ'*80}
    
    Best case scenario (95th percentile):    {p95:.1%}
    Typical high estimate:                   {p75:.1%}
    Your observed accuracy:                  {observed_accuracy:.1%}
    Typical low estimate:                    {p25:.1%}
    Worst case scenario (5th percentile):    {p05:.1%}
    
    Critical threshold:                      85.0%
    
    Probability accuracy ‚â• 85%:              {above_85:.1f}%
    Probability accuracy < 85%:              {100-above_85:.1f}%
    
    {'‚ïê'*80}
    DECISION: {decision}
    {'‚îÄ'*80}
    
    Risk Level: {risk}
    Reasoning: {reasoning}
    
    Recommendation:
    {recommendation}
    
    {'‚îÄ'*80}
    
    Why this decision?
    
    ‚Ä¢ The uncertainty analysis shows that {above_85:.1f}% of bootstrap samples 
      (representing possible outcomes with different test data) achieve ‚â• 85% accuracy
    
    ‚Ä¢ Your estimate of {observed_accuracy:.1%} would typically vary by ¬±{bootstrap_std*100:.1f} percentage points
      if you tested on different data
    
    ‚Ä¢ The 5th percentile ({p05:.1%}) {"is above" if p05 >= 0.85 else "is close to but below"} your 85% threshold
    
    ‚Ä¢ With {n_test:,} test samples, you have {"sufficient" if above_85 > 95 else "moderate" if above_85 > 90 else "insufficient"} 
      evidence to support the $50M investment
    """
    
ax4.text(0.02, 0.98, decision_text, transform=ax4.transAxes,
            fontsize=9, verticalalignment='top', family='monospace',
            bbox=dict(boxstyle='round', facecolor=decision_color, alpha=0.7))
    
plt.tight_layout()
plt.show()

So, the answer:  GREEN LIGHT - PROCEED
    - Even in bad scenarios, we're above the threshold

<div class="alert alert-warning">
<h4>üí° What Your Intuition Missed</h4>
<p><strong>Initial Guess:</strong> Most people say "87% ¬± 2%" or just trust the 87%</p>
<p><strong>Reality:</strong> With 1000 samples and 87% observed accuracy:</p>

1. Uncertainty (how much it would vary): ¬±1.1%
    - This means with different test data, we'd typically get accuracy between 85.9% and 88.1%
2. Risk assessment: 96.9% probability of meeting 85% threshold
    - Only 3.1% chance of falling short
3. Worst-case scenario: 85.2% (5th percentile)

<p><strong>Lesson:</strong> Sample size and uncertainty quantification are crucial for business decisions</p>
</div>

## Common Mistakes

<div class="alert alert-danger">

<h4>‚ö†Ô∏è Common Pitfalls</h4>
<ul>
<li><strong>Bootstrap without replacement:</strong> Gives same sample every time - useless!</li>
<li><strong>Too few iterations:</strong> Need 1000+ for stable CI estimates</li>
<li><strong>Ignoring dependence:</strong> Bootstrap assumes independent samples</li>
<li><strong>Confusing SE with CI:</strong> SE is standard deviation, CI is interval</li>
<li><strong>MC without convergence check:</strong> Always verify error is acceptable</li>
</ul>

</div>

## ML Application Summary

<div class="alert alert-summary">
<h4>ü§ñ ML Applications Summary</h4>

<table style="width:100%; border-collapse: collapse;">
<tr style="background-color: #f0f0f0;">
<th style="border: 1px solid #ddd; padding: 8px;">Method</th>
<th style="border: 1px solid #ddd; padding: 8px;">ML Application</th>
<th style="border: 1px solid #ddd; padding: 8px;">When to Use</th>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 8px;">Monte Carlo</td>
<td style="border: 1px solid #ddd; padding: 8px;">Policy gradients, dropout, Bayesian inference</td>
<td style="border: 1px solid #ddd; padding: 8px;">Computing expectations, high-dim integrals</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 8px;">Bootstrap</td>
<td style="border: 1px solid #ddd; padding: 8px;">Model evaluation, CI for metrics, A/B testing</td>
<td style="border: 1px solid #ddd; padding: 8px;">Estimating uncertainty without assumptions</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 8px;">MC + Bootstrap</td>
<td style="border: 1px solid #ddd; padding: 8px;">Complete uncertainty quantification</td>
<td style="border: 1px solid #ddd; padding: 8px;">Production deployment decisions</td>
</tr>
</table>
</div>

## Key Takeaways

<div class="alert alert-summary">
<h4>üéì Key Takeaways</h4>
<ol>
<li><strong>Monte Carlo Methods:</strong> Use randomness to solve deterministic problems
   - Error: O(1/‚àön) regardless of dimensionality
   - Essential for high-dimensional integration
</li>
<li><strong>Numerical Integration:</strong> E[f(X)] ‚âà (1/n)Œ£f(X·µ¢)
   - Works when analytical integration is impossible
   - Convergence guaranteed by Law of Large Numbers
</li>
<li><strong>Bootstrap:</strong> Resample with replacement to estimate uncertainty
   - Treats sample as population
   - No assumptions about distribution needed
   - 95% CI: [2.5th percentile, 97.5th percentile]
</li>
<li><strong>Production ML:</strong> Always quantify uncertainty
   - Bootstrap CI for performance metrics
   - MC dropout for neural network uncertainty
   - Statistical tests for model comparison
</li>
</ol>

## Useful Links

1. [Sampling from a Statistical Distribution, Clearly Explained!! by StatQuest](https://www.youtube.com/watch?v=XLCWeSVzHUU)
2. [Sample Size and Effective Sample Size, Clearly Explained!! by StatQuest](https://www.youtube.com/watch?v=67zCIqdeXpo)
3. [Bootstrapping Main Ideas!! by StatQuest](https://www.youtube.com/watch?v=Xz0x-8-cgaQ)