# Estimating Earthquake Probability: Two Approaches

**Question**: What's the probability of at least one M $\ge$ 4.0 earthquake in California in the next week?

We'll compare two natural approaches:

1. **Empirical CDF**: What fraction of past interarrival times were $\le$ 7 days?
2. **Model-based (MLE)**: Fit an exponential distribution, then compute $P(X \le 7) = 1 - e^{-\hat{\lambda} \cdot 7}$

Both estimators are approximately Gaussian by the CLT, but they have different variances. Which is more precise?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

## Part 1: Magnitude $\ge$ 4.0 Earthquakes

### 1.1 Load and Prepare Data

In [None]:
# Load the declustered earthquake data (produced by usgs_gk_ca_mainshocks.ipynb)
earthquakes = pd.read_csv('data/california_earthquakes_declustered.csv')
earthquakes['time'] = pd.to_datetime(earthquakes['time'], format='ISO8601')

print(f"Total earthquakes in dataset: {len(earthquakes)}")
print(f"  - Mainshocks: {earthquakes['is_mainshock'].sum()}")
print(f"  - Dependent: {(~earthquakes['is_mainshock']).sum()}")
print(f"Date range: {earthquakes['time'].min().date()} to {earthquakes['time'].max().date()}")

In [None]:
# Filter to mainshocks with M >= 4.0
mag_threshold = 4.0
mainshocks = earthquakes[(earthquakes['is_mainshock']) & (earthquakes['mag'] >= mag_threshold)].copy()
mainshocks = mainshocks.sort_values('time').reset_index(drop=True)

print(f"Mainshocks with M >= {mag_threshold}: {len(mainshocks)}")

In [None]:
# Compute interarrival times (in days)
interarrivals = mainshocks['time'].diff().dt.total_seconds() / (60 * 60 * 24)
interarrivals = interarrivals.dropna().values

print(f"Number of interarrival times: {len(interarrivals)}")
print(f"Mean interarrival time: {np.mean(interarrivals):.2f} days")
print(f"Median interarrival time: {np.median(interarrivals):.2f} days")

### 1.2 Does the Exponential Model Fit?

The exponential distribution is the natural model for waiting times in a Poisson process. If mainshock occurrences follow a Poisson process with rate $\lambda$, then interarrival times are $\text{Exponential}(\lambda)$.

Key property: For an exponential distribution, the mean equals the standard deviation. Let's check:

In [None]:
mean_ia = np.mean(interarrivals)
std_ia = np.std(interarrivals)

print(f"Mean: {mean_ia:.2f} days")
print(f"Std:  {std_ia:.2f} days")
print(f"Ratio (std/mean): {std_ia/mean_ia:.2f}")
print(f"\nFor a perfect exponential, this ratio would be 1.0")

In [None]:
# Histogram with exponential fit
fig, ax = plt.subplots(figsize=(10, 6))

# Histogram of data
ax.hist(interarrivals, bins=30, density=True, alpha=0.7, color='steelblue', 
        edgecolor='white', label='Data')

# Fitted exponential
lambda_hat = 1 / mean_ia
x = np.linspace(0, np.percentile(interarrivals, 99), 200)
y = lambda_hat * np.exp(-lambda_hat * x)
ax.plot(x, y, 'r--', linewidth=2.5, label=f'Exponential fit ($\\hat{{\\lambda}}$ = {lambda_hat:.3f}/day)')

ax.set_xlabel('Interarrival Time (days)', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title(f'Interarrival Times for M $\geq$ {mag_threshold} Mainshocks', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.set_xlim(0, None)

plt.tight_layout()
plt.show()

print(f"The exponential model appears to fit reasonably well.")

### 1.3 Two Estimators for P(earthquake within 7 days)

Now we want to estimate $P(X \le 7)$ where $X$ is the time until the next earthquake.

**Estimator 1: Empirical CDF**
$$\hat{p}_{\text{ECDF}} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}(X_i \le 7)$$
This is just the proportion of observed interarrival times that were 7 days or less.

**Estimator 2: MLE-based**
$$\hat{p}_{\text{MLE}} = 1 - e^{-\hat{\lambda} \cdot 7} \quad \text{where} \quad \hat{\lambda} = \frac{1}{\bar{X}}$$
This fits the exponential model and computes the theoretical probability.

Let's compute both from our data:

In [None]:
t = 7  # days

# Empirical CDF estimate
p_ecdf = np.mean(interarrivals <= t)

# MLE-based estimate
lambda_hat = 1 / np.mean(interarrivals)
p_mle = 1 - np.exp(-lambda_hat * t)

print(f"Probability of M >= {mag_threshold} earthquake within {t} days:\n")
print(f"  Empirical CDF:  {p_ecdf:.3f}  ({100*p_ecdf:.1f}%)")
print(f"  MLE-based:      {p_mle:.3f}  ({100*p_mle:.1f}%)")

<cell_type>markdown</cell_type>The estimates are close, but which one is more *precise*? That is, which has lower sampling variability?

### 1.4 Comparing Sampling Variability via Bootstrap

To understand the sampling distributions of these estimators, we'll use the bootstrap:
1. Resample $n = 100$ interarrival times with replacement from our data
2. Compute both estimates on the bootstrap sample
3. Repeat many times to see the distribution of each estimator

In [None]:
n_boot_size = 100  # Sample size for each bootstrap sample
n_bootstrap = 5000
t = 7

np.random.seed(42)

# Storage for bootstrap estimates
boot_ecdf = np.zeros(n_bootstrap)
boot_mle = np.zeros(n_bootstrap)

for b in range(n_bootstrap):
    # Resample with replacement
    boot_sample = np.random.choice(interarrivals, size=n_boot_size, replace=True)
    
    # Empirical CDF estimate
    boot_ecdf[b] = np.mean(boot_sample <= t)
    
    # MLE-based estimate
    lambda_boot = 1 / np.mean(boot_sample)
    boot_mle[b] = 1 - np.exp(-lambda_boot * t)

print(f"Bootstrap complete: {n_bootstrap} resamples of size {n_boot_size}")

In [None]:
# Summary statistics
print(f"Bootstrap results for P(X <= {t} days), n = {n_boot_size}:\n")
print(f"{'Estimator':<18} {'Mean':>10} {'Std':>10}")
print("-" * 40)
print(f"{'Empirical CDF':<18} {np.mean(boot_ecdf):>10.4f} {np.std(boot_ecdf):>10.4f}")
print(f"{'MLE-based':<18} {np.mean(boot_mle):>10.4f} {np.std(boot_mle):>10.4f}")
print()
print(f"Std ratio (ECDF / MLE): {np.std(boot_ecdf) / np.std(boot_mle):.2f}")
print(f"Variance ratio: {np.var(boot_ecdf) / np.var(boot_mle):.2f}")

In [None]:
# Histogram comparison
fig, ax = plt.subplots(figsize=(10, 6))

# For ECDF, possible values are k/n. Create bins centered on these values.
# Bin edges at (k - 0.5)/n for k = 0, 1, ..., n+1
ecdf_min, ecdf_max = boot_ecdf.min(), boot_ecdf.max()
k_min = int(np.floor(ecdf_min * n_boot_size))
k_max = int(np.ceil(ecdf_max * n_boot_size))
ecdf_bin_edges = (np.arange(k_min, k_max + 2) - 0.5) / n_boot_size

# For MLE, use regular bins spanning full range
mle_min, mle_max = boot_mle.min(), boot_mle.max()
mle_bins = np.linspace(mle_min, mle_max, 50)

ax.hist(boot_ecdf, bins=ecdf_bin_edges, alpha=0.6, color='coral', density=True,
        label=f'Empirical CDF (std = {np.std(boot_ecdf):.4f})')
ax.hist(boot_mle, bins=mle_bins, alpha=0.6, color='steelblue', density=True,
        label=f'MLE-based (std = {np.std(boot_mle):.4f})')

# Mark the point estimates
ax.axvline(p_ecdf, color='darkred', linestyle='--', linewidth=2, alpha=0.8)
ax.axvline(p_mle, color='darkblue', linestyle='--', linewidth=2, alpha=0.8)

ax.set_xlabel('Estimated Probability', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title(f'Bootstrap Sampling Distributions: P(earthquake within {t} days)\n'
             f'M $\\geq$ {mag_threshold}, n = {n_boot_size}', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)

plt.tight_layout()
plt.show()

### 1.5 Why is the MLE-based Estimator More Precise?

The MLE-based estimator has lower variance because it uses **more information** from each observation.

**Empirical CDF**: For each interarrival time $X_i$, it only uses the binary information "was $X_i \le 7$ or $X_i > 7$?" The actual value of $X_i$ is thrown away.

**MLE-based**: Uses the actual value of every $X_i$ to estimate $\lambda$, then transforms to get the probability.

**The key insight**: If the exponential model is correct, knowing $\lambda$ tells you *everything* about the distribution. Estimating $\lambda$ well (using all the data) lets you estimate any probability or quantile well.

**Both are approximately Gaussian** by the CLT:
- ECDF: It's a sample mean of binary indicators $\to$ CLT applies directly
- MLE-based: It's a smooth function of $\bar{X}$ $\to$ CLT + **delta method**

---

## Part 2: Magnitude $\ge$ 5.0 Earthquakes

Larger earthquakes are rarer. Let's repeat the analysis for M $\ge$ 5.0.

In [None]:
# Filter to M >= 5.0 mainshocks
mag_threshold_5 = 5.0
mainshocks_5 = earthquakes[(earthquakes['is_mainshock']) & (earthquakes['mag'] >= mag_threshold_5)].copy()
mainshocks_5 = mainshocks_5.sort_values('time').reset_index(drop=True)

# Compute interarrival times
interarrivals_5 = mainshocks_5['time'].diff().dt.total_seconds() / (60 * 60 * 24)
interarrivals_5 = interarrivals_5.dropna().values

print(f"M >= {mag_threshold_5} mainshocks: {len(mainshocks_5)}")
print(f"Mean interarrival: {np.mean(interarrivals_5):.1f} days")

In [None]:
# Histogram with exponential fit
fig, ax = plt.subplots(figsize=(10, 6))

mean_ia_5 = np.mean(interarrivals_5)
lambda_hat_5 = 1 / mean_ia_5

ax.hist(interarrivals_5, bins=20, density=True, alpha=0.7, color='steelblue', 
        edgecolor='white', label='Data')

x = np.linspace(0, np.percentile(interarrivals_5, 99), 200)
y = lambda_hat_5 * np.exp(-lambda_hat_5 * x)
ax.plot(x, y, 'r--', linewidth=2.5, label=f'Exponential fit ($\\hat{{\\lambda}}$ = {lambda_hat_5:.4f}/day)')

ax.set_xlabel('Interarrival Time (days)', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title(f'Interarrival Times for M $\geq$ {mag_threshold_5} Mainshocks', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.set_xlim(0, None)

plt.tight_layout()
plt.show()

In [None]:
# Point estimates
t = 7
p_ecdf_5 = np.mean(interarrivals_5 <= t)
p_mle_5 = 1 - np.exp(-lambda_hat_5 * t)

print(f"P(M >= {mag_threshold_5} earthquake within {t} days):\n")
print(f"  Empirical CDF:  {p_ecdf_5:.3f}  ({100*p_ecdf_5:.1f}%)")
print(f"  MLE-based:      {p_mle_5:.3f}  ({100*p_mle_5:.1f}%)")

In [None]:
# Bootstrap
n_5 = len(interarrivals_5)
n_bootstrap = 5000

np.random.seed(42)

boot_ecdf_5 = np.zeros(n_bootstrap)
boot_mle_5 = np.zeros(n_bootstrap)

for b in range(n_bootstrap):
    boot_sample = np.random.choice(interarrivals_5, size=n_5, replace=True)
    boot_ecdf_5[b] = np.mean(boot_sample <= t)
    lambda_boot = 1 / np.mean(boot_sample)
    boot_mle_5[b] = 1 - np.exp(-lambda_boot * t)

print(f"Bootstrap results (n = {n_5}):\n")
print(f"{'Estimator':<18} {'Mean':>10} {'Std':>10}")
print("-" * 40)
print(f"{'Empirical CDF':<18} {np.mean(boot_ecdf_5):>10.4f} {np.std(boot_ecdf_5):>10.4f}")
print(f"{'MLE-based':<18} {np.mean(boot_mle_5):>10.4f} {np.std(boot_mle_5):>10.4f}")
print()
print(f"Std ratio (ECDF / MLE): {np.std(boot_ecdf_5) / np.std(boot_mle_5):.2f}")

In [None]:
# Histogram comparison
fig, ax = plt.subplots(figsize=(10, 6))

# For ECDF, possible values are k/n_5. Create bins centered on these values.
ecdf_min, ecdf_max = boot_ecdf_5.min(), boot_ecdf_5.max()
k_min = int(np.floor(ecdf_min * n_5))
k_max = int(np.ceil(ecdf_max * n_5))
ecdf_bin_edges = (np.arange(k_min, k_max + 2) - 0.5) / n_5

# For MLE, use regular bins spanning full range
mle_min, mle_max = boot_mle_5.min(), boot_mle_5.max()
mle_bins = np.linspace(mle_min, mle_max, 50)

ax.hist(boot_ecdf_5, bins=ecdf_bin_edges, alpha=0.6, color='coral', density=True,
        label=f'Empirical CDF (std = {np.std(boot_ecdf_5):.4f})')
ax.hist(boot_mle_5, bins=mle_bins, alpha=0.6, color='steelblue', density=True,
        label=f'MLE-based (std = {np.std(boot_mle_5):.4f})')

ax.axvline(p_ecdf_5, color='darkred', linestyle='--', linewidth=2, alpha=0.8)
ax.axvline(p_mle_5, color='darkblue', linestyle='--', linewidth=2, alpha=0.8)

ax.set_xlabel('Estimated Probability', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title(f'Bootstrap Sampling Distributions: P(earthquake within {t} days)\n'
             f'M $\\geq$ {mag_threshold_5}, n = {n_5}', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)

plt.tight_layout()
plt.show()

---

## Summary

| Magnitude | Sample Size | ECDF Std | MLE Std | Ratio |
|-----------|-------------|----------|---------|-------|
| M $\ge$ 4.0 | larger | lower | lower | ~same |
| M $\ge$ 5.0 | smaller | higher | higher | ~same |

**Key takeaways**:
1. Both estimators are approximately Gaussian (by CLT)
2. The MLE-based estimator has lower variance in both cases
3. The efficiency gain is consistent across different magnitude thresholds
4. The MLE-based estimator's Gaussianity comes from the **delta method**