<a href="https://colab.research.google.com/github/bnsreenu/python_for_microscopists/blob/master/363b_Time_Series_Statistical_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://youtu.be/ubItfMWR9aw

# Time Series Statistical Comparison: Handling Temporal Autocorrelation

## The Problem
Traditional statistical tests (like Mann-Whitney U) assume data points are independent. However, time series data violates this assumption because:
- Today's stock price depends on yesterday's price
- Observations close in time are more similar than distant ones
- This temporal autocorrelation can lead to incorrect statistical conclusions

## What We'll Learn
This notebook demonstrates 4 different approaches to compare two time series while properly handling temporal dependence:

### 1. **Naive Approach** (What NOT to do)
- Apply standard Mann-Whitney U test directly
- Problem: Ignores temporal correlation, may give false results

### 2. **Pre-whitening**
- Remove temporal correlation first using ARIMA models
- Compare the "cleaned" residuals
- Idea: If we remove the time pattern, we can use standard tests

### 3. **Dynamic Time Warping (DTW)**
- Aligns two time series optimally before comparison
- **Simple explanation**: Imagine stretching/compressing one series to best match the other
- Useful when series have similar patterns but different timing

### 4. **Block Bootstrap**
- Resampling method that preserves temporal correlation structure
- Idea: Instead of shuffling individual points, shuffle blocks of consecutive points
- Maintains the "memory" in the data while testing significance

## Our Example: AAPL vs MSFT Stock Returns
We'll compare Apple and Microsoft daily stock returns to answer:
- Are their return distributions significantly different?
- How do different methods handle the temporal correlation in stock prices?
- When do the methods agree vs disagree?


## Expected Outcome
Since AAPL and MSFT are both large tech stocks, we expect:
- High correlation between their returns
- Possible temporal autocorrelation in individual series
- Methods may show **no significant difference** (which is a valid scientific result!)
- Demonstrates importance of choosing appropriate comparison pairs

---


In [None]:
!pip install yfinance dtaidistance statsmodels

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import mannwhitneyu
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller, acf
from dtaidistance import dtw
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['font.size'] = 11



##1. Data Loading & Visualization
**Process:**
1. Download AAPL and MSFT stock prices from Yahoo Finance
2. Calculate daily returns (percentage changes)
3. Create visualizations to explore the data

**Key Insight:** High correlation (0.685) and similar distributions suggest these stocks move together - setting expectation for "no significant difference" result.

---



In [None]:
# Download stock data
print("Downloading stock data...")
tickers = ['AAPL', 'MSFT']
data = yf.download(tickers, start='2022-01-01', end='2024-12-01')['Close']
data = data.dropna()

# Calculate returns for better stationarity
returns = data.pct_change().dropna() * 100  # Convert to percentage
prices = data.copy()


print(f"Data shape: {data.shape}")
print(f"Date range: {data.index[0].date()} to {data.index[-1].date()}")
print(f"Total observations: {len(data)}")

# Basic visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot prices
axes[0,0].plot(prices.index, prices['AAPL'], label='AAPL', alpha=0.8)
axes[0,0].plot(prices.index, prices['MSFT'], label='MSFT', alpha=0.8)
axes[0,0].set_title('Stock Prices')
axes[0,0].set_ylabel('Price ($)')
axes[0,0].legend()

# Plot returns
axes[0,1].plot(returns.index, returns['AAPL'], label='AAPL', alpha=0.7)
axes[0,1].plot(returns.index, returns['MSFT'], label='MSFT', alpha=0.7)
axes[0,1].set_title('Daily Returns (%)')
axes[0,1].set_ylabel('Return (%)')
axes[0,1].legend()

# Correlation plot
axes[1,0].scatter(returns['AAPL'], returns['MSFT'], alpha=0.6)
correlation = returns['AAPL'].corr(returns['MSFT'])   # Pearson paiwise correlation
axes[1,0].set_xlabel('AAPL Returns (%)')
axes[1,0].set_ylabel('MSFT Returns (%)')
axes[1,0].set_title(f'Returns Correlation (r = {correlation:.3f})')

# Distribution comparison
axes[1,1].hist(returns['AAPL'], bins=30, alpha=0.7, label='AAPL', density=True)
axes[1,1].hist(returns['MSFT'], bins=30, alpha=0.7, label='MSFT', density=True)
axes[1,1].set_xlabel('Returns (%)')
axes[1,1].set_ylabel('Density')
axes[1,1].set_title('Return Distributions')
axes[1,1].legend()

plt.tight_layout()
plt.show()

## 2: Autocorrelation Check
**Process:**
1. Apply Ljung-Box test to detect temporal correlation. Basically, test whether a series of observations over time are random and independent
2. Plot autocorrelation functions (ACF) for visual inspection. Note that autocorrelation refers to the measure of similarity between a time series and a lagged version of itself over successive time intervals.
3. Determine if standard tests might be biased

**Key Insight:** MSFT shows significant autocorrelation (p=0.025) while AAPL doesn't (p=0.626), despite visually similar ACF plots. This mixed result makes it perfect for comparing methods.

---



In [None]:

def check_autocorrelation(series, name, max_lags=20):
    """Check for temporal autocorrelation in a time series"""
    print(f"\n=== Autocorrelation Analysis: {name} ===")

    # Ljung-Box test for autocorrelation
    from statsmodels.stats.diagnostic import acorr_ljungbox
    lb_stat = acorr_ljungbox(series, lags=10, return_df=True)

    print(f"Ljung-Box test p-value (lag 10): {lb_stat['lb_pvalue'].iloc[-1]:.4f}")
    if lb_stat['lb_pvalue'].iloc[-1] < 0.05:
        print("-> Autocorrelation detected")
    else:
        print("-> No significant autocorrelation")

    # Plot ACF
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

    # Autocorrelation function
    autocorr = acf(series, nlags=max_lags, alpha=0.05)
    lags = range(len(autocorr[0]))
    ax1.plot(lags, autocorr[0], 'bo-', markersize=4)
    ax1.fill_between(lags, autocorr[1][:, 0], autocorr[1][:, 1], alpha=0.3)
    ax1.axhline(0, color='black', linestyle='--', alpha=0.8)
    ax1.set_title(f'Autocorrelation Function: {name}')
    ax1.set_xlabel('Lag')
    ax1.set_ylabel('Autocorrelation')

    # Time series plot
    ax2.plot(series.index, series, alpha=0.8)
    ax2.set_title(f'Time Series: {name}')
    ax2.set_ylabel('Value')

    plt.tight_layout()
    plt.show()

    return lb_stat['lb_pvalue'].iloc[-1] < 0.05

# Check autocorrelation in returns
AAPL_has_autocorr = check_autocorrelation(returns['AAPL'], 'AAPL Returns')
MSFT_has_autocorr = check_autocorrelation(returns['MSFT'], 'MSFT Returns')

## 3: Naive Comparison
**Process:**
1. Apply Mann-Whitney U test directly to original returns
2. Ignore any temporal correlation structure
3. Establish baseline comparison

**Key Insight:** P-value = 0.921 (not significant). This is our "what NOT to do" baseline, though it may still give correct results when autocorrelation is mild.

---



In [None]:

def naive_comparison(series1, series2, name1, name2):
    """Standard Mann-Whitney U test ignoring temporal structure"""
    print(f"\n=== Naive Comparison: {name1} vs {name2} ===")

    # Mann-Whitney U test
    stat, p_value = mannwhitneyu(series1, series2, alternative='two-sided')

    # Basic statistics
    print(f"{name1}: Mean = {series1.mean():.3f}, Std = {series1.std():.3f}")
    print(f"{name2}: Mean = {series2.mean():.3f}, Std = {series2.std():.3f}")
    print(f"Mann-Whitney U statistic: {stat:.2f}")
    print(f"P-value: {p_value:.6f}")
    print(f"Result: {'Significantly different' if p_value < 0.05 else 'Not significantly different'}")

    return p_value

naive_p = naive_comparison(returns['AAPL'], returns['MSFT'], 'AAPL', 'MSFT')

## 4: Pre-whitening Approach
**Process:**
1. Fit ARIMA(1,0,0) models to remove temporal correlation. Note that ARIMA(1,0,0) is a first-order autoregressive model. It predicts future values of a time series based on a linear combination of its previous value and a constant, plus a random error term.
2. Extract residuals (cleaned data with correlation removed)
3. Apply Mann-Whitney U test to residuals
4. Check if residuals are truly white noise

**Key Insight:** P-value = 0.882 (not significant). Since original autocorrelation was weak (AR coefficients ~0.01), pre-whitening had minimal impact on results.

**In summary** - After removing the temporal correlation patterns (pre-whitening), we will find that the "cleaned" return data from AAPL and MSFT look statistically indistinguishable.

---



In [None]:

def prewhiten_series(series, name):
    """Remove temporal correlation using ARIMA modeling"""
    print(f"\n=== Pre-whitening: {name} ===")

    # Fit ARIMA(1,0,0) model - simple AR(1)
    model = ARIMA(series, order=(1, 0, 0))
    fitted = model.fit()

    # Extract residuals
    residuals = fitted.resid

    print(f"AR(1) coefficient: {fitted.params[1]:.4f}")
    print(f"Residuals mean: {residuals.mean():.4f}")
    print(f"Residuals std: {residuals.std():.4f}")

    # Check if residuals are white noise
    from statsmodels.stats.diagnostic import acorr_ljungbox
    lb_residuals = acorr_ljungbox(residuals, lags=10, return_df=True)
    print(f"Ljung-Box test on residuals p-value: {lb_residuals['lb_pvalue'].iloc[-1]:.4f}")

    return residuals, fitted

# Pre-whiten both series
AAPL_residuals, AAPL_model = prewhiten_series(returns['AAPL'], 'AAPL')
MSFT_residuals, MSFT_model = prewhiten_series(returns['MSFT'], 'MSFT')

In [None]:
# Compare pre-whitened series
def compare_prewhitened(resid1, resid2, name1, name2):
    """Compare pre-whitened residuals"""
    print(f"\n=== Pre-whitened Comparison: {name1} vs {name2} ===")

    stat, p_value = mannwhitneyu(resid1, resid2, alternative='two-sided')

    print(f"Mann-Whitney U statistic: {stat:.2f}")
    print(f"P-value: {p_value:.6f}")
    print(f"Result: {'Significantly different' if p_value < 0.05 else 'Not significantly different'}")

    # Visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # Time series of residuals
    axes[0].plot(resid1.index, resid1, label=name1, alpha=0.7)
    axes[0].plot(resid2.index, resid2, label=name2, alpha=0.7)
    axes[0].set_title('Pre-whitened Residuals')
    axes[0].set_ylabel('Residual')
    axes[0].legend()

    # Distribution comparison
    axes[1].hist(resid1, bins=30, alpha=0.7, label=name1, density=True)
    axes[1].hist(resid2, bins=30, alpha=0.7, label=name2, density=True)
    axes[1].set_title('Residual Distributions')
    axes[1].set_xlabel('Residual Value')
    axes[1].set_ylabel('Density')
    axes[1].legend()

    # Box plot
    box_data = pd.DataFrame({name1: resid1, name2: resid2})
    axes[2].boxplot([resid1.dropna(), resid2.dropna()], labels=[name1, name2])
    axes[2].set_title('Residual Box Plots')
    axes[2].set_ylabel('Residual Value')

    plt.tight_layout()
    plt.show()

    return p_value

prewhitened_p = compare_prewhitened(AAPL_residuals, MSFT_residuals, 'AAPL', 'MSFT')


## 5: Dynamic Time Warping (DTW)
**Process:**
1. **Warp the time axis** - Find optimal alignment between series
2. **Create aligned datasets** - Extract optimally matched data points  
3. **Apply standard test** - Run Mann-Whitney U on aligned values
4. Visualize the warping path and alignment results

**Key Insight:** P-value = 0.618 (not significant). DTW improved correlation from 0.685 → 0.834, showing series are very similar when timing differences are removed. DTW is preprocessing, not a different statistical test.

**Note:** DTW doesn't remove autocorrelation - it removes timing differences. We're testing whether the return magnitudes are different when patterns are optimally matched, regardless of autocorrelation.




In [None]:
def dtw_comparison(series1, series2, name1, name2):
    """Compare time series using Dynamic Time Warping"""
    print(f"\n=== DTW Analysis: {name1} vs {name2} ===")

    # Convert to numpy arrays and ensure they're clean
    s1 = series1.dropna().values
    s2 = series2.dropna().values

    # Truncate to same length for fair comparison
    min_len = min(len(s1), len(s2))
    s1 = s1[:min_len]
    s2 = s2[:min_len]

    print(f"Comparing series of length: {len(s1)}")

    # Calculate DTW distance
    dtw_distance = dtw.distance(s1, s2)
    print(f"DTW Distance: {dtw_distance:.4f}")

    # Get optimal warping path
    path = dtw.warping_path(s1, s2)

    # Create DTW visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))

    # Original time series
    axes[0,0].plot(range(len(s1)), s1, 'b-', label=name1, alpha=0.8)
    axes[0,0].plot(range(len(s2)), s2, 'r-', label=name2, alpha=0.8)
    axes[0,0].set_title('Original Time Series')
    axes[0,0].set_ylabel('Value')
    axes[0,0].legend()

    # DTW path visualization (simplified)
    # Create a manual distance matrix for visualization
    n1, n2 = len(s1), len(s2)
    dtw_matrix = np.zeros((n1, n2))

    for i in range(n1):
        for j in range(n2):
            dtw_matrix[i, j] = abs(s1[i] - s2[j])

    im = axes[0,1].imshow(dtw_matrix, cmap='viridis', aspect='auto')
    axes[0,1].plot([p[1] for p in path], [p[0] for p in path], 'w-', linewidth=2)
    axes[0,1].set_title('DTW Distance Matrix & Optimal Path')
    axes[0,1].set_xlabel(f'{name2} Index')
    axes[0,1].set_ylabel(f'{name1} Index')
    plt.colorbar(im, ax=axes[0,1])

    # Aligned series using DTW path
    aligned_s1 = [s1[p[0]] for p in path]
    aligned_s2 = [s2[p[1]] for p in path]

    axes[1,0].plot(aligned_s1, 'b-', label=f'{name1} (aligned)', alpha=0.8)
    axes[1,0].plot(aligned_s2, 'r-', label=f'{name2} (aligned)', alpha=0.8)
    axes[1,0].set_title('DTW Aligned Series')
    axes[1,0].set_ylabel('Value')
    axes[1,0].legend()

    # Scatter plot of aligned values
    axes[1,1].scatter(aligned_s1, aligned_s2, alpha=0.6)
    axes[1,1].plot([min(aligned_s1), max(aligned_s1)], [min(aligned_s1), max(aligned_s1)], 'r--')
    correlation_aligned = np.corrcoef(aligned_s1, aligned_s2)[0,1]
    axes[1,1].set_xlabel(f'{name1} (aligned)')
    axes[1,1].set_ylabel(f'{name2} (aligned)')
    axes[1,1].set_title(f'Aligned Values (r = {correlation_aligned:.3f})')

    plt.tight_layout()
    plt.show()

    return dtw_distance, aligned_s1, aligned_s2

# Apply DTW to returns
dtw_dist, AAPL_aligned, MSFT_aligned = dtw_comparison(returns['AAPL'], returns['MSFT'], 'AAPL', 'MSFT')

### Note about the above Results:

- White diagonal path shows optimal alignment between series - basically, the optimal warping path
- Path stays close to diagonal = series already well-aligned in time (In other words, Both stocks move together on similar days)

**Alignment Effect:**
- Original correlation: 0.685
- **After DTW alignment: 0.834** (significant improvement!)
- DTW found subtle timing differences and corrected them

**Key Takeaway:** DTW revealed that AAPL and MSFT have very similar patterns - they just don't always move on exactly the same days. This explains why the statistical test may find even fewer differences after alignment.

In [None]:
def compare_dtw_aligned(aligned1, aligned2, name1, name2):
    """Compare DTW-aligned series"""
    print(f"\n=== DTW-Aligned Comparison: {name1} vs {name2} ===")

    stat, p_value = mannwhitneyu(aligned1, aligned2, alternative='two-sided')

    print(f"Mann-Whitney U statistic: {stat:.2f}")
    print(f"P-value: {p_value:.6f}")
    print(f"Result: {'Significantly different' if p_value < 0.05 else 'Not significantly different'}")

    return p_value

dtw_aligned_p = compare_dtw_aligned(AAPL_aligned, MSFT_aligned, 'AAPL', 'MSFT')

## 6. Block Bootstrap

**Process:**

1. **Combine series data** - Concatenate AAPL and MSFT returns into one long series (since we're comparing overall distributions, not paired differences)
2. **Resample in chunks** - Take consecutive blocks of 10 days (e.g., days 50-59, 200-209) with random starting positions
3. **Preserve local correlation** - Don't shuffle individual points (breaks autocorrelation), shuffle blocks instead
4. **Create bootstrap samples** - Combine blocks to make new time series of same length, split back into two groups
5. **Calculate test statistic** - Find difference in means for each bootstrap sample (repeat 1000 times)
6. **Build null distribution** - If no true difference exists, our observed difference should fall within this distribution
7. **Compare observed to null** - Our actual difference (0.0024%) vs bootstrap distribution

**Key Insight:** P-value = 0.990 (not significant). Our observed difference falls right in the center of what we'd expect by random chance, even accounting for temporal correlation. Most conservative method confirms conclusion with maximum confidence.

In [None]:

def block_bootstrap_test(series1, series2, name1, name2, block_size=10, n_bootstrap=1000):
    """Block bootstrap test preserving temporal correlation"""
    print(f"\n=== Block Bootstrap Test: {name1} vs {name2} ===")
    print(f"Block size: {block_size}, Bootstrap samples: {n_bootstrap}")

    # Original test statistic (difference in means)
    original_diff = series1.mean() - series2.mean()
    print(f"Original mean difference: {original_diff:.4f}")

    # Combine series for bootstrap
    combined = pd.concat([series1, series2])
    n1, n2 = len(series1), len(series2)
    n_total = len(combined)

    bootstrap_diffs = []

    for i in range(n_bootstrap):
        # Generate bootstrap sample using block sampling
        bootstrap_sample = []
        current_pos = 0

        while len(bootstrap_sample) < n_total:
            # Random starting position
            start = np.random.randint(0, n_total - block_size + 1)
            block = combined.iloc[start:start + block_size].values
            bootstrap_sample.extend(block)

        # Trim to exact size
        bootstrap_sample = bootstrap_sample[:n_total]

        # Split into two groups
        boot_group1 = bootstrap_sample[:n1]
        boot_group2 = bootstrap_sample[n1:n1+n2]

        # Calculate test statistic
        boot_diff = np.mean(boot_group1) - np.mean(boot_group2)
        bootstrap_diffs.append(boot_diff)

    bootstrap_diffs = np.array(bootstrap_diffs)

    # Calculate p-value (two-tailed)
    p_value = 2 * min(np.mean(bootstrap_diffs >= original_diff),
                      np.mean(bootstrap_diffs <= original_diff))

    print(f"Bootstrap p-value: {p_value:.4f}")
    print(f"Result: {'Significantly different' if p_value < 0.05 else 'Not significantly different'}")

    # Visualization
    plt.figure(figsize=(10, 6))
    plt.hist(bootstrap_diffs, bins=50, alpha=0.7, density=True, color='lightblue')
    plt.axvline(original_diff, color='red', linestyle='--', linewidth=2,
                label=f'Observed difference: {original_diff:.4f}')
    plt.axvline(0, color='black', linestyle='-', alpha=0.5, label='Null hypothesis')
    plt.xlabel('Mean Difference')
    plt.ylabel('Density')
    plt.title(f'Block Bootstrap Distribution (p = {p_value:.4f})')
    plt.legend()
    plt.show()

    return p_value

block_bootstrap_p = block_bootstrap_test(returns['AAPL'], returns['MSFT'], 'AAPL', 'MSFT')


## 7: Final Comparison
**Process:**
1. Compare all four p-values side by side
2. Evaluate method consistency and sensitivity
3. Interpret results in context of detected autocorrelation

**Key Insight:** All methods agree (no significant difference), building confidence in conclusion. DTW was most sensitive, block bootstrap most conservative. Method consistency validates that AAPL and MSFT truly have similar return distributions.

In [None]:

print("\n" + "="*60)
print("SUMMARY OF ALL APPROACHES")
print("="*60)

results_summary = pd.DataFrame({
    'Method': ['Naive Mann-Whitney U', 'Pre-whitened Residuals', 'DTW Aligned', 'Block Bootstrap'],
    'P-value': [naive_p, prewhitened_p, dtw_aligned_p, block_bootstrap_p],
    'Significant': [p < 0.05 for p in [naive_p, prewhitened_p, dtw_aligned_p, block_bootstrap_p]]
})

print(results_summary.to_string(index=False))

print(f"\nKey Insights:")
print(f"- Temporal autocorrelation {'was' if AAPL_has_autocorr or MSFT_has_autocorr else 'was not'} detected")
print(f"- DTW distance between series: {dtw_dist:.4f}")
print(f"- Original correlation: {correlation:.3f}")

# Final visualization comparing all approaches
fig, ax = plt.subplots(figsize=(10, 6))
methods = results_summary['Method']
p_values = results_summary['P-value']
colors = ['red' if sig else 'gray' for sig in results_summary['Significant']]

bars = ax.bar(methods, p_values, color=colors, alpha=0.7)
ax.axhline(0.05, color='black', linestyle='--', label='α = 0.05')
ax.set_ylabel('P-value')
ax.set_title('Comparison of Statistical Test Results')
ax.tick_params(axis='x', rotation=45)

# Add p-value labels on bars
for bar, p_val in zip(bars, p_values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.001,
            f'{p_val:.4f}', ha='center', va='bottom')

plt.legend()
plt.tight_layout()
plt.show()

## Final Summary: Method Comparison

**Unanimous Agreement:** All four methods conclude no significant difference between AAPL and MSFT returns.

**Method Rankings by Sensitivity:**
1. **DTW Aligned** (p = 0.618) - Most sensitive after optimal alignment
2. **Pre-whitened** (p = 0.882) - Slight improvement after removing autocorrelation  
3. **Naive Mann-Whitney** (p = 0.921) - Simple approach worked well
4. **Block Bootstrap** (p = 0.990) - Most conservative, accounts for correlation structure

**Key Takeaways:**
- **Method consistency builds confidence** - when all approaches agree, we trust the result
- **Autocorrelation didn't bias results severely** - mild correlation (r=0.685) didn't mislead naive test
- **"No significant difference" IS a valid scientific conclusion** - confirms these tech stocks behave similarly
- **Sophisticated methods aren't always needed** - but checking multiple approaches is good practice

**Real-World Insight:** AAPL and MSFT are so similar (same sector, market cap, economic drivers) that even advanced statistical methods can't find meaningful differences in their return patterns.