In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as st
from sklearn.utils import resample

In [2]:
def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), st.sem(a)
    h = se * st.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

**1. Let X1,..., Xn be independent and identically distributed random variables
having unknown mean μ. For given constants a < b, we are interested in
estimating p = P{a < sum(Xi)/n − μ < b}.**

**(a) Explain how we can use the bootstrap approach to estimate p.**

The bootstrap method is a technique to estimate the variance of an estimator based on sampling from the empirical solution. For a concrete number of iterations, it takes N samples from the original dataset **with replacement**. For a relatively high number of iterations we can use this algorithm to estimate the probability given by the estimator p. This is can done by checking if the condition (a < sum(Xi)/N − μ < b) for each of the generated samples and then simply compute the fraction (condition_ocurred/num_iterations).

**(b) Estimate p if n = 10 and the values of the Xi are 56, 101, 78, 67, 93, 87,
64, 72, 80, and 69. Take a = −5, b = 5.**

In [3]:
n = 10
a = -5
b = 5
data = [56, 101, 78, 67, 93, 87, 64, 72, 80, 69]
iterations = 1000

# bootstrap algorithm
statistics = []
mu = np.mean(data)
for i in range(iterations):
    sample = resample(data, replace=True, n_samples=n, random_state=i)
    estimator = (sum(sample) / n) - mu
    #print(f'a < {estimator} < b = {a < estimator and b > estimator}')
    statistics.append(estimator)
p = [1 if a < i and b > i else 0 for i in statistics]
variance = np.var(p)
CI = mean_confidence_interval(p)

print(f'p = {sum(p)/len(p)}')
print(f'Variance using bootstrap algorithm: {variance} [CI = ({CI[1]},{CI[2]})]')

p = 0.777
Variance using bootstrap algorithm: 0.17327099999999998 [CI = (0.7511562948998318,0.8028437051001682)]


In [4]:
print(np.var(data))

174.01


2. If n = 15 and the data are

In [5]:
n = 15
a = -5
b = 5
data = [5, 4, 9, 6, 21, 17, 11, 20, 7, 10, 21, 15, 13, 16, 8]
iterations = 1000

# bootstrap algorithm
statistics = []
mu = np.mean(data)
for i in range(iterations):
    sample = resample(data, replace=True, n_samples=n, random_state=i)
    estimator = (sum(sample) / n) - mu
    #print(f'a < {estimator} < b = {a < estimator and b > estimator}')
    statistics.append(estimator)
print(np.mean(statistics))
p = [1 if a < i and b > i else 0 for i in statistics]
variance = np.var(p)
CI = mean_confidence_interval(p)

print(f'p = {sum(p)/len(p)}')
print(f'Variance using bootstrap algorithm: {variance} [CI = ({CI[1]},{CI[2]})]')

-0.010733333333332616
p = 1.0
Variance using bootstrap algorithm: 0.0 [CI = (1.0,1.0)]


In [6]:
print(np.var(data))

32.026666666666664


**3. Write a subroutine that takes as input a “data” vector of
observed values, and which outputs the median as well as the
bootstrap estimate of the variance of the median, based on
r = 100 bootstrap replicates. Simulate N = 200 Pareto
distributed random variates with β = 1 and k = 1.05.**

**(a) Compute the mean and the median (of the sample)**

**(b) Make the bootstrap estimate of the variance of the sample
mean.**

**(c) Make the bootstrap estimate of the variance of the sample
median.**

**(d) Compare the precision of the estimated median with the
precision of the estimated mean.**

In [7]:
def bootstrap_mean(data, N, r=100):
    statistics = []
    median = np.mean(data)
    for i in range(r):
        sample = resample(data, replace=True, n_samples=N, random_state=i)
        estimator = np.mean(sample)
        statistics.append(estimator)
    return np.var(statistics)

def bootstrap_median(data, N, r=100):
    statistics = []
    median = np.median(data)
    for i in range(r):
        sample = resample(data, replace=True, n_samples=N, random_state=i)
        estimator = np.median(sample)
        statistics.append(estimator)
    return np.var(statistics)

In [8]:
N = 200
beta = 1
k = 1.05
u = np.random.uniform(size=N)
pareto = beta / np.power(u,(1/k))

# a)
mean = np.mean(pareto)
median = np.median(pareto)
print(f'Mean: {mean}')
print(f'Median: {median}')

# b)
print(f'Boostrap estimate of the variance of the sample mean: {bootstrap_mean(data,N)}')

# c)
print(f'Boostrap estimate of the variance of the sample median: {bootstrap_median(data,N)}')

Mean: 5.002711195183667
Median: 2.0284181538476607
Boostrap estimate of the variance of the sample mean: 0.15177431000000005
Boostrap estimate of the variance of the sample median: 0.785275


By the definition of mean and median, it is logic to find that the estimate variance of the mean is lower than the estimated variance for the median. Since the different samples of length N=200 are being generated from a set of (also) N values, the median will change its value per iteration more than the average of an N-sequence.