# Lab 6: Confidence and Credible Intervals
Welcome to the sixth DS102 lab! 

The goals of this lab is to get a better understanding of confidence resulting from the Chebyshev and Hoeffding bounds we have seen in Lecture.

The code you need to write is commented out with a message "TODO: fill in". There is additional documentation for each part as you go along.


## Course Policies

**Collaboration Policy**

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** in the cell below.

**Submission**: to submit this assignment, rerun the notebook from scratch (by selecting Kernel > Restart & Run all), and then print as a pdf (File > download as > pdf) and submit it to Gradescope.


**This assignment should be completed and submitted before Wednesday October 23, 2019 at 11:59 PM.** This is intentionally one day later than usual since the homework is due on Tuesday.

Write collaborator names here.

In [None]:
%pylab inline
import matplotlib.pyplot as plt
import matplotlib.patches as patches 
import numpy as np
import scipy.stats
import seaborn as sns

In [None]:
# Define the alpha and beta for the beta distribution
alpha=1.5
beta=1.5

#Plots the confidence interval, using the number of samples.
def visualize_confidence_interval(samples,alpha,beta,ci,c, title ,ci_label,plot_dist=1,linestyle='--'):
    """
    Visualizes confidence interval. Plots underlying Beta distribution given alpha and beta,
    and then visualizes the confidence interval. 
    
    Parameters
    ----------
    samples: list of samples
    
    alpha,beta: parameters of beta distribution
    
    ci: computed confidence level such that the interval [c1,c2] captures the mean with 
        probability at least 1-delta.
    
    c: color of confidence interval
    
    title: title of plot
    
    ci_label: string for the type of confidence interval
    
    """
    #If visualize distribution
    if plot_dist:
        sns.kdeplot(np.random.beta(alpha,beta,(50000,)),shade=False,lw=3,label='p.d.f')
        plt.plot([np.mean(samples),np.mean(samples)],[0,3.0],'k',lw=2, label='Sample mean')
        plt.plot([alpha/(alpha+beta),alpha/(alpha+beta)],[0,3.0],'darkred',lw=2, label='True mean')
        plt.title(title) 
        
    #Visualize confidnce interval
    plt.text(x=0.1,y=-0.15,s='Mean is in the shaded area at least 95% of the time')
    rect=patches.Rectangle([ci[0],0], ci[1]-ci[0], 3.0, color=c,alpha=0.25,ls=linestyle)
    plt.gca().add_patch(rect)
    plt.plot([ci[0],ci[0]],[0,3.0],c,ls=linestyle,lw=2,label=ci_label)
    plt.plot([ci[1],ci[1]],[0,3.0],c,ls=linestyle,lw=2)
    plt.ylim([-0.2,3.0])
    plt.xlim([0,1.0])
    plt.legend(loc='upper right')  
    return

## Chebyshev Confidence Intervals

In this lab we are interested in building confidence intervals for the mean of the distribution. These are intervals $CI=[c_1(X),c_2(X)]$ such that, with probability at least $1-\delta$, the mean $\mu=\mathbb{E}[X]$ is captured in CI. That is: 

$$ \mathbb{P}(c_1(X)<\mu<c_2(X))\ge 1-\delta$$


We will begin by analyzing the Chebyshev bound, which is given by:

$$\mathbb{P}(|X-\mu| > \epsilon) \le \frac{Var(X)}{\epsilon^2},$$

where $\epsilon>0$.

Rearranging, we can see that this bound can be used to construct a confidence interval:

$$\mathbb{P}(|X-\mu| \le  \epsilon)  \ge 1-\frac{Var(X)}{\epsilon^2}\\
\mathbb{P}( X-\epsilon \le \mu\le X+\epsilon) \ge 1-\frac{Var(X)}{\epsilon^2}$$


Therefore, the Chebyshev bounds guarantees us that with probability at least $1-\frac{Var(X)}{\epsilon^2}$, the mean $\mu$ is in the interval: 

$$[c_1(X),c_2(X)]= [X-\epsilon,X+\epsilon]$$.

Notice, that for the Chebyshev bound, we only require knowledge of the variance of the distribution to derive a confidence interval for the mean $\mathbb{E}[X]$. 


Fill out the following function that returns the confidence interval for the mean that is guaranteed with probability at least $1-\delta$ by the Chebyshev bound. 



In [None]:
def confidence_interval_Chebyshev(x,variance,delta):
    """
    Given the sample, variance, and desired confidence level, returns the interval [c1,c2], that 
    captures the mean with probability at least 1-delta.
    
    
    Parameters
    ----------
    x : sample
    
    variance: variance of distribution

    delta : confidence level such that the interval [c1,c2] captures the mean with probability at least 1-delta.
    
    Returns
    -------
    [c1,c2]: where c1, and c2 are the lower and upper bounds of the confidence interval
    
    """

    c1=#TODO: fill this in
    c2= #TODO: fill this in
    return [c1,c2]

## 1: Chebyshev Confidence Interval for the Beta Distribution

In this lab we will be interested in constructing confidence intervals for data coming from a $Beta$ distribution. Remember that the $Beta$ is parametrized by two parameters $\alpha>0, \beta>0$ and the density is given by:

$$Beta(x;\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}$$

where $0<x<1$. The mean and variance of the $Beta$ distribution are given by:

$$ \mu=\mathbb{E}[X]=\frac{\alpha}{\alpha+\beta}$$
$$ \sigma^2=Var[X]=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$$

We will assume that we know the variance $\sigma^2$ of the distribution, but not the mean, and that we get $n$ i.i.d samples $X_1,...,X_n$ from the distribution. 

We would like to use our Chebyshev confidence interval function to construct a confidence interval from the sample mean $\bar X$ defined as:

$$ \bar X= \frac{1}{n}\sum_{i=1}^n X_i $$,

Note, that the variance of the sample mean is given by:

$$ \sigma_n^2 = \frac{\sigma^2}{n} $$

Fill our the following function that takes in the samples, the variance of the distribution, and returns the confidence interval around the sample mean. We will make use of the function you filled out above.

In [None]:
def confidence_interval_Chebyshev_sample_mean(samples,variance,delta):
    """
    Given the samples, variance, and desired confidence level, returns the interval [c1,c2], that 
    captures the mean with probability at least 1-delta.
    
    
    Parameters
    ----------
    x : sample
    
    variance: variance of distribution

    delta : confidence level such that the interval [c1,c2] captures the mean with probability at least 1-delta.
    
    Returns
    -------
    [c1,c2]: where c1, and c2 are the lower and upper bounds of the confidence interval
    """
    n=len(samples)
    sample_mean=#TODO: fill this in
    sample_mean_variance = #TODO: fill this in
    return confidence_interval_Chebyshev(sample_mean,sample_mean_variance,delta)

Run the following code to see the $95 \%$ confidence intervals with different numbers of samples from the same distribution.

In [None]:
#Compute the variance of the Beta distribution given alpha and beta
variance=alpha*beta/((alpha+beta)**2*(alpha+beta+1))

#1 sample from distribution
x1= np.random.beta(alpha,beta,(1,))

#10 samples from distribution
x10= np.random.beta(alpha,beta,(10,))

#50 samples from distribution
x50= np.random.beta(alpha,beta,(50,))

#Compute the 95% confidence intervals for each set of samples using the Chebyshev Bounds.
delta=0.05
chebyshev_ci_1_sample=confidence_interval_Chebyshev_sample_mean(x1,variance,delta)
chebyshev_ci_10_samples=confidence_interval_Chebyshev_sample_mean(x10,variance,delta)
chebyshev_ci_50_samples=confidence_interval_Chebyshev_sample_mean(x50,variance,delta)


#Visualize the 95% confidence intervals for each set of samples
plt.figure()
visualize_confidence_interval(x1,alpha,beta,chebyshev_ci_1_sample,'darkgreen','1 Sample', 'Chebyshev CI')

xlabel='#TODO: fill this in'
ylabel='#TODO: fill this in'
plt.xlabel(xlabel)
plt.ylabel(ylabel)

plt.figure()
visualize_confidence_interval(x10,alpha,beta,chebyshev_ci_10_samples,'darkgreen','10 Samples','Chebyshev CI')
plt.xlabel(xlabel)
plt.ylabel(ylabel)

plt.figure()
visualize_confidence_interval(x50,alpha,beta,chebyshev_ci_50_samples,'darkgreen','50 Samples','Chebyshev CI')
plt.xlabel(xlabel)
plt.ylabel(ylabel)



## 2. Frequentist Properties of Confidence Intervals

In the following function, we will test out frequentist properties of confidence intervals. Fill out the following function, which generates $10,000$ sets of $10$ samples from a $Beta(\alpha,\beta)$ distribution, and counts the number of times the confidence interval you calculated above captures the mean.

In [None]:
def test_confidence_intervals(confidence_interval_function, alpha,beta,delta,bound_type='Chebyshev'):
    """
    Counts the number of times (out of 10,000) the confidence interval calculated using a 
    confidence_interval_function captures the true mean.
    
    
    Parameters
    ----------
    confidence_interval_function  : a function that computes the confidence interval given the 
                                    sample, variance, and level
    
    alpha, beta: parameters of beta distribution

    delta : confidence level such that the interval [c1,c2] captures the mean with probability at least 1-delta.
    
    Returns
    -------
    count: number of times the confidence interval captured the true mean.
    """
    
    #compute true mean and variance of beta distribution
    mean=alpha/(alpha+beta)
    variance=alpha*beta/((alpha+beta)**2*(alpha+beta+1))
    
    count=0
    for test in range(10000):
        #collect 10 samples
        samples=np.random.beta(alpha,beta,(10,))
        
        #compute confidence interval
        c1,c2=confidence_interval_function(samples,variance,delta)
        
        #check if confidence interval encompasses mean.
        
        #TODO: fill this in
        
        
    print('#############################################################')
    print(bound_type+r' Confidence Interval captured the mean: {}% of the time'.format(count/100.0))
    print('#############################################################')
    return 

Given this function, let us test out the frequentist properties of the Chebysehv confidence interval.

In [None]:
test_confidence_intervals(confidence_interval_Chebyshev_sample_mean, alpha,beta,0.05)

## 3. Confidence Intervals from Hoeffding Bounds
In the previous sections, we constructed confidence intervals using the Chebyshev Bound. In this part we will use the Hoeffding Bound. Recall that the Hoeffding bound is defined for a bounded random variable. This means that there exists $a,b$ such that:

$$\mathbb{P}(a\le X \le b)=1$$. For such random variables the Hoeffding bound gives the following bound around the sample mean:

$$ \mathbb{P}(| \bar X -\mu| > \epsilon ) \le 2e^{-\frac{2n^2\epsilon^2}{\sum_{i=1}^n(b-a)^2}}$$

Note that this does not require any knowledge of the variance!

Since the $Beta$ distribution is defined on $[0,1]$, it is a bounded random variable, and the Hoeffding bound gives:

$$ \mathbb{P}(| \bar X -\mu| > \epsilon ) \le 2e^{-2n\epsilon^2}$$ 

Doing the exact same derivation as the Chebyshev inequality we can construct the following confidence interval for the mean:

$$ \mathbb{P}(\bar X- \epsilon \le \mu \le \bar X+ \epsilon ) \ge 1-2e^{-2n\epsilon^2}$$.

For a given confidence level $\delta$, we get the following confidence interval:

$$ \mathbb{P}\left(\bar X- \sqrt{\frac{1}{2n}\log{\frac{2}{\delta}}} \le \mu \le \bar X+ \sqrt{\frac{1}{2n}\log{\frac{2}{\delta}}} \right) \ge 1-\delta$$.

Fill out the following function that returns the Hoeffding confidence interval from a list of samples.

In [None]:
def confidence_interval_Hoeffding_sample_mean(samples,variance,delta):
    """
    Given the samples and desired confidence level, returns the interval [c1,c2], that 
    captures the mean with probability at least 1-delta.
    
    
    Parameters
    ----------
    x : sample

    delta : confidence level such that the interval [c1,c2] captures the mean with probability at least 1-delta.
    
    Returns
    -------
    [c1,c2]: where c1, and c2 are the lower and upper bounds of the confidence interval
    """
    c1=#TODO: fill this in
    c2=#TODO: fill this in
    return [c1,c2]

In [None]:
#Compute the 95% confidence intervals for each set of samples using the Hoeffding Bounds.
hoeffding_ci_1_sample=confidence_interval_Hoeffding_sample_mean(x1,variance,delta)
hoeffding_ci_10_samples=confidence_interval_Hoeffding_sample_mean(x10,variance,delta)
hoeffding_ci_50_samples=confidence_interval_Hoeffding_sample_mean(x50,variance,delta)


#Visualize both the Chebyshev and Hoeffding Confidence Intervals
visualize_confidence_interval(x1,alpha,beta,chebyshev_ci_1_sample,'darkgreen','1 Sample','Chebyshev CI')
visualize_confidence_interval(x1,alpha,beta,hoeffding_ci_1_sample,'gold','1 Samples','Hoeffding CI',False,'-')
plt.xlabel(xlabel)
plt.ylabel(ylabel)

plt.figure()
visualize_confidence_interval(x10,alpha,beta,chebyshev_ci_10_samples,'darkgreen','10 Samples','Chebyshev CI')
visualize_confidence_interval(x10,alpha,beta,hoeffding_ci_10_samples,'gold','10 Samples', 'Hoeffding CI',False,'-')
plt.xlabel(xlabel)
plt.ylabel(ylabel)

plt.figure()
visualize_confidence_interval(x50,alpha,beta,chebyshev_ci_50_samples,'darkgreen','50 Samples','Chebyshev CI')
visualize_confidence_interval(x50,alpha,beta,hoeffding_ci_50_samples,'gold','50 Samples','Hoeffding CI',False,'-')
plt.xlabel(xlabel)
plt.ylabel(ylabel)

Let us test the frequentist properties of the Hoeffding Confidence Interval.

In [None]:
test_confidence_intervals(confidence_interval_Hoeffding_sample_mean, alpha,beta,0.05, 'Hoeffding')

## 4: Comparing the Hoeffding and Chebyshev Confidence Intervals
### Lower Variance Beta Distribution

Now let us compare the two confidence intervals when the variance of the distribution is lower:

In [None]:
#Define the parameters of the lower-variance beta distribution and compute its variance
alpha=3.0
beta=3.0
variance=alpha*beta/((alpha+beta)**2*(alpha+beta+1))

#Test the frequentist properties of the confidence intervals with this new distribution
test_confidence_intervals(confidence_interval_Chebyshev_sample_mean, alpha,beta,0.05,'Chebyshev')
test_confidence_intervals(confidence_interval_Hoeffding_sample_mean, alpha,beta,0.05,'Hoeffding')

#Collect samples from this distribution.
x20_lower_variance= np.random.beta(alpha,beta,(20,))

#Compute and plot confidence intervals
hoeffding_ci_lower_var=confidence_interval_Hoeffding_sample_mean(x20_lower_variance,variance,delta)
chebyshev_ci_lower_var=confidence_interval_Chebyshev_sample_mean(x20_lower_variance,variance,delta)

visualize_confidence_interval(x20_lower_variance,alpha,beta,chebyshev_ci_lower_var,'darkgreen','20 Samples','Chebyshev CI')
visualize_confidence_interval(x20_lower_variance,alpha,beta,hoeffding_ci_lower_var,'gold','20 Samples','Hoeffding CI',False,'-')
plt.xlabel(xlabel)
plt.ylabel(ylabel)

### Higher Variance Beta Distribution
Now let us compare the two confidence intervals when the variance of the distribution is higher:

In [None]:
#Define the parameters of the lower-variance beta distribution and compute its variance
alpha=0.25
beta=0.2
variance=alpha*beta/((alpha+beta)**2*(alpha+beta+1))

#Test the frequentist properties of the confidence intervals with this new distribution
test_confidence_intervals(confidence_interval_Chebyshev_sample_mean, alpha,beta,0.05,'Chebyshev')
test_confidence_intervals(confidence_interval_Hoeffding_sample_mean, alpha,beta,0.05,'Hoeffding')

#Collect samples from this distribution.
x20_higher_variance= np.random.beta(alpha,beta,(60,))


#Compute and plot confidence intervals
hoeffding_ci_higher_var=confidence_interval_Hoeffding_sample_mean(x20_higher_variance,variance,delta)
chebyshev_ci_higher_var=confidence_interval_Chebyshev_sample_mean(x20_higher_variance,variance,delta)

visualize_confidence_interval(x20_higher_variance,alpha,beta,chebyshev_ci_higher_var,'darkgreen','20 Samples','Chebyshev CI')
visualize_confidence_interval(x20_higher_variance,alpha,beta,hoeffding_ci_higher_var,'gold','20 Samples','Hoeffding CI',False,'-')
plt.xlabel(xlabel)
plt.ylabel(ylabel)

## Part 5: Pros and Cons of Chebyshev and Hoeffding Bounds
Fill out the cell below with your takeaways on the pros and cons for Chebyshev and Hoeffding Bounds.

#TODO: fill this in