Data is from the paper shown in Slide 26:

A Blood Test Might One Day Mass Screen Military Personnel for Post-Traumatic Stress Disorder (PTSD)

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import t

Constructing confidence intervals via central limit theorem (CLT):

Recap: CLT says, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. This is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions. 

In [12]:
# From PTSD paper, there were 52 test cases, 42 of which were correct

n = 52 # total number of the test set sample
c = 42 # predicted correctly 42 cases

# Create array of zeros, size n
rawdata = np.zeros(n)

# Set first c elements to 1
rawdata[range(c)] = 1

# Create pandas data frame
data = pd.DataFrame({"Match": rawdata})
data.head(2)

Unnamed: 0,Match
0,1.0
1,1.0


First, we'll build the confidence interval assuming a normal variable

In [13]:
## Using central limit theorem, compute confidence interval

# Degrees of freedom of an estimate is the number of independent
# pieces of information that went into calculating the estimate.
stderr = np.std(data.Match, ddof=1) / np.sqrt(len(data.Match))
print("Stderr: %.3f" % stderr)

# Area under a standard normal from -1.96 to 1.96 is about 95% (3 for 99.9% probability)
critval = 1.96 # critical value

# Confidence interval
norm_ci = [(data.Match.mean() - critval*stderr).round(3), 
           (data.Match.mean() + critval*stderr).round(3)]

print("Norm ci:",norm_ci)

Stderr: 0.055
Norm ci: [0.7, 0.916]


[t_test](https://en.wikipedia.org/wiki/Student%27s_t-test)

[t-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution)

If we take a sample of n observations from a normal distribution, then the t-distribution with n − 1 degrees of freedom can be defined as the distribution of the location of the sample mean relative to the true mean, divided by the sample standard deviation, after multiplying by the standardizing term $\sqrt {n}$. In this way, the t-distribution can be used to construct a confidence interval for the true mean.

What if we use a t-test? The $100(1-\alpha)\%$ confidence interval is

$$ \bar{x} \pm  t_{1-\alpha/2, n-1} \dfrac{\hat{\sigma}}{\sqrt{n}} $$

The t distribution is available for us to use in the [scipy.stats.t package](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html).

You can see different implemented probability distributions [Statistical functions.](https://docs.scipy.org/doc/scipy/reference/stats.html)

In [14]:
# Get the critical values for t at an alpha = 0.05/2 (i.e., 5% divided by 2), and 52-1 = 51 dof.
alpha = (5/100)/2
dof = n-1
crit_val = 1-alpha

# ppf: percent point function (or inverse cumulative distribution function).
# It returns the value x of the variable that has a given cumulative distribution probability (cdf).
# Thus, given the cdf(x) of a x value, ppf returns the value x itself, therefore, operating as the inverse of cdf.
t_quantile = t.ppf(crit_val, df=dof) # 1st arg: critical value; 2nd arg: dof
print('The t-based critical value is equal to %.2f' % t_quantile)

t_ci = data.Match.mean() + t_quantile * stderr * np.array([-1, 1])
print('The t-based confidence interval is equal to {}'.format(t_ci.round(3)))

The t-based critical value is equal to 2.01
The t-based confidence interval is equal to [0.697 0.918]


The t-based critical value equals to 2.01 which is a little bit different from 1.96. The difference is 0.05 that is about 2% off from the normal approximation. Depending on your application 2% off may or may not be significant. If it is not, you can use the normal approximation.