# Confidence Intervals

In statistics, a confidence interval (CI) is a type of estimate computed from the statistics of the observed data. This gives a range of values for an unknown parameter (for example, a population mean). The interval has an associated confidence level that gives the probability with which an estimated interval will contain the true value of the parameter.

# The confidence level is designated before examining the data. Most commonly, a 95% confidence level is used.

Suppose a dataset {\displaystyle x_{1},\ldots ,x_{n}}x_{1},\ldots ,x_{n} is given, modeled as realization of random variables {\displaystyle X_{1},\ldots ,X_{n}}X_{1},\ldots ,X_{n}. Let {\displaystyle \theta }\theta  be the parameter of interest, and {\displaystyle \gamma }\gamma  a number between 0 and 1. If there exist sample statistics {\displaystyle L_{n}=g(X_{1},\ldots ,X_{n})}{\displaystyle L_{n}=g(X_{1},\ldots ,X_{n})} and {\displaystyle U_{n}=h(X_{1},\ldots ,X_{n})}{\displaystyle U_{n}=h(X_{1},\ldots ,X_{n})} such that:

{\displaystyle P(L_{n}<\theta <U_{n})=\gamma }{\displaystyle P(L_{n}<\theta <U_{n})=\gamma } for every value of {\displaystyle \theta }\theta 

then {\displaystyle (l_{n},u_{n})}{\displaystyle (l_{n},u_{n})}, where {\displaystyle l_{n}=g(x_{1},\ldots ,x_{n})}{\displaystyle l_{n}=g(x_{1},\ldots ,x_{n})} and {\displaystyle u_{n}=h(x_{1},\ldots ,x_{n})}{\displaystyle u_{n}=h(x_{1},\ldots ,x_{n})}, is called a {\displaystyle \gamma \times 100\%}{\displaystyle \gamma \times 100\%} confidence interval for {\displaystyle \theta }\theta . The number {\displaystyle \gamma }\gamma  is called the confidence level.[1]

In [1]:
# bootstrap confidence intervals
from numpy.random import seed
from numpy.random import rand
from numpy.random import randint
from numpy import mean
from numpy import median
from numpy import percentile
# seed the random number generator
seed(1)
# generate dataset
dataset = 0.5 + rand(1000) * 0.5
# bootstrap
scores = list()
for _ in range(100):
	# bootstrap sample
	indices = randint(0, 1000, 1000)
	sample = dataset[indices]
	# calculate and store statistic
	statistic = mean(sample)
	scores.append(statistic)
print('50th percentile (median) = %.3f' % median(scores))
# calculate 95% confidence intervals (100 - alpha)
alpha = 5.0
# calculate lower percentile (e.g. 2.5)
lower_p = alpha / 2.0
# retrieve observation at lower percentile
lower = max(0.0, percentile(scores, lower_p))
print('%.1fth percentile = %.3f' % (lower_p, lower))
# calculate upper percentile (e.g. 97.5)
upper_p = (100 - alpha) + (alpha / 2.0)
# retrieve observation at upper percentile
upper = min(1.0, percentile(scores, upper_p))
print('%.1fth percentile = %.3f' % (upper_p, upper))

50th percentile (median) = 0.750
2.5th percentile = 0.741
97.5th percentile = 0.757


We can then use these observations to make a claim about the sample distribution, such as:

There is a 95% likelihood that the range 0.741 to 0.757 covers the true statistic mean.