<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Confidence-Interval" data-toc-modified-id="Confidence-Interval-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Confidence Interval</a></span></li><li><span><a href="#Confidence-Level-calculations" data-toc-modified-id="Confidence-Level-calculations-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Confidence Level calculations</a></span></li><li><span><a href="#CI-from-bootstrapping" data-toc-modified-id="CI-from-bootstrapping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>CI from bootstrapping</a></span><ul class="toc-item"><li><span><a href="#calculate-CI-from-point-estimates" data-toc-modified-id="calculate-CI-from-point-estimates-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>calculate CI from point estimates</a></span></li><li><span><a href="#Look-at-CI-of-lower-and-upper-interval-estimates" data-toc-modified-id="Look-at-CI-of-lower-and-upper-interval-estimates-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Look at CI of lower and upper interval estimates</a></span></li></ul></li></ul></div>

# Confidence Interval
- interval in scipy stats gives different result than R and GraphPad (R is better).
- scipy stats t/norm/etc interval has parameter alpha but it is acutally confidenceLevel (1-alpha).

In [54]:
import numpy as np
import pandas as pd

from scipy import stats
import statsmodels.stats.api as sms
import tqdm
from tqdm import trange

# Confidence Level calculations

In [47]:
def good_confIntMean(a, alpha=0.05):
    conf = 1-alpha # confidence level
    mean, sem, m = np.mean(a), stats.sem(a), stats.t.ppf((1+conf)/2., len(a)-1)
    return mean - m*sem, mean + m*sem

In [31]:
def mean_confidence_interval(data, alpha=0.05):
    confidence = 1-alpha
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m-h, m+h

In [48]:
a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])

# better method
good_confIntMean(a) # (3.9974214366806184, 4.877578563319382)

(3.5255162351846367, 5.349483764815363)

In [33]:
mean_confidence_interval(a)

(3.5255162351846367, 5.349483764815363)

In [34]:
sms.DescrStatsW(a).tconfint_mean()

(3.5255162351846367, 5.349483764815363)

In [40]:
sms.DescrStatsW(a).tconfint_mean(alpha=0.05) # statsmodel ALPHA is actually ALPHA

(3.5255162351846367, 5.349483764815363)

In [35]:
stats.t.interval(1-0.05, df = len(a)-1, loc=np.mean(a), scale=stats.sem(a))

(3.5255162351846367, 5.349483764815363)

In [29]:
# for normal distribution
stats.norm.interval(0.68, loc=np.mean(a), scale=stats.sem(a)) # there is no df for norm distribution
#(4.0120010966037407, 4.8629989033962593)

(4.012001096603741, 4.862998903396259)

# CI from bootstrapping
- https://stackoverflow.com/questions/44392978/compute-a-confidence-interval-from-sample-data-assuming-unknown-distribution/66008548#66008548

- If we have the population, we don't need to bootstrap, we can simply get confidencen interval.
- If we only have sample, we can calculate CI for bootstrapped sample and then estimate it for population.

In [88]:
import numpy as np
import tqdm
from scipy import stats

alpha = 0.05
a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])


reps = 1_000

sample = a # suppose that a is sample drawn from big population
ci_los = []
ci_his = []
ci_points = [] # point estimate
for _ in tqdm.trange(reps):
    bootsample = np.random.choice(sample, size=len(sample),replace=True)
    
    # make sure to use bootsample, not sample!
    ci_lo, ci_hi = stats.t.interval(1-alpha,
                                    df = len(bootsample)-1,
                                    loc=np.mean(bootsample),
                                    scale=stats.sem(bootsample))
    ci_los.append(ci_lo)
    ci_his.append(ci_hi)
    ci_point = (ci_lo+ci_hi)/2
    ci_points.append(ci_point)

100%|██████████| 1000/1000 [00:00<00:00, 1978.11it/s]


## calculate CI from point estimates

In [84]:
np.percentile(ci_points, [alpha/2*100,100-alpha/2*100])

array([3.5625, 5.25  ])

## Look at CI of lower and upper interval estimates

In [85]:
np.mean(ci_los), np.mean(ci_his)

(3.5725661374525783, 5.327183862547421)

In [86]:
np.percentile(ci_los,[2.5,97.5])

array([2.67006742, 4.41318561])

In [87]:
np.percentile(ci_his,[2.5,97.5])

array([4.43445622, 6.1977602 ])