# Tests for calculating confidence interval

### Examplary description
The application provides different methods for calculating confidence interval for obtained accuracy from different training techniques. The aim of the confidence interval is to measure the degree of uncertainty or certainty in a sampling method. There are two available options and number of tests to choose from:
- holdout
    - Z-test -> use when holdout sample size is big (>30) and you know standard deviation
    - Z-test with confidence -> use when holdout sample size is big (>30) and you don't know standard deviation
    - T-test -> use when holdout sample size is small (<30) and you know standard deviation
    - Loose test set bound -> use to obtain wide intervals
    - Clopper-Pearson -> use to obtain narrow intervals
    - Wilson score -> use to obtain the narrowest intervals
- for bootstrap
    - Percentile Bootstrap Method -> use when number of bootstrap resamples is big
    - Standard error method -> use when you know additionally overall accuracy of bootstrap resamples
    
Moreover, for z-test and loose test set bound there are reverse tests. With their help, when you know what confidence interval you want to obtain at a given confidence, tests will return number of samples needed for a holdout method.

In [88]:
import scipy.stats as st
import math
import statsmodels as sm
import numpy as np

#### Z-test confidence interval

upper/lower bound of a confidence interval:<br>
X ± Z*(s/√n)<br>
X is the obtained accuracy<br>
Z is the value from the standard normal distribution for the selected confidence level<br>
s is the standard deviation<br>
n is the number of observations<br>

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-tests test the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value (for example, 1.96 for 5% two tailed).<br>
Z-test can be used when a sample size is big, usually > 30.

In [89]:
# Z-Test


def ztest(n, acc, std_dev, conf):
    '''
    This test assumes that data is normally distributed and works well for bigger number of samples (>30).
    Function takes number of samples (n), obtained accuracy (acc),
    standard deviation (std_dev) and confidence (conf).
    Returns lower and upper bounds for the given confidence interval as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    
    z90 = st.norm.ppf(1-(1-0.9)/2)
    upper_bound90 = acc + z90*std_dev/math.sqrt(n)
    lower_bound90 = acc - z90*std_dev/math.sqrt(n)
    int90 = [lower_bound90, upper_bound90]
    
    z95 = st.norm.ppf(1-(1-0.95)/2)
    upper_bound95 = acc + z95*std_dev/math.sqrt(n)
    lower_bound95 = acc - z95*std_dev/math.sqrt(n)
    int95 = [lower_bound95, upper_bound95]
    
    z98 = st.norm.ppf(1-(1-0.98)/2)
    upper_bound98 = acc + z98*std_dev/math.sqrt(n)
    lower_bound98 = acc - z98*std_dev/math.sqrt(n)
    int98 = [lower_bound98, upper_bound98]
    
    z99 = st.norm.ppf(1-(1-0.99)/2)
    upper_bound99 = acc + z99*std_dev/math.sqrt(n)
    lower_bound99 = acc - z99*std_dev/math.sqrt(n)
    int99 = [lower_bound99, upper_bound99]
    
    
    z = st.norm.ppf(1-(1-conf)/2)
    upper_bound = acc + z*std_dev/math.sqrt(n)
    lower_bound = acc - z*std_dev/math.sqrt(n)
    int_conf = [lower_bound, upper_bound]
    
    return (int_conf, int90, int95, int98, int99)


print(ztest(50, 50, 5, 0.95))


def reverse_ztest(std_dev, diff, conf):
    
    z = st.norm.ppf(1-(1-conf)/2)
    n = (z*std_dev/(diff))**2
    
    return int(round(n))

print(reverse_ztest(5, 1.39, 0.95))


([48.614096175650324, 51.385903824349676], [48.83691284632333, 51.16308715367667], [48.614096175650324, 51.385903824349676], [48.35502364286681, 51.64497635713319], [48.178613632281554, 51.821386367718446])
50


#### Z-test with precision
upper/lower bound of a confidence interval:<br>
X ± 10*Z*√(pr^2/n)<br>
X is the obtained accuracy<br>
Z is the value from the standard normal distribution for the selected confidence level<br>
pr is the precision; the test assumes the worst possible precision = 0.5<br>
n is the number of observations<br>
An alternative way to calculate confidence intervals using z-test. When the standard deviation of the data is unknown, one can use this approach. Similarly, a sample size should be big ( > 30).

In [90]:
# Z-Test precission

def ztest_pr(n, acc, conf):
    '''
    This test assumes that data is normally distributed and works well for bigger number of samples (>30).
    Function takes number of samples (n), obtained accuracy (acc) and confidence (conf).
    Returns lower and upper bounds for the given confidence interval as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    
    z90 = st.norm.ppf(1-(1-0.9)/2)
    pr = z90*math.sqrt(0.25/n)
    upper_bound90 = acc + 10*pr
    lower_bound90 = acc - 10*pr
    int90 = [lower_bound90, upper_bound90]
    
    z95 = st.norm.ppf(1-(1-0.95)/2)
    pr = z95*math.sqrt(0.25/n)
    upper_bound95 = acc + 10*pr
    lower_bound95 = acc - 10*pr
    int95 = [lower_bound95, upper_bound95]
    
    z98 = st.norm.ppf(1-(1-0.98)/2)
    pr = z98*math.sqrt(0.25/n)
    upper_bound98 = acc + 10*pr
    lower_bound98 = acc - 10*pr
    int98 = [lower_bound98, upper_bound98]
    
    z99 = st.norm.ppf(1-(1-0.99)/2)
    pr = z99*math.sqrt(0.25/n)
    upper_bound99 = acc + 10*pr
    lower_bound99 = acc - 10*pr
    int99 = [lower_bound99, upper_bound99]    
    
    
    z = st.norm.ppf(1-(1-conf)/2)
    pr = z*math.sqrt(0.25/n)
    upper_bound = acc + 10*pr
    lower_bound = acc - 10*pr
    int_conf = [lower_bound, upper_bound]
    
    return (int_conf, int90, int95, int98, int99)

print(ztest_pr(50, 50, 0.95))


def reverse_ztest_pr(diff, conf):
    '''
    Function takes mean of data (mean), difference from accuracy to lower/upper bound which is upper_bound-acc 
    or acc-lower_bound (diff) and confidence (conf).
    Returns rounded number of samples which should be taken to obtain a given confidence interval.
    '''
    z = st.norm.ppf(1-(1-conf)/2)
    n = (z*math.sqrt(0.25)/(diff/10))**2
    return int(round(n))

print(reverse_ztest_pr(1.39, 0.95))

([48.614096175650324, 51.385903824349676], [48.83691284632333, 51.16308715367667], [48.614096175650324, 51.385903824349676], [48.35502364286681, 51.64497635713319], [48.178613632281554, 51.821386367718446])
50


#### t-test confidence interval

upper/lower bound of a confidence interval:<br>
X ± t*(s/√n)<br>
X is the obtained accuracy<br>
t is the value from the Student's t-distribution for the selected confidence level<br>
s is the standard deviation<br>
n is the number of observations<br>

The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.
A t-test is the most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.<br> 
T-test can be used for a small number of samples ( < 30).

In [91]:
# T-test

def ttest(n, acc, std_dev, conf):
    '''
    This test works for smaller number of samples (<30), uses t-distribution instead the gaussian.
    Function takes number of samples (n), obtained accuracy (acc), standard deviation (std_dev) and confidence (conf).
    Returns lower and upper bounds for the given confidence interval as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    
    t90 = st.t.ppf(1-(1-0.9)/2, n-1)
    upper_bound90 = acc + t90*std_dev/math.sqrt(n)
    lower_bound90 = acc - t90*std_dev/math.sqrt(n)
    int90 = [lower_bound90, upper_bound90]
    
    t95 = st.t.ppf(1-(1-0.95)/2, n-1)
    upper_bound95 = acc + t95*std_dev/math.sqrt(n)
    lower_bound95 = acc - t95*std_dev/math.sqrt(n)
    int95 = [lower_bound95, upper_bound95]
    
    t98 = st.t.ppf(1-(1-0.98)/2, n-1)
    upper_bound98 = acc + t98*std_dev/math.sqrt(n)
    lower_bound98 = acc - t98*std_dev/math.sqrt(n)
    int98 = [lower_bound98, upper_bound98]
    
    t99 = st.t.ppf(1-(1-0.99)/2, n-1)
    upper_bound99 = acc + t99*std_dev/math.sqrt(n)
    lower_bound99 = acc - t99*std_dev/math.sqrt(n)
    int99 = [lower_bound99, upper_bound99]
    
    
    t = st.t.ppf(1-(1-conf)/2, n-1)
    upper_bound = acc + t*std_dev/math.sqrt(n)
    lower_bound = acc - t*std_dev/math.sqrt(n)
    int_conf = [lower_bound, upper_bound]
    
    return (int_conf, int90, int95, int98, int99)

print(ttest(9, 50, 3, 0.95))


#def reverse_ttest()

([47.69399586496663, 52.30600413503337], [48.14045196247716, 51.85954803752284], [47.69399586496663, 52.30600413503337], [47.10354055723948, 52.89645944276052], [46.6446126686666, 53.3553873313334])


#### Loose test set bound (Langford)

In [92]:
# Loose test set bound Langford

def loose_langford(n, acc, conf):
    '''
    Function takes number of samples (n), obtained accuracy (acc) and confidence (conf).
    Returns lower and upper bounds for the given confidence interval as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    
    pr90 = math.sqrt(math.log(2/(1-0.90))/(n*2))
    upper_bound90 = acc + pr90*10
    lower_bound90 = acc - pr90*10
    int90 = [lower_bound90, upper_bound90]
    
    pr95 = math.sqrt(math.log(2/(1-0.95))/(n*2))
    upper_bound95 = acc + pr95*10
    lower_bound95 = acc - pr95*10
    int95 = [lower_bound95, upper_bound95]
    
    pr98 = math.sqrt(math.log(2/(1-0.98))/(n*2))
    upper_bound98 = acc + pr98*10
    lower_bound98 = acc - pr98*10
    int98 = [lower_bound98, upper_bound98]
    
    pr99 = math.sqrt(math.log(2/(1-0.99))/(n*2))
    upper_bound99 = acc + pr99*10
    lower_bound99 = acc - pr99*10
    int99 = [lower_bound99, upper_bound99]
    
    
    pr = math.sqrt(math.log(2/(1-conf))/(n*2))
    upper_bound = acc + pr*10
    lower_bound = acc - pr*10
    int_conf = [lower_bound, upper_bound]
    return (int_conf, int90, int95, int98, int99)

print(loose_langford(50, 50, 0.95))

def loose_langford_reverse(diff, conf):
    '''
    Function takes difference from accuracy to lower/upper bound which is upper_bound-acc 
    or acc-lower_bound (diff) and confidence (conf).
    Returns rounded number of samples which should be taken to obtain a given confidence interval.
    '''
    n = math.log(2/(1-conf))/(2*(diff/10)**2)
    return int(round(n))

print(loose_langford_reverse(1.92, 0.95))

([48.07935441736016, 51.92064558263984], [48.26918161739771, 51.73081838260229], [48.07935441736016, 51.92064558263984], [47.854033973710656, 52.145966026289344], [47.69819258699864, 52.30180741300136])
50


#### Clopper-Pearson (beta distribution)
The Clopper–Pearson interval is an early and very common method for calculating binomial confidence intervals. This is often called an 'exact' method, because it is based on the cumulative probabilities of the binomial distribution (i.e., exactly the correct distribution rather than an approximation). However, in cases where we know the population size, the intervals may not be the smallest possible.

In [93]:
# Clopper-Pearson

def clopper_pearson(n, acc, conf):
    '''
    Function takes number of samples (n), obtained accuracy (acc) and confidence (conf).
    Returns lower and upper bounds for the given confidence interval as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    
    low90, high90 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.90, method = "beta")
    int90 = [acc-(0.5-low90)*10, acc+(high90-0.5)*10]
    
    low95, high95 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.95, method = "beta")
    int95 = [acc-(0.5-low95)*10, acc+(high95-0.5)*10]
    
    low98, high98 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.98, method = "beta")
    int98 = [acc-(0.5-low98)*10, acc+(high98-0.5)*10]
    
    low99, high99 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.99, method = "beta")
    int99 = [acc-(0.5-low99)*10, acc+(high99-0.5)*10]
    
    
    low, high = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-conf, method = "beta")
    int_conf = [acc-(0.5-low)*10, acc+(high-0.5)*10]
    return (int_conf, int90, int95, int98, int99)

print(clopper_pearson(50, 50, 0.95))

([48.55272997129908, 51.44727002870092], [48.76245891209321, 51.23754108790679], [48.55272997129908, 51.44727002870092], [48.31404705573435, 51.68595294426565], [48.1551042112981, 51.8448957887019])


#### Wilson score interval
The Wilson score interval is an improvement over the normal approximation interval in multiple respects. Unlike the symmetric normal approximation interval, the Wilson score interval is asymmetric. It does not suffer from problems of overshoot and zero-width intervals that afflict the normal interval, and it may be safely employed with small samples and skewed observations.<br>
The Wilson interval is derived from the Wilson Score Test, which belongs to a class of tests called Rao Score Tests. It relies on the asymptotic normality of your estimator.
Wilson intervals get their assymetry from the underlying likelihood function for the binomial, which is used to compute the "expected standard error" and "score" (i.e., first derivative of the likelihood function) under the null hypotheisis. Since these values will change as you very your null hypothesis, the interval where the normalized score (score/expected standard error) exceeds your pre-specified Z-cutoff for significance will not be symmetric, in general.

In [94]:
# Wilson score interval

def wilson(n, acc, conf):
    '''
    Function takes number of samples (n), obtained accuracy (acc) and confidence (conf).
    Returns lower and upper bounds for the given confidence interval as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    
    low90, high90 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.90, method = "wilson")
    int90 = [acc-(0.5-low90)*10, acc+(high90-0.5)*10]
    
    low95, high95 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.95, method = "wilson")
    int95 = [acc-(0.5-low95)*10, acc+(high95-0.5)*10]
    
    low98, high98 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.98, method = "wilson")
    int98 = [acc-(0.5-low98)*10, acc+(high98-0.5)*10]
    
    low99, high99 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.99, method = "wilson")
    int99 = [acc-(0.5-low99)*10, acc+(high99-0.5)*10]
    
    
    low, high = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-conf, method = "wilson")
    int_conf = [acc-(0.5-low)*10, acc+(high-0.5)*10]
    return (int_conf, int90, int95, int98, int99)

print(wilson(50, 50, 0.95))

([48.664451431682856, 51.335548568317144], [48.86715859686505, 51.13284140313495], [48.664451431682856, 51.335548568317144], [48.43741675431165, 51.56258324568835], [48.28862561286529, 51.71137438713471])


#### Percentile Bootstrap Method
The percentile bootstrap interval is just the interval between the 100*(α/2) and 100*(1−α/2) percentiles of the distribution of θ estimates obtained from resampling, where θ represents a parameter of interest and α is the level of significance (e.g., α = 0.05 for 95% CIs) (Efron, 1982). A bootstrap percentile CI of θˆ (an estimator of θ) can be obtained as follows: (1) B random bootstrap samples are generated, (2) a parameter estimate is calculated from each bootstrap sample, (3) all B bootstrap parameter estimates are ordered from the lowest to highest, and (4) the CI is constructed as follows,<br>

$θˆ_{lower limit}$, $θˆ_{upper limit}$ = $θˆ_{j}$, $θˆ_{k}$ <br>

where $θˆ_{j}$ denotes the jth quantile (lower limit), and $θˆ_{k}$ denotes the kth quantile (upper limit); j=(α/2)*B, k=(1−α/2)*B. For example, a 95% percentile bootstrap CI with 1,000 bootstrap samples is the interval between the 25th quantile value and the 975th quantile value of the 1,000 bootstrap parameter estimates.

In [95]:
# Percentile bootstrap method

def percentile_BM(accs, conf):
    '''
    Function takes list of resamples accuracies obtained from bootstrap method(accs) and confidence (conf).
    Returns lower and upper bounds for the given confidence interval as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    accs.sort()
    accs = np.array(accs)
    
    lower_bound90 = np.percentile(accs, 100*((1-0.9)/2))
    upper_bound90 = np.percentile(accs, 100*(0.9 + (1-0.9)/2))
    int90 = [lower_bound90, upper_bound90]
    
    lower_bound95 = np.percentile(accs, 100*((1-0.95)/2))
    upper_bound95 = np.percentile(accs, 100*(0.95 + (1-0.95)/2))
    int95 = [lower_bound95, upper_bound95]
    
    lower_bound98 = np.percentile(accs, 100*((1-0.98)/2))
    upper_bound98 = np.percentile(accs, 100*(0.98 + (1-0.98)/2))
    int98 = [lower_bound98, upper_bound98]
    
    lower_bound99 = np.percentile(accs, 100*((1-0.99)/2))
    upper_bound99 = np.percentile(accs, 100*(0.99 + (1-0.99)/2))
    int99 = [lower_bound99, upper_bound99]
    
    
    lower_bound = np.percentile(accs, 100*((1-conf)/2))
    upper_bound = np.percentile(accs, 100*(conf + (1-conf)/2))
    int_conf = [lower_bound, upper_bound]
    return (int_conf, int90, int95, int98, int99)


test_lst = list(np.random.normal(loc = 50, size=1000))
print(percentile_BM(test_lst, 0.95))

([48.13924775312831, 52.03775143699797], [48.37459766391131, 51.645317916031864], [48.13924775312831, 52.03775143699797], [47.73783994056898, 52.458574437952514], [47.52274273370016, 52.76294393519458])


#### Standard error method
We can only use the standard error rule when the bootstrap distribution is roughly normally shaped.

upper/lower bound of a confidence interval:<br>
X ± Z*SE<br>
X is the obtained accuracy<br>
Z is the value from the standard normal distribution for the selected confidence level<br>
SE is the standard error (standard deviation of the resamples accuracies)<br>

In [96]:
# Standard method error

def std_method(acc, accs, conf):
    '''
    Function takes obtained accuracy (acc), list of resamples accuracies obtained from bootstrap method(accs) 
    and confidence (conf). Returns lower and upper bounds for the given confidence interval 
    as well as confidence intervals for 90%, 95%, 98% and 99% confidences.
    '''
    
    z90 = st.norm.ppf(1-(1-0.9)/2)
    se90 = np.std(accs)
    lower_bound90 = acc - z90*se90
    upper_bound90 = acc + z90*se90
    int90 = [lower_bound90, upper_bound90]
    
    z95 = st.norm.ppf(1-(1-0.95)/2)
    se95 = np.std(accs)
    lower_bound95 = acc - z95*se95
    upper_bound95 = acc + z95*se95
    int95 = [lower_bound95, upper_bound95]
    
    z98 = st.norm.ppf(1-(1-0.98)/2)
    se98 = np.std(accs)
    lower_bound98 = acc - z98*se98
    upper_bound98 = acc + z98*se98
    int98 = [lower_bound98, upper_bound98]
    
    z99 = st.norm.ppf(1-(1-0.99)/2)
    se99 = np.std(accs)
    lower_bound99 = acc - z99*se99
    upper_bound99 = acc + z99*se99
    int99 = [lower_bound99, upper_bound99]
    
    
    z = st.norm.ppf(1-(1-conf)/2)
    se = np.std(accs)
    lower_bound = acc - z*se
    upper_bound = acc + z*se
    int_conf = [lower_bound, upper_bound]
    return (int_conf, int90, int95, int98, int99)

print(std_method(50, test_lst, 0.95))

([48.05141868827758, 51.94858131172242], [48.36469901320731, 51.63530098679269], [48.05141868827758, 51.94858131172242], [47.68716260723283, 52.31283739276717], [47.439130064290325, 52.560869935709675])
