# Tests for calculating confidence interval

### Description
The application provides different methods for calculating confidence interval for obtained accuracy from different training techniques. The aim of the confidence interval is to measure the degree of uncertainty or certainty in a sampling method. There are four available options and number of tests to choose from:
- holdout
    - Z-test -> use when holdout sample size is big (>30) and the distribution of the test statistic can be approximated by a normal distribution
    - T-test -> use when holdout sample size is small (<30) and the test statistic follows a normal distribution
    - Loose test set bound -> use when you would like to be sure that the obtained interval will be at least of provided confidence, the interval may be much wider than the tighest possible one
    - Clopper-Pearson -> use when you would like to be sure that the obtained interval will be at least of provided confidence, so the interval may be wider but not as wide as in the Loose test set bound
    - Wilson score -> use to obtain the interval, which on average is precisely for the given confidence; do not use when your accuracy is close to 0 or 1
- for bootstrap
    - Percentile Bootstrap Method -> use when number of bootstrap resamples is big (at least 100) and you have accuracies from each resample
- for Cross-Validation
    - cv_interval -> use when you have an accuracy from CV and you know number of samples and number of folds
- for Progressive Validation
    - prog_val -> use when you have an accuracy from Progressive Validation technique and you know test set size
    
Moreover, for z-test, t-test and loose test set bound there are reverse tests. With their help, when you know what confidence interval you want to obtain at a given confidence, tests will return number of samples needed for a holdout method.<br>
For z-test and loose test set bound, if you know confidence interval and number of holdout samples, you can obtain confidence.<br>

### All accuracies in percentage form, e.g. 50% -> 50, all confidences in range (0, 1), e.g. confidence 95% -> 0.95

#### Imports

In [1]:
import scipy.stats as st
import math
import statsmodels.api
import statsmodels as sm
import numpy as np

#### Functions that assure correct interval ranges

In [2]:
def min_max(tup):
    '''
    Function takes tuple with lists with the accuracy confidence interval
    and guarantees that the lower bound will be at least 0 and the upper bound will be at most 100.
    '''
    for i in range(len(tup)):
        tup[i][0] = max(0, tup[i][0])
        tup[i][1] = min(tup[i][1], 100)
        
    return tup


def min_max_conf(conf):
    '''
    Function takes confidence and guarantees that it will be in the range 0-1
    '''
    if conf < 0:
        conf = 0
    if conf > 1:
        conf = 1
        
    return conf


## Holdout
Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set. The training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data. A common split when using the hold-out method is using 80% of data for training and the remaining 20% of the data for testing.

#### Z-test
A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-tests test the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value (for example, 1.96 for 5% two tailed).<br>
Z-test can be used when a sample size is big, usually > 30.

upper/lower bound of a confidence interval:<br>
X ± 100*Z*√(pr^2/n)<br>
X is the obtained accuracy<br>
Z is the value from the standard normal distribution for the selected confidence level<br>
pr is the precision; the test assumes the worst possible precision = 0.5<br>
n is the number of observations

In [3]:
# Z-Test

def ztest_pr(n, acc, conf):
    '''
    This test assumes that data is normally distributed and works well for bigger number of samples (>30).
    Function takes number of samples (n), obtained accuracy (acc) and confidence (conf).
    Returns confidence interval for the given confidence as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    z90 = st.norm.ppf(1-(1-0.9)/2)
    pr = z90*math.sqrt(0.25/n)
    upper_bound90 = acc + 100*pr
    lower_bound90 = acc - 100*pr
    int90 = [lower_bound90, upper_bound90]
    
    z95 = st.norm.ppf(1-(1-0.95)/2)
    pr = z95*math.sqrt(0.25/n)
    upper_bound95 = acc + 100*pr
    lower_bound95 = acc - 100*pr
    int95 = [lower_bound95, upper_bound95]
    
    z98 = st.norm.ppf(1-(1-0.98)/2)
    pr = z98*math.sqrt(0.25/n)
    upper_bound98 = acc + 100*pr
    lower_bound98 = acc - 100*pr
    int98 = [lower_bound98, upper_bound98]
    
    z99 = st.norm.ppf(1-(1-0.99)/2)
    pr = z99*math.sqrt(0.25/n)
    upper_bound99 = acc + 100*pr
    lower_bound99 = acc - 100*pr
    int99 = [lower_bound99, upper_bound99]    
    
    
    z = st.norm.ppf(1-(1-conf)/2)
    pr = z*math.sqrt(0.25/n)
    upper_bound = acc + 100*pr
    lower_bound = acc - 100*pr
    int_conf = [lower_bound, upper_bound]
    
    return min_max((int_conf, int90, int95, int98, int99))


def reverse_ztest_pr(diff, conf):
    '''
    Function takes difference from accuracy to lower/upper bound which is upper_bound-acc 
    or acc-lower_bound (diff) and confidence (conf).
    Returns rounded number of samples which should be taken to obtain a given confidence interval.
    '''
    z = st.norm.ppf(1-(1-conf)/2)
    n = (z*math.sqrt(0.25)/(diff/100))**2
    
    return int(round(n))


def reverse_ztest_pr_conf(diff, n):
    '''
    Function takes difference from accuracy to lower/upper bound which is upper_bound-acc 
    or acc-lower_bound (diff) and number of samples (n).
    Returns confidence rounded to two decimal places.
    '''
    z = (math.sqrt(n)*diff)/50
    
    return min_max_conf(round(2*st.norm.cdf(z)-1, 2))
    

#### T-test
The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.
A t-test is the most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.<br> 
T-test can be used for a small number of samples ( < 30).

upper/lower bound of a confidence interval:<br>
X ± 100*t*√(pr^2/n)<br>
X is the obtained accuracy<br>
t is the value from the Student's t-distribution for the selected confidence level<br>
pr is the precision; the test assumes the worst possible precision = 0.5<br>
n is the number of observations

In [4]:
# T-test

def ttest_pr(n, acc, conf):
    '''
    This test works for smaller number of samples (<30), uses t-distribution.
    Function takes number of samples (n), obtained accuracy (acc) and confidence (conf).
    Returns confidence interval for the given confidence as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    t90 = st.t.ppf(1-(1-0.9)/2, n-1)
    pr90 = t90*math.sqrt(0.25/n)
    upper_bound90 = acc + 100*pr90
    lower_bound90 = acc - 100*pr90
    int90 = [lower_bound90, upper_bound90]
    
    t95 = st.t.ppf(1-(1-0.95)/2, n-1)
    pr95 = t95*math.sqrt(0.25/n)
    upper_bound95 = acc + 100*pr95
    lower_bound95 = acc - 100*pr95
    int95 = [lower_bound95, upper_bound95]
    
    t98 = st.t.ppf(1-(1-0.98)/2, n-1)
    pr98 = t98*math.sqrt(0.25/n)
    upper_bound98 = acc + 100*pr98
    lower_bound98 = acc - 100*pr98
    int98 = [lower_bound98, upper_bound98]
    
    t99 = st.t.ppf(1-(1-0.99)/2, n-1)
    pr99 = t99*math.sqrt(0.25/n)
    upper_bound99 = acc + 100*pr99
    lower_bound99 = acc - 100*pr99
    int99 = [lower_bound99, upper_bound99]
    
    
    t = st.t.ppf(1-(1-conf)/2, n-1)
    pr = t*math.sqrt(0.25/n)
    upper_bound = acc + 100*pr
    lower_bound = acc - 100*pr
    int_conf = [lower_bound, upper_bound]
    
    return min_max((int_conf, int90, int95, int98, int99))

    
def reverse_ttest_pr_conf(diff, n):
    '''
    Function takes difference from accuracy to lower/upper bound which is upper_bound-acc 
    or acc-lower_bound (diff) and number of samples (n).
    Returns confidence rounded to two decimal places.
    '''
    pr = diff/100
    t = pr/math.sqrt(0.25/n)
    
    return min_max_conf(round(2*st.t.cdf(t, n-1)-1, 2))


#### Loose test set bound (Langford)

upper/lower bound of a confidence interval:<br>
X ± 100*√(ln(2/(1-conf))/2/n)<br>
X is the obtained accuracy<br>
conf is the given confidence<br>
n is the number of samples

In [5]:
# Loose test set bound Langford

def loose_langford(n, acc, conf):
    '''
    Function takes number of samples (n), obtained accuracy (acc) and confidence (conf).
    Returns confidence interval for the given confidence as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    pr90 = math.sqrt(math.log(2/(1-0.90))/(n*2))
    upper_bound90 = acc + pr90*100
    lower_bound90 = acc - pr90*100
    int90 = [lower_bound90, upper_bound90]
    
    pr95 = math.sqrt(math.log(2/(1-0.95))/(n*2))
    upper_bound95 = acc + pr95*100
    lower_bound95 = acc - pr95*100
    int95 = [lower_bound95, upper_bound95]
    
    pr98 = math.sqrt(math.log(2/(1-0.98))/(n*2))
    upper_bound98 = acc + pr98*100
    lower_bound98 = acc - pr98*100
    int98 = [lower_bound98, upper_bound98]
    
    pr99 = math.sqrt(math.log(2/(1-0.99))/(n*2))
    upper_bound99 = acc + pr99*100
    lower_bound99 = acc - pr99*100
    int99 = [lower_bound99, upper_bound99]
    
    
    pr = math.sqrt(math.log(2/(1-conf))/(n*2))
    upper_bound = acc + pr*100
    lower_bound = acc - pr*100
    int_conf = [lower_bound, upper_bound]
    
    return min_max((int_conf, int90, int95, int98, int99))


def loose_langford_reverse(diff, conf):
    '''
    Function takes difference from accuracy to lower/upper bound which is upper_bound-acc 
    or acc-lower_bound (diff) and confidence (conf).
    Returns rounded number of samples which should be taken to obtain a given confidence interval.
    '''
    n = math.log(2/(1-conf))/(2*(diff/100)**2)
    
    return int(round(n))


def loose_langford_conf(diff, n):
    '''
    Function takes difference from accuracy to lower/upper bound which is upper_bound-acc 
    or acc-lower_bound (diff) and number of samples (n).
    Returns confidence rounded to two decimal places.
    '''
    pr = diff/100
    pr2 = (diff/100)**2
    expnt = math.exp(2*n*pr2)
    conf = 1 - 2/expnt
    
    return min_max_conf(round(conf, 2))


#### Clopper-Pearson (beta distribution)
The Clopper–Pearson interval is an early and very common method for calculating binomial confidence intervals. This is often called an 'exact' method, because it is based on the cumulative probabilities of the binomial distribution (i.e., exactly the correct distribution rather than an approximation). However, in cases where we know the population size, the intervals may not be the smallest possible.<br>
This method always ensures that the actual confidence level is greater than the level you requested. On average, therefore, these intervals have a greater confidence level than you requested so are wider than they need to be.

In [6]:
# Clopper-Pearson

def clopper_pearson(n, acc, conf):
    '''
    Function takes number of samples (n), obtained accuracy (acc) and confidence (conf).
    Returns confidence interval for the given confidence as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    low90, high90 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.90, method = "beta")
    int90 = [acc-(0.5-low90)*100, acc+(high90-0.5)*100]
    
    low95, high95 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.95, method = "beta")
    int95 = [acc-(0.5-low95)*100, acc+(high95-0.5)*100]
    
    low98, high98 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.98, method = "beta")
    int98 = [acc-(0.5-low98)*100, acc+(high98-0.5)*100]
    
    low99, high99 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.99, method = "beta")
    int99 = [acc-(0.5-low99)*100, acc+(high99-0.5)*100]
    
    
    low, high = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-conf, method = "beta")
    int_conf = [acc-(0.5-low)*100, acc+(high-0.5)*100]
    
    return min_max((int_conf, int90, int95, int98, int99))


#### Wilson score interval
The Wilson score interval is an improvement over the normal approximation interval in multiple respects. Unlike the symmetric normal approximation interval, the Wilson score interval is asymmetric. It does not suffer from problems of overshoot and zero-width intervals that afflict the normal interval, and it may be safely employed with small samples and skewed observations.<br>
The Wilson interval is derived from the Wilson Score Test, which belongs to a class of tests called Rao Score Tests. It relies on the asymptotic normality of your estimator.
Wilson intervals get their assymetry from the underlying likelihood function for the binomial, which is used to compute the "expected standard error" and "score" (i.e., first derivative of the likelihood function) under the null hypotheisis. Since these values will change as you very your null hypothesis, the interval where the normalized score (score/expected standard error) exceeds your pre-specified Z-cutoff for significance will not be symmetric, in general.<br>
With some data the actual confidence level is greater than what you requested, and for some data the actual confidence level is less. On the average, the actual confidence level equals the confidence level you requested.  Wilson's method is great except when the probability is very close to 0 or 1.

In [7]:
# Wilson score interval

def wilson(n, acc, conf):
    '''
    Function takes number of samples (n), obtained accuracy (acc) and confidence (conf).
    Returns confidence interval for the given confidence as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    low90, high90 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.90, method = "wilson")
    int90 = [acc-(0.5-low90)*100, acc+(high90-0.5)*100]
    
    low95, high95 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.95, method = "wilson")
    int95 = [acc-(0.5-low95)*100, acc+(high95-0.5)*100]
    
    low98, high98 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.98, method = "wilson")
    int98 = [acc-(0.5-low98)*100, acc+(high98-0.5)*100]
    
    low99, high99 = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-0.99, method = "wilson")
    int99 = [acc-(0.5-low99)*100, acc+(high99-0.5)*100]
    
    
    low, high = sm.stats.proportion.proportion_confint(n/2, n, alpha=1-conf, method = "wilson")
    int_conf = [acc-(0.5-low)*100, acc+(high-0.5)*100]
    
    return min_max((int_conf, int90, int95, int98, int99))


## Bootstrap
Bootstrap is a resampling method by independently sampling with replacement from an existing sample data with same sample size n, and performing inference among these resampled data.

#### Percentile Bootstrap Method
The percentile bootstrap interval is just the interval between the 100*(α/2) and 100*(1−α/2) percentiles of the distribution of θ estimates obtained from resampling, where θ represents a parameter of interest and α is the level of significance (e.g., α = 0.05 for 95% CIs) (Efron, 1982). A bootstrap percentile CI of θˆ (an estimator of θ) can be obtained as follows: (1) B random bootstrap samples are generated, (2) a parameter estimate is calculated from each bootstrap sample, (3) all B bootstrap parameter estimates are ordered from the lowest to highest, and (4) the CI is constructed as follows,<br>

$θˆ_{lower limit}$, $θˆ_{upper limit}$ = $θˆ_{j}$, $θˆ_{k}$ <br>

where $θˆ_{j}$ denotes the jth quantile (lower limit), and $θˆ_{k}$ denotes the kth quantile (upper limit); j=(α/2)*B, k=(1−α/2)*B. For example, a 95% percentile bootstrap CI with 1,000 bootstrap samples is the interval between the 25th quantile value and the 975th quantile value of the 1,000 bootstrap parameter estimates.

In [8]:
# Percentile bootstrap method

def percentile_BM(accs, conf):
    '''
    Function takes list of resamples accuracies obtained from bootstrap method(accs) and confidence (conf).
    Returns confidence interval for the given confidence as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    accs.sort()
    accs = np.array(accs)
    
    lower_bound90 = np.percentile(accs, 100*((1-0.9)/2))
    upper_bound90 = np.percentile(accs, 100*(0.9 + (1-0.9)/2))
    int90 = [lower_bound90, upper_bound90]
    
    lower_bound95 = np.percentile(accs, 100*((1-0.95)/2))
    upper_bound95 = np.percentile(accs, 100*(0.95 + (1-0.95)/2))
    int95 = [lower_bound95, upper_bound95]
    
    lower_bound98 = np.percentile(accs, 100*((1-0.98)/2))
    upper_bound98 = np.percentile(accs, 100*(0.98 + (1-0.98)/2))
    int98 = [lower_bound98, upper_bound98]
    
    lower_bound99 = np.percentile(accs, 100*((1-0.99)/2))
    upper_bound99 = np.percentile(accs, 100*(0.99 + (1-0.99)/2))
    int99 = [lower_bound99, upper_bound99]
    
    
    lower_bound = np.percentile(accs, 100*((1-conf)/2))
    upper_bound = np.percentile(accs, 100*(conf + (1-conf)/2))
    int_conf = [lower_bound, upper_bound]
    
    return min_max((int_conf, int90, int95, int98, int99))


## CV
Cross-validation or ‘k-fold cross-validation’ is when the dataset is randomly split up into ‘k’ groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group as been used as the test set.
For example, for 5-fold cross validation, the dataset would be split into 5 groups, and the model would be trained and tested 5 separate times so each group would get a chance to be the test set.

upper/lower bound of a confidence interval:<br>
X ± √(-ln((1-conf)/2)*k/2/n)<br>
X is the obtained accuracy<br>
conf is the selected confidence level<br>
n is the number of all samples<br>
k is the number of folds

In [9]:
# Cross-validation

def cv_interval(n, k, acc, conf):
    '''
    Function takes number of samples (n), number of folds(k), obtained accuracy (acc) and confidence (conf).
    Returns confidence interval for the given confidence as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    x90 = math.log((1-0.9)/2)*k/2/n
    t90 = math.sqrt(-x90)
    lower_bound90 = acc-t90*100
    upper_bound90 = acc+t90*100
    int90 = [lower_bound90, upper_bound90]
    
    x95 = math.log((1-0.95)/2)*k/2/n
    t95 = math.sqrt(-x95)
    lower_bound95 = acc-t95*100
    upper_bound95 = acc+t95*100
    int95 = [lower_bound95, upper_bound95]
    
    x98 = math.log((1-0.98)/2)*k/2/n
    t98 = math.sqrt(-x98)
    lower_bound98 = acc-t98*100
    upper_bound98 = acc+t98*100
    int98 = [lower_bound98, upper_bound98]
    
    x99 = math.log((1-0.99)/2)*k/2/n
    t99 = math.sqrt(-x99)
    lower_bound99 = acc-t99*100
    upper_bound99 = acc+t99*100
    int99 = [lower_bound99, upper_bound99]
    
    
    x = math.log((1-conf)/2)*k/2/n
    t = math.sqrt(-x)
    lower_bound = acc-t*100
    upper_bound = acc+t*100
    int_conf = [lower_bound, upper_bound]
    
    return min_max((int_conf, int90, int95, int98, int99))


## Progressive Validation
Suppose that you have a training set of size mtrain and test set of size mpv. Progressive validation starts by first learning a hypothesis on the training set and then testing on the first example of the test set. Then, we train on training set plus the first example of the test set and test on the second example of the test set. The process continues mtest iterations. Progressive Validation technique is used in data streams.

upper/lower bound of a confidence interval:<br>
X ± √(-ln((1-conf)/2)/2/s)<br>
X is the obtained accuracy<br>
conf is the selected confidence level<br>
s is the number of samples in the test set

In [10]:
# Progressive Validation

def prog_val(s, acc, conf):
    '''
    Function takes number of samples from a test set (s), obtained accuracy (acc) and confidence (conf).
    Returns confidence interval for the given confidence as well as confidence intervals 
    for 90%, 95%, 98% and 99% confidences.
    '''
    x90 = math.log((1-0.9)/2)/2/s
    t90 = math.sqrt(-x90)
    lower_bound90 = acc-t90*100
    upper_bound90 = acc+t90*100
    int90 = [lower_bound90, upper_bound90]
    
    x95 = math.log((1-0.95)/2)/2/s
    t95 = math.sqrt(-x95)
    lower_bound95 = acc-t95*100
    upper_bound95 = acc+t95*100
    int95 = [lower_bound95, upper_bound95]
    
    x98 = math.log((1-0.98)/2)/2/s
    t98 = math.sqrt(-x98)
    lower_bound98 = acc-t98*100
    upper_bound98 = acc+t98*100
    int98 = [lower_bound98, upper_bound98]
    
    x99 = math.log((1-0.99)/2)/2/s
    t99 = math.sqrt(-x99)
    lower_bound99 = acc-t99*100
    upper_bound99 = acc+t99*100
    int99 = [lower_bound99, upper_bound99]
    
    
    x = math.log((1-conf)/2)/2/s
    t = math.sqrt(-x)
    lower_bound = acc-t*100
    upper_bound = acc+t*100
    int_conf = [lower_bound, upper_bound]
    
    return min_max((int_conf, int90, int95, int98, int99))
