<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Qn1-proportion-test" data-toc-modified-id="Qn1-proportion-test-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Qn1 proportion test</a></span><ul class="toc-item"><li><span><a href="#using-direct-formula" data-toc-modified-id="using-direct-formula-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>using direct formula</a></span></li><li><span><a href="#using-statsmodels" data-toc-modified-id="using-statsmodels-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>using statsmodels</a></span></li><li><span><a href="#getting-p-value-instead-of-confidence-interval" data-toc-modified-id="getting-p-value-instead-of-confidence-interval-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>getting p-value instead of confidence interval</a></span></li><li><span><a href="#p-value-from-statsmodels-proportion-ztest" data-toc-modified-id="p-value-from-statsmodels-proportion-ztest-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>p-value from statsmodels proportion ztest</a></span></li></ul></li><li><span><a href="#Proportion-test-Qn2" data-toc-modified-id="Proportion-test-Qn2-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Proportion test Qn2</a></span><ul class="toc-item"><li><span><a href="#Proprotion-test-Qn3" data-toc-modified-id="Proprotion-test-Qn3-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Proprotion test Qn3</a></span></li></ul></li><li><span><a href="#CI-for-two-samples" data-toc-modified-id="CI-for-two-samples-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>CI for two samples</a></span><ul class="toc-item"><li><span><a href="#Qn-1-Two-sample-t-test" data-toc-modified-id="Qn-1-Two-sample-t-test-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Qn 1 Two sample t-test</a></span></li><li><span><a href="#Two-sample-z-test-Qn-2" data-toc-modified-id="Two-sample-z-test-Qn-2-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Two sample z-test Qn 2</a></span></li><li><span><a href="#Qn-3" data-toc-modified-id="Qn-3-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Qn 3</a></span></li></ul></li></ul></div>

In [1]:
import numpy as np
from scipy import stats

import statsmodels
import statsmodels.stats.api as sms

# Qn1 proportion test

- https://towardsdatascience.com/40-statistics-interview-problems-and-answers-for-data-scientists-6971a02b7eee

- https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

- https://sigmazone.com/binomial-confidence-intervals/

You are running for office and your pollster polled hundred people. Sixty of them claimed they will vote for you. Can you relax?

![](images/ci.png)
![](images/ci2.png)
```
p-hat = 60/100 = 0.6
z* = 1.96
n = 100
```
This gives us a confidence interval of [50.4,69.6]. Therefore, given a confidence interval of 95%, if you are okay with the worst scenario of tying then you can relax. Otherwise, you cannot relax until you got 61 out of 100 to claim yes.


## using direct formula

In [17]:
# find critical value for 95% confidence interval
alpha = 0.05
z_crit = stats.norm.ppf(1-alpha/2)
z_crit

1.959963984540054

In [21]:
alpha_by_2 = 1 - stats.norm.cdf(z_crit)
alpha = alpha_by_2 * 2

alpha

0.050000000000000044

In [23]:
alpha = stats.norm.sf(z_crit) * 2
alpha

0.05

In [2]:
phat = 60/100
z_crit = 1.96 # 95% confidence interval
n = 100

second = z_crit * np.sqrt(phat * (1-phat)) / np.sqrt(n)

ci = (phat-second, phat + second)
ci

(0.5039800020828994, 0.6960199979171006)

## using statsmodels

In [3]:
from statsmodels.stats import proportion as smp

smp.proportion_confint(
    count=60,nobs=100,alpha=0.05,method='normal')

(0.5039817664728937, 0.6960182335271062)

## getting p-value instead of confidence interval

In [4]:
x = 60
n = 100
p = 0.5

pval = stats.binom_test(x=x,n=n,p=p,alternative='greater')
pval

0.02844396682049039

In [5]:
# using survival function ( 1-cdf)
pval = stats.binom.sf(x-1,n,p)
pval

0.02844396682049039

In [6]:
# we need to use x-1 for survival function not x
stats.binom.sf(x,n,p), 1 - stats.binom.sf(x,n,p)

(0.017600100108852407, 0.9823998998911476)

In [7]:
# these do not give pval, we need to use survial function with x-1.
stats.binom.cdf(x,n,p), 1 - stats.binom.cdf(x,n,p)

(0.9823998998911476, 0.01760010010885238)

## p-value from statsmodels proportion ztest

- https://docs.w3cub.com/statsmodels/generated/statsmodels.stats.proportion.proportions_ztest

In [8]:
from statsmodels.stats.proportion import proportions_ztest

stat, pval = proportions_ztest(count=60, nobs=100, value=0.05,alternative='two-sided')

pval

3.0109462239884175e-29

# Proportion test Qn2

- https://stackoverflow.com/questions/62562064/how-to-calculate-pvalue-for-one-tailed-test-in-python
    
In previous years 52% of 1018 parents believed that electronics and social media was the cause of their teenager’s lack of sleep.

Do more parents today believe that their teenager’s lack of sleep is caused due to electronics and social media? 


Population: Parents with a teenager (age 13-18)


Parameter of Interest: p 

Null Hypothesis: p = 0.52 

Alternative Hypthesis: p > 0.52 (note that this is a one-sided test)


1018 Parents

56% believe that their teenager’s lack of sleep is caused due to electronics and social media


![](images/z_for_prop.jpeg)

In [9]:
p0 = 0.52
p = 0.56
n = 1018
Z = (p-p0)/np.sqrt(p0*(1-p0)/n)

Z

2.5545334262132955

In [10]:
# stats.norm.cdf(Z) gives you the cumulative probability up till Z
# and since we need the probability of observing 
# something more extreme than this we need 1 - cdf

pval = 1-stats.norm.cdf(Z) # one-sided
pval

0.0053165109918223985

In [11]:
# The probability density function of the
# normal distribution is symmetric

pval = stats.norm.cdf(-Z)
pval

0.005316510991822442

In [12]:
pval = stats.norm.sf(Z) # survival function gives 1 -cdf
pval

0.005316510991822442

## Proprotion test Qn3

- https://sigmazone.com/binomial-confidence-intervals/
    
Let’s assume that two candidates are running in an election for Governor of California. This fictitious election pits Mr. Gubinator vs. Mr. Ventura.


We would like to know who is winning the race, and therefore we conduct a poll of likely voters in California. If the poll gives the voters a choice between the two candidates, then the results can be reasonably modeled with the Binomial Distribution.


In our poll of 50 likely voters, 58% indicate they intend to vote for Mr. Gubinator.

Does this mean that 58% of all voters intend to vote for Mr. Gubinator? Probably not.


If we were to repeat this poll several times in the same day (using a different group of 50 each time) we would find that the percentage that intends to vote for Mr. Gubinator would change with each poll. 


Using our previous example, if a poll of 50 likely voters resulted in 29 expressing their desire to vote for Mr. Gubinator, the resulting 95% CI would be calculated as follows.

![](images/sigmazone.png)

Thus, we would be 95% confident that the proportion of the target population (all voters in California) who intend to vote for Mr. Gubernator falls between 44% and 72%.

While this method is very easy to teach and understand, you may have noticed that z1- α/2 is derived from the Normal Distribution and not the Binomial Distribution. The use of the z value from the Normal Distribution is where the method earns its moniker “Normal Approximation”. While the use of the Normal Distribution seems odd at first, it is supported by the central limit theorem and with sufficiently large n, the Normal Distribution is a good estimate of the Binomial Distribution.

However, there are times when the Normal Distribution is not a good estimator of the Binomial. When p is very small or very large, the Normal Approximation starts to suffer from increased inaccuracy. Specifically, when np > 5 or n(1-p)>5 the Normal Approximation method should not be used.

Additionally, if you try to calculate any CI with p=0 or p=1, you will find that it is not possible.



# CI for two samples

- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

- https://www.reneshbedre.com/blog/ttest.html

- https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals5.html


![](images/two_sample_t_test.png)
![](images/two_sample_t_df.png)

## Qn 1 Two sample t-test

In a study of emergency room waiting times, investigators consider a new and the standard triage systems.

To test the systems, administrators selected 20 nights and randomly assigned the new triage system to be used on 10 nights and the standard system on the remaining 10 nights. 

They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 3 hours with a variance of 0.60 while the average MWT for the old system was 5 hours with a variance of 0.68. 

Consider the 95% confidence interval estimate for the differences of the mean MWT associated with the new system. Assume a constant variance.

What is the interval? Subtract in this order (New System — Old System).

Note: s^2p is pooled sample variance.


Answer:

![](images/qn_39.png)

In [13]:
from scipy.stats import ttest_ind_from_stats

mean1, mean2 = 3,5
var1, var2 = 0.6, 0.68
nobs1, nobs2 = 10,10


ttest_ind_from_stats(mean1=mean1, std1=np.sqrt(var1), nobs1=nobs1,
                     mean2=mean2, std2=np.sqrt(var2), nobs2=nobs2)

Ttest_indResult(statistic=-5.5901699437494745, pvalue=2.6367163153806834e-05)

## Two sample z-test Qn 2

To further test the hospital triage system, administrators selected 200 nights and randomly assigned a new triage system to be used on 100 nights and a standard system on the remaining 100 nights.

They calculated the nightly median waiting time (MWT) to see a physician. 


The average MWT for the new system was 4 hours with a standard deviation of 0.5 hours while the average MWT for the old system was 6 hours with a standard deviation of 2 hours.

Consider the hypothesis of a decrease in the mean MWT associated with the new treatment.

What does the 95% independent group confidence interval with unequal variances suggest vis a vis this hypothesis?


(Because there’s so many observations per group, just use the Z quantile instead of the T.)


Answer:
![](images/qn_40.png)

## Qn 3

- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind_from_stats.html

```
                 Sample   Sample
           Size   Mean   Variance
Sample 1    13    15.0     87.5
Sample 2    11    12.0     39.0


Apply the t-test to this data (with the assumption that the population variances are equal)
```

In [14]:
from scipy.stats import ttest_ind_from_stats
ttest_ind_from_stats(mean1=15.0, std1=np.sqrt(87.5), nobs1=13,
                     mean2=12.0, std2=np.sqrt(39.0), nobs2=11)

Ttest_indResult(statistic=0.9051358093310269, pvalue=0.3751996797581487)

In [15]:
from scipy.stats import ttest_ind

a = np.array([1, 3, 4, 6, 11, 13, 15, 19, 22, 24, 25, 26, 26])
b = np.array([2, 4, 6, 9, 11, 13, 14, 15, 18, 19, 21])

ttest_ind(a, b)

Ttest_indResult(statistic=0.905135809331027, pvalue=0.3751996797581486)