# Hypothesis Testing

Hypothesis testing is a critical tool in determing what the value of a parameter could be.

The basis of testing has two attributes:

**Null Hypothesis: $H_0$**

**Alternative Hypothesis: $H_a$**

The tests we have discussed in lecture are:

* One Population Proportion
* Difference in Population Proportions
* One Population Mean
* Difference in Population Means

### Setting Up a Test for a Population Proportion

Our research question, we were given the background information of in previous years 52 percent of parents believed that electronics and social media was the cause of their teenagers lack of sleep. Do more parents today believe that their teenagers lack of sleep is caused due to electronics and social media?


So, with this background, we first want to define what our parameter of interest is.

**Population** -parents with a teenager and that's ages 13 to 18,
**parameter of interest** - p or population proportion.


**Goal**: Test for a significant increase in the proportion of parents with a teenager who believe that electronics and social media is the cause for lack of sleep. 


*With any hypotheses test, you first want to start with your hypothesis, so you make these before you even collect any data and so that's to say you not influenced in what you believe.*

First hypothesis or Null Hypothesis $$H_0: p = 0.52$$
Alternative Hypothesis $$H_a: p ? 0.52$$
'?' stands for >,< or = sign. Since we are focusing on significant increase, so in this case,
Alternative Hypothesis $$H_a: p > 0.52$$
where, p is the population proportion of parents with a teenager who believe that electronics and social media is the cause of their teen's lack of sleep.

Finally, we want to set a $$\alpha \text{( or significant level)}=0.05 $$ A significance level which typically this is 0.05. This is basically the cut-off point of when we've found something to be significant.
In this tutorial, I will introduce some functions that are extremely useful when calculating a t-statistic and p-value for a hypothesis test.

**Survey Result:
A random sample of 1,018 parents with the teenager was taken and 56% said they believe electronics and social media was the cause of their teenager's lack of sleep.**


For further studies, we need some assumptions.

First assumption: we need to check is we need a random sample of parents.
Second: if our sample size is large enough and that ensures that we have our sample proportions being a normal distribution. 


How we check a large enough sample size?
That is, $$ n.p ~\text{be at least} ~10$$
$$ n.(1-p) ~\text{be at least}~ 10$$
So, it's like were there at least 10 people that said yes, and were there at least 10 people that said no to the question.

Of course we don't know what p is exactly. So, instead of using p,we're going to use a pseudo p, which is $p_0$ that is the null population proportion. 

So, $$ n.p ~\text{be at least} ~10 \rightarrow n.p_0$$
$$ n.(1-p) ~\text{be at least} ~10 \rightarrow n.(1-p_0)$$

Under the null hypothesis, we believe our population proportion to be this, which for our case is 0.52 and our n again was 1,018. 
So, $$ n.p_0 = 1018(0.52)= 529 $$
$$ n.(1-p_0) = 1018(1-0.52) = 489$$

So now, we've set up our hypotheses, we've checked our assumptions, and now we can go on to actually running the tests and seeing if we have significant results or not.We're going to calculate a test statistic, which a test statistic that is determined by taking our best estimate, subtracting the hypothesized estimate, and dividing by the standard error of the estimate.


Let's quickly review the following ways to calculate a test statistic for the tests listed above.

The equation is:

$$\frac{Best\ Estimate - Hypothesized\ Estimate}{Standard\ Error\ of\ Estimate}$$ 

Where, 
$$Standard\ Error \ for\ Population\ Proportion (\ Or \ Estimate)= \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

We will use the examples from our lectures and use python functions to streamline our tests.

In [1]:
import statsmodels.api as sm
import numpy as np
import pandas as pd
import scipy.stats.distributions as dist

In [2]:
best_estimate = 0.56 # p_hat
null_est = 0.52 #p_0
n = 1018
se_nu =  np.sqrt((null_est*(1-null_est))/n)
z = (best_estimate-null_est)/se_nu
z

2.5545334262132955

### One Population Proportion

#### Research Question 

In previous years 52% of parents believed that electronics and social media was the cause of their teenager’s lack of sleep. Do more parents today believe that their teenager’s lack of sleep is caused due to electronics and social media? 

**Population**: Parents with a teenager (age 13-18)  
**Parameter of Interest**: p  
**Null Hypothesis:** p = 0.52  
**Alternative Hypthosis:** p > 0.52 (note that this is a one-sided test)

1018 Parents

56% believe that their teenager’s lack of sleep is caused due to electronics and social media.

In [3]:
n = 1018
pnull = .52
phat = .56
sm.stats.proportions_ztest(phat * n, n, pnull, alternative='larger', prop_var=0.52)

(2.5545334262132955, 0.005316510991822442)

This Z test statistic means that our observed sample proportion is, 2.555 null standard errors above our hypothesized population proportion, and so we took our sample, we subtracted out the hypothesis, and then divided by the standard error, so we get the number of null standard errors.


### Test Statistic Distribution

This Z test statistic is another random varibale which has a distribution. This Ze test statistic will always follow a normal or N(0,1). This is due to us centering and scaling our original data. To be more clear, look at the equation below,

$$\frac{Best\ Estimate - Hypothesized\ Estimate}{Standard\ Error\ of\ Estimate}$$ 

The numerator is centering data and denominator is scaling the data.

 So, now that we have the Z test statistic, we can find a p-value from it, 
 
 ![imagew.png](image\18.png)

In [5]:
sm.stats.proportions_ztest(phat * n, n, pnull, alternative='larger', prop_var=0.52)

(2.5545334262132955, 0.005316510991822442)

To get the p value we can get from sm.stats.proportions_ztest and we got 0.0053 which is less than $\alpha$, So that means, we can reject the null hypothesis $H_0$

So, to summerize our finding, we can say,  there is sufficient evidence to conclude that the population proportion of parents with a teenager who believe that electronics and social media is the cause for lack of sleep is greater than 0.5, or greater than 52 percent. 

**So overall,there are four main steps to a hypothesis test.** 
  - we state our hypothesis and select a significance level, which is alpha, 
  - we check our assumptions, 
  - we then calculate a test statistic, and we get a p-value from that test statistic, 
  - and finally we draw a conclusion from that p-value.**

### Difference in Population Proportions

#### Research Question

Is there a significant difference between the population proportions of parents of black children and parents of Hispanic children who report that their child has had some swimming lessons?

**Populations**: All parents of black children age 6-18 and all parents of Hispanic children age 6-18  
**Parameter of Interest**: p1 - p2, where p1 = black and p2 = hispanic  
**Null Hypothesis:** p1 - p2 = 0  
**Alternative Hypthosis:** p1 - p2 $\neq$ = 0  


91 out of 247 (36.8%) sampled parents of black children report that their child has had some swimming lessons.

120 out of 308 (38.9%) sampled parents of Hispanic children report that their child has had some swimming lessons.

Now, our question of interest is, we'd like to test for a significant difference in the population proportions of parents reporting that their child has had swimming lessons, and we'll do this at the 10 percent significance level.

In [6]:
# This example implements the analysis from the "Difference in Two Proportions" lecture videos

# Sample sizes
n1 = 247
n2 = 308

# Number of parents reporting that their child had some swimming lessons
y1 = 91
y2 = 120

# Estimates of the population proportions
p1 = round(y1 / n1, 2)
p2 = round(y2 / n2, 2)

p1 - p2

-0.020000000000000018

Meaning that the sample proportion for black children is smaller than the sample proportion for Hispanic children.

Let's quickly review the following ways to calculate a test statistic for the tests listed above.

The equation is:

$$\frac{Best\ Estimate - Hypothesized\ Estimate}{Standard\ Error\ of\ Estimate}$$ 

Where, Best Estimate = p1 -p2, Hyp = 0 (0, since we choose equal population proportions)
Where, 
$$Standard\ Error \ for\ Population\ Proportion (\ Or \ Estimate)= \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

where, ST error = $\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n}+\frac{1}{n})}$ and $\hat{p} = \frac{p_1+p_2}{n_1+n_2}$

In [9]:
# Estimate of the combined population proportion
phat = (y1 + y2) / (n1 + n2)

# Estimate of the variance of the combined population proportion
va = phat * (1 - phat)

# Estimate of the standard error of the combined population proportion
se = np.sqrt(va * (1 / n1 + 1 / n2))

# Test statistic 
test_stat = (p1 - p2) / se  #z
print("Test Statistic")
print(round(test_stat, 2))

Test Statistic
-0.48


Our z is negative 0.48, and so that means that our observed difference in sample proportions is 0.48 estimated standard errors below our hypothesized mean of equal population proportions. That gives us a sense of where our sample falls. But let's look at what the p-value is and what that means. Because we have a z-test statistic, that will be distributed according to a standard normal distribution. In other words, it's a normal distribution. So, it'll be a bell-shaped distribution with the mean at zero and a standard deviation of one. 

In [10]:
pvalue = 2*dist.norm.cdf(-np.abs(test_stat))

print("\nP-Value")
print(round(pvalue, 2))


P-Value
0.63


Now, based on our p-value of 0.63, we need to compare that to our alpha of 0.1 to decide what to do in this case. Now, 0.63 is larger than 0.1 and that means that we will fail to reject the null hypothesis. This doesn't mean that we accept the null hypothesis, but we can't reject it. Now, that again means that we don't have evidence against equal population proportions. So, our null hypothesis was that we have equal population proportions, and we don't have evidence against it. It doesn't mean that it's necessarily not the null hypothesis, you just don't have evidence against it. Formally, based on our sample and our p-value, we fail to reject the null hypothesis. In this case, we conclude that there is no significant difference between the population proportion of parents of Black and Hispanic children who report their child has had swimming lessons. In essence, this means it seems like those two population proportions are roughly equivalent.

### One Population Mean

#### Research Question 

Is the average cartwheel distance (in inches) for adults 
more than 80 inches?

**Population**: All adults  
**Parameter of Interest**: $\mu$, population mean cartwheel distance.
**Null Hypothesis:** $\mu$ = 80
**Alternative Hypthosis:** $\mu$ > 80

25 Adults

In [13]:
df = pd.read_csv("dataset/Cartwheeldata.csv")
df.describe()['CWDistance']

count     25.000000
mean      82.480000
std       15.058552
min       63.000000
25%       70.000000
50%       81.000000
75%       92.000000
max      115.000000
Name: CWDistance, dtype: float64

We got from above,

$\mu = 82.46$

$\sigma = 15.06$

In [14]:
n = len(df)
mean = df["CWDistance"].mean()
sd = df["CWDistance"].std()
(n, mean, sd)

(25, 82.48, 15.058552387264855)

In [16]:
null_h = 80
t = (mean - null_h)/(sd/np.sqrt(n))
t

0.8234523266982027

Our test statistic, our t statistic turned out to be 0.82.

In [15]:
sm.stats.ztest(df["CWDistance"], value = 80, alternative = "larger")

(0.8234523266982029, 0.20512540845395266)

Since our P is much bigger than 0.05 significance level, weak evidence against the null and we fail to reject the null.

### Difference in Population Means

#### Research Question 

Considering adults in the NHANES data, do males have a significantly higher mean Body Mass Index than females?

**Population**: Adults in the NHANES data.  
**Parameter of Interest**: $\mu_1 - \mu_2$, Body Mass Index.  
**Null Hypothesis:** $\mu_1 = \mu_2$  
**Alternative Hypthosis:** $\mu_1 \neq \mu_2$

2976 Females 
$\mu_1 = 29.94$  
$\sigma_1 = 7.75$  

2759 Male Adults  
$\mu_2 = 28.78$  
$\sigma_2 = 6.25$  

$\mu_1 - \mu_2 = 1.16$

In [20]:
da = pd.read_csv("dataset/nhanes_2015_2016.csv")
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [21]:
females = da[da["RIAGENDR"] == 2]
male = da[da["RIAGENDR"] == 1]

In [22]:
n1 = len(females)
mu1 = females["BMXBMI"].mean()
sd1 = females["BMXBMI"].std()

(n1, mu1, sd1)

(2976, 29.939945652173996, 7.75331880954568)

In [23]:
n2 = len(male)
mu2 = male["BMXBMI"].mean()
sd2 = male["BMXBMI"].std()

(n2, mu2, sd2)

(2759, 28.778072111846985, 6.252567616801485)

In [24]:
sm.stats.ztest(females["BMXBMI"].dropna(), male["BMXBMI"].dropna())

(6.1755933531383205, 6.591544431126401e-10)