# Null Hypothesis & Confidence Interval

# Null Hypothesis

**Formal definition:** The formal methods of reaching conclusinos based on the population stats and sample data where we apply changes to the population

**SF STANDS FOR SURVIVAL FUNCTION**

* **Scenario A:** We know the *population size* and the population's *standard deviation* (σ). We apply changes, and **now**, we are in....
* **Scenario B:** We gather *n* samples from the population.
    * **x̄** : sample mean
    * **n** : number of samples taken
    * **s** : sample standard deviation
* **Goal:** Have conclusion about population mean for B.
* **Define:** Significant level (

* **Null:** H0: μA = μB
* **Z-distribution:** H1: μA != μB
* ***For either of the above:*** If σ or n >= 30:
    * **z_score** = (x̄ - μA)/(σ/(sqrt(n)))
    * **p_value** = 2 * stats.norm.sf(|Z-score|)
        * **stats.norm.cdf(-np.abs(z))**
* **In contrast,** if n < 30 or σ is unknown:
    * **t_score** = (x̄ - μA)/(S/(sqrt(n)))
        * S = Standard Dev
    * **p_value** = 2tSF(|T-score|)
        * don't worry about the syntax. Just use:
        * **stats.ttest_1samp**

1. **If** p_value < σ **reject  null hypothesis.**
1. **Else**, we **accept null hypothesis.**

Testing assumes:
* There's a relationship between our test variable (drug) and our other variable (rat reaction)
* That relationship ISN'T cause by chance or errors (ex. what the rats ate)

In [38]:
from scipy import stats
import numpy as np
import pandas as pd

## Z-test activity:

In [2]:
# Function that has these values. Should we accept or reject the null hypothesis?

Rats

* **Significant level** α = 0.05
* **Population number** N = 100
* **Population mean A** μA = 1.2 sec
* **Standard dev** σ = 0.5 sec
* **Sample mean** x̄ = 1.05

(1.05-1.2)/(0.5/(10))

1-norm(cdf) = pvalue

In [14]:
zscore = (1.05-1.2)/(0.5/(10))
# 1-norm(cdf) = pvalue
# This number is the z score
print(zscore)

-2.9999999999999982


In [20]:
# pval formula: pval = 2SF(Zscore)
pval = 2 * stats.norm.cdf(-np.abs(zscore))
print(pval)

0.0026997960632602026


### Actual function

In [23]:
def null_hyp(mu, sample_mean, sig_level, n, sigma):
    # Calc standard dev of the sampling distribution
    sample_std = sigma / n
    # Calc z-score from population mean (mu), sample mean, and sample std
    z = (sample_mean - mu) / sample_std
    
    #Calc p-value from z-score
    p = 2 * stats.norm.cdf(-np.abs(z))
    
    #Determine whether to accept/reject null hypothesis
    if p < sig_level:
        print('reject null hypothesis')
    else:
        print('accept null hypothesis')

In [24]:
null_hyp(1.2, 1.05, 0.05, 100, 0.5)

reject null hypothesis


Yay, it matches.

## T-test activity:

The average British man is 175.3 cm tall. A survey recorded the heights of 10 British men. Calculate the t-score from formula and use available function in `stats.ttest_1samp`. Compare what you will get

`x = [177.3, 182.7, 169.6, 176.3, 180.3, 179.4, 178.5, 177.2, 181.8, 176.5]`

In [31]:
x = [177.3, 182.7, 169.6, 176.3, 180.3, 179.4, 178.5, 177.2, 181.8, 176.5]
mu = 175.3
sample_mean = np.array(x).mean()

# Calc the standard deviation of sample distribution

N = len(x) # num of data samples
S = np.array(x).std(ddof=1)
den = S / np.sqrt(N)

# T-test from formula
t = (sample_mean - mu)/den
print("t-statistic: ",t)

# One sample t-test that gives you the p-value too can be done with scipy as follows:
# ttest_1samp: Calculate the T-test for the mean of ONE group of scores.
# This is a two-sided test for the null hypothesis that the expected value (mean) \
# of a sample of independent observations a is equal to the given population mean, popmean.
t, p = stats.ttest_1samp(x, mu)
print("t = ", t, ", p = ", p)

t-statistic:  2.295568968083183
t =  2.295568968083183 , p =  0.04734137339747034


## Write a function that determines whether to use the z-test or t-test in order to accept or reject null hypothesis

In [33]:
def z_t_null_hypothesis(data_sample, mu, sigma, sig_level):
    if sigma:
        z_score = (np.mean(data_sample)-mu)/(sigma/np.sqrt(len(data_sample)))
        p = scipy.stats.norm.sf(abs(z_score))*2
    elif len(data_sample) > 30:
        z_score = (np.mean(data_sample)-mu)/(np.std(data_sample)/np.sqrt(len(data_sample)))
        p = scipy.stats.norm.sf(abs(z_score))*2
    else:
        t, p = stats.ttest_1samp(data_sample, mu)
        
    if p < sig_level:
        print('Reject null hypothesis')
    else:
        print('Accept null hypothesis')

## What is one-tail or two-tail calculation for p-value?

If the alternative hypotehsis says the mean of sample is different from the mean of the overall population, we should compute p-value from two-tail. If it says the mean of sample is greater or lower than the mean of population, we should compute one-tail.

## Possible errors that can happen when we accept or reject the null hypothesis

Type I error : We reject the null hypothesis when the null is true

$\alpha$ = P(rejecting $H_o$  $|$  $H_o$ is true)

Type II error : We accept the null hypothesis when it is not true

$\beta$ = P(accepting $H_o$ $|$ $H_o$ is false)

The drug has effect on brain

The drug has no effect on brain

## Unpaired t-test

This test compares two unrelated samples. In the followinge xample, data was collected on the weight (kg) of 8 elderly women and 8 elderly men. We are intersted in whther the weights of these 2 samples is different.

In [34]:
# Missed. 
# female = [63.8, 56.4, 55.2, 58.5, 64.0, 51.6, 54.6, 71.0]
# male = [75.5, 83.9, 75.7, 72.5, 56.2, 73.4, 67.7, 87.9]

# two_sample

## Confidence Interval

It's useful to examine an interval for the possible values of the parameter and put a probability on how confident we are that the true parameter value falls inside this interval.

### EXAMPLE
We have the data X and assume we know the population standard deviation (sigma). What is the confidence interval for population mean?


$P(L < \mu < U) = 1 - \alpha$

We want to obtain $L$ and $U$, with 1-$\alpha$ confidence

## From statics references

$L = \bar{x} - z_{1- \alpha/2}\frac{\sigma}{\sqrt{N}}$

$U = \bar{x} + z_{1- \alpha/2}\frac{\sigma}{\sqrt{N}}$

## Flower time

Tasks:

1- Explore this dataset. How many features, records and plants does it have?

2- Gather all of the sepal length for Iris-setosa

3- Write a function that calculate lower and upper bound for mean of sepal length for Iris-setosa with %95 confidence. 

Assume $\sigma = 0.3525$

That says sigma


**The functions below aim to find the Lower and Upper bounds of our Confidence Interval.**

In [70]:
df = pd.read_csv("Iris.csv")
x = df[df['Species'] == 'Iris-setosa']['SepalLengthCm'].tolist()

print(np.mean(x))

def ci_z(data_sample, significant_level, sigma):
    z = stats.norm.ppf(1-significant_level/2)
    L = np.mean(data_sample) - z*sigma/np.sqrt(len(data_sample))
    U = np.mean(data_sample) + z*sigma/np.sqrt(len(data_sample))
    return L, U


def ci_t(data_sample, significant_level):
    t = stats.t.ppf(1 - significant_level/2, len(data_sample) - 1)
    L = np.mean(data_sample) - t * np.std(data_sample, ddof=1) / np.sqrt(len(data_sample))
    U = np.mean(data_sample) + t * np.std(data_sample, ddof=1) / np.sqrt(len(data_sample))
    return L, U


# Both return lower, upper bounds
print(ci_z(x, 0.05, 0.3525))
print(ci_t(x,0.05))

5.006
(4.908293780383348, 5.103706219616653)
(4.905823539430869, 5.106176460569132)
