<a href="https://colab.research.google.com/github/ava11235/it125/blob/main/week7_notes_inferential_stats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Reading**

zyBooks Ch 5.1 - 5.9

**Hypothesis Testing** *

Hypothesis testing is used to answer a question about a process or the world by testing two hypotheses, a null and an alternative. Usually the null hypothesis makes a statement that “the world/process works this way”, and the alternative hypothesis says “the world/process does not work that way”.

Example
Null:

“The customer was not cheating - his chances of winning and losing were like random tosses of a fair coin-50% chance of winning, 50% of losing. Any variation from what we expect is due to chance variation.”

Alternative: 

“The customer was cheating-his chances of winning were something other than 50%”.

Pro tip:

You must be very precise about chances in your hypotheses. Hypotheses such as “the customer cheated” or “Their chances of winning were normal” are vague and might be considered incorrect, because you don’t state the exact chances associated with the events.

Pro tip: 

Null hypothesis should also explain differences in the data. For example, if your hypothesis stated that the coin was fair, then why did you get 70 heads out of 100 flips? Since it’s possible to get that many (though not expected), your null hypothesis should also contain a statement along the lines of “Any difference in outcome from what we expect is due to chance variation”.

**P-value**

is the chance, under the null hypothesis, of getting a test statistic equal to the observed test statistic or more extreme in the direction of the alternative. In other words, we want to see if the observed test statistic is likely to come from the null distribution. If the observed test statistic is inconsistent (i.e.looks nothing like the others) with the rest of the test statistics generated under the null, then it starts to look like the null hypothesisis not true.


**Confidence Intervals**

To provide an interval of estimates for a population parameter. For example, let’s say we wanted to estimate the median annual household income in the United States. We collect a large random sample from the population. However, the median of our sample isn’t a good estimate by itself. Due to random chance, our sample could have come out differently, and then the sample median would have been different. Thus, we need to take many samples, and provide an interval of estimates from the sample medians. However, we do not have the resources to physically take more samples, so we bootstrap (or “resample”) from our original sample.

**Testing Hypotheses Using Confidence Intervals**

For certain types of hypothesis tests, it’s easier and more beneficial to make a conclusion about the hypotheses with a confidence interval. If your hypotheses are something like: 

Null:

“The population parameter is some number X” and Alternative:“The population parameter is not that number X”, then you can construct a confidence interval to answer this question. 

Examples:

Null: The true slope of the regression line between X and Y is 0.

Alternative:The true slope of the regression line between X and Y is not 0.

Pro tip:

 In order to use this method, the hypotheses must be in the format of “this parameter is this number x”. A hypothesis test about whether a coin was fair does not work for this format, because there is no number to construct a confidence interval around.
 
If a 95% confidence interval does not contain the number in question, that’s like a p-value being less than a 0.05 cutoff value.


Likewise, a 99% confidence interval that does not contain the number in question is like a p-value being less than a 0.01 cutoff value.

*Adapted from Data 8 


**Reference**

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.norm.html#scipy.stats.norm

https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html


**Practice**

zyBooks Ch 5.1 - 5.9 Participation and Challenge Activities

**Learning Outcomes**

Upon successful completion of the material, students will be able to:
* Find the margin of error based on values above and below the point estimate.

* Use Python functions to calculate confidence intervals for population means and proportions.

* Use Python functions for hypothesis testing for population means and proportions.* Analyze problems with the use of hypothesis testing.
* Use Python functions to calculate one-way ANOVA test to determine if a statistically significant difference exists amongst the means of 3 or more populations.



In [6]:
# The norm.interval() function is used to find a confidence interval for a normal distribution
import pandas as pd
import scipy.stats as st
import math

scores = pd.read_csv('http://data-analytics.zybooks.com/ExamScores.csv')
sigma = 2.5
mean = scores['Exam1'].mean()
stderr = sigma/math.sqrt(len(scores['Exam1']))
print(st.norm.interval(0.99, mean, stderr))

(81.78930681614078, 83.61069318385923)


In [7]:
# t.interval() function is used to find a confidence interval for a variable with a  t-distribution.

import pandas as pd
import scipy.stats as st
scores = pd.read_csv('http://data-analytics.zybooks.com/ExamScores.csv')

# Let n be the number of students who took Exam 1.
n = scores[['Exam1']].count()

# Degrees of freedom is number of samples minus 1
df = n - 1

# The mean of Exam1 scores are obtained
mean = scores[['Exam1']].mean()

# The standard error is standard deviation/sqrt(number of samples)
stdev = scores[['Exam1']].std()
stderr = stdev/(n ** 0.5)

print(st.t.interval(0.95, df, mean, stderr))

(array([80.05931209]), array([85.34068791]))


In [8]:
# The norm.interval() function can also be used to find the confidence interval for population proportion.

# In a survey of 1200 randomly selected registered voters, 
# 348 were in favor of banning public smoking. Find the 95% 
# conﬁdence interval for the proportion of voters in favor 
# of banning public smoking.

import scipy.stats as st

# Let n be the number of voters surveyed.
n = 1200

# Let p be the proportion of voters that voted in favor
p = 348.0/1200.0

# The standard error is sqrt(p * (1-p)/n)
stderr = (p * (1 - p)/n) ** 0.5

print(st.norm.interval(0.95, p, stderr))


(0.26432646675431226, 0.3156735332456877)


In [9]:
# A confidence interval can also be calculated from raw data. 

# In the Exam Scores data set, find a 99% confidence interval
# for the proportion of students who scores more than 90 in Exam 1.

import pandas as pd
import scipy.stats as st
scores = pd.read_csv('http://data-analytics.zybooks.com/ExamScores.csv')

# Let n be the number of students who took Exam 1.
n = scores[['Exam1']].count()

# Let x be the total of all Exam 1 scores greater than 90
x = (scores[['Exam1']] > 90).values.sum()

# Let p be x/n, the proportion of all students that scored over 90 on Exam 1
# Multiplying by 1.0 is needed for correct floating point arithmetic
p = x/n*1.0

# The standard error is sqrt(p * (1-p)/n)
stderr = (p * (1 - p)/n) ** 0.5

print(st.norm.interval(0.99, p, stderr))

(array([0.02645375]), array([0.29354625]))


In [11]:
#test for mean based on normal distribution, one or two samples
#In the case of two samples, the samples are assumed to be independent.

from statsmodels.stats.weightstats import ztest
import pandas as pd
scores = pd.read_csv('http://data-analytics.zybooks.com/ExamScores.csv')
print(ztest(x1 = scores['Exam1'],  value = 86))

(-2.5113146627890988, 0.012028242796839027)


In [12]:
# The proportions_ztest(count, nobs, value, prop_var = value) function is used to perform a one-sample -test for proportions.

#Test for proportions based on normal (z) test
from statsmodels.stats.proportion import proportions_ztest
counts = 31
nobs = 50
value = 0.50
print(proportions_ztest(counts, nobs, value, prop_var = value))

(1.697056274847714, 0.08968602177036457)


In [13]:
# A one-way ANOVA compares the means of three or more groups of one predictor variable.
# The f_oneway() function performs a one-way ANOVA

# The Exam Score dataset includes scores obtained in 4 exams in a class.
# Perform a hypothesis test to determine if the mean scores of the exams 
# are different. Use the 5% level of significance. 

import pandas as pd
import scipy.stats as st
scores = pd.read_csv('http://data-analytics.zybooks.com/ExamScores.csv')

# Statistics of each exam
exam1_score = scores[['Exam1']]
exam2_score = scores[['Exam2']] 
exam3_score = scores[['Exam3']] 
exam4_score = scores[['Exam4']] 

print(st.f_oneway(exam1_score, exam2_score, exam3_score, exam4_score))


F_onewayResult(statistic=array([3.85696089]), pvalue=array([0.01034867]))
