<a href="https://colab.research.google.com/github/s-brez/openintro-statistics/blob/master/Ch5_Inference_foundations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#INFERENCE FOR A PROPORTION
 As part of a quality control process for computer chips, an engineer randomly samples 232 chips at a factory during a week of production to test the current rate of chips with severe defects. She finds that 34 of the chips are defective.

**SE/variability = sqrt( (p_hat*(1-p_hat)) / n )**

A) What is the population under consideration in this study?

B) What parameter is being estimated?

C) What is the point estimate for the parameter p_hat?

D) Compute the standard error to measure the uncertainty in the point estimate.

In [0]:
import numpy as np
import math
from scipy.stats import norm

In [26]:
x, n, p_hat = 34, 232, x/n
se = math.sqrt( (p_hat*(1-p_hat)) / n )

print("A: All chips manufactured at the factory that week.")

print("\nB: She is trying to estimate what proportion of the whole population is defective.")

print("\nC: Sample proportion, or point estimate of p-hat = 34/232 =", p_hat)

print("\nD: SE:", round(se, 4))

A: All chips manufactured at the factory that week.

B: She is trying to estimate what proportion of the whole population is defective.

C: Sample proportion, or point estimate of p-hat = 34/232 = 0.04

D: SE: 0.0129


#SAMPLING DIST



18% of first year students made the Dean's list in the current year. As part of a class project, students randomly sample 40 students and check if those students made the list. They repeat this 1,000 times and build a distribution of sample proportions.

A) What is this distribution called?

B) Calculate the variability of this distribution.

In [25]:
x, n, p_hat = 40, 1000, 0.18
se =  math.sqrt( (p_hat*(1-p_hat)) / n )

print("A: Sampling distribution")

print("\nB:", round(se, 4))

A: Sampling distribution

B: 0.0121


#CONFIDENCE INTERVAL FOR A PROPORTION



A website is trying to increase registration for first-time visitors, exposing 1% of these visitors to a new site design. Of 782 randomly sampled visitors over a month who saw the new design, 64 of them registered.

**CI = p_hat +/- zcrit x SE**

A) Check any conditions required for constructing a confidence interval.

B) Compute the standard error.

C) Construct and interpret a 90% confidence interval for the fraction of first-time visitors of the site who would register under the new design (assuming stable behaviors by new visitors over time).

D) This time, construct a 95% confidence interval. Is this interval wider or narrower? Why?

In [0]:
import math

In [96]:
n, k = 782, 64
p_hat = round(k/n, 2)
zcrit = round(norm.ppf(0.95), 4)
se = round(math.sqrt( (p_hat*(1-p_hat)) / n ), 3)
me = round(zcrit * se, 4)

print("A: We check for independence, sampling is random in this case, so independence is satisfied.\n",
      "We also need to check success-failure, which is satisfied with both above 10 (64 and 782).\n",
      "Because these conditions are satisfied we can model p_hat using a normal distibution.")

print("\nB:\nSample proportion p_hat:", round(p_hat, 4), "SE:", round(se, 4))

print("\nC:\nUse z-critical value of", zcrit, "for a 90% CI")
print("Confidence interval is p_hat +/- zcrit x SE",
      "\nCI = p_hat +/- zcrit x SE",
      "\nCI =", p_hat, "+/-", me )
print("We establish with 90% confidence that", round((p_hat-me)*100,2), 
      "% to", round((p_hat+me)*100,2), "% of visitors will register." )

zcrit = round(norm.ppf(0.975), 4)
me = round(zcrit * se, 4)
print("\nD:\nUse z-critical value of", zcrit, "for a 95% CI")
print("Our confidence interval is p_hat +/- zcrit x SE",
      "\nCI = p_hat +/- zcrit x SE",
      "\nCI =", p_hat, "+/-", me )
print("We establish with 95% confidence that", round((p_hat-me)*100,2), 
      "% to", round((p_hat+me)*100,2), "% of visitors will register." )

print("\nThe 95% interval must be wider as we would like to be more"
      "\nconfident that the true proportion of first-time visitors registering is in this interval.")



A: We check for independence, sampling is random in this case, so independence is satisfied.
 We also need to check success-failure, which is satisfied with both above 10 (64 and 782).
 Because these conditions are satisfied we can model p_hat using a normal distibution.

B:
Sample proportion p_hat: 0.08 SE: 0.01

C:
Use z-critical value of 1.6449 for a 90% CI
Confidence interval is p_hat +/- zcrit x SE 
CI = p_hat +/- zcrit x SE 
CI = 0.08 +/- 0.0164
We establish with 90% confidence that 6.36 % to 9.64 % of visitors will register.

D:
Use z-critical value of 1.96 for a 95% CI
Our confidence interval is p_hat +/- zcrit x SE 
CI = p_hat +/- zcrit x SE 
CI = 0.08 +/- 0.0196
We establish with 95% confidence that 6.04 % to 9.96 % of visitors will register.

The 95% interval must be wider as we would like to be more
confident that the true proportion of first-time visitors registering is in this interval.


#HYPOTHESIS TESTING FOR A PROPORTION
A food safety inspector is called upon to investigate a restaurant. The food safety inspector uses a hypothesis testing framework to evaluate whether regulations are not being met. If he decides the restaurant is in gross violation, its license to serve food will be revoked.

A) Write the hypotheses in words.

B) What is a Type 1 Error in this context?

C) What is a Type 2 Error in this context?

In [101]:
print("A:\nNull hypothesis = The restaurant meets health and safety standards.")
print("Alt hypothesis = The restaurant does not meet health and safety standards.")

print("\nB:\nA type one error would be the inspector declaring the restarant as being in violation, when it actually is safe.")
print("A type two error would be the inspector failing to declare the restarant as being in violation, when it actually is unsafe.")

A:
Null hypothesis = The restaurant meets health and safety standards.
Alt hypothesis = The restaurant does not meet health and safety standards.

B:
A type one error would be the inspector declaring the restarant as being in violation, when it actually is safe.
A type two error would be the inspector failing to declare the restarant as being in violation, when it actually is unsafe.


400 students were randomly sampled and 220 of them said they go to the gym at least once a week. Conduct a hypothesis test to check whether this represents a statistically significant difference from 50%:

A) Using a significance level of 1%, and

B) Using a significance level of 5%.

C) Explain how and why the conclusion changes when the significance level changes.

In [126]:
p0 = 0.5  # the null value
n, k = 400, 220
p_hat = round(k/n, 2)
se = round(math.sqrt( (p_hat*(1-p_hat)) / n ), 3)
z_score = round((p_hat-p ) / se, 4)
one_tail = round(1-norm.cdf(z_score), 4)
p_value = round(2*(1-norm.cdf(z_score)), 3)

print("Null hypothesis: p_hat = 0.5, Alt hypothesis: p_hat != 0.5")
print("Sampling is random in this case, so independence criteria is satisfied.",
      "Success-failure condition is also satisfied with", n-k, "and", k, "above 10.")

print("\nSE = sqrt( (p_hat*(1-p_hat)) / n ):", se)
print("Z-Score = (p_hat - p0) / SE:", z_score)
print("One-tail area for this z-score:", one_tail)
print("p-value is twice the one-tail area:", p_value)

print("\nA: alpha = 1%: We do not reject the null hypothesis because the p-value is larger than alpha = %1,"
      "\nand conclude that number of students who go to the gym weekly is not different to 50%.")

print("\nB: alpha = 5%: We reject the null hypothesis because the p-value is smaller than alpha = %5,"
      "\nand conclude that number of students who go to the gym weekly is different to 50%.")

print("\nC: The significance level affects T1 error chance. 1% alpha is very conservatice and will rarely reject",
      "\nthe null hypothesis. a 5% alpha will reject the null hypothesis more easily. In this case, with",
      "\n a p-value of,", p_value, "a 5% alpha will reject the null hypothesis while a 1% will not.")


Null hypothesis: p_hat = 0.5, Alt hypothesis: p_hat != 0.5
Sampling is random in this case, so independence criteria is satisfied. Success-failure condition is also satisfied with 180 and 220 above 10.

SE = sqrt( (p_hat*(1-p_hat)) / n ): 0.025
Z-Score = (p_hat - p0) / SE: 2.0
One-tail area for this z-score: 0.0228
p-value is twice the one-tail area: 0.046

A: alpha = 1%: We do not reject the null hypothesis because the p-value is larger than alpha = %1,
and conclude that number of students who go to the gym weekly is not different to 50%.

B: alpha = 5%: We reject the null hypothesis because the p-value is smaller than alpha = %5,
and conclude that number of students who go to the gym weekly is different to 50%.

C: The significance level affects T1 error chance. 1% alpha is very conservatice and will rarely reject 
the null hypothesis. a 5% alpha will reject the null hypothesis more easily. In this case, with 
 a p-value of, 0.046 a 5% alpha will reject the null hypothesis while a 1%