# Introduction

These are some exercises and quizzes from the __"Become a Probability & Statistics Master"__ (https://www.udemy.com/course/statistics-probability/) course on Udemy.

Main topics of this notebook are

- Sampling
- Hypothesis testing
- Regression

---

In [4]:
# Importing useful libraries

import pandas as pd
import math
from math import sqrt
import numpy as np
import statistics as st
import scipy
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# 06 Sampling
Link: https://drive.google.com/drive/folders/1Vu56xgUToj9XqvnMqDeNCxWJmDFGSgum

---
### 03 Sampling distribution of the sample mean

__1.Question:__ A hospital finds that the average birth weight of a newborn is 7.5 lbs with a standard deviation of 0.4 lbs. The hospital randomly selects 45 newborns to test this claim. What is the standard deviation of the sampling distribution?

__Explanation:__ 

- the standard deviation of the sampling distribution, also called the __standard error__ is given by:  $SE = σ/\sqrt{n}$

We can used this formula because the population standard deviation σ is known.

In [5]:
std = 0.4
n = 45

se = std/sqrt(n)

print("The standard deviation of the sampling distribution is",round(se,4))

The standard deviation of the sampling distribution is 0.0596


---
__2.Question:__ A company produces tires in a factory. Individual tires are filled to an approximate pressure of 36 PSI (pounds per square inch), with a standard deviation of 0.8 PSI. The pressure in the tires is normally distributed. The company randomly selects 125 tires to check their pressure. What is the probability that the mean amount of pressure in the tires is within 0.1 PSI of the population mean?

__Explanation:__ 

We know from the exercise that the pressure in the tyres is normally distributed, the sample is randomly selected and it is greater than 30, so we can use the Z test statistic:

-  $Z test = x̂-μ/(σ/\sqrt{n})$

In [6]:
# We want to know the probability that the sample mean x̂ is within 0.1 PSI of the population mean.
diff = 0.1
std = 0.8
n = 125

z = diff/(std/sqrt(n))
z = round(z,5)
z

1.39754

In [7]:
# This means we want to know the probability P(−1.39754 < z < 1.39754).
# Using norm cdf function to find the area under the curve
p1 = stats.norm.cdf(z)
p2 = stats.norm.cdf(-z)

print(p1)
print(p2)

0.9188743765809492
0.08112562341905083


In [8]:
# The probability under the normal curve between these z-scores is:
p = p1-p2

print(round(p,3))

0.838


---
### 04 Sampling distribution of the sample proportion

__1.Question:__ A restaurant wants to know the percentage of their customers who order desert. The restaurant has 1,500 customers in one week and finds by randomly surveying 100 customers that 35 of them order desert. What is the standard deviation for the sample?

__Explanation:__ 

The standard deviation of the Sampling distribution of the sample proportion is given by:

-  $σ = \sqrt\frac{p(1-p)}{n}$

In [9]:
n = 100
p = 35/100

std = sqrt((p*(1-p))/n)

print("The standard deviation of the sampling distribution of the sample proportion is",round(std,6))

The standard deviation of the sampling distribution of the sample proportion is 0.047697


---
__2.Question:__ A group of scientists is studying 10,000 manatees and finds that 20 % are calves. You want to verify the claim, but can’t conduct a study of all 10,000, so you randomly sample 500. What’s the probability that your results are within 5 % of the first study?

In [10]:
# find the mean and standard error for the sample.
n = 500
p = 0.2

std = sqrt((p*(1-p))/n)
std

0.01788854381999832

In [11]:
# We want to know the probability that the sample proportion p is within 0.05 population proportion p=0.2
diff = 0.05

z = diff/std
z = round(z,5)
z

2.79508

In [12]:
# This means we want to know the probability P(−2.79508 < z < 2.79508).
# Using norm cdf function to find the area under the curve
p1 = stats.norm.cdf(z)
p2 = stats.norm.cdf(-z)

print(p1)
print(p2)

0.9974056563240676
0.0025943436759323472


In [13]:
# The probability under the normal curve between these z-scores is:
p = p1-p2

print("The probability that our sample proportion will fall within 5 % of the first study’s claim is",round(p,4))

The probability that our sample proportion will fall within 5 % of the first study’s claim is 0.9948


---
### 05 Confidence interval for a population mean

__1.Question:__ The height of students in your school is normally distributed with a standard deviation of σ = 4 inches. You take a sample of 50 of your
classmates and get a sample mean of x̂ = 66 inches. What is the confidence interval for a confidence level of 95 % ?

__Explanation:__ 

When the population standard deviation is known, the confidence interval is given by:

-  $(a,b) = \hat{x} ± Z * σ / {\sqrt n}$

In [14]:
## data:

# height of students, sample mean x̂ and standard deviation σ, n =50
x_hat = 66
std = 4
n = 50

We are looking fot the confidence interval for a confidence level of 95 %, from the Z table we know that Z = 1.96

In [15]:
# Now we calculate the confidence interval using the given data

z = 1.96
a = x_hat - z*(std/sqrt(n))
b = x_hat + z*(std/sqrt(n))

In [16]:
print("The confidence interval for a confidence level of 95 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 95 % is: 64.89 , 67.11


---
__2.Question:__ You want to know the mean number of daylight hours in a day in your city (the time between sunrise and sunset) over the course of a year.
You take a random sample of 30 days throughout the year and get a sample mean of 𝑥̂ = 13.15 hours and a sample standard deviation of s = 0.85 hours. What is the confidence interval for a confidence level of 90 % ?

__Explanation:__ 

When the population standard deviation is unknown, we use the sample standard deviation S and T distribution, so confidence interval is given by:

-  $(a,b) = x̂ ± T * S / {\sqrt n}$

In [17]:
## data:

# daylight hours in a day, sample mean x̂ and sample standard deviation σ, n = 30
x_hat = 13.15
s = 0.85
n = 30

We are looking fot the confidence interval for a confidence level of 90 %, from the T table, with n-1 (30 - 1 = 29) degres of freedom, we know that T = 1.699

In [18]:
# Now we calculate the confidence interval using the given data

t = 1.699
a = x_hat - t*(s/sqrt(n))
b = x_hat + t*(s/sqrt(n))

In [19]:
print("The confidence interval for a confidence level of 90 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 90 % is: 12.89 , 13.41


---
### 06 Confidence interval for a population proportion

__1.Question:__ A study shows that 78 % of patients who try a new medication for migraines feel better within 30 minutes of taking the medicine. If the
study involved 120 patients, construct a 95 % confidence interval for the proportion of patients who feel better within 30 minutes of taking the
medicine.

__Explanation:__ 

When the population standard deviation is known, the confidence interval is given by:

-  $(a,b) = p̂ ± Z *\sqrt\frac{p(1-p)}{n}$

In [20]:
## data:

p = 0.78
n = 120

We are looking fot the confidence interval for a confidence level of 95 %, from the Z table we know that Z = 1.96

In [21]:
# Now we calculate the confidence interval using the given data

z = 1.96
a = p - z*(sqrt(p*(1-p)/n))
b = p + z*(sqrt(p*(1-p)/n))

In [22]:
print("The confidence interval for a confidence level of 95 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 95 % is: 0.71 , 0.85


---
__2.Question:__ A study shows that 243 of 500 randomly selected households were using a family member to care for their children who were under
preschool age. Build a 90 % confidence interval for the proportion of households using a family member to care for children under preschool age.

In [23]:
## data:

p = 243/500
n = 500

We are looking fot the confidence interval for a confidence level of 90 %, from the Z table we know that Z = 1.65

In [24]:
# Now we calculate the confidence interval using the given data

z = 1.65
a = p - z*(sqrt(p*(1-p)/n))
b = p + z*(sqrt(p*(1-p)/n))

In [25]:
print("The confidence interval for a confidence level of 90 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 90 % is: 0.45 , 0.52


# 07 Hypothesis testing
Link: https://drive.google.com/drive/folders/1k23HKSl6N-4tVy5ID1GVJ6jDAp1lThBI

---
### 03 Test statistics for one- and two-tailed tests

__1.Question:__ You’ve decided to give all of your friends a small box of homemade cookies. You’ve already baked a variety of cookies and randomly placed them into the boxes for each of your friends. You want to make sure that each box is close to 0.5 pounds. So you take a sample of 10 boxes and find a sample mean of 0.54 pounds, and a sample standard deviation of s = 0.3 pounds. Assuming that the weights of all the boxes are normally distributed, calculate the test statistic.

__Explanation:__ 

When the population standard deviation is unknown, but we know that the underline data are nromally distributed, t-test ststistics is given by:

-  $t = \frac{X̂ - μ_0}{\frac{s}{\sqrt n}} $

In [26]:
## data:

mu = 0.5
x = 0.54
s = 0.3
n = 10

In [27]:
# Now we calculate the t-test statistic, using the given data

t = (x - mu)/(s/sqrt(n))

In [28]:
print("The t-test statistic is:",round(t,2))

The t-test statistic is: 0.42


### 05 Hypothesis testing for the population proportion

__1.Question:__ We want to test the hypothesis that 10 % of people are left-handed, so we collect a random sample of 500 people and find that 43 of them are left-handed. What can you conclude at a significance level of α = 0.10?

__Explanation:__ 

The standard error of the proportion is given by:

-  $σ_\bar{p} = \sqrt\frac{p_0(1-p_0)}{n}$

The test statistic is given by:

-  $z = \frac{\hat{p}-p_0}{σ_\bar{p}}$

First build the hypothesis statements.

- H0: 10 % of people are left-handed, p = 0.1

- Ha: The proportion of left-handed people is different than 10 % , p ≠ 0.1

In [29]:
## data:

p_hat = 43/500
p_zero = 0.10
n = 500
alpha = 0.10

In [30]:
# calculate the standard error:

se = sqrt((p_zero*(1-p_zero))/n)
se

0.01341640786499874

In [31]:
# calculate the test statistics:

z = (p_hat-p_zero)/se
round(z,3)

-1.043

The critical z-values for 90 % confidence with a two-tail test are z = ± 1.65. 
Since the test statistic we found is negative (z = − 1.043), we’ll compare it to z = − 1.65.
Our z-value of z = − 1.043 is not less than z = − 1.65, and therefore falls in the region of acceptance, which means we’ll fail to reject the null hypothesis and fail to conclude that the proportion of left-handed people is different than 10 % .

---
__2.Question:__ We want to test the hypothesis that fewer than 80 % of Americans eat breakfast, so we collect a random sample of 650 Americans and find that 496 of them eat breakfast. What can you conclude at a significance level of α = 0.05?

First build the hypothesis statements.

- H0: 80% or more of Americans eat breakfast , p >= 0.8

- Ha: fewer than 80% of Americans eat breakfast, p < 0.8

It is a lower tailed test

In [32]:
## data:

p_hat = 496/650
p_zero = 0.8
n = 650
alpha = 0.05

In [33]:
# calculate the standard error:

se = sqrt((p_zero*(1-p_zero))/n)
se

0.01568929081105472

In [34]:
# calculate the test statistics:

z = (p_hat-p_zero)/se
round(z,3)

-2.353

From the z table we find that area to the left of z score (-2.353) is equal to 0.00939 which is lower than alpha (0.05) so we can reject the null hypothesis and we can say that fewer of 80% of Americans eat breakfast.

---
__3.Question:__ We want to test the hypothesis that more than 25 % of NBA players (professional basketball players) started playing basketball before
age 5, so we collect a random sample of 117 NBA players and find that 34 of them started playing before they turned 5. What can you conclude at a significance level of α = 0.01?

First build the hypothesis statements.

- H0: 25% or less  of NBA players started playing basketball before age 5, p =< 0.25

- Ha: more than 25% of NBA players started playing basketball before age 5, p > 0.25

It is a upper tailed test

In [35]:
## data:

p_hat = 34/117
p_zero = 0.25
n = 117
alpha = 0.01

In [36]:
# calculate the standard error:

se = sqrt((p_zero*(1-p_zero))/n)
se

0.04003203845127178

In [37]:
# calculate the test statistics:

z = (p_hat-p_zero)/se
round(z,3)

1.014

From the z table we find that area to the left of z score (1.014) is equal to 0.84375, so the area in the right is equal to 1 - 0.84375 = 0.15625 which is greater than alpha (0.01) so we fail to reject the null hypothesis at 0.01 confidence level.

---
### 06 Confidence interval for the difference of means

__1.Question:__ A professor is interested in whether exam scores differ between two nearby colleges. He selects a simple random sample of 20 students each from both colleges and finds a mean test score of 350 with a standard deviation of 15 at the first college, and a mean test score of 390 with a standard deviation of 30 at the second college. Assuming exam scores are normally distributed at both colleges, find a 95 % confidence interval around the difference in exam scores.

__Explanation:__ 

When the population standard deviation is unknown, and the sample size are smaller than 30, the confidence interval for 2 means, with independent samples, is given by:

-  $(a,b) = \hat{x} -\hat{y} ± {t_\frac{\alpha}{2}} * \sqrt{\frac{S_x^2}{n_x}+\frac{S_y^2}{n_y}}$

In [47]:
## data:

# 1st group of students
x_hat = 390
x_std = 30
nx = 20
# 2nd group of students
y_hat = 350
y_std = 15
ny = 20

In [54]:
# calculate degrees of freedom, becasue we are using t-test we need to calculate them using this formula

df = (((x_std**2/nx)+(y_std**2/ny))**2)/(((1/(nx-1))*(x_std**2/nx)**2)+((1/(ny-1))*(y_std**2/ny)**2))
df

27.94117647058824

In [49]:
# calculate difference of means and standar error

diff = x_hat-y_hat 
se = sqrt((x_std**2/nx)+(y_std**2/ny))
print(diff)
print(se)

40
7.5


Rounding down to the nearest whole number in order to keep the estimate conservative, we find 27 degrees of freedom. Together with a 95 % confidence level, the t-table gives tα/2 = 2.052.

In [52]:
# Now we calculate the confidence interval using the given data

t = 2.052
a = diff - t*se
b = diff + t*se

In [53]:
print("The confidence interval for a confidence level of 95 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 95 % is: 24.61 , 55.39


---
__2.Question:__ Two college directors want to determine whether there’s a difference in the amount that their students spend annually on textbooks.
They sample 200 students from college A and 230 from college B and find mean spends of x ̄A = 1,258 and x ̄B = 1,150. Assuming both populations are
normally distributed with σA = 52 and σB = 64, find a 99 % confidence interval around the difference in annual textbook spend.

__Explanation:__ 

When the population standard deviation is known, the confidence interval for 2 means, with independent samples, is given by:

-  $(a,b) = \hat{x} -\hat{y} ± Z * \sqrt{\frac{\sigma_x^2}{n_x}+\frac{\sigma_y^2}{n_y}}$

In [67]:
## data:

# 1st group of students, A
x_hat = 1258
x_std = 52
nx = 200
# 2nd group of students, B
y_hat = 1150
y_std = 64
ny = 230

In [68]:
# calculate difference of means and standar error

diff = x_hat-y_hat 
se = sqrt((x_std**2/nx)+(y_std**2/ny))
print(diff)
print(se)

108
5.597204271078009


We are looking fot the confidence interval for a confidence level of 99 %, from the Z table we know that Z = 2.58

In [69]:
# Now we calculate the confidence interval using the given data

z = 2.58
a = diff - z*se
b = diff + z*se

In [70]:
print("The confidence interval for a confidence level of 99 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 99 % is: 93.56 , 122.44


---
__3.Question:__ The owners of two restaurants on the same street are interested in whether or not their daily earnings differ. They take simple random
samples of earnings over 15 days, and find mean daily earnings of 1,365 with a standard deviation of 48 for the first restaurant, and mean daily
earnings of 1,230 with a standard deviation of 28 for the second restaurant. Assuming daily earnings at both restaurants follow a normal distribution, find a 90 % confidence interval around the difference in daily earnings.

__Explanation:__ 

When the population standard deviation is unknown, and the sample size are smaller than 30, the confidence interval for 2 means, with independent samples, is given by:

-  $(a,b) = \hat{x} -\hat{y} ± {t_\frac{\alpha}{2}} * \sqrt{\frac{S_x^2}{n_x}+\frac{S_y^2}{n_y}}$

In [72]:
## data:

# 1st restaurant 
x_hat = 1365
x_std = 48
nx = 15
# 2nd restaurant
y_hat = 1230
y_std = 28
ny = 15

In [73]:
# calculate degrees of freedom, becasue we are using t-test we need to calculate them using this formula

df = (((x_std**2/nx)+(y_std**2/ny))**2)/(((1/(nx-1))*(x_std**2/nx)**2)+((1/(ny-1))*(y_std**2/ny)**2))
df

22.539050006483127

In [74]:
# calculate difference of means and standar error

diff = x_hat-y_hat 
se = sqrt((x_std**2/nx)+(y_std**2/ny))
print(diff)
print(se)

135
14.348054455802245


Rounding down to the nearest whole number in order to keep the estimate conservative, we find 22 degrees of freedom. Together with a 90 % confidence level, the t-table gives tα/2 = 1.717.

In [76]:
# Now we calculate the confidence interval using the given data

t = 1.717
a = diff - t*se
b = diff + t*se

In [77]:
print("The confidence interval for a confidence level of 90 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 90 % is: 110.36 , 159.64


---
### 07 Confidence interval for the difference of proportions

__1.Question:__ A video game developer wants to know how the number of male video game players compares to the number of female players. They randomly select 500 males and 500 females and find that 368 of the males play video games, while 230 of the females play video games. Find a 95 % confidence interval around the difference between the number of male and female players.

__Explanation:__ 

When the population standard deviation is known, the confidence interval is given by:

-  $(a,b) = (p̂_1- p̂_2) ± {z_\frac{\alpha}{2}} *\sqrt{\frac{p̂_1(1-p̂_1)}{n_1}+\frac{p̂_2(1-p̂_2)}{n_2}}$

In [85]:
## data:

# male video players
m = 368
nx = 500
px = m/nx

# female video players
f =230
ny = 500
py = f/ny

print(px)
print(py)

0.736
0.46


In [91]:
# calculate standar error

diff = px - py 
se = sqrt((px*(1-px)/nx)+(py*(1-py)/ny))
print(diff)
print(se)

0.27599999999999997
0.029755806156110107


We are looking fot the confidence interval for a confidence level of 95 %, from the Z table we know that Z = 1.96

In [92]:
# Now we calculate the confidence interval using the given data

z = 1.96
a = diff - z*se
b = diff + z*se

In [94]:
print("The confidence interval for a confidence level of 95 % is:",round(a,3),",",round(b,3))

The confidence interval for a confidence level of 95 % is: 0.218 , 0.334


---
__2.Question:__ A college director wonders about the difference between the number of male and female students who scored higher than 90 on a recent final exam. He randomly selects 25 males and 25 females and finds that 15 males and 12 females scored more than 90. Find a 99 % confidence interval around the true difference of male students and female students who scored higher than 90 on the recent exam.

In [104]:
## data:

# male students
m = 15
nx = 25
px = m/nx

# female students
f =12
ny = 25
py = f/ny

print(px)
print(py)

0.6
0.48


In [105]:
# calculate standar error

diff = px - py 
se = sqrt((px*(1-px)/nx)+(py*(1-py)/ny))
print(diff)
print(se)

0.12
0.13994284547628721


We are looking fot the confidence interval for a confidence level of 99 %, from the Z table we know that Z = 2.58

In [106]:
# Now we calculate the confidence interval using the given data

z = 2.58
a = diff - z*se
b = diff + z*se

In [115]:
print("The confidence interval for a confidence level of 99 % is:",round(a,3),",",round(b,3))
print("Because the confidence interval includes 0, we can’t conclude that there’s a significant difference between the number of male and female students who scored higher than 90.")

The confidence interval for a confidence level of 99 % is: -0.241 , 0.481
Because the confidence interval includes 0, we can’t conclude that there’s a significant difference between the number of male and female students who scored higher than 90.


---
__3.Question:__ Directors at colleges A and B are interested whether there’s a difference in the number of students who work while attending their colleges. They take a random sample of 100 students from each college and find that 37 students at college A and 40 students at college B are currently working. Find a 90 % confidence interval around the difference of population proportions.

In [116]:
## data:

# college A
a = 37
nx = 100
px = a/nx

# college b
b = 40
ny = 100
py = b/ny

print(px)
print(py)

0.37
0.4


In [118]:
# calculate standar error

diff = px - py 
se = sqrt((px*(1-px)/nx)+(py*(1-py)/ny))
print(diff)
print(se)

-0.030000000000000027
0.06878226515607057


We are looking fot the confidence interval for a confidence level of 90 %, from the Z table we know that Z = 1.65

In [119]:
# Now we calculate the confidence interval using the given data

z = 1.65
a = diff - z*se
b = diff + z*se

In [122]:
print("The confidence interval for a confidence level of 90 % is:",round(a,2),",",round(b,2))
print("Because the confidence interval includes 0, we can’t conclude that there’s a significant difference.")

The confidence interval for a confidence level of 90 % is: -0.14 , 0.08
Because the confidence interval includes 0, we can’t conclude that there’s a significant difference.


---
### 08 Hypothesis testing for the difference of proportions

__1.Question:__ Assuming p1 − p2 = 0, find the value of the z-test statistic, given p̂1 = 0.295 for n1 = 130, and p̂2 = 0.226 for n2 = 110.

__Explanation:__ 

As long as we take independent random samples from each population, and n1p̂1 ≥ 5, n1(1 − p̂1) ≥ 5, n2p̂2 ≥ 5, and n2(1 − p̂2) ≥ 5, then the test statistic formula we’ll use is:

-  $Z = \frac{(p̂_1- p̂_2) - (p_1-p_2)}{\sqrt{{p̂(1-p̂)}(\frac{1}{n_1}+\frac{1}{n_2})}}$

If the null hypothesis states a zero difference between population proportions, such that p1 − p2 = 0, we can silpify the formula:

-  $Z = \frac{(p̂_1- p̂_2)}{\sqrt{{p̂(1-p̂)}(\frac{1}{n_1}+\frac{1}{n_2})}}$

where p̂1 and p̂2 are the sample proportions, p1 and p2 are the population proportions, n1 and n2 are the sample sizes, and p̂ is the proportion of the combined sample:

-  $p̂ = \frac{x_1 + x_2}{n_1 + n_2}$

In [130]:
## data:

# group 1
p1 = 0.295
n1 = 130

# group 2
p2 = 0.226
n2 = 110

p_pool = (p1*n1 + p2*n2)/(n1 + n2)
se_pool = sqrt(p_pool*(1-p_pool)*(1/n1 + 1/n2))

d = p1 - p2
d

0.06899999999999998

In [133]:
# calculate the test statistics
# The null hypothesis states a zero difference between population proportions, such that p1 − p2 = 0:

z = d/se_pool
round(z,2)

1.21

---
__2.Question:__ A scientist wants to test how fast two flu drugs help patients recover from flu. He randomly assigns 100 patients each to two groups, and gives group 1 the first drug and group 2 the second drug. In the first group, 57 patients recovered from flu 5 days, while 49 patients in the second group recovered from flu in 5 days. Using a critical value approach at a 99 % confidence level, can the scientist conclude that either drug is more effective than the other?

First build the hypothesis statements.

- H0: p1 - p2 = 0

- Ha: p1 - p2 ≠ 0

It is a two tailed test

In [134]:
## data:

# group 1
x1 = 57
n1 = 100
p1 = x1/n1

# group 2
x2 = 49
n2 = 100
p2 = x2/n2

p_pool = (p1*n1 + p2*n2)/(n1 + n2)
se_pool = sqrt(p_pool*(1-p_pool)*(1/n1 + 1/n2))

d = p1 - p2
d

0.07999999999999996

In [142]:
# calculate the test statistics:
# The null hypothesis states a zero difference between population proportions, such that p1 − p2 = 0

z = d/se_pool
round(z,2)

0.91

For two-tailed test with α = 0.01, the critical value of z is 2.58. Since 1.13 < 2.58, the test statistic falls into the region of acceptance, so we fail to reject the null hypothesis, and we can’t conclude that either of the drugs is more effective than the other.

---
__3.Question:__ 50 randomly chosen well-prepared students, and 50 randomly chosen poorly-prepared students, all took a math test. The response to a specific question was examined by the professor, who was interested whether the proportion of well-prepared students who answered the question correctly was at least 18 % higher than the proportion of poorly-prepared students who answered the question correctly. The professor found that 39 of the well-prepared and 35 of the poorly-prepared students answered the question correctly. What can he conclude at a 95 % confidence level?

First build the hypothesis statements.

- H0: p1 - p2 <= 0.18

- Ha: p1 - p2 > 0.18

It is a upper tailed test

In [139]:
## data:

# hypothesized proportion p
p = 0.18

# group 1
x1 = 39
n1 = 50
p1 = x1/n1

# group 2
x2 = 35
n2 = 50
p2 = x2/n2

p_pool = (p1*n1 + p2*n2)/(n1 + n2)
se_pool = sqrt(p_pool*(1-p_pool)*(1/n1 + 1/n2))

d = p1 - p2
d

0.08000000000000007

In [143]:
# calculate the test statistics:
# The null hypothesis states 0.18 difference between population proportions, such that p = 0.18

z = (d-p)/se_pool
round(z,2)

-1.14

For lower-tailed test with α = 0.05, the critical value of z is 1.65. Since -1.14 > 1.65, so we fail to reject the null hypothesis and therefore can’t conclude that there’s an at least 18 % difference between the well-prepared and poorly-prepared students.

---
Sandbox:

__Explanation:__ 

we can use pooled variance if our two samples were taken from the same population, and/or if neither sample variance is more than twice the other:

-  $(a,b) = \hat{x} -\hat{y} ± {t_\frac{\alpha}{2}} *s_p \sqrt{\frac{1}{n_x}+\frac{1}{n_y}}$

with $df = n1 + n2 − 2$