# Introduction

These are some exercises and quizzes from the __"Become a Probability & Statistics Master"__ (https://www.udemy.com/course/statistics-probability/) course on Udemy.

Main topics of this notebook are

- Sampling
- Hypothesis testing
- Regression

---

In [45]:
# Importing useful libraries

import pandas as pd
import math
from math import sqrt
import numpy as np
import statistics as st
import scipy
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# 06 Sampling
Link: https://drive.google.com/drive/folders/1Vu56xgUToj9XqvnMqDeNCxWJmDFGSgum

---
### 03 Sampling distribution of the sample mean

__1.Question:__ A hospital finds that the average birth weight of a newborn is 7.5 lbs with a standard deviation of 0.4 lbs. The hospital randomly selects 45 newborns to test this claim. What is the standard deviation of the sampling distribution?

__Explanation:__ 

- the standard deviation of the sampling distribution, also called the __standard error__ is given by:  $SE = σ/\sqrt{n}$

We can used this formula because the population standard deviation σ is known.

In [46]:
std = 0.4
n = 45

se = std/sqrt(n)

print("The standard deviation of the sampling distribution is",round(se,4))

The standard deviation of the sampling distribution is 0.0596


---
__2.Question:__ A company produces tires in a factory. Individual tires are filled to an approximate pressure of 36 PSI (pounds per square inch), with a standard deviation of 0.8 PSI. The pressure in the tires is normally distributed. The company randomly selects 125 tires to check their pressure. What is the probability that the mean amount of pressure in the tires is within 0.1 PSI of the population mean?

__Explanation:__ 

We know from the exercise that the pressure in the tyres is normally distributed, the sample is randomly selected and it is greater than 30, so we can use the Z test statistic:

-  $Z test = x̂-μ/(σ/\sqrt{n})$

In [47]:
# We want to know the probability that the sample mean x̂ is within 0.1 PSI of the population mean.
diff = 0.1
std = 0.8
n = 125

z = diff/(std/sqrt(n))
z = round(z,5)
z

1.39754

In [48]:
# This means we want to know the probability P(−1.39754 < z < 1.39754).
# Using norm cdf function to find the area under the curve
p1 = stats.norm.cdf(z)
p2 = stats.norm.cdf(-z)

print(p1)
print(p2)

0.9188743765809492
0.08112562341905083


In [49]:
# The probability under the normal curve between these z-scores is:
p = p1-p2

print(round(p,3))

0.838


---
### 04 Sampling distribution of the sample proportion

__1.Question:__ A restaurant wants to know the percentage of their customers who order desert. The restaurant has 1,500 customers in one week and finds by randomly surveying 100 customers that 35 of them order desert. What is the standard deviation for the sample?

__Explanation:__ 

The standard deviation of the Sampling distribution of the sample proportion is given by:

-  $σ = \sqrt\frac{p(1-p)}{n}$

In [50]:
n = 100
p = 35/100

std = sqrt((p*(1-p))/n)

print("The standard deviation of the sampling distribution of the sample proportion is",round(std,6))

The standard deviation of the sampling distribution of the sample proportion is 0.047697


---
__2.Question:__ A group of scientists is studying 10,000 manatees and finds that 20 % are calves. You want to verify the claim, but can’t conduct a study of all 10,000, so you randomly sample 500. What’s the probability that your results are within 5 % of the first study?

In [51]:
# find the mean and standard error for the sample.
n = 500
p = 0.2

std = sqrt((p*(1-p))/n)
std

0.01788854381999832

In [52]:
# We want to know the probability that the sample proportion p is within 0.05 population proportion p=0.2
diff = 0.05

z = diff/std
z = round(z,5)
z

2.79508

In [53]:
# This means we want to know the probability P(−2.79508 < z < 2.79508).
# Using norm cdf function to find the area under the curve
p1 = stats.norm.cdf(z)
p2 = stats.norm.cdf(-z)

print(p1)
print(p2)

0.9974056563240676
0.0025943436759323472


In [54]:
# The probability under the normal curve between these z-scores is:
p = p1-p2

print("The probability that our sample proportion will fall within 5 % of the first study’s claim is",round(p,4))

The probability that our sample proportion will fall within 5 % of the first study’s claim is 0.9948


---
### 05 Confidence interval for a population mean

__1.Question:__ The height of students in your school is normally distributed with a standard deviation of σ = 4 inches. You take a sample of 50 of your
classmates and get a sample mean of x̂ = 66 inches. What is the confidence interval for a confidence level of 95 % ?

__Explanation:__ 

When the population standard deviation is known, the confidence interval is given by:

-  $(a,b) = x̂ ± Z * σ / {\sqrt n}$

In [55]:
## data:

# height of students, sample mean x̂ and standard deviation σ, n =50
x_hat = 66
std = 4
n = 50

We are looking fot the confidence interval for a confidence level of 95 %, from the Z table we know that Z = 1.96

In [56]:
# Now we calculate the confidence interval using the given data

z = 1.96
a = x_hat - z*(std/sqrt(n))
b = x_hat + z*(std/sqrt(n))

In [57]:
print("The confidence interval for a confidence level of 95 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 95 % is: 64.89 , 67.11


---
__2.Question:__ You want to know the mean number of daylight hours in a day in your city (the time between sunrise and sunset) over the course of a year.
You take a random sample of 30 days throughout the year and get a sample mean of 𝑥̂ = 13.15 hours and a sample standard deviation of s = 0.85 hours. What is the confidence interval for a confidence level of 90 % ?

__Explanation:__ 

When the population standard deviation is unknown, we use the sample standard deviation S and T distribution, so confidence interval is given by:

-  $(a,b) = x̂ ± T * S / {\sqrt n}$

In [58]:
## data:

# daylight hours in a day, sample mean x̂ and sample standard deviation σ, n = 30
x_hat = 13.15
s = 0.85
n = 30

We are looking fot the confidence interval for a confidence level of 90 %, from the T table, with n-1 (30 - 1 = 29) degres of freedom, we know that T = 1.699

In [59]:
# Now we calculate the confidence interval using the given data

t = 1.699
a = x_hat - t*(s/sqrt(n))
b = x_hat + t*(s/sqrt(n))

In [60]:
print("The confidence interval for a confidence level of 90 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 90 % is: 12.89 , 13.41


---
### 06 Confidence interval for a population proportion

__1.Question:__ A study shows that 78 % of patients who try a new medication for migraines feel better within 30 minutes of taking the medicine. If the
study involved 120 patients, construct a 95 % confidence interval for the proportion of patients who feel better within 30 minutes of taking the
medicine.

__Explanation:__ 

When the population standard deviation is known, the confidence interval is given by:

-  $(a,b) = p̂ ± Z *\sqrt\frac{p(1-p)}{n}$

In [61]:
## data:

p = 0.78
n = 120

We are looking fot the confidence interval for a confidence level of 95 %, from the Z table we know that Z = 1.96

In [62]:
# Now we calculate the confidence interval using the given data

z = 1.96
a = p - z*(sqrt(p*(1-p)/n))
b = p + z*(sqrt(p*(1-p)/n))

In [63]:
print("The confidence interval for a confidence level of 95 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 95 % is: 0.71 , 0.85


---
__2.Question:__ A study shows that 243 of 500 randomly selected households were using a family member to care for their children who were under
preschool age. Build a 90 % confidence interval for the proportion of households using a family member to care for children under preschool age.

In [64]:
## data:

p = 243/500
n = 500

We are looking fot the confidence interval for a confidence level of 95 %, from the Z table we know that Z = 1.65

In [65]:
# Now we calculate the confidence interval using the given data

z = 1.65
a = p - z*(sqrt(p*(1-p)/n))
b = p + z*(sqrt(p*(1-p)/n))

In [66]:
print("The confidence interval for a confidence level of 90 % is:",round(a,2),",",round(b,2))

The confidence interval for a confidence level of 90 % is: 0.45 , 0.52


# 07 Hypothesis testing
Link: https://drive.google.com/drive/folders/1k23HKSl6N-4tVy5ID1GVJ6jDAp1lThBI

---
### 03 Test statistics for one- and two-tailed tests

__1.Question:__ You’ve decided to give all of your friends a small box of homemade cookies. You’ve already baked a variety of cookies and randomly placed them into the boxes for each of your friends. You want to make sure that each box is close to 0.5 pounds. So you take a sample of 10 boxes and find a sample mean of 0.54 pounds, and a sample standard deviation of s = 0.3 pounds. Assuming that the weights of all the boxes are normally distributed, calculate the test statistic.

__Explanation:__ 

When the population standard deviation is unknown, but we know that the underline data are nromally distributed, t-test ststistics is given by:

-  $t = \frac{X̂ - μ_0}{\frac{s}{\sqrt n}} $

In [67]:
## data:

mu = 0.5
x = 0.54
s = 0.3
n = 10

In [68]:
# Now we calculate the t-test statistic, using the given data

t = (x - mu)/(s/sqrt(n))

In [69]:
print("The t-test statistic is:",round(t,2))

The t-test statistic is: 0.42


### 05 Hypothesis testing for the population proportion

__1.Question:__ We want to test the hypothesis that 10 % of people are left-handed, so we collect a random sample of 500 people and find that 43 of them are left-handed. What can you conclude at a significance level of α = 0.10?

__Explanation:__ 

The standard error of the proportion is given by:

-  $σ_\bar{p} = \sqrt\frac{p_0(1-p_0)}{n}$

The test statistic is given by:

-  $z = \frac{\hat{p}-p_0}{σ_\bar{p}}$

First build the hypothesis statements.

- H0: 10 % of people are left-handed, p = 0.1

- Ha: The proportion of left-handed people is different than 10 % , p ≠ 0.1

In [70]:
## data:

p_hat = 43/500
p_zero = 0.10
n = 500
alpha = 0.10

In [71]:
# calculate the standard error:

se = sqrt((p_zero*(1-p_zero))/n)
se

0.01341640786499874

In [72]:
# calculate the test statistics:

z = (p_hat-p_zero)/se
round(z,3)

-1.043

The critical z-values for 90 % confidence with a two-tail test are z = ± 1.65. 
Since the test statistic we found is negative (z = − 1.043), we’ll compare it to z = − 1.65.
Our z-value of z = − 1.043 is not less than z = − 1.65, and therefore falls in the region of acceptance, which means we’ll fail to reject the null hypothesis and fail to conclude that the proportion of left-handed people is different than 10 % .

---
__2.Question:__ We want to test the hypothesis that fewer than 80 % of Americans eat breakfast, so we collect a random sample of 650 Americans and find that 496 of them eat breakfast. What can you conclude at a significance level of α = 0.05?

First build the hypothesis statements.

- H0: 80% or more of Americans eat breakfast , p >= 0.8

- Ha: fewer than 80% of Americans eat breakfast, p < 0.8

It is a lower tailed test

In [73]:
## data:

p_hat = 496/650
p_zero = 0.8
n = 650
alpha = 0.05

In [74]:
# calculate the standard error:

se = sqrt((p_zero*(1-p_zero))/n)
se

0.01568929081105472

In [80]:
# calculate the test statistics:

z = (p_hat-p_zero)/se
round(z,3)

-2.353

From the z table we find that area to the left of z score (-2.353) is equal to 0.00939 which is lower than alpha (0.05) so we can reject the null hypothesis and we can say that fewer of 80% of Americans eat breakfast.

---
__3.Question:__ We want to test the hypothesis that more than 25 % of NBA players (professional basketball players) started playing basketball before
age 5, so we collect a random sample of 117 NBA players and find that 34 of them started playing before they turned 5. What can you conclude at a significance level of α = 0.01?

First build the hypothesis statements.

- H0: 25% or less  of NBA players started playing basketball before age 5, p =< 0.25

- Ha: more than 25% of NBA players started playing basketball before age 5, p > 0.25

It is a upper tailed test

In [81]:
## data:

p_hat = 34/117
p_zero = 0.25
n = 117
alpha = 0.01

In [82]:
# calculate the standard error:

se = sqrt((p_zero*(1-p_zero))/n)
se

0.04003203845127178

In [83]:
# calculate the test statistics:

z = (p_hat-p_zero)/se
round(z,3)

1.014

From the z table we find that area to the left of z score (1.014) is equal to 0.84375, so the area in the right is equal to 1 - 0.84375 = 0.15625 which is greater than alpha (0.01) so we fail to reject the null hypothesis at 0.01 confidence level.