In [1]:
import numpy as np
import scipy.stats as stats

## Q1: What is the difference between a t-test and a z-test? Provide an example scenario where you would use each type of test.

|Point|Z-Test|t-Test|
|---|---|---|
|**Definition**| It is a hypothesis testing used for calculating the average/mean when population standard deviation is given. | It is also a hypothesis teting used for finding how the average/mean of 2 datasets differ when standard deviation is not give.|
|**Population variance**| Population variance or standard deviation is known here.| Population variance or standard deviation is unknown here.|
|**Sample Size**| When sample size is large and above at least 30.| When the sample size is small.|
|**Distribution Type**| It follows usually Normal Distribution.| It follows Student's t-distribution.|
|**Degrees of Freedom**| Not applicable here.| It is applicable here and is equal to size of sample - 1.|
|**Formula**| $z = \frac{\bar{x} - \mu }{\sigma}$ | $t = \frac{m - \mu}{\sigma / \sqrt{n}}$ |
|**Confidence Interval(CI)**| $CI = \bar{x} \pm z_{\alpha/2} \sqrt{\frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}}$ | $CI = \bar{x} \pm t_{\alpha / 2, df}s_{pooled} \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}$ |

$$s_{pooled} = \sqrt{\frac{(n_{1} - 1) s_{1}^2 + (n_{2} - 1) s_{2}^2}{n_{1} + n_{2} - 2}}$$

![image-2.png](attachment:image-2.png)

Example Scenarios:
--------------------------------

1. t-test : Suppose we want to compare the average height of 2 groups of students belonging to different classes. Here we are not provided with the population standard deviation and then we use 2 sample indenpendent t-test to compare the means of the 2 groups of students and they are not dependent on each other.

2. z-test : Suppose we want to find out the difference between the average sales of the same shops located in different locations and we know the population standard deviation. Additionally, we also know the sample size and since it is greater then 30, we will use z-test to calculate the difference and specifically we can use 2 sample indenpendent z-test.

## Q2: Differentiate between one-tailed and two-tailed tests.

|Points|One-Tailed Test|Two-Tailed Test|
|---|---|---|
|**Definition**| A test of any statistical hypothesis, where the alternative hypothesis is one-tailed either right-tailed or left-tailed.| A test of a statistical hypothesis, where the alternative hypothesis is two-tailed.|
|**Critical Regions**| One Tailed test has only one critical region.| Two Tailed test hs 2 critical regions.|
|**Symbols Used**| For one-tailed, we use either > or < sign for the alternative hypothesis.| For two-tailed, we use ≠ sign for the alternative hypothesis.|
|**Significance Level**| Here, the Entire level of significance (α) i.e. 5% has either in the left tail or right tail.| It splits the level of significance ($\alpha$) into half.|
|**Rejection Area**| Rejection region is either from the left side or right side of the sampling distribution.| Rejection region is from both sides i.e. left and right of the sampling distribution.|
|**What it finds out**| It is used to check whether the one mean is different from another mean or not.| It is used to check whether the two mean different from one another or not.|

## Q3: Explain the concept of Type 1 and Type 2 errors in hypothesis testing. Provide an example scenario for each type of error.

**Reality** - Null Hypothesis is True or False  
**Decision** - Null Hypothesis is True or False

**Type 1 Error** - The Null Hypothesis is rejected, when in reality it is actually True.(False Negative)
**Type 2 Error** - The Null Hypothesis is not rejected, when in reality it is aactually False.(False Positive)

## Q4: Explain Bayes's theorem with an example.

**Bayes’ theorem** describes the probability of occurrence of an event related to any condition. It is also considered for the case of **conditional probability**. Bayes theorem is also known as the formula for the probability of **“causes”**.

Formula:

$$P(A|B) = \frac{P(A) P(B|A)}{P(B)}$$

**P(A)** = Probability of event A occuring independently  
**P(B)** = Probability of event B occuring independently  
**P(A|B)** = Probability of event A occuring given event B has already occured  
**P(B|A)** = Probability of event B occuring given event A has already occured

E.g:  We have to calculate the probability of finding a Yellow ball from a bag containing 5 Yellow and 6 Green balls given that a Green ball is extracted the first time.

P(A) = Probability of finding a Yellow Ball  
P(B) = Probability of finding a Green Ball  

We need to calculate P(A|B) i.e. the probability of finding a Yellow ball given a Green ball has already been drawn from the bag.

P(A) = 1/5  
P(B) = 1/6  

$$P(A|B) = \frac{P(A) P(B|A)}{P(B)}$$
$$= \frac{\frac{1}{5} * \frac{5}{10}}{\frac{1}{6}}$$
$$= 0.6$$

## Q5: What is a confidence interval? How to calculate the confidence interval, explain with an example.

**Confidence Interval(CI)** - In frequentist statistics, a confidence interval is a range of estimates for an unknown parameter of the population. A confidence interval is computed at a designated confidence level; the 95% confidence level is most common, but other levels, such as 90% or 99%, are sometimes used. It basically refers to the probability that a population parameter will fall between a set of values for a certain proportion of times.

|**How is Confidence Interval Calculated**|
|---|

The confidence level is the overall capture rate if the method is used many times. The sample mean will vary from sample to sample, but the method **estimate $\pm$ margin_of_error** is used to get an interval based on each sample. So for calculating the Confidence Interval, we first need to know the sample mean as well as the standard error of the mean (SEM)which is a measure of the variation in sample mean. For calculating the standard error also, we need to know or calculate the sample standard deviation.

$$SEM = \frac{\sigma}{\sqrt{SampleSize}}$$

$$MarginOfError = z * SEM$$

$$ConfidenceInterval = SampleMean \pm MarginOfError$$

**Eg:** Suppose there is a class population and some researcher has taken a sample of size 100 with a mean height of 180cm and a standard deviation of 10.2. Find out the population true mean weight with a confidence of 95%.

**Steps**

1. Here we are provided with the sample of size 100. This sample has a mean weight of 180 with a standard deviation of 10.2
2. We are not provided anything about the population and hence we need to use t-test for the same.
3. First we will need to calculate the standard error.
4. Then we calculate the t_statistics value with the degree of freedom as 99 and the confidence level(samp_size - 1)
5. We then calculate the margin error and do $\pm$ with the given sample mean to find out the confidence interaval of the true population mean with 95% confidence.

In [2]:
sample_size = 100
sample_mean = 180
sample_std = 10.2

dof = sample_size - 1

ci = 0.95
alpha = 1 - ci

t_stats = stats.t.ppf(q = 1- alpha/2, df = dof)

se = sample_std / np.sqrt(sample_size)
me = t_stats * se

lb = sample_mean - me
ub = sample_mean + me

print(f"The true population mean height lies between {lb:.2f} and {ub:.2f} with a confidence of {ci * 100}%")

The true population mean height lies between 177.98 and 182.02 with a confidence of 95.0%


## Q6. Use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the event's probability and new evidence. Provide a sample problem and solution.

**Problem** - A company manufactures a certain type of product that has a defect rate of 2%. The company uses a quality control process that is 95% effective in catching defective products, and 99% effective in passing non-defective products. If a product passes the quality control process, what is the probability that it is actually defect-free?

**Solution**

Assumptions:

1. D: Defective Product
2. ~D: Non Defective product
3. P: Product passes the quality control

Problem Breakdown:  
====================
We need to find out the following:
$$P(\sim{D}|P) = \frac{P(\sim{D}) P(P|\sim{D})}{P(P)}$$


Given:
1. P(D) = 0.02
2. P(~D) = 0.98
3. P(P|D) = 1 - 0.95 = 0.05
4. P(P|~D) = 0.99

$$P(P) = P(D)P(P|D) + P(\sim{D})P(P|\sim{D})$$
$$= 0.02 * 0.05 + 0.98 * 0.99$$
$$= 0.9712$$

Therefore,

$$P(\sim{D}|P) = \frac{0.98 * 0.99}{0.9712}$$
$$= 0.99897$$

## Q7. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5. Interpret the results.

In [3]:
ci = 0.95
alpha = 1 - ci

sample_mean = 50
sample_std = 5
# Assuming sample size to be 50
sample_size = 50

dof = sample_size - 1

t_crit = stats.t.ppf(q = 1 - alpha, df = dof)

se = sample_std / np.sqrt(sample_size)
me = t_crit * se

lb = sample_mean - me
ub = sample_mean + me

print(f"The population mean lies between {lb:.2f} and {ub:.2f} with a confidence level of {ci * 100}%")

The population mean lies between 48.81 and 51.19 with a confidence level of 95.0%


## Q8. What is the margin of error in a confidence interval? How does sample size affect the margin of error? Provide an example of a scenario where a larger sample size would result in a smaller margin of error.

**Margin of Error** -The margin of error is half the confidence interval (also, the radius of the interval). A margin of error tells us how many percentage points our results will differ from the real population value. For example, a 95% confidence interval with a 4 percent margin of error means that your statistic will be within 4 percentage points of the real population value 95% of the time.
![image.png](attachment:image.png)

Eg: If a poll might state that there is a 98% confidence interval of 4.88 and 5.26. So we can say that if the poll is repeated using the same techniques, 98% of the time the true population parameter (parameter vs. statistic) will fall within the interval estimates (i.e. between 4.88 and 5.26) 98% of the time.

**Impact of Sample Size on Margin of Error** - The larger the sample size, the smaller the margin of error. Also, the further from 50% the reported percentage, the smaller the margin of error. The reason for tis is the larger population provides more information about the population and hence the sample statistics should be close to the population statistics.

$$Margin of Error(MoE)= critical\_value * (\frac{sample\_standard\_deviation} {\sqrt(sample\_size})$$

So as the sample_size increases the fractional part decreases and the MoE reduces. So we can observe that the MoE is indirectly proportional to the size of the sample.

In [4]:
critical_val = 2.325

sample_std = 3.5
sample_size1 = 50

moe1 = critical_val * sample_std / np.sqrt(sample_size1)

sample_size2 = 100

moe2 = critical_val * sample_std / np.sqrt(sample_size2)

print(f"Margin of error with sample size {sample_size1} = {moe1}")
print(f"Margin of error with sample size {sample_size2} = {moe2}")

Margin of error with sample size 50 = 1.1508162863811062
Margin of error with sample size 100 = 0.8137500000000001


## Q9. Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population standard deviation of 5. Interpret the results.

In [5]:
obs = 75
pop_mean = 70
pop_std = 5

z_score = (obs - pop_mean) / pop_std

print(f"The Z-Score is {z_score}")

The Z-Score is 1.0


This means the observed value is 1 standard deviation away from population mean and it is towards the right.

## Q10. In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is significantly effective at a 95% confidence level using a t-test.

In [6]:
sample_size = 50
sample_std = 2.5
sample_mean = 6

mean = 0

dof = sample_size - 1

ci = 0.95
alpha = 1 - ci

null_hypo = "The drug is not significantly effective"
alt_hypo = "The drug is significantly effective"

t_score = (sample_mean - mean) / (sample_std / np.sqrt(sample_size))
p_val = stats.t.cdf(x = t_score, df = dof)

if p_val < alpha:
    print("Reject the null hypothesis:", null_hypo)
    print("Accept the alternate hypothesis:", alt_hypo)
else:
    print("Failed to reject the null hypothesis:", null_hypo)
    print("Reject the alternate hypothesis:", alt_hypo)

Failed to reject the null hypothesis: The drug is not significantly effective
Reject the alternate hypothesis: The drug is significantly effective


## Q11. In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95% confidence interval for the true proportion of people who are satisfied with their job.

**Standard Error of the Proportion:** Often in statistics we’re interested in estimating the proportion of individuals in a population with a certain characteristic. For example, we might be interested in estimating the proportion of residents in a certain city who support a new law. Instead of going around and asking every individual resident if they support the law, we would instead collect a simple random sample and find out how many residents in the sample support the law. We would then calculate the sample proportion (p̂) as:

Sample Proportion Formula:

 

$$p̂ = x / n$$

where:

x: The count of individuals in the sample with a certain characteristic.  
n: The total number of individuals in the sample.

Standard Error of the Proportion Formula:

 

$$Standard Error = \sqrt{p̂(1-p̂) / n}$$

In [8]:
n = 500
p = 0.65

z_score = 1.65 # z-score for 95% confidence interval

se = np.sqrt(p * (1 - p) / n)
me = z_score * se

lb = p - me
ub = p + me

print(f"The true portion of people who are satisfied with their job lies between {lb * 100: .2f}% and {ub * 100: .2f}%")

The true portion of people who are satisfied with their job lies between  61.48% and  68.52%


## Q12. A researcher is testing the effectiveness of two different teaching methods on student performance. Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82 with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a significant difference in student performance using a t-test with a significance level of 0.01.

In [12]:
sampleA_mean = 85
sampleA_std = 6
n1 = 100

sampleB_mean = 82
sampleB_std = 5
n2 = 100

alpha = 0.01

null_hypo = "The 2 teaching methods have a significant difference in student performance"
alt_hypo = "The 2 teaching methods doesn't have a significant difference in student performance"

# pooled standard deviation
sp = np.sqrt(((n1 - 1) * sampleA_std ** 2 +(n2 - 1) * sampleB_std ** 2) / (n1 + n2 - 2))

t_stats = (sampleA_mean - sampleB_mean) / (sp * np.sqrt(1 / n1 + 1 / n2))

dof = n1 + n2 - 2

p_val = stats.t.cdf(x = t_stats, df = dof)

if p_val < alpha:
    print("Reject the null hypothesis:", null_hypo)
    print("Accept the alternate hypothesis:", alt_hypo)
else:
    print("Failed to reject the null hypothesis:", null_hypo)
    print("Reject the alternate hypothesis:", alt_hypo)

Failed to reject the null hypothesis: The 2 teaching methods have a significant difference in student performance
Reject the alternate hypothesis: The 2 teaching methods doesn't have a significant difference in student performance


## Q13. A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean of 65. Calculate the 90% confidence interval for the true population mean.

In [20]:
pop_mean = 60
pop_std = 8

sample_size = 50
sample_mean = 65

ci = 0.9
alpha = 1 - ci

z_score = (sample_mean - pop_mean) / pop_std

se = sample_std / np.sqrt(sample_size)
me = z_score * se

lb = sample_mean - me
ub = sample_mean + me

print(f"The true population mean lies between {lb:.2f} and {ub : .2f} with a confidence interval of {ci * 100}")

The true population mean lies between 64.78 and  65.22 with a confidence interval of 90.0


## Q14. In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.

In [30]:
sample_size = 30
sample_mean = 0.25
sample_std = 0.05

pop_mean = 0.3

dof = sample_size - 1

ci = 0.9
alpha = 1 - ci

null_hypo = "The caffeine has a significant effect on reaction time"
alt_hypo = "The caffeine doesn't have a significant effect on reaction time"

t_stats = (sample_mean - pop_mean) / (sample_std / np.sqrt(sample_size))
t_crit = stats.t.ppf(q = 1- alpha / 2, df = dof)

se = (sample_std / np.sqrt(sample_size))
me = t_crit * se

lb = sample_mean - me
ub = sample_mean + me

p_val = stats.t.cdf(x = t_stats, df = dof)

print(f"90% Confidence Interval: ({lb:.3f}, {ub:.3f})")

if p_val > t_crit:
    print("Reject the null hypothesis:", null_hypo)
    print("Accept the alternate hypothesis:", alt_hypo)
else:
    print("Fail to reject the null hypothesis:", alt_hypo)
    print("Reject the alternate hypothesis:", alt_hypo)

90% Confidence Interval: (0.234, 0.266)
Fail to reject the null hypothesis: The caffeine doesn't have a significant effect on reaction time
Reject the alternate hypothesis: The caffeine doesn't have a significant effect on reaction time
