## Essentials of Statistics and Math for Data Science - Assignment 4
Questions 1: Assume a manufacturer states that their new light bulbs have a lifespan of 1000 hours on average. You want to put this assertion to the test, so you conduct a hypothesis test with a significance threshold of 0.05. You take 25 light bulbs at random and discover that they have an average lifespan of 980 hours with a standard deviation of 50 hours.

In [1]:
import scipy.stats as stats
import numpy as np

mean_sample = 980
std_sample = 50
n = 25
mean_population = 1000
alpha = 0.05

In [2]:
# Calculate the test statistic
t_statistic = (mean_sample - mean_population) / (std_sample / np.sqrt(n))

In [3]:
# Calculate degrees of freedom
df = n - 1

In [4]:
# Calculate the critical t-value for a two-tailed test
t_critical = stats.t.ppf(1 - alpha / 2, df)

In [5]:
# Calculate the p-value
p_value = 2 * (1 - stats.t.cdf(np.abs(t_statistic), df))

In [6]:
t_statistic, t_critical, p_value

(-2.0, 2.0638985616280205, 0.056939849936591624)

Test Statistic (t): -2.00 (calculated value)
Critical t-Value (t₀.₀₅): ±2.064 (for df = 24)
P-Value: 0.057

Conclusion:

Since the absolute value of the test statistic -2.0 is less than the critical value 2.064 or since the p-value 0.057 is greater than 0.05, we do not reject the null hypothesis.
Based on the analysis, the data does not provide sufficient evidence to conclude that the average lifespan of the light bulbs is different from 1000 hours. The manufacturer's claim that their light bulbs have an average lifespan of 1000 hours remains acceptable based on this test.


#### Questions 2: Two different product categories, A and B, are produced by a company and sold to clients in a 3:2 ratio each. For type A, there is a 0.05 defect probability while for type B, there is a 0.08 defect probability. What is the chance that a product of type B is defective if a consumer complains about it?

In [9]:
from sympy import Rational

# Given data
P_A = Rational(3, 5)  # Probability of product being type A
P_B = Rational(2, 5)  # Probability of product being type B
P_D_given_A = 0.05    # Probability of defect given type A
P_D_given_B = 0.08    # Probability of defect given type B

In [10]:
# Calculate the total probability of defect P(D)
P_D = (P_D_given_A * P_A) + (P_D_given_B * P_B)

In [11]:
# Calculate the probability that a defective product is of type B P(B|D)
P_B_given_D = (P_D_given_B * P_B) / P_D

In [12]:
# Output the result
P_B_given_D.evalf()  # Evaluate to a floating-point number

0.516129032258065

Thus, there is a 51.6% chance that a product of type B is defective if a consumer complains about it.

#### Questions 3: A company employs ten people, four of whom are managers and six of whom are non-managers. The company will choose three employees at random to participate in a training programmed. What is the likelihood that exactly two of the chosen employees are managers?

In [14]:
from math import comb

total_employees = 10
managers = 4
non_managers = 6

employees_chosen = 3

In [15]:
# Number of ways to choose 3 employees from 10
total_ways = comb(total_employees, employees_chosen)

In [16]:
# Number of ways to choose 2 managers from 4
ways_to_choose_managers = comb(managers, 2)

In [17]:
# Number of ways to choose 1 non-manager from 6
ways_to_choose_non_managers = comb(non_managers, 1)

In [18]:
# Number of favorable outcomes
favorable_ways = ways_to_choose_managers * ways_to_choose_non_managers

In [19]:
# Calculate the probability
probability = favorable_ways / total_ways

probability

0.3

Thus there is a 30% likelihood that exactly two of the chosen employees are managers.

#### Questions 4: Suppose a researcher wishes to see if the variances of two populations, A and B, are the same. Population A is sampled with 20 observations, whereas population B is sampled with 25 observations. The variances of the samples are 12 and 16, respectively. Assuming that the populations are regularly distributed, test the hypothesis at a significance level of 5%.

To test whether the variances of two populations are the same, we can use the F-test for equality of variances. 

Null Hypothesis: The variances of the two populations are equal.

Alternative Hypothesis: The variances of the two populations are not equal.

In [20]:
from scipy.stats import f

n_A = 20
n_B = 25
s_A2 = 12
s_B2 = 16
alpha = 0.05

In [21]:
# Calculate the F statistic
F = s_A2 / s_B2

In [22]:
# Degrees of freedom
df1 = n_A - 1
df2 = n_B - 1

In [23]:
# Calculate the critical value for a two-tailed test
critical_value = f.ppf(1 - alpha / 2, df1, df2)

In [24]:
# Print results
F, critical_value

(0.75, 2.3451537596631566)

Since F is 0.75 which is less than the critical value 2.3, hence we fail to reject the null hypothesis.

Thus the variances of the two populations are equal.


#### Questions 5: A university is conducting a study to determine if there is a relationship between students' study hours and their exam scores. The following data represents the study hours (in hours per week) and the corresponding exam scores (out of 100) for ten students:

Study Hours: 12, 15, 22, 10, 8, 16, 18, 17, 9, 13
Exam Scores: 58, 72, 80, 65, 55, 70, 80, 71, 66, 62
Calculate the coefficient of correlation between study hours and exam scores. Based on the correlation coefficient, can we conclude that study hours have a significant impact on exam scores?
Find the probable error as well.

In [26]:
import numpy as np
from scipy.stats import pearsonr

study_hours = np.array([12, 15, 22, 10, 8, 16, 18, 17, 9, 13])
exam_scores = np.array([58, 72, 80, 65, 55, 70, 80, 71, 66, 62])

In [27]:
# Calculate Pearson correlation coefficient
correlation_coefficient, _ = pearsonr(study_hours, exam_scores)

In [28]:
# Number of data points
n = len(study_hours)

In [29]:
# Calculate Probable Error
probable_error = (0.6745 * np.sqrt(1 - correlation_coefficient**2)) / np.sqrt(n)

In [30]:
# Print results
correlation_coefficient, probable_error

(0.8585644446883827, 0.10935787592865016)

Based on the high correlation coefficient and relatively low probable error, we can conclude that there is a significant positive relationship between study hours and exam scores. This suggests that, generally, students who study more hours tend to have higher exam scores. The low probable error reinforces the reliability of this correlation, making the observed relationship between study hours and exam scores statistically significant.