# One-, Two- Sample Estimation of the Mean 

## Estimation Problems

### Problem 1

9.10) A random sample of 12 graduates of a certain secretarial school typed an average of 79.3 words per minute with a standard deviation of 7.8 words per minute. Assuming a normal distribution for the number of words typed per minute, find a 95% confidence interval for the average number of words typed by all graduates of this school.

In [30]:
n <- 12 # n = "size" = graduate_students
dof<- (n - 1) # degrees of freedom
sample_mean <- 79.3 # words per minute, "x bar"
s <- 7.8 # sample standard deviation "sigma" = "S"
confidence_level <- .95 #95% confidence level interval

# Set the confidence level
alpha <- (1 - confidence_level)

# Calculate the critical t-value
t_value <- qt(1 - alpha/2, dof)
cat("t-value: ", t_value, "\n")

# Calculate Margin of Error with the formula for the confidence interval
margin_of_error <- t_value * (s / sqrt(n))
cat("Margin of Error: ", margin_of_error, "\n")
cat("95% Confidence Interval : [ ", sample_mean - margin_of_error, " , ", sample_mean + margin_of_error , " ]")

t-value:  2.200985 
Margin of Error:  4.955884 
95% Confidence Interval : [  74.34412  ,  84.25588  ]

### Problem 2

9.13) A random sample of 12 shearing pins is taken in a study of the Rockwell hardness of the pin head. Measurements on the Rockwell hardness are made for each of the 12, yielding an average value of 48.50 with a sample standard deviation of 1.5. Assuming the measurements to be normally distributed, construct a 90% confidence interval for the mean Rockwell hardness.

In [8]:
n <- 12 # random sample of shearing pins
dof<- (n - 1) # degrees of freedom
sample_mean <- 48.50 # average value
s <- 1.5 # sample standard deviation "sigma" = "S"
confidence_level <- .90 #90% confidence level interval

# Set the confidence level
alpha <- (1 - confidence_level)

# Calculate the critical t-value
t_value <- qt(1 - alpha/2, dof)
cat("t-value: ", t_value, "\n")

# Calculate Margin of Error with the formula for the confidence interval
margin_of_error <- t_value * (s / sqrt(n))
cat("Margin of Error: ", margin_of_error, "\n")
cat("90% Confidence Interval : [ ", sample_mean - margin_of_error, " , ", sample_mean + margin_of_error , " ]")

t-value:  1.795885 
Margin of Error:  0.7776409 
90% Confidence Interval : [  47.72236  ,  49.27764  ]

### Problem 3

9.19) A random sample of 25 tablets of buffered aspirin contains, on average, 325.05 mg of aspirin per tablet, with a standard deviation of 0.5 mg. Find the 95% tolerance limits that will contain 90% of the tablet contents for this brand of buffered aspirin. Assume that the aspirin content is normally distributed.

In [9]:
n <- 25 # random sample of tablets of buffered aspirin
dof<- (n - 1) # degrees of freedom
sample_mean <- 325.05 # average value, 325.05 mg of aspirin per tablet
s <- 0.5 # sample standard deviation "sigma" = "S", of 0.5 mg 
lambda <- .95 #95% confidence level interval
p <- .90 # Proportion of the population p = 0.90

# Set the confidence level
alpha <- (1 - lambda)
# z-value corresponding to the cumulative probability p
z_value <- qnorm(p)
cat("z_value: ", z_value, "\n")

# chi-square value with n−1 degrees of freedom for the confidence level "lambda"
chi_squared <-  qchisq(lambda, dof)
cat("Chi-Squared value: ", chi_squared, "\n")

z_value:  1.281552 
Chi-Squared value:  36.41503 


In [10]:
# k = tolerance factor
tolerance_level <- (z_value * sqrt(1 + (1/n))) / sqrt(chi_squared / dof)

# calculate tolerance interval
lower_bound <- sample_mean - (tolerance_level * s)
upper_bound <- sample_mean + (tolerance_level * s)

cat("95% Confidence Interval : [ ", lower_bound, " , ", upper_bound , " ]")

95% Confidence Interval : [  324.5195  ,  325.5805  ]

### Problem 4

9.28) In Section 9.3, we emphasized the notion of “most efficient estimator” by comparing the variance of two unbiased estimators ˆ Θ sub 1 and ˆ Θ sub 2. However, this does not take into account bias in case one or both estimators are not unbiased. Consider the quantity 

> MSE=E(ˆ Θ−θ),

where MSE denotes mean squared error. The MSE is often used to compare two estimators ˆ Θ1 and ˆ Θ2 of θ when either or both is unbiased because (i) it is intuitively reasonable and (ii) it accounts for bias. Show that MSE can be written

> MSE=E[ˆ Θ−E(ˆ Θ)]^2 +[E(ˆ Θ−θ)]^2
>
> =Var(ˆ Θ) +[Bias(ˆ Θ)]^2


**Solution:**

^Θ is referred to as "big theta"; Θ is referred to as "theta"; E is Expected Value

Step 1: Definition of Mean Squared Error, <b>MSE(big theta) = E[(big theta - theta)^2]</b>

Step 2: Introduced expected value of the estimator E(big theta) in squared term
> MSE(big theta) = E[(big theta - E(big theta) + E(big theta) - theta)^2]

Step 3: Expanding the Squared Term
> (big theta - theta)^2 = (theta - E(big theta) + E(big theta) - theta)^2
> 
> (big theta - theta)^2 = (big theta - E(big theta))^2 + 2(big theta - E(big theta))*(E(big theta) - theta) + (E(big theta) - theta)^2

Step 4: Taking the Expectation for each term separately 
> E(big theta - theta)^2 = E(big theta - E(big theta))^2 + E[2(big theta - E(big theta))*(E(big theta) - theta)] + E[(E(big theta) - theta)^2]

Step 5: Simplify each term
> "first term" E(big theta - E(big theta))^2 = VAR(big theta)
> 
> "second term" E[2(big theta - E(big theta))*(E(big theta) - theta)] = 0
> 
> > <b>Justification</b> => E(big theta - E(big theta)) = 0 ; E(big theta) - theta = k
> > 
> "third term" E(big theta) - theta = k thus, E[(E(big theta) - theta)^2] = (E(big theta) - theta)^2

Step 6: Combine Term Results
> MSE(big theta) = Var(big theta) + (E(big theta) - theta)^2

Conclusion
> MSE(big theta) = Var(big theta) + Bias(big theta)^2
> 
> where the bias "big theta" is defined as => Bias(big theta) = E(big theta) - theta

### Problem 5

The following data represent the length of time, in days, to recovery for patients randomly treated with one of two medications to clear up severe bladder infections:

![image.png](attachment:3b272b92-4b06-40aa-8ccb-3509a384ff38.png)


- Medication 1 : n sub 1 is 14, x bar 1 is 17, s to the power of 2 sub 1 is 1.5
- Medication 2 : n sub 2 is 16, x bar 2 is 19, s to the power of 2 sub 2 is 1.8

Find a 99% confidence interval for the difference μ2−μ1 in the mean recovery times for the two medications, assuming normal populations with equal variances.

In [11]:
# Medication 1
n_1 <- 14 # n sub 1 is 14
sample_mean_1 <- 17 # x bar 1 is 17
s_1 <- 1.5 # s to the power of 2 sub 1 is 1.5 

# Medication 2
n_2 <- 16 # n sub 2 is 16
sample_mean_2 <- 19 # x bar 2 is 19
s_2 <- 1.8 #s to the power of 2 sub 2 is 1.8

In [12]:
# Pooled Variance calculation
pooled_variance <- (((n_1 - 1) * s_1 ) + ((n_2 - 1) * s_2)) / (n_1 + n_2 - 2) # = s squared sub p
cat("Pooled Variance: ", pooled_variance, "\n")

# Standard Error, (SE)
standard_error <- sqrt(pooled_variance * ((1 / n_1) + (1 / n_2)))# Difference in means = "SE"
cat("Standard Error (SE): ", standard_error, "\n")

# Margin of Error, (ME) = (t sub alpha/2) * SE 
alpha <- 0.01 # since we want the 99% confidence interval
dof <- n_1 + n_2 - 2 # degrees of freedom 
t_value <- qt(1 - alpha/2, dof) # get t value since n <30

margin_of_error <- t_value * standard_error

cat("Critical T Value: ", t_value, "\n")
cat("Margin of Error (ME): ", margin_of_error, "\n")

Pooled Variance:  1.660714 
Standard Error (SE):  0.4716112 
Critical T Value:  2.763262 
Margin of Error (ME):  1.303185 


In [13]:
# Confidence Interval (mu sub 1 - mu sub 2) +- ME 
mean_combined <- sample_mean_2 - sample_mean_1

lower_bound <- mean_combined-margin_of_error
upper_bound <- mean_combined+margin_of_error
cat("99% Confidence Interval : [ ", lower_bound, " , ", upper_bound , " ]")

99% Confidence Interval : [  0.6968146  ,  3.303185  ]

## R Built-in Datasets, Assess data with Confidence interval for the Mean

In [1]:
# enable built-in R datasets for analysis
library(datasets)

In [11]:
# Traverse data frame, identify desired dataset
# data()

In [14]:
head(USArrests)

Unnamed: 0_level_0,Murder,Assault,UrbanPop,Rape
Unnamed: 0_level_1,<dbl>,<int>,<int>,<dbl>
Alabama,13.2,236,58,21.2
Alaska,10.0,263,48,44.5
Arizona,8.1,294,80,31.0
Arkansas,8.8,190,50,19.5
California,9.0,276,91,40.6
Colorado,7.9,204,78,38.7


In [22]:
# identify dataframe characteristics
names(USArrests)
nrow(USArrests)
attach(USArrests) #attaches columns to callable "Variables"

In [39]:
n <- length(Murder) # sample size
df <- n - 1 # Degree of Freedom
sample_mean <- mean(Murder) # sample mean

confidence_level <- .95
alpha <- (1 - confidence_level)

z_value <- qt(1 - alpha/2, df)
standard_error <- sd(Murder) / sqrt(n)
margin_of_error <- z_value * standard_error

cat("z value: ",z_value,"\n")
cat("Standard Error: ",standard_error,"\n")
cat("95% Confidence Interval : [ ", sample_mean - margin_of_error, " , ", sample_mean + margin_of_error , " ]")

z value:  2.009575 
Standard Error:  0.6159621 
95% Confidence Interval : [  6.550178  ,  9.025822  ]

##  One- and Two-Sample Tests of Hypotheses

### Problem 1 

10.8) In Relief from Arthritis published by Thorsons Publishers, Ltd., John E. Croft claims that over 40% of those who suffer from osteoarthritis receive measurable relief from an ingredient produced by a particular species of mussel found off the coast of New Zealand. To test this claim, the mussel extract is to be given to a group of 7 osteoarthritic patients. If 3 or more of the patients receive relief, we shall not reject the null hypothesis that p =0.4; otherwise, we conclude that p<0.4.

 (a) Evaluate α, assuming that p =0.4.

 (b) Evaluate β for the alternative p =0.3.


In [6]:
# null hypothesis (H sub 0): p = 0.4
# Alternative hypothesis (H sub alpha): p < 0.4
# X follows a binomial distribution: X ~ (n=7, p)

# evaluating alpha ... α = P( X<3 | p=0.4 )
# CDF of binomial distribution... α = P(X=0) + P(X=1) + P(X=2)
# P(X=k) = choose(7,k) * 0.4^(k) * (0.6)^(7-k)

P_X_0 <- choose(7,0) * 0.4^(0) * (1- 0.4)^(7-0)
P_X_1 <- choose(7,1) * 0.4^(1) * (1- 0.4)^(7-1)
P_X_2 <- choose(7,2) * 0.4^(2) * (1- 0.4)^(7-2)

alpha <- P_X_0+P_X_1+P_X_2

alpha

In [1]:
# null hypothesis (H sub 0): p = 0.3
# Alternative hypothesis (H sub alpha): p < 0.3

# Type II error (β) is the probability of not rejecting the null hypothesis when the alternative hypothesis is true.
# X follows a binomial distribution: X ~ (n=7, p= 0.3)

# evaluating alpha ... β = P( X>=3 | p=0.3 )
# CDF of binomial distribution... α = P(X=3) + P(X=4) + P(X=5) + P(X=6) + P(X=7)
# P(X=k) = choose(7,k) * 0.3^(k) * (0.3)^(7-k)

P_X_3 <- choose(7,3) * 0.3^(3) * (1- 0.3)^(7-3)
P_X_4 <- choose(7,4) * 0.3^(4) * (1- 0.3)^(7-4)
P_X_5 <- choose(7,5) * 0.3^(5) * (1- 0.3)^(7-5)
P_X_6 <- choose(7,6) * 0.3^(6) * (1- 0.3)^(7-6)
P_X_7 <- choose(7,7) * 0.3^(7) * (1- 0.3)^(7-7)

beta <- P_X_3+P_X_4+P_X_5+P_X_6+P_X_7

beta

(a) The significance level α (the probability of rejecting the null hypothesis when it is true, assuming p=0.4) is approximately 0.420.

(b) The probability of a Type II error β (the probability of not rejecting the null hypothesis when the alternative hypothesis is true, assuming p=0.3) is approximately 0.353.

### Problem 2

In the American Heart Association journal Hypertension, researchers report that individuals who practice Transcendental Meditation (TM) lower their blood pressure significantly. If a random sample of 225 male TM practitioners meditate for 8.5 hours per week with a standard deviation of 2.25 hours, does that suggest that, on average, men who use TM meditate more than 8 hours per week? Quote a P-value in your conclusion.

In [15]:
# null hypothesis
mu_0 <- 8 # Hypothesized population mean (mu sub 0) in hours
n <- 225
sample_mean <- 8.5
s <- 2.25 # sample std. dev.
df <- n - 1 # degrees of freedom

# one-sample t-test
t_test <- (sample_mean - mu_0) / (s / sqrt(n))

# Calculate the P-value for a one-tailed test for the t-distribution
p_value <- 1 - pt(t_test, df)

cat("One Sample T-Test (t-statistic):", t_test, "\n")
cat("P-Value:", p_value)

One Sample T-Test (t-statistic): 3.333333 
P-Value: 0.0005019815

Since the P-value (0.0005) is much smaller than the common significance level (α=0.05), we reject the null hypothesis. This suggests that, on average, men who practice Transcendental Meditation meditate more than 8 hours per week. The evidence is statistically significant, supporting the claim of increased meditation time.

### Problem 3


A study was conducted by the Department of Zoology at Virginia Tech to determine if there is a significant difference in the density of organisms at Two different stations located on Cedar Run, a secondary stream in the Roanoke River drainage basin. Sewage from a sewage treatment plant and overflow from the Federal Mogul Corporation settling pond enter the stream near its headwaters. The following data give the density measurements, In number of organisms per square meter, at the two collecting stations:

Number of Organisms per Square Meter

 Station1 		Station2
 
 5030 4980 		2800 2810
 
 13,700 11,910 		4670 1330
 
 10,730 8130 		6890 3320
 
 11,400 26,850 		7720 1230
 
 860 17,660 		7030 2130
 
 2200 22,800 		7330 2190
 
 4250 1130
 
 15,040 1690

![image.png](attachment:458095e7-6226-4411-bc5b-2289fbed8b6d.png)

Can we conclude, at the 0.05 level of significance, that the average densities at the two stations are equal? Assume that the observations come from normal populations with different variances.

In [24]:
# 1st 
# Data
station1 <- c(5030, 4980, 13700, 11910, 10730, 8130, 11400, 26850, 860, 17660, 2200, 22800, 4250, 1130, 15040, 1690)
station2 <- c(2800, 2810, 4670, 1330, 6890, 3320, 7720, 1230, 7030, 2130, 7330, 2190)

# Calculate means
mean1 <- mean(station1)
mean2 <- mean(station2)

# Calculate variances
var1 <- var(station1)
var2 <- var(station2)

# Calculate sample sizes
n1 <- length(station1)
n2 <- length(station2)

# Calculate t-statistic
t_stat <- (mean1 - mean2) / sqrt((var1 / n1) + (var2 / n2))

# Calculate degrees of freedom
df <- ((var1 / n1) + (var2 / n2))^2 / (((var1 / n1)^2 / (n1 - 1)) + ((var2 / n2)^2 / (n2 - 1)))

# Calculate p-value for two-tailed test
p_value <- 2 * (1 - pt(abs(t_stat), df))

# Output the results
cat("Degrees of Freedom:", df, "\n")
cat("T-statistic:", t_stat, "\n")
cat("P-Value:", p_value)

Degrees of Freedom: 18.78065 
T-statistic: 2.757793 
P-Value: 0.01261219

In [23]:
# 2nd Example  
# Data
station1 <- c(5030, 4980, 13700, 11910, 10730, 8130, 11400, 26850, 860, 17660, 2200, 22800, 4250, 1130, 15040, 1690)
station2 <- c(2800, 2810, 4670, 1330, 6890, 3320, 7720, 1230, 7030, 2130, 7330, 2190)

# Perform Welch's t-test
t_test_result <- t.test(station1, station2, var.equal = FALSE)

# Output the results
t_test_result



	Welch Two Sample t-test

data:  station1 and station2
t = 2.7578, df = 18.781, p-value = 0.01261
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  1389.003 10164.331
sample estimates:
mean of x mean of y 
 9897.500  4120.833 


Results: 

The P-value is 0.0126. This value represents the probability of observing a t-statistic as extreme as 2.7578, assuming that the null hypothesis (mu_1 = mu_2) is true.
Since the P-value 0.0126 is less than the significance level α=0.05, we reject the null hypothesis.

There is enough statistical evidence to conclude that there is a significant difference in the average densities of organisms at the two stations. The evidence suggests that the means are not equal.

### Problem 4

10.52) For testing

H0: μ=14,

H1: μ≠14,

An α=0.05 level t-test is being considered. What sample size is necessary in order for the probability to be 0.1 of falsely failing to reject H0 when the true population mean differs from 14 by 0.5? From a preliminary sample we estimate σ to be 1.25.


In [15]:
# Sample-Size Calculation, two-sided test
# n = ((z_alpha/2 + z_beta) / (delta / sigma))^2

alpha <- 0.05 # significance level
beta <- 0.1 # Type II Error (Power = 1 - beta)
power <- 1 - beta # Power of the test
sigma <- 1.25 # Estimated Population Standard Deviation
delta <- 0.5 # Difference to Detect

# Critical values
z_alpha <- qnorm(1 - alpha / 2) # z_alpha/2
z_beta <- qnorm(1 - beta)

# Calculate sample size
n <- ((z_alpha + z_beta) / (delta / sigma)) ** 2
cat("Sample-Size Calculated=", n, "=", round(n, digits = 0))

Sample-Size Calculated= 65.67139 = 66

### Problem 5 

Suppose that, in the past, 40% of all adults favored capital punishment. Do we have reason to believe that the proportion of adults favoring capital punishment has increased if, in a random sample of 15 adults, 8 favor capital punishment? Use a 0.05 level of significance.

In [18]:
alpha <- 0.05 # Significance Level
n <- 15 # Random Sample Size 
X <- 8 # Number of adults favoring capital punishment
p_0 <- 0.4 # Hypothesized Population proportion

# z-statistic for a one-sample proportion
p_hat <- X / n # ^p "sample_portion"
z = (p_hat - p_0) / sqrt((p_0*(1-p_0)) / n)
z

# P-value for a one-tailed test
p_value <- 1 - pnorm(z)
p_value

Since the P-value (0.1459) is greater than the significance level (α=0.05), we do not have sufficient evidence to reject the null hypothesis.

The test results suggest that, with a significance level of 0.05, the sample data does not provide strong enough evidence to support the claim that more than 40% of adults now favor capital punishment. Therefore, we maintain the assumption that the true population proportion is still 0.4 or lower.

## One & Two Sample T-Tests

In [10]:
# one sample t test
weights <- c(48, 50, 52, 49, 51, 47, 53, 50, 49, 52)
t.test(weights, mu = 50)



	One Sample t-test

data:  weights
t = 0.1654, df = 9, p-value = 0.8723
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
 48.73227 51.46773
sample estimates:
mean of x 
     50.1 


In [11]:
# two sample t test
batch_a <- c(48, 50, 52, 49, 51)
batch_b <- c(47, 53, 50, 49, 52)
t.test(batch_a, batch_b, var.equal = TRUE)



	Two Sample t-test

data:  batch_a and batch_b
t = -0.15617, df = 8, p-value = 0.8798
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.153126  2.753126
sample estimates:
mean of x mean of y 
     50.0      50.2 


### Final Estimation Problems

### Problem 1 (Mean & Variance)

The tar contents of 8 brands of cigarettes selected at random from the latest list released by the Federal Trade Commission are as follows: 7.6, 8.97, 10.2, 16.1, 12.9, 15.1, 14.5, and 9.50 milligrams. Calculate 

(a) the mean; 

(b) the variance. 


In [10]:
tar_contents <- c(7.6, 8.97, 10.2, 16.1, 12.9, 15.1, 14.5, 9.5)

# Calculate mean
mean_tar <- sum(tar_contents) / length(tar_contents)

# Calculate variance
variance_tar <- var(tar_contents) * (length(tar_contents) - 1) / length(tar_contents)

cat("Mean:", mean_tar, "\n")
cat("Variance:", variance_tar)

Mean: 11.85875 
Variance: 8.918911

### Problem 2 (Critical Values of Chi Distribution)

For a chi-squared distribution, find χ2α such that 

(a) P(χ 2 > χ2α ) = 0.95 when v = 5;

(b) P(χ 2> χ2α ) = 0.020 when v = 17;

(c) P(37.652 < χ 2< χ2α) = 0.040 when v = 20.


In [26]:
# (a) P(χ² > χ²α) = 0.95 when v = 5
alpha_a <- 0.95
v_a <- 5
chi2_alpha_a <- qchisq(1 - alpha_a, v_a)

# (b) P(χ² > χ²α) = 0.020 when v = 17
alpha_b <- 0.020
v_b <- 17
chi2_alpha_b <- qchisq(1 - alpha_b, v_b)

# (c) P(37.652 < χ² < χ²α) = 0.040 when v = 20
alpha_c <- 0.040
v_c <- 20
lower_bound <- 37.652
p_lower_bound <- pchisq(lower_bound, v_c)
total_prob_c <- p_lower_bound + alpha_c # P(χ² < χ²α)=P(χ² < 37.652)+ α 
chi2_alpha_c <- qchisq(total_prob_c, v_c)

chi2_alpha_a
chi2_alpha_b
chi2_alpha_c
total_prob_c

Your code contains a unicode char which cannot be displayed in your
current locale and R will silently convert it to an escaped form when the
R kernel executes this code. This can lead to subtle errors if you use
such chars to do comparisons. For more information, please see
"NaNs produced"


### Problem 3 (Confidence Interval)

An electrical firm manufactures light bulbs that have a life span that is approximately normally distributed. The population standard deviation is not known. A sample of 35 bulbs are found to have an average life span of 800 hours and a sample standard deviation of 40 hours. 

(a) Find a 99% confidence interval for the population mean. 

(b) Would a 95% confidence interval computed from the same sample be wider or narrower than the confidence interval found in part (a)? 

(c) Find a 90% confidence lower bound for the population mean. 


In [33]:
s <- 40 # hours (sample standard deviation)
n <- 35 # bulbs (sample size)
x_bar <- # average life span (sample mean)

# Degrees of freedom
df <- n - 1

# (a) 99% confidence interval
confidence_level_a <- 0.99
t_value_a <- qt(1 - (1 - confidence_level_a) / 2, df)
margin_of_error_a <- t_value_a * (s / sqrt(n))
lower_bound_a <- x_bar - margin_of_error_a
upper_bound_a <- x_bar + margin_of_error_a
cat("99% Confidence Interval is [", lower_bound_a,",", upper_bound_a,"]")

99% Confidence Interval is [ 15.55269 , 52.44731 ]

(b) The width of a confidence interval depends on the t-score, which decreases as the confidence level decreases. So, a 95% confidence interval will be narrower than a 99% confidence interval, as the t-score for 95% is smaller than that for 99%.

In [36]:
# (c) 90% confidence lower bound
confidence_level_c <- 0.90
t_value_c <- qt(1 - confidence_level_c, df)
lower_bound_c <- x_bar + t_value_c * (s / sqrt(n))
cat("90% confidence lower bound is", lower_bound_c)

90% confidence lower bound is 25.16339

### Problem 4 (Confidence Interval, Two Sided)

We are interested in evaluating the performance of two brands of CPUs in terms of their maximum sustainable clock speeds. 

(a) A sample of 40 CPUs of brand 1 are found to have an average maximum sustainable clock speed of 550 MHz. A sample of 150 CPUs of brand 2 are found to have an average maximum sustainable clock speed of 530 MHz. The manufacturer for brand1 reports that the population standard deviation for the maximum sustainable clock speed of their CPUs is 10 MHz. The manufacturer for brand 2 reports that the population standard deviation for the maximum sustainable clock speed of their CPUs is 15 MHz. Find a 99% confidence interval for mu1 - mu2 , the difference of the population means of the maximum sustainable clock speeds for brand 1 and 2. 

Note: If you can't find the exact entry you are looking for in the table, then use the closest one. If there are two equally close entries, use their average. 


In [44]:
# 99% confidence interval for the difference in means (mu1 - mu2)
x1_bar <- 550
sigma1 <- 10
n1 <- 40

x2_bar <- 530
sigma2 <- 15
n2 <- 150

# Z-score for 99% confidence level
alpha <- 0.01  # Significance level for 99% confidence(100% - 99%)
z_alpha <- qnorm(1 - alpha / 2)

# Calculate the standard error
SE <- sqrt((sigma1^2 / n1) + (sigma2^2 / n2))

# Calculate the margin of error
margin_of_error <- z_alpha * SE

# Calculate the confidence interval
lower_bound <- (x1_bar - x2_bar) - margin_of_error
upper_bound <- (x1_bar - x2_bar) + margin_of_error

cat("99% confidence interval of mu1 - mu2 is[", lower_bound,",", upper_bound,"]")

99% confidence interval of mu1 - mu2 is[ 14.84834 , 25.15166 ]

(b) This time lets consider the following scenario: Assume that the population distri- butions for the maximum sustainable clock speed is approximately normally dis- tributed but the population variances are unknown. Assume that the population variances are equal. A sample of 41 CPUs of brand 1 are found to have a sample mean of 550 MHz and a sample variance of 200. A sample of 21 CPUs of brand 2 are found to have a sample mean of 530 MHz and a sample variance of 100. Find a 95% confidence interval for mu1 - mu2, the difference of the population means of the maximum sustainable clock speeds for brand 1 and 2. 

In [53]:
n1 <- 41
xbar1 <- 550
s1_sq <- 200
n2 <- 21
xbar2 <- 530
s2_sq <- 100
# Degrees of freedom
df <- n1 + n2 - 2

# Calculate pooled standard deviation
sp_sq <- ((n1 - 1) * s1_sq + (n2 - 1) * s2_sq) / (n1 + n2 - 2)
sp <- sqrt(sp_sq)

# Calculate standard error of the difference
SE <- sp * sqrt(1 / n1 + 1 / n2)

# Total significance level for 95% confidence
alpha <- 1 - 0.95  # = 0.05
# Significance level for one tail
alpha_one_tail <- alpha / 2  # = 0.025

# t-value for 95% confidence
t_value <- qt(1 - alpha_one_tail, df)

# Confidence interval
CI_lower <- (xbar1 - xbar2) - t_value * SE
CI_upper <- (xbar1 - xbar2) + t_value * SE
cat("95% confidence interval of mu1 - mu2 is[", CI_lower,",", CI_upper,"]")

95% confidence interval of mu1 - mu2 is[ 13.07032 , 26.92968 ]

### Problem 5 (Confidence Interval)

A manufacturer of MP3 players conducts a set of comprehensive tests on the electrical functions of its product. All MP3 players must pass all tests prior to being sold. Of a random sample of 400 MP3 players, 15 failed one or more tests. Find a 90% confidence interval for the proportion of MP3 players from the population that pass all tests. 

In [60]:
n <- 400
x <- 15

# Sample proportion
p_hat <- (n - x) / n # n-x = number passed 

# Standard error
SE <- sqrt(p_hat * (1 - p_hat) / n)

# Total significance level for 90% confidence
alpha <- 1 - 0.90  # = 0.10
# Significance level for one tail
alpha_one_tail <- alpha / 2  # = 0.05
# Z-value for 90% confidence
z_value <- qnorm(1 - alpha_one_tail)

# Confidence interval
CI_lower <- p_hat - z_value * SE
CI_upper <- p_hat + z_value * SE

cat("90% confidence interval is [", CI_lower,",",CI_upper,"]")

90% confidence interval is [ 0.9468752 , 0.9781248 ]

### Problem 6 (Null & Alternative Hypothesis, Two-tailed)

A DC power supply manufacturer wants to test the hypothesis that the mean output voltage is 50 V (for a variety of different loads). Assume that the output voltage has a normal probability distribution.  A quality control engineer measures the output voltage when the supply is connected to 9 different loads and computes a sample mean x bar = 46.7 and sample standard deviation S = 5.

(a) State the null and alternative hypothesis.

(b) What is the critical region if the hypothesis test is to be conducted at significance level α = 0.05? Is the null hypothesis rejected? 


In [10]:
x_bar <- 46.7 # sample mean
s <- 5 # sample standard deviation
n <- 9 # loads connected (sample size)

# Null Hypothesis is mu = 50
# Alternative Hypothesis mu =! 50
h_0 <- 50 # hypothesized population mean

t_test <- (x_bar - h_0) / (s / sqrt(n)) # Test Statistic 
cat("T-Statistic: ", t_test, "\n")


# Critical t-value for a two-tailed test
df <- n - 1 # degrees of freedom
alpha <- 0.05 # significance level
t_critical <- qt(1 - alpha / 2, df)
cat("Critical Region:[", (-1*t_critical),",",t_critical,"]") 


T-Statistic:  -1.98 
Critical Region:[ -2.306004 , 2.306004 ]

Since −1.98 does not fall into the critical region, we do not reject the null hypothesis. This means there is not enough evidence to conclude that the mean output voltage is different from 50 V at the 0.05 significance level.

### Problem 7 (Null & Alt. Hypothesis, One-Tailed)

A company manufacturing pacemakers is testing a new electrode. The electrodes must adhere to a silicone substrate for at least 20 years. The company is going to test the hypothesis that the mean adherence time is 20 years vs. the alternative that it is less than 20 years at the significance level α = 0.01. The experiment will be conducted with a sample of 27 volunteers. Assume that the population distribution for the adherence time is approximately normally distributed. The average adherence time for the pacemakers in the 25 volunteers is found to be 21.8 years and the standard deviation of the sample is found to be 3.5 years. 

(a) Is the null hypothesis rejected? 

(b) If the company wants to decrease the probability of making a type I error without increasing the sample size, should the critical value be increased or decreased? 

(c) Find the 95% confidence interval for the population variance σ^2. 


In [17]:
alpha <- 0.01 # significance level
n <- 27 # sample size
x_bar <- 21.8 # years (sample mean)
s <- 3.5 # years (sample std. dev.)

# Null Hypothesis is mu = 20
# Alternative Hypothesis mu < 20
h_0 <- 20 

t_test <- (x_bar - h_0) / (s / sqrt(n)) # Test Statistic 
cat("T-Statistic: ", t_test, "\n")

# Critical t-value for a one-tailed test
df <- n - 1
t_critical <- qt(alpha, df, lower.tail = TRUE)
cat("Critical T-Value: ",t_critical)

T-Statistic:  2.672307 
Critical T-Value:  -2.47863

(a) Since 2.67 > −2.479, we do not reject the null hypothesis. There is not enough evidence to suggest that the mean adherence time is less than 20 years at the 0.01 significance level.

(b) Decreasing α makes it harder to reject Hsub0, which requires increasing the critical value (moving it further from zero in the direction of the tail)

In [25]:
# (c) Confidence interval of 95%
alpha <- 1 - 0.95

# Chi-squared critical values
chi2_lower <- qchisq(alpha / 2, df = 26)
chi2_upper <- qchisq(1 - alpha / 2, df = 26)

cat("Chi-squared Critical Value:[", chi2_lower,",", chi2_upper,"]\n")

# Sample variance
S_sq <- s^2

# Confidence interval for population variance
CI_lower_var <- (26 * S_sq) / chi2_upper
CI_upper_var <- (26 * S_sq) / chi2_lower

cat("Population Variance for 95% Confidence Interval:[", CI_lower_var,",",CI_upper_var,"]")


Chi-squared Critical Value:[ 13.8439 , 41.92317 ]
Population Variance for 95% Confidence Interval:[ 7.597231 , 23.00651 ]

### Problem 8 (chi-squared test of independence)

A random sample of 200 married men, all retired, was classified according to education and number of children: 

>Number of Children 
>
>Education 	0–1	2–3	Over 3
>
>Elementary 16	37	30
>
>Secondary 	17	42	15
>
>College  	13  19 	11 

Test the hypothesis, at the 0.05 level of significance, that the size of a family is independent of the level of education attained by the father. 

>Null Hypothesis (H_0): The size of a family is independent of the level of education attained by the father.
>
>Alternative Hypothesis (H_1): The size of a family is not independent of the level of education attained by the father.

In [36]:
# Observed frequencies
observed <- matrix(c(16, 37, 30, 17, 42, 15, 13, 19, 11), nrow = 3, byrow = TRUE)

# Row and column totals
row_totals <- rowSums(observed)
col_totals <- colSums(observed)
grand_total <- sum(observed)

# Expected frequencies
expected <- outer(row_totals, col_totals) / grand_total

# Chi-squared statistic
chi_squared_stat <- sum((observed - expected)^2 / expected)

# Critical value
alpha <- 0.05
df <- (nrow(observed) - 1) * (ncol(observed) - 1)
critical_value <- qchisq(1 - alpha, df)

# p-value
p_value <- 1 - pchisq(chi_squared_stat, df)

observed # Original Matrix
expected # Expected Frequency Matrix
cat("Chi-Squared T-stat:",chi_squared_stat, "\n")
cat("Critical Value:",critical_value, "\n")
cat("p-value:", p_value)

0,1,2
16,37,30
17,42,15
13,19,11


0,1,2
19.09,40.67,23.24
17.02,36.26,20.72
9.89,21.07,12.04


Chi-Squared T-stat: 6.556584 
Critical Value: 9.487729 
p-value: 0.1612599

In [33]:
# Chi-squared test, ***easy way, pre-defined function***
chisq.test(observed)


	Pearson's Chi-squared test

data:  observed
X-squared = 6.5566, df = 4, p-value = 0.1613


We will reject the null hypothesis (H_0) if and only if the chi-squared statistic > critical value or p-value < 0.05.

So, we will not reject the Null Hypothesis

### Problem 9 (T-test, Null and Alt. Hypothesis)

A random sample of 62 bags of cheddar popcorn weighed, on average, 5.63 ounces with a standard deviation of 0.24 ounce. Test the hypothesis that μ = 5.5 ounces against the alternative hypothesis, μ > 5.5 ounces, at the 0.01 and .05 levels of significance. 

In [44]:
n <- 62
x_bar <- 5.63
s <- 0.24

# Null Hypothesis is mu = 5.5
# Alternative Hypothesis mu > 5.5
h_0 <- 5.5 # ounces

t_test <- (x_bar - h_0) / (s / sqrt(n)) # Test Statistic 
cat("T-Statistic: ", t_test, "\n")

# significance level 0.01
alpha_01 <- 0.01
df <- n-1
t_critical_01 <- qt(1 - alpha_01, df)
t_critical_01

# significance level 0.05
alpha_05 <- 0.05
t_critical_05 <- qt(1 - alpha_05, df)
t_critical_05

T-Statistic:  4.265088 


Since the calculated t value (4.26) is greater than the critical t-value (1.671), we reject the null hypothesis at the 0.05 significance level.

### Problem 10 ()

To meet the ISO 4 standard a clean room for semiconductor manufacturing must not have more than 350 particles (size 0.5 microns or larger) per cubic meter. Assume that the number of particles per cubic meter has an approximately normal probability distribution with population standard deviation σ= 25. An engineering working at a CPU manufacturing plant wants to test the hypothesis that µ the mean for number of particles per cubic meter is equal to 350 vs the alternative hypothesis that it is greater than 350. He takes air samples on 47 different occasions and finds x bar = 362. 

(a) State the null and alternative hypothesis.

(b) What is the critical region if the hypothesis test is to be conducted at significance level α = 0.05? Is the null hypothesis rejected? 

(c) Based on your answer for part (b), can you determine if the null hypothesis would be rejected if the test was performed at a level of significance α = 0.001? 

(d) What is the probability of type II error for the critical region computed in the previous part when testing against the specific alternative µ = 368? 


(a) 
>Null Hypothesis (H_0): The mean number of particles per cubic meter is equal to 350. H_0 : mu = 350
>
>Alternative Hypothesis (H_1): The mean number of particles per cubic meter is greater than 350. H_1 : mu > 350

In [50]:
# (b) Critical Region and Hypothesis Test at α=0.05
n <- 47 # sample size
sigma <- 25 # particles, sample std. dev.
x_bar <- 362 # partciles, sample mean
h_0 <- 350 # null hypothesis

# Test Statistic
z_test <- (x_bar - h_0) / (sigma / sqrt(n)) 
cat("Z Test Statistic: ", z_test, "\n")

# Critical Value 
alpha <- 0.05
z_critical <- qnorm(1 - alpha)
cat("Z Critical Value:",z_critical)

Z Test Statistic:  3.290714 
Z Critical Value: 1.644854

Since Z ≈ 3.29 is greater than the critical value of 1.645, we reject the null hypothesis at the 0.05 significance level.

In [52]:
# (c) Hypothesis Test at α = 0.001

alpha_001 <- 0.001
z_critical_001 <- qnorm(1 - alpha_001)
z_critical_001


Since Z≈3.29 is greater than the critical value of 3.090, we would also reject the null hypothesis at the 0.001 significance level.

In [1]:
# (d) Probability of Type II Error (β)
mu_a <- 368
sigma <- 25
n <- 47

z_critical <- 1.645

# Z-score under alternative hypothesis
z_alternative <- (mu_a - 350) / (sigma / sqrt(n))

# Probability of Type II error
beta <- pnorm(z_critical - z_alternative)
beta


The probability of a Type II error β is very low because the Z-score for μ=368 is much larger than the critical Z-value.

### Problem 11 (T-distribution probability with degree of freedom provided)

Similar to Exercise 8.45 and 8.47 from textbook 

(a) Find P(T < 2.65) when v = 6. 

(b) Find P(T > 1.5) when v = 20. 

(c) Given a sample size 20, find k such that P(k < T < 2.85) = 0.095. 

In [10]:
# (a)
v_a <- 6
t_a <- 2.65

# Calculate P(T < 2.65)
p_a <- pt(t_a, df = v_a)
p_a
cat(p_a*100,"%\n")

98.09857 %


In [11]:
# (b)
v_b <- 20
t_b <- 1.5

# Calculate P(T > 1.5)
p_b <- 1 - pt(t_b, df = v_b)
p_b
cat(p_b*100,"%\n")


7.461789 %


In [8]:
# (c)
# P( k < T < 2.85) = 0.095
v_c <- 19
t_c <- 2.85

# Calculate P(T < 2.85)
p_2.85 <- pt(t_c, df = v_c)
p_2.85

# Calculate P(T < k)
p_k <- p_2.85 - 0.095

# Find k using the inverse CDF
k <- qt(p_k, df = v_c)
cat("k =", k)


k = 1.326981

### Problem 12 (estimator to be unbiased & relative efficiency between estimators)

Let Y1, Y2,…Yn denote the random sample from a population with mean µ and variance σ2. Consider the following three estimators with µ: 
^mu1 = 1/3( Y1 +Y2 +Y3 ), ^mu2 = 1/2  (Y bar),  ^mu3 = ½*Y1 + (Y2 +...+Yn /2n-2), 

(a) Show that each of the three estimators is unbiased.

![image.png](attachment:ec9b2c12-41ae-42dc-97a9-6e36c411558b.png)

(b) Find the efficiency of ^mu3 relative to ^mu2 and ^mu1, respectively.

![image.png](attachment:624cf36d-9b27-49c5-ac4a-620baf5174e1.png)

![image.png](attachment:6fae0937-7e68-41fa-87c6-6401afc9ca37.png)