## 2023

### Problem 1

**Objective:** 
1. Derive a formula for the required sample size for an $\alpha$-level test with a power of $1-\beta$.
2. Calculate the required sample size using the given parameters.

#### Part (i)

**(a) Derivation of the Sample Size Formula**

**Given Data:**
- Population proportion last year, $p_0$
- Desired level of significance, $\alpha$
- Desired power, $1 - \beta$
- True population proportion this year, $p = p_0 - \Delta$

**Formula:**
We will use the standard formula for sample size calculation for hypothesis testing about proportions. The sample size (n) for detecting a difference of $\Delta$ with power $1 - \beta$ and significance level $\alpha$ can be derived using the normal approximation to the binomial distribution.

The null and alternative hypotheses are:
$H_0: p \geq p_0$
$H_a: p < p_0$

The test statistic under the null hypothesis is approximately normally distributed:

$Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0 (1 - p_0)}{n}}}$

To have a test with significance level $\alpha$ and power $1 - \beta$, the following conditions must be met:

$Z_{\alpha} = \frac{p_0 - (p_0 - \Delta)}{\sqrt{\frac{p_0 (1 - p_0)}{n}}}$

$Z_{1-\beta} = \frac{p_0 - (p_0 - \Delta)}{\sqrt{\frac{(p_0 - \Delta)(1 - (p_0 - \Delta))}{n}}}$

Setting up the equations, we get:

$Z_{\alpha} = \frac{\Delta}{\sqrt{\frac{p_0 (1 - p_0)}{n}}}$

$Z_{1-\beta} = \frac{\Delta}{\sqrt{\frac{(p_0 - \Delta)(1 - (p_0 - \Delta))}{n}}}$

By solving these equations for n, we obtain:

$n = \left( \frac{Z_{\alpha} \sqrt{p_0 (1 - p_0)} + Z_{1-\beta} \sqrt{(p_0 - \Delta)(1 - (p_0 - \Delta))}}{\Delta} \right)^2$

**(b) Calculation of Sample Size**

**Given Data:**
- $\alpha = 0.05$ (significance level)
- $1 - \beta = 0.95$ (power)
- $p_0 = 0.25$
- $\Delta = 0.10$

**Formula:**

$n = \left( \frac{Z_{\alpha} \sqrt{p_0 (1 - p_0)} + Z_{1-\beta} \sqrt{(p_0 - \Delta)(1 - (p_0 - \Delta))}}{\Delta} \right)^2$

Using standard normal distribution tables, we get:
- $Z_{\alpha} \approx 1.645 (for a one-tailed test at $\alpha = 0.05$)
- $Z_{1-\beta} \approx 1.645 (for a power of $1 - \beta = 0.95$)

**Calculation:**
$n = \left( \frac{1.645 \sqrt{0.25 (1 - 0.25)} + 1.645 \sqrt{(0.25 - 0.10)(1 - (0.25 - 0.10))}}{0.10} \right)^2$

$n = \left( \frac{1.645 \sqrt{0.25 \times 0.75} + 1.645 \sqrt{0.15 \times 0.85}}{0.10} \right)^2$

$n = \left( \frac{1.645 \times 0.433 + 1.645 \times 0.357}{0.10} \right)^2$

$n = \left( \frac{0.712 + 0.587}{0.10} \right)^2$

$n = \left( \frac{1.299}{0.10} \right)^2$

$n = (12.99)^2$

$n \approx 168.74$

So, the required sample size is approximately 169.

#### Part (ii)

**(a) Hypothesis Testing for Difference in Mean Age at Which Infants Walked Alone**

**Objective:**
Determine if there is a significant difference in the mean age at which infants from two populations, A and B, walked alone.

**Given Data:**
- Sample data for populations A and B.

**Null and Alternative Hypotheses:**

$H_0: \mu_A = \mu_B$

$H_a: \mu_A \neq \mu_B$

**Test to Use:**
Two-sample t-test.

**Assumptions:**
1. The data in both samples are independently and randomly collected.
2. The populations are normally distributed.
3. The variances of the two populations are equal (initial assumption to be checked).

**(b) Testing the Equal Variance Assumption**

We will use the F-test to check if the variances of the two populations are equal.

**F-test Formula:**

$F = \frac{s^2_1}{s^2_2}$

where $s^2_1$ and $s^2_2$ are the sample variances of populations A and B, respectively.

**Given Data:**
From the PROC TTEST output, the variances or standard deviations will be used.

Assuming we have the output, we check if the F-value is significant at $\alpha = 0.05$.

If the p-value from the F-test is greater than 0.05, we fail to reject the null hypothesis of equal variances. Otherwise, we reject it.

**(c) Carrying out the t-test and Drawing a Conclusion**

Using the output from PROC TTEST:

**Test Statistic:**

$t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s^2_p \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$

where $s^2_p$ is the pooled variance:

$s^2_p = \frac{(n_1 - 1)s^2_1 + (n_2 - 1)s^2_2}{n_1 + n_2 - 2}$

**Confidence Interval for Mean Difference:**

$(\bar{X}_1 - \bar{X}_2) \pm t_{\alpha/2, df} \cdot \sqrt{s^2_p \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$

Using the given data and t-test output, we calculate the mean difference, t-statistic, p-value, and confidence interval.

**Interpretation:**
If the p-value is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in the mean age at which infants from populations A and B walked alone. The confidence interval provides a range of plausible values for the true mean difference.

The CI will also be used to check if zero lies within the interval. If zero is not within the CI, it supports the rejection of the null hypothesis.

**Example Calculation:**
Assuming from PROC TTEST:
- Mean (A) = 10 months, Mean (B) = 12 months
- Pooled standard deviation $s_p = 1.2$
- $n_1 = 12$, $n_2 = 12$

$t = \frac{10 - 12}{\sqrt{1.2^2 \left(\frac{1}{12} + \frac{1}{12}\right)}}$

$t = \frac{-2}{\sqrt{1.2^2 \left(\frac{1}{6}\right)}}$

$t = \frac{-2}{\sqrt{0.24}}$

$t = \frac{-2}{0.49}$

$t \approx -4.08$

Compare t with the critical t-value from the t-distribution table with df = 22 at $\alpha = 0.05$.

If t is beyond the critical value, reject $H_0$.

**95% CI:**
$(10 - 12) \pm t_{\alpha/2, df} \cdot \sqrt{1.2^2 \left(\frac{1}{12} + \frac{1}{12}\right)}$

$= -2 \pm 2.074 \cdot 0.49$

$= -2 \pm 1.02$

= (-3.02, -0.98)

Interpretation: The mean difference is significantly different from zero, supporting the conclusion from the hypothesis test.

This approach outlines the detailed steps for both parts of Problem 1, using methods from the textbooks and providing a complete solution with calculations.


### Problem 2

**Objective:**
To determine if the four filling machines are consistent in filling 12 oz cough syrup bottles by analyzing the variance among them using an appropriate statistical model.

#### Part (a) Model Selection

**Given Data:**
- The weight of cough syrup filled in bottles by four different machines.
- 5 runs for each machine.

**Model Selection:**
Given the data structure, where we have measurements from multiple groups (machines), an **Analysis of Variance (ANOVA)** model is appropriate. Specifically, a **One-Way ANOVA** will be used to determine if there are any statistically significant differences between the means of the weights filled by the different machines.

**Reason:**
ANOVA is used to compare means across multiple groups to see if at least one group mean is different from the others, which is suitable for testing the consistency of the machines.

**Model:**

$Y_{ij} = \mu + \tau_i + \epsilon_{ij}$

Where:
- $Y_{ij}$ is the weight filled by machine i on run j.
- $\mu$ is the overall mean weight filled.
- $\tau_i$ is the effect of the $i^{th}$ machine.
- $\epsilon_{ij}$ is the random error term.

#### Part (b) Assumptions for ANOVA

**Assumptions:**
1. **Independence:** The observations are independent of each other.
2. **Normality:** The residuals (differences between observed and predicted values) should be approximately normally distributed.
3. **Homogeneity of Variances (Homoscedasticity):** The variances among the different groups should be approximately equal.

**Check Assumptions Using Output:**
- **Normality:** Can be checked using a Q-Q plot or a normality test (e.g., Shapiro-Wilk test).
- **Homogeneity of Variances:** Can be checked using Levene's test or Bartlett's test.

**Example Output:**
If these assumptions are met, proceed with the ANOVA. If not, consider transforming the data or using a non-parametric test like the Kruskal-Wallis test.

#### Part (c) Hypothesis Testing Using ANOVA

**Hypothesis:**
- **Null Hypothesis ($H_0$):** All machines fill bottles with the same mean weight ($\mu_1 = \mu_2 = \mu_3 = \mu_4$).
- **Alternative Hypothesis ($H_a$):** At least one machine fills bottles with a different mean weight.

**ANOVA Test:**

$F = \frac{\text{Mean Square Between Groups (MSB)}}{\text{Mean Square Within Groups (MSW)}}$

Where:
- **MSB** measures the variation between the group means.
- **MSW** measures the variation within each group.

**Decision Rule:**
If the p-value of the F-test is less than the significance level (typically 0.05), reject the null hypothesis.

**Example Output from PROC ANOVA:**
```sas
proc anova data=fill_data;
   class machine;
   model weight = machine;
run;
```

**Example Output:**
```
                            The ANOVA Procedure

                       Class Level Information
                       Class       Levels Values
                       machine        4     1 2 3 4

                       Number of Observations Read 20
                       Number of Observations Used 20


                            ANOVA for Weight
                  Source     DF    Sum of Squares    Mean Square    F Value    Pr > F

                  Model       3       1.12345          0.37448       7.89      0.0012
                  Error      16       0.75982          0.04749
                  Corrected Total 19    1.88327

                  R-Square     Coeff Var      Root MSE        Weight Mean
                  0.5966       1.2573         0.21792         11.2597
```

**Interpretation:**
- **F Value:** 7.89
- **p-value:** 0.0012

Since the p-value is less than 0.05, we reject the null hypothesis and conclude that at least one machine fills the bottles with a different mean weight than the others.

#### Part (d) Interpretation of Additional Output Pages

**Objective:**
Examine specific pages of the SAS output related to the cough syrup filling data analysis.

1. **Page 5-7:** These pages typically contain detailed ANOVA tables, post-hoc comparisons (like Tukey’s HSD test), and possibly residual diagnostics.
   - **Page 5:** Could show detailed group means and standard deviations.
   - **Page 6:** Might include post-hoc tests indicating which machines are significantly different.
   - **Page 7:** Could have residual plots or other diagnostic checks.

**Conclusion:**
- **Consistency:** Look at whether post-hoc tests show specific machines that differ significantly in performance.
- **Diagnostics:** Ensure residuals are normally distributed and homoscedasticity is maintained.

#### Part (e) Interpretation of F-Test

**Objective:**
Interpret the F-test from Page 8 of the output.

**F-Test:**
The F-test determines whether the variances of the groups (machines) are significantly different.

**Example Interpretation:**
If the F-test is significant, it indicates that the variances between groups are not equal, suggesting variability in the filling process across machines.

---

This approach outlines how to handle the ANOVA analysis using PROC ANOVA in SAS, provides an example output, and guides how to interpret each part of the problem. This should give you a solid foundation for similar questions.


### Problem 3

**Objective:**
Analyze the relationship between advertising revenue and other factors for US magazines using multiple linear regression models. Specifically, assess the validity of the initial model, consider transformations, and interpret the results.

#### Part (a): Assessing the Validity of Model (1)

**Model (1):**

$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon$

where:
- Y = AdRevenue (in thousands of \$)
- $x_1$ = AdPages
- $x_2$ = SubRevenue (in thousands of \$)
- $x_3$ = NewsRevenue (in thousands of \$)
- $\varepsilon$ = Error term assumed to be normally distributed with mean 0 and variance $\sigma^2$

**Assessing Validity:**

To determine if Model (1) is valid, we need to check if it meets the assumptions of multiple linear regression:

1. **Linearity:** The relationship between each predictor and the response variable is linear.
2. **Independence:** The residuals (errors) are independent.
3. **Homoscedasticity:** The residuals have constant variance.
4. **Normality:** The residuals are normally distributed.
5. **No Multicollinearity:** The predictors are not highly correlated with each other.

**Approach:**

1. **Linearity Check:**
   - **Scatterplots:** Plot each predictor ($x_1, x_2, x_3$) against the response variable Y. If the plots show a linear trend, the linearity assumption holds.
   
2. **Independence Check:**
   - **Study Design:** If the data collection process ensures independence (e.g., different magazines), this assumption is likely satisfied.
   
3. **Homoscedasticity Check:**
   - **Residual Plots:** Plot residuals vs. fitted values. If the spread of residuals is constant across all levels of fitted values, homoscedasticity holds.
   
4. **Normality Check:**
   - **Q-Q Plot:** Plot the quantiles of the residuals against the quantiles of a normal distribution. If the points fall approximately along a straight line, normality holds.
   
5. **Multicollinearity Check:**
   - **Variance Inflation Factor (VIF):** Calculate VIF for each predictor. VIF values greater than 10 indicate high multicollinearity.

**Potential Issues:**

Given that the variables involve monetary amounts and counts (e.g., AdPages), it's common for such data to exhibit skewness, especially monetary variables. If the residuals show non-linearity, heteroscedasticity, or non-normality, the model may not be valid without transformation.

#### Part (b): Considering Log Transformation

**Recommendation:**
A statistician suggests applying a log transformation to Y and all predictor variables.

**Rationale for Log Transformation:**

1. **Stabilize Variance:**
   - Monetary variables often have increasing variance with increasing mean (heteroscedasticity). Log transformation can stabilize variance.
   
2. **Normalize Distribution:**
   - Skewed data can be made more symmetric and closer to normal distribution via log transformation.
   
3. **Linearize Relationships:**
   - If the relationship between variables is multiplicative or exponential, log transformation can linearize it.

**Assessment:**

Given the nature of the data (revenues and counts), it is likely that:

- **AdRevenue, SubRevenue, NewsRevenue:** These are monetary variables, potentially right-skewed.
- **AdPages:** Counts, may also be skewed.

Therefore, applying a log transformation is reasonable to address potential violations of regression assumptions.

**Conclusion:**
Agree with the recommendation to apply log transformations, leading to Model (2).

#### Part (c): Validity of Model (2) Using Residual Plots

**Model (2):**

$\log(Y) = \beta_0 + \beta_1 \log(x_1) + \beta_2 \log(x_2) + \beta_3 \log(x_3) + \varepsilon$

**Assessing Validity:**

Using residual plots is essential to evaluate the assumptions:

1. **Residuals vs. Fitted Values:**
   - **Purpose:** Check for homoscedasticity and linearity.
   - **Interpretation:** Look for random scatter around zero with no clear patterns.
   
2. **Normal Q-Q Plot:**
   - **Purpose:** Assess normality of residuals.
   - **Interpretation:** Points should lie close to the 45-degree line.

3. **Scale-Location Plot:**
   - **Purpose:** Further check for homoscedasticity.
   - **Interpretation:** Points should be randomly scattered.

4. **Residuals vs. Leverage:**
   - **Purpose:** Identify influential observations.
   - **Interpretation:** Points with high leverage and large residuals may unduly influence the model.

**Conclusion:**

If the residual plots after log transformation show:

- **No patterns or funnels** in Residuals vs. Fitted Values,
- **Residuals aligned** along the line in the Q-Q plot,
  
then Model (2) is valid as it satisfies regression assumptions.

#### Part (d): Identifying Statistically Significant Variables in Model (2)

**Approach:**

1. **Regression Output:**
   - Use statistical software (e.g., SAS, R) to fit Model (2) and obtain regression coefficients, standard errors, t-values, and p-values.

2. **Statistical Significance:**
   - At the 5% significance level, variables with p-values less than 0.05 are considered statistically significant.

**Example Output:**
Assuming the following regression table:

| Predictor           | Coefficient ($\hat{\beta}$) | Std. Error | t-value | p-value |
|---------------------|-------------------------------|------------|---------|---------|
| Intercept           | 0.500                         | 0.200      | 2.5     | 0.013   |
| log(AdPages) ($x_1$)    | 0.800                         | 0.100      | 8.0     | <0.0001 |
| log(SubRevenue) ($x_2$) | 0.150                         | 0.070      | 2.14    | 0.034   |
| log(NewsRevenue) ($x_3$)| 0.050                         | 0.080      | 0.625   | 0.533   |

**Interpretation:**

- **log(AdPages):** p-value < 0.0001 → **Significant**
- **log(SubRevenue):** p-value = 0.034 → **Significant**
- **log(NewsRevenue):** p-value = 0.533 → **Not Significant**

**Conclusion:**

At the 5% significance level, $x_1$ (AdPages) and $x_2$ (SubRevenue) are significantly associated with log(AdRevenue). $x_3$ (NewsRevenue) is not.

#### Part (e): Interpretation of Significant Regression Coefficients

**Given Interpretation:**
In the log-log model, coefficients represent the **elasticity**, i.e., the percentage change in Y for a 1% change in x.

**Coefficients:**

1. **log(AdPages) ($\hat{\beta}_1 = 0.800$):**
   - **Interpretation:** A 1% increase in AdPages is associated with an average **0.8% increase** in AdRevenue, holding other factors constant.
   
2. **log(SubRevenue) ($\hat{\beta}_2 = 0.150$):**
   - **Interpretation:** A 1% increase in SubRevenue is associated with an average **0.15% increase** in AdRevenue, holding other factors constant.

**Note:** The coefficient for log(NewsRevenue) is not significant; thus, we refrain from interpreting it.

**Example:**

- If AdPages increase by 10%, AdRevenue is expected to increase by $0.8 \times 10\% = 8\%$.
- If SubRevenue increases by 10%, AdRevenue is expected to increase by $0.15 \times 10\% = 1.5\%$.

**Conclusion:**

AdPages have a strong positive association with AdRevenue, indicating that more advertising pages contribute significantly to advertising revenue. SubRevenue also has a positive but smaller effect. NewsRevenue does not have a statistically significant association with AdRevenue in this model.

---

This comprehensive analysis guides through assessing model validity, considering transformations, interpreting regression outputs, and understanding the implications of significant predictors in the context of magazine advertising revenues.


### Problem 4

**Objective:**
Estimate the relationship between atmospheric pressure and the boiling point of water using a linear regression model. Compute the least square estimates, construct confidence intervals, and assess the statistical significance of the relationship.

**Given Data:**
- $\bar{x}$ = 202.9529 (mean boiling point, bp)
- $s_{xx}$ = 530.7824 (sum of squared deviations of bp)
- $s_{xy}$ = 475.3122 (sum of cross products of bp and lpres)
- $\bar{y}$ = 139.6053 (mean log pressure, lpres)
- $s_{yy}$ = 427.7942 (sum of squared deviations of lpres)
- n = 17 (sample size)

### Part (a): Compute Least Squares Estimates of $\beta_0$ and  $\beta_1$

**Model:**

$E(\text{lpres}) = \beta_0 + \beta_1 \cdot \text{bp}$

**Formulas:**

1. **Estimate of $\beta_1$:**
   
   $\hat{\beta}_1 = \frac{s_{xy}}{s_{xx}}$
   
   Substituting the given values:
   
   $\hat{\beta}_1 = \frac{475.3122}{530.7824} \approx 0.8956$

2. **Estimate of $\beta_0$:**
   
   $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \cdot \bar{x}$
  
   Substituting the calculated $\hat{\beta}_1$ and given values:
   
   $\hat{\beta}_0 = 139.6053 - (0.8956 \times 202.9529) \approx -42.8555$

**Interpretation:**
- **$\hat{\beta}_1$ = 0.8956:** For each 1-degree Fahrenheit increase in boiling point, the log of atmospheric pressure increases by approximately 0.8956 units.
- **$\hat{\beta}_0$ = -42.8555:** This is the estimated log pressure when the boiling point is 0°F, though it’s more of an extrapolation beyond the range of observed data.

### Part (b): Compute the Estimate of the Regression Line

**Regression Line Equation:**

$E(\text{lpres}) = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{bp}$

Substituting the estimates:

$E(\text{lpres}) = -42.8555 + 0.8956 \cdot \text{bp}$

This equation can be used to predict the log of atmospheric pressure for any given boiling point.

### Part (c): Compute the Unbiased Estimate for Error Variance $\sigma^2$, Construct a 95% CI for $\beta_1$, and Assess Statistical Significance

**1. Error Variance $\sigma^2$:**
   - Formula for error variance:
   
   $\hat{\sigma}^2 = \frac{1}{n-2} \left(s_{yy} - \hat{\beta}_1 s_{xy}\right)$
   
   Substituting the known values:
   
   $\hat{\sigma}^2 = \frac{1}{15} \left(427.7942 - (0.8956 \times 475.3122)\right)$
   
   $\hat{\sigma}^2 = \frac{1}{15} \left(427.7942 - 425.5532\right)$
   
   $\hat{\sigma}^2 = \frac{2.2410}{15} \approx 0.1494$

**2. 95% Confidence Interval for $\beta_1$:**

   **Standard Error of $\hat{\beta}_1$:**
   
   $SE(\hat{\beta}_1) = \frac{\hat{\sigma}}{\sqrt{s_{xx}}}$
   
   $SE(\hat{\beta}_1) = \frac{\sqrt{0.1494}}{\sqrt{530.7824}} \approx 0.0168$

   **Critical value $t_{\alpha/2, df}$ for $\alpha = 0.05$ and df = 15:**
   
   $t_{0.025, 15} \approx 2.131$

   **Confidence Interval:**
   
   $\hat{\beta}_1 \pm t_{\alpha/2, df} \times SE(\hat{\beta}_1)$
   
   $0.8956 \pm 2.131 \times 0.0168$
   
   $0.8956 \pm 0.0358$
   
   (0.8598, 0.9314)

**Interpretation:**
- The 95% CI for $\beta_1$ is (0.8598, 0.9314). Since the interval does not include zero, we conclude that there is a statistically significant positive relationship between boiling point and log atmospheric pressure.

**3. Assessing Statistical Significance at the 5% Level:**

- Given that the confidence interval for $\beta_1$ does not contain zero, and assuming the normality of errors, we conclude that the relationship between boiling point and log atmospheric pressure is statistically significant at the 5% level.

### Summary

The least squares estimates for the regression line indicate a significant positive relationship between the boiling point of water and the log of atmospheric pressure. The error variance is small, suggesting a good fit of the model. The 95% confidence interval for the slope $\beta_1$ suggests that for each unit increase in boiling point, the log atmospheric pressure increases by between 0.8598 and 0.9314 units. This result is statistically significant, confirming that the boiling point is a reliable predictor of atmospheric pressure.