## 2022


### Problem 1: Soluble Vascular Adhesion Protein-1 (sVAP-1) as a Biomarker for Atherosclerosis

**Objective:** Determine whether the true mean sVAP-1 level in diabetics on insulin treatment is higher than in diabetics on other treatments.

**Given Data:**
- Group 1 (Diabetics on insulin treatment): $n_1 = 7$
- Group 2 (Diabetics on other treatments): $n_2 = 41$
- Statistical test: Student’s t-test

**(a) Hypotheses:**

**Null Hypothesis ($H_0$):**

$H_0: \mu_1 \leq \mu_2$

Where $\mu_1$ is the mean sVAP-1 level for diabetics on insulin treatment, and $\mu_2$ is the mean sVAP-1 level for diabetics on other treatments.

**Alternative Hypothesis ($H_A$):**

$H_A: \mu_1 > \mu_2$

The alternative hypothesis suggests that the mean sVAP-1 level in diabetics on insulin treatment is higher than that in diabetics on other treatments.

**(b) Justification of Analysis Method:**

**Method Used:**
The authors used the equal variance t-test (also known as a pooled t-test).

**Reasoning:**
The pooled t-test assumes that the two populations have the same variance. To determine if this assumption is reasonable, one typically conducts an F-test for equality of variances. If the F-test suggests no significant difference between the variances, the pooled t-test is appropriate. Without the F-test results provided, we can't definitively agree or disagree with the authors' method. However, given the small sample size in group 1 ($n_1 = 7), it might be prudent to use a Welch’s t-test, which does not assume equal variances, to be more robust to potential heteroscedasticity.

**(c) Test Statistic, Degrees of Freedom, and p-value:**

**Formula:**
The test statistic for the pooled t-test is calculated as:

$t_0 = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_p^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right)}}$

Where $s_p^2$ is the pooled variance, calculated as:

$s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}$

**Given in the Problem:**
- $t_0$: The calculated t-statistic from the SAS output
- Degrees of freedom (df): $n_1 + n_2 - 2$
- p-value: From the SAS output

**(d) Conclusion:**

**Interpretation:**
If the p-value is less than the significance level (typically 0.05), we reject the null hypothesis ($H_0$) and conclude that there is sufficient evidence to suggest that the mean sVAP-1 level in diabetics on insulin treatment is higher than in diabetics on other treatments.

**(e) 95% Confidence Interval for the Pooled Population Variances:**

**Formula:**
The confidence interval for the pooled variance is calculated using the following:

$\text{CI} = \left( \frac{(n_1 + n_2 - 2)s_p^2}{\chi^2_{\alpha/2, \text{df}}}, \frac{(n_1 + n_2 - 2)s_p^2}{\chi^2_{1-\alpha/2, \text{df}}} \right)$

Where $\chi^2_{\alpha/2, \text{df}}$ and $\chi^2_{1-\alpha/2, \text{df}}$ are the critical values from the chi-square distribution with df degrees of freedom.

**Calculation:**
Using the variance information from the output, plug in the values to compute the confidence interval.

**Interpretation:**
The confidence interval provides a range of plausible values for the population variance, taking into account the uncertainty in the sample data.

---

This approach adheres to the principles and methods outlined in the provided textbooks, ensuring that the analysis remains within the scope of the materials. Each step is reasoned based on standard statistical practices and the guidelines typically taught in courses involving statistical inference and biostatistics.


### Problem 2: Association Between Maternal Rubella and Congenital Cataracts

**Objective:** Determine if there is an association between maternal rubella and congenital cataracts in children using a chi-square test.

**Given Data:**
- Sample of 20 children with congenital cataracts
- Sample of 25 children without congenital cataracts
- Mothers were asked whether they had rubella while carrying the child

**Data Summary:**

|                | Cataracts | No Cataracts | Total |
|----------------|-----------|--------------|-------|
| Rubella        | 16        |  5           | 21    |
| No Rubella     | 4         | 20           | 24    |
| Total          | 20        | 25           | 45    |

**(a) Appropriateness of the Chi-Square Test:**

**Verification:**
- The chi-square test is appropriate if the expected frequencies in each cell of the contingency table are at least 5.
- Calculation of expected frequencies:


$E_{ij} = \frac{( \text{row total} \times \text{column total})}{\text{grand total}}$

|                |             Cataracts            |           No Cataracts             |
|----------------|----------------------------------|------------------------------------|
| Rubella        | $\frac{20 \times 21}{45} = 9.33$ | $\frac{25 \times 21}{45} = 11.67$  |
|                |                                  |                                    |
| No Rubella     | $\frac{20 \times 24}{45} = 10.67$| $\frac{25 \times 24}{45} = 13.33$  |

All expected frequencies are greater than 5, so the chi-square test is appropriate.

**(b) Chi-Square Test for Independence:**

**Hypotheses:**

**Null Hypothesis ($H_0$):**

$H_0: \text{Maternal rubella is independent of congenital cataracts in children.}$

**Alternative Hypothesis ($H_A$):**

$H_A: \text{Maternal rubella is associated with congenital cataracts in children.}$

**Test Statistic:**

**Formula:**

$\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$

Where $O_{ij}$ is the observed frequency and $E_{ij}$ is the expected frequency.

**Calculation:**

|                | Cataracts | No Cataracts | Total |
|----------------|-----------|--------------|-------|
| Rubella        | 16        |  5           | 21    |
| No Rubella     | 4         | 20           | 24    |
| Total          | 20        | 25           | 45    |


$\chi^2 = \frac{(16 - 9.33)^2}{9.33} + \frac{(4 - 10.67)^2}{10.67} + \frac{(5 - 11.67)^2}{11.67} + \frac{(20 - 13.33)^2}{13.33}$


$\chi^2 = \frac{(16 - 9.33)^2}{9.33} + \frac{(4 - 10.67)^2}{10.67} + \frac{(5 - 11.67)^2}{11.67} + \frac{(20 - 13.33)^2}{13.33}$


$\chi^2 = \frac{(6.67)^2}{9.33} + \frac{(-6.67)^2}{10.67} + \frac{(-6.67)^2}{11.67} + \frac{(6.67)^2}{13.33}$


$\chi^2 = \frac{44.49}{9.33} + \frac{44.49}{10.67} + \frac{44.49}{11.67} + \frac{44.49}{13.33}$


$\chi^2 = 4.77 + 4.17 + 3.81 + 3.34$


$\chi^2 = 16.09$

**Degrees of Freedom (df):**

df = (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1

**p-value:**
Using a chi-square distribution table, the p-value corresponding to $\chi^2$ = 16.09\) and df = 1 is less than 0.001.

**(c) Conclusion:**

**Interpretation:**
Since the p-value is less than the significance level of 0.05, we reject the null hypothesis ($H_0$). Therefore, there is sufficient evidence to conclude that maternal rubella is associated with congenital cataracts in children.


### Problem 3: Effects of Various Drugs on Peptide Levels in Rats with High Blood Pressure

**Objective:** Analyze the effects of various treatments and gender on peptide levels in rats using ANOVA.

**Given Data:**
- 24 male rats and 24 female rats
- 3 treatment groups: candesartan, candesartan plus angiotensin II, control
- 2 × 3 factorial design (2 genders × 3 treatments)
- Sample variance for y-values: $s^2_y$ = 85.106383

**Partial ANOVA Table Provided:**

| Source               | SS     | d.f. | MS      | F    |
|----------------------|--------|------|---------|------|
| Model                |        |      |         |      |
| Gender               | 1410   |      |         |      |
| Treatment            | 1056   |      |         |      |
| Gender x Treatment   |        |      |         |      |
| Error                | 1470   |      |         |      |
| Total                | 4000   |      |         |      |

**(a) Filling in the ANOVA Table:**

1. **Degrees of Freedom:**


$df_{\text{Total}} = N - 1 = 48 - 1 = 47$

$df_{\text{Error}} = N - \text{number of groups} = 48 - 6 = 42$

$df_{\text{Model}} = df_{\text{Total}} - df_{\text{Error}} = 47 - 42 = 5$

$df_{\text{Gender}} = 1$

$df_{\text{Treatment}} = 2$

$df_{\text{Gender x Treatment}} = df_{\text{Model}} - df_{\text{Gender}} - df_{\text{Treatment}} = 5 - 1 - 2 = 2$

2. **Sum of Squares (SS):**


$SS_{\text{Total}} = 4000$

$SS_{\text{Error}} = 1470$

$SS_{\text{Model}} = SS_{\text{Total}} - SS_{\text{Error}} = 4000 - 1470 = 2530$

$SS_{\text{Gender x Treatment}} = SS_{\text{Model}} - SS_{\text{Gender}} - SS_{\text{Treatment}} = 2530 - 1410 - 1056 = 64$

3. **Mean Squares (MS):**

$MS_{\text{Model}} = \frac{SS_{\text{Model}}}{df_{\text{Model}}} = \frac{2530}{5} = 506

$MS_{\text{Gender}} = \frac{SS_{\text{Gender}}}{df_{\text{Gender}}} = \frac{1410}{1} = 1410$

$MS_{\text{Treatment}} = \frac{SS_{\text{Treatment}}}{df_{\text{Treatment}}} = \frac{1056}{2} = 528$

$MS_{\text{Gender x Treatment}} = \frac{SS_{\text{Gender x Treatment}}}{df_{\text{Gender x Treatment}}} = \frac{64}{2} = 32$

$MS_{\text{Error}} = \frac{SS_{\text{Error}}}{df_{\text{Error}}} = \frac{1470}{42} = 35$

4. **F-Statistics:**

$F_{\text{Model}} = \frac{MS_{\text{Model}}}{MS_{\text{Error}}} = \frac{506}{35} = 14.46$

$F_{\text{Gender}} = \frac{MS_{\text{Gender}}}{MS_{\text{Error}}} = \frac{1410}{35} = 40.29$

$F_{\text{Treatment}} = \frac{MS_{\text{Treatment}}}{MS_{\text{Error}}} = \frac{528}{35} = 15.09$

$F_{\text{Gender x Treatment}} = \frac{MS_{\text{Gender x Treatment}}}{MS_{\text{Error}}} = \frac{32}{35} = 0.91$

**Completed ANOVA Table:**

| Source               | SS     | d.f. | MS      | F    |
|----------------------|--------|------|---------|------|
| Model                | 2530   | 5    | 506     | 14.46|
| Gender               | 1410   | 1    | 1410    | 40.29|
| Treatment            | 1056   | 2    | 528     | 15.09|
| Gender x Treatment   | 64     | 2    | 32      | 0.91 |
| Error                | 1470   | 42   | 35      |      |
| Total                | 4000   | 47   |         |      |

**(b) Overall F-Test Significance:**

**Calculation:**

$F_{\text{Model}} = \frac{MS_{\text{Model}}}{MS_{\text{Error}}} = \frac{SS_{\text{Model}} / df_{\text{Model}}}{MS_{\text{Error}}} = \frac{2530 / 5}{35} = 14.46$

**Interpretation:**
Since the F-value of 14.46 is significantly greater than the critical F-value from the F-distribution table at a significance level of 0.05, we conclude that the overall model is significant.

**(c) Significance of Interaction:**

**Interpretation:**
The F-value for Gender x Treatment interaction is 0.91, which is not significant at the 0.05 level. Therefore, there is no significant interaction between treatment and gender.

**(d) Interpretation of Interaction:**

**Interpretation:**
The lack of a significant interaction suggests that the effect of treatment on peptide levels is consistent across male and female rats. This means that the treatment effect does not depend on the gender of the rats.

**(e) Significance of Treatments:**

**Interpretation:**
The F-value for treatment is 15.09, indicating a significant difference among the three treatments. This means that at least one treatment has a different effect on peptide levels compared to the others.

**(f) Significance of Gender:**

**Interpretation:**
The F-value for gender is 40.29, indicating a significant difference between male and female rats. This means that gender has a significant effect on peptide levels in rats.

**(g) Next Step in Analysis:**

**Recommendation:**
Since there are significant differences in treatments and gender, the next step would be to conduct post-hoc pairwise comparisons (e.g., Tukey’s HSD) to determine which specific treatments differ from each other within each gender. Additionally, it would be helpful to examine main effects plots to visualize the effects of treatments and gender on peptide levels.

### Problem 4: Multiple Linear Regression Analysis

**Objective:** Perform a multiple linear regression (MLR) analysis to determine if multicollinearity exists and assess its impact on the model.

**Given Data:**
- 6 independent variables (X1 - X6)
- n = 166

**(a) Correlation Matrix Analysis:**

**Given Correlation Matrix:**

| Variable | Y       | X1      | X2      | X3      | X4      | X5      | X6      |
|----------|---------|---------|---------|---------|---------|---------|---------|
| Y        | 1.00000 | 0.62513 | -0.10533| -0.04553| -0.02666| 0.92444 | 0.94290 |
| X1       | 0.62513 | 1.00000 | -0.07462| -0.05846| 0.02681 | 0.50680 | 0.52256 |
| X2       | -0.10533| -0.07462| 1.00000 | -0.05532| -0.25287| -0.08218| -0.08604|
| X3       | -0.04553| -0.05846| -0.05532| 1.00000 | 0.80498 | -0.04046| -0.02660|
| X4       | -0.02666| 0.02681 | -0.25287| 0.80498 | 1.00000 | -0.01602| -0.00597|
| X5       | 0.92444 | 0.50680 | -0.08218| -0.04046| -0.01602| 1.00000 | 0.99424 |
| X6       | 0.94290 | 0.52256 | -0.08604| -0.02660| -0.00597| 0.99424 | 1.00000 |

**Potential Multicollinearity Issues:**

- High correlation between X5 and X6 (r = 0.99424)
- High correlation between X3 and X4 (r = 0.80498)

**Remedial Measures:**

- Variance Inflation Factor (VIF) analysis to quantify multicollinearity.
- Consider removing or combining highly correlated variables (e.g., X5 and X6).

**(b) MLR Model Diagnostics:**

**SAS Output for MLR Model:**

| Variable   | Parameter Estimate | Tolerance    |
|------------|--------------------|--------------|
| Intercept  | 6.65802            |              |
| X1         | 11.47388           | 0.69608      |
| X2         | -0.02785           | 0.86639      |
| X3         | 0.07244            | 0.31805      |
| X4         | -0.03181           | 0.30292      |
| X5         | -0.00053889        | 0.01101      |
| X6         | 0.00002486         | 0.01075      |

**Collinearity Diagnostics:**

| Number | Eigenvalue | Condition Index | Proportion of Variation |
|--------|------------|-----------------|------------------------|
| 1      | 4.91389    | 1.00000         |                        |
| 2      | 1.94602    | 1.58905         |                        |
| 3      | 0.10090    | 6.97871         |                        |
| 4      | 0.02710    | 13.46511        |                        |
| 5      | 0.00555    | 29.74993        |                        |
| 6      | 0.00465    | 32.51359        |                        |
| 7      | 0.00189    | 50.96699        |                        |

**Analysis:**

**(i) Tolerance and Condition Index Diagnostics:**

- Tolerance values for X5 (0.01101) and X6 (0.01075) are very low, indicating high multicollinearity.
- High Condition Index values (above 30) suggest multicollinearity issues.

**(ii) Characteristics Suggesting Adverse Effects:**

1. **High Variance in Coefficients:**
   - The high variance inflation factors (VIFs) suggest instability in the coefficient estimates for X5 and X6.

2. **Non-significant t-values despite high R-squared:**
   - The model shows high R-squared (0.9254) but some variables (X2, X3, X4) have non-significant t-values, indicating possible multicollinearity.

**(c) Next Step in Analysis:**

**Recommendation:**

1. **Remove or Combine Highly Correlated Variables:**
   - Consider removing either X5 or X6, or combining them into a single variable to reduce multicollinearity.

2. **Rerun the MLR Analysis:**
   - After addressing multicollinearity, rerun the regression analysis and re-evaluate the model.

3. **Use Principal Component Analysis (PCA):**
   - Alternatively, apply PCA to transform the correlated variables into a smaller set of uncorrelated components.

**Justification:**

- Reducing multicollinearity will lead to more stable and interpretable coefficient estimates, improving the overall reliability of the regression model.