### Problem 1: Complete the sentence

**Objective:** 
- The objective is to complete the sentence, "In the NHST framework, the p-value is the probability of…"

**Given Data:** 
- The sentence to complete is related to the p-value in the context of Null Hypothesis Significance Testing (NHST).

**Explanation:**
- In the NHST (Null Hypothesis Significance Testing) framework, the p-value represents the probability of obtaining a test statistic at least as extreme as the one observed, assuming that the null hypothesis is true.

**Interpretation:**
- The correct completion of the sentence is:
  - **“In the NHST framework, the p-value is the probability of observing data as extreme as, or more extreme than, what was actually observed, under the assumption that the null hypothesis is true.”**


### Problem 2: Linear Regression Assumptions

**Objective:** 
- The objective is to outline the assumptions of linear regression, specify a simple linear regression model, and analyze situations where these assumptions might be violated.

#### Part A: Write out the four usual assumptions of linear regression.

**Given Data:**
- Four assumptions of linear regression are required.

**Explanation:**
- The four usual assumptions of linear regression are:

1. **Linearity:** The relationship between the dependent variable y and the independent variable(s) X is linear.
2. **Independence:** The residuals (errors) are independent of each other. This means that there is no correlation between the errors for different observations.
3. **Homoscedasticity (Constant Variance):** The residuals have constant variance across all levels of the independent variable(s). This implies that the spread of the residuals should be roughly the same at all levels of X.
4. **Normality of Residuals:** The residuals of the model are normally distributed. This assumption is important for making valid inferences about the regression coefficients.

**Interpretation:**
- These assumptions are critical for the validity of hypothesis tests and confidence intervals in linear regression. Violations of these assumptions can lead to biased estimates and incorrect conclusions.

#### Part B: Write the complete formal specification of the simple linear regression model.

**Formula:**
- The formal specification of the simple linear regression model can be written as:
  $
  y_i = \beta_0 + \beta_1 x_i + \epsilon_i
  $
  where:
  - $y_i$ is the dependent variable for observation i,
  - $x_i$ is the independent variable for observation i,
  - $\beta_0$ is the intercept of the regression line,
  - $\beta_1$ is the slope of the regression line,
  - $\epsilon_i$ is the error term (or residual) for observation i, which is assumed to be normally distributed with mean 0 and variance $\sigma^2$ (i.e., $\epsilon_i$ $\sim N(0, \sigma^2)$).

**Interpretation:**
- This equation represents the expected relationship between the independent variable (x) and the dependent variable (y), with the error term accounting for the deviation from this relationship.

#### Part C: Correlated Covariates and Response

**Objective:**
- Determine if the assumptions of linear regression are still satisfied when covariates are correlated and potentially correlated with the response.

**Given Data:**
- Two covariates $x_1$ and $x_2$ are correlated, and the response (y) is suspected to be correlated with at least one of the covariates.

**Explanation:**
- When covariates $x_1$ and $x_2$ are correlated, multicollinearity may arise, which can violate the assumption of independence among predictors. Additionally, if the response (y) is correlated with the covariates, it could lead to issues with the interpretation of the regression coefficients.

**Interpretation:**
- The assumption of independence among residuals is crucial, and in this case, it might not hold due to the potential multicollinearity. Multicollinearity does not violate the basic assumptions of the linear model but can lead to inflated standard errors and make it difficult to determine the individual effect of each predictor. If multicollinearity is present, it may be necessary to use techniques like Ridge Regression or Principal Component Analysis to address this issue.

#### Part D: Example Violating Normality Assumption

**Objective:**
- Provide a real-world example that would likely violate the normality assumption of residuals.

**Example:**
- A dataset that records income levels in a population is likely to violate the normality assumption. Income data often follows a skewed distribution, with a large number of individuals earning below the average and a few earning significantly more, leading to a long right tail. This skewness results in residuals that are not normally distributed.

**Interpretation:**
- Non-normality in residuals can affect the validity of hypothesis tests in the linear regression model. Transformations of the dependent variable or using robust statistical methods can sometimes correct this violation.

#### Part E: Example Violating Independence Assumption

**Objective:**
- Provide a real-world example that would likely violate the independence assumption.

**Example:**
- Time series data, such as daily stock prices, would likely violate the independence assumption. Observations in time series data are often autocorrelated, meaning that past values influence future values.

**Interpretation:**
- Violation of independence can lead to underestimated standard errors and incorrect inferences. In such cases, time series models like ARIMA should be used instead of ordinary linear regression.

#### Part F: Example Violating Constant Variance Assumption

**Objective:**
- Provide a real-world example that would likely violate the constant variance assumption.

**Example:**
- A dataset that records house prices across different neighborhoods might violate the constant variance assumption. In wealthier neighborhoods, the variance in house prices is often larger compared to less affluent areas, leading to heteroscedasticity (non-constant variance).

**Interpretation:**
- Heteroscedasticity can lead to inefficient estimates and invalid inferences. To address this, weighted least squares or robust standard errors can be used to adjust for non-constant variance.

### Problem 3: One Sample Normal Model

**Objective:** 
- The objective is to work with the one-sample normal model, derive maximum likelihood estimates (MLEs), and compute biases.

#### Part A: Write down the likelihood for $y_1$, $y_2$, $\dots$, $y_n$.

**Given Data:**
- The observations $y_1, y_2, \dots, y_n$ are assumed to be independent and identically distributed (i.i.d.) following a normal distribution $N(\mu, \sigma^2)$.

**Formula:**
- The likelihood function for a sample of n observations from a normal distribution $N(\mu, \sigma^2)$ is given by:
  $
  L(\mu, \sigma^2 | y_1, y_2, \dots, y_n) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right)
  $
  This is the joint probability of observing the data given the parameters $\mu$ and $\sigma^2$.

**Interpretation:**
- The likelihood function represents the probability of observing the data as a function of the parameters $\mu and $\sigma^2$.

#### Part B: Derive the MLE of $\mu$ assuming $\sigma^2$ is known.

**Formula:**
- To find the MLE of $\mu$, we take the natural logarithm of the likelihood function (log-likelihood) and differentiate it with respect to $\mu$:
  
  $
  \log L(\mu | y_1, y_2, \dots, y_n) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \mu)^2
  $
  
  Differentiating with respect to $\mu$:
  
  $
  \frac{\partial \log L(\mu | y_1, y_2, \dots, y_n)}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^{n} (y_i - \mu)
  $
  
  Setting this derivative to zero:
  
  $
  \sum_{i=1}^{n} (y_i - \mu) = 0
  $
  
  Solving for $\mu$:
  
  $
  \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} y_i
  $

**Interpretation:**
- The MLE for $\mu$ is the sample mean $\hat{\mu}$, which is the most likely value of $\mu$ given the observed data when $\sigma^2$ is known.

#### Part C: Derive the MLE of $\sigma^2$ when $\mu$ is known.

**Formula:**
- With $\mu$ known, the log-likelihood function is:
  
  $
  \log L(\sigma^2 | y_1, y_2, \dots, y_n) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \mu)^2
  $
  
  Differentiating with respect to $\sigma^2$:
  
  $
  \frac{\partial \log L(\sigma^2 | y_1, y_2, \dots, y_n)}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^{n} (y_i - \mu)^2
  $
  
  Setting this derivative to zero:
  
  $
  \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mu)^2
  $

**Interpretation:**
- The MLE for $\sigma^2$ is the average of the squared deviations from the mean $\mu$, representing the most likely value of the variance given the observed data.

#### Part D: Derive the MLEs for $\mu$ and $\sigma^2$ when both are unknown.

**Formula:**
- The log-likelihood function when both $\mu$ and $\sigma^2$ are unknown is:
  
  $
  \log L(\mu, \sigma^2 | y_1, y_2, \dots, y_n) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \mu)^2
  $
  
  To find the MLEs, differentiate with respect to $\mu$ and $\sigma^2$ separately and set the derivatives to zero:
  - For $\mu$:
    
    $
    \frac{\partial \log L(\mu, \sigma^2 | y_1, y_2, \dots, y_n)}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^{n} (y_i - \mu) = 0 \quad \Rightarrow \quad \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} y_i
    $
    
  - For $\sigma^2$:
    
    $
    \frac{\partial \log L(\mu, \sigma^2 | y_1, y_2, \dots, y_n)}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^{n} (y_i - \hat{\mu})^2 = 0 \quad \Rightarrow \quad \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{\mu})^2
    $
    

**Interpretation:**
- The MLEs for $\mu$ and $\sigma^2$ are the sample mean $\hat{\mu}$ and the sample variance $\hat{\sigma}^2$, respectively. These are the most likely estimates for the population parameters given the observed data.

#### Part E: Compute the bias for both of the MLEs you calculated in the previous part.

**Explanation:**
- **Bias of $\hat{\mu}$**:
  - The sample mean $\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} y_i$ is an unbiased estimator of $\mu$, meaning:
    
    $
    \text{Bias}(\hat{\mu}) = E(\hat{\mu}) - \mu = \mu - \mu = 0
    $
    
- **Bias of $\hat{\sigma}^2$**:
  - The sample variance $\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{\mu})^2$ is a biased estimator of $\sigma^2$, with:
    
    $
    \text{Bias}(\hat{\sigma}^2) = E(\hat{\sigma}^2) - \sigma^2 = \left(\frac{n-1}{n}\right)\sigma^2 - \sigma^2 = -\frac{\sigma^2}{n}
    $
    
    Thus, the bias of $\hat{\sigma}^2$ is $-\frac{\sigma^2}{n}$.

**Interpretation:**
- The sample mean $\hat{\mu}$ is unbiased, while the sample variance $\hat{\sigma}^2$ is biased downward. This bias can be corrected by using $\frac{1}{n-1}$ instead of $\frac{1}{n}$ in the calculation of the variance, which leads to the unbiased estimator of variance.

### Problem 4: One-Way ANOVA

**Objective:** 
- The objective is to verify an identity in ANOVA, write the null hypothesis, describe key ANOVA terms, and address various aspects related to the ANOVA model.

#### Part A: Verify the identity $\text{SST} = \text{SSW} + \text{SSB}$.

**Given Data:**
- SST: Total Sum of Squares
- SSW: Within-group Sum of Squares
- SSB: Between-group Sum of Squares

**Formula:**
- The identity $\text{SST} = \text{SSW} + \text{SSB}$ can be derived as follows:
  
  1. **Total Sum of Squares (SST):**
     $
     \text{SST} = \sum_{i=1}^{n} (y_i - \bar{y})^2
     $
     
     where $y_i$ are individual observations and $\bar{y}$ is the overall mean.

  2. **Within-group Sum of Squares (SSW):**
     
     $
     \text{SSW} = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_j)^2
     $
     
     where $\bar{y}_j$ is the mean of group (j) and $( n_j )$ is the number of observations in group j.

  3. **Between-group Sum of Squares (SSB):**
     $
     \text{SSB} = \sum_{j=1}^{k} n_j (\bar{y}_j - \bar{y})^2
     $

**Calculation:**
- To verify the identity, we express the total variation $( \text{SST} )$ as the sum of the variation within groups $(\text{SSW})$ and the variation between groups $(\text{SSB})$:

  $
  \text{SST} = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (y_{ij} - \bar{y})^2
  $

  Decompose $(y_{ij} - \bar{y})$ into two parts:
  $
  y_{ij} - \bar{y} = (y_{ij} - \bar{y}_j) + (\bar{y}_j - \bar{y})
  $

  Expanding the square:
  $
  (y_{ij} - \bar{y})^2 = (y_{ij} - \bar{y}_j)^2 + 2(y_{ij} - \bar{y}_j)(\bar{y}_j - \bar{y}) + (\bar{y}_j - \bar{y})^2
  $

  Summing over all observations:
  $
  \text{SST} = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_j)^2 + \sum_{j=1}^{k} \sum_{i=1}^{n_j} (\bar{y}_j - \bar{y})^2
  $

  The cross-product term $(\sum_{j=1}^{k} \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_j)(\bar{y}_j - \bar{y}))$ equals zero because the within-group deviations $(y_{ij} - \bar{y}_j)$ sum to zero.

  Therefore:
  $
  \text{SST} = \text{SSW} + \text{SSB}
  $

**Interpretation:**
- The identity $(\text{SST} = \text{SSW} + \text{SSB})$ confirms that the total variability in the data is the sum of the variability within groups and the variability between groups. This is a fundamental result in ANOVA that allows partitioning the total variation into components attributable to different sources.

#### Part B: Write the null hypothesis for the one-way ANOVA model.

**Objective:**
- Formulate the null hypothesis for the one-way ANOVA.

**Hypothesis:**
- The null hypothesis $(H_0)$ for the one-way ANOVA model is:
  $
  H_0: \mu_1 = \mu_2 = \dots = \mu_k
  $
  
  where $\mu_j$ represents the mean of the j-th group.

**Interpretation:**
- The null hypothesis asserts that all group means are equal, meaning that any observed differences among group means are due to random chance rather than systematic effects.

#### Part C: Provide a (brief) intuitive description of what SST, SSW, and SSB are measuring.

**Explanation:**
- **SST (Total Sum of Squares):** 
  - Measures the total variability in the observed data. It quantifies how much individual observations deviate from the overall mean, providing a measure of the total variation in the dataset.

- **SSW (Within-group Sum of Squares):**
  - Measures the variability within each group. It quantifies how much the individual observations within each group deviate from their respective group means. This captures the variability that is not explained by the grouping factor.

- **SSB (Between-group Sum of Squares):**
  - Measures the variability between the group means. It quantifies how much the group means deviate from the overall mean. This component represents the variability that is explained by the differences between the groups.

**Interpretation:**
- SST represents the overall variation in the data, SSW represents the variation within each group, and SSB represents the variation between groups. The ANOVA test examines whether the between-group variability (SSB) is large enough relative to the within-group variability (SSW) to conclude that the group means are significantly different.

### Problem 5: Drug Dosage Effect Analysis

**Objective:** 
- The objective is to determine the appropriate methods or models to analyze the effects of different drug dosages and to construct linear regression models for the data.

#### Part A: What method or model would you use to determine if the 50mg dose produces a lower response than the 25mg dose?

**Objective:**
- Determine if the 50mg dose produces a lower response than the 25mg dose.

**Method:**
- To determine if the 50mg dose produces a lower response than the 25mg dose, a **paired t-test** or an **independent two-sample t-test** can be used, depending on whether the samples for the 50mg and 25mg doses are paired or independent.

**Explanation:**
- **Paired t-test**: If the same subjects are given both the 25mg and 50mg doses, then a paired t-test would be appropriate. This test compares the means of the two related groups to determine if there is a statistically significant difference between them.

- **Independent two-sample t-test**: If the 25mg and 50mg dose groups are independent (different subjects), an independent two-sample t-test should be used to compare the means of the two groups.

**Formula for Independent t-test:**
$
t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$

where:
- $(\bar{x}_1)$ and $(\bar{x}_2)$ are the sample means for the 25mg and 50mg doses, respectively.
- $(s_1^2)$ and $(s_2^2)$ are the sample variances for the two groups.
- $(n_1)$ and $(n_2)$ are the sample sizes for the two groups.

**Interpretation:**
- A significant negative t-value (with a corresponding low p-value) would suggest that the 50mg dose produces a lower response than the 25mg dose.

#### Part B: What method or model would you use to determine if the effect of the dose with the highest average response value has a significantly greater effect relative to the other dosages?

**Objective:**
- Determine if the dose with the highest average response has a significantly greater effect than the other dosages.

**Method:**
- **One-way ANOVA** followed by **post hoc tests** (e.g., Tukey's HSD) would be appropriate to compare the mean responses across all dosage levels and determine if the highest response is significantly greater than the others.

**Explanation:**
- **One-way ANOVA**: This test compares the means across multiple groups (dosages in this case) to see if at least one mean is significantly different from the others.
- **Post hoc tests**: If the ANOVA is significant, post hoc tests like Tukey's HSD (Honestly Significant Difference) are conducted to determine which specific groups (dosages) differ from each other.

**Formula for ANOVA F-statistic:**
$
F = \frac{\text{MSB}}{\text{MSW}}
$

where:
- **MSB (Mean Square Between)** represents the variance between the group means.
- **MSW (Mean Square Within)** represents the variance within the groups.

**Interpretation:**
- If the ANOVA shows a significant F-statistic, and the post hoc tests reveal that the dose with the highest average response is significantly greater than the others, then it can be concluded that this dose has a significantly greater effect.

#### Part C: Construct a linear regression model for the sample, assuming nothing is known regarding the relationship of the average response values at the different dosages.

**Objective:**
- Construct a basic linear regression model for the sample data.

**Model:**
- A simple linear regression model can be constructed as follows:
$
y_i = \beta_0 + \beta_1 \text{Dose}_{i} + \epsilon_i
$

where:
- $y_i$ is the response variable for observation i.
- $\text{Dose}_i$ represents the dosage level (e.g., 10mg, 25mg, 50mg).
- $\beta_0$ is the intercept.
- $\beta_1$ is the coefficient for the dosage level.
- $\epsilon_i$ is the error term.

**Interpretation:**
- This model assumes a linear relationship between the dosage and the response. The coefficient $\beta_1$ indicates the expected change in the response for each unit increase in the dosage.

#### Part D: Construct an alternative linear regression model for the sample, now assuming that the average response is quadratic with respect to the dosages.

**Objective:**
- Construct a quadratic regression model to capture a potential nonlinear relationship.

**Model:**
- The quadratic regression model is constructed as:
$
y_i = \beta_0 + \beta_1 \text{Dose}_{i} + \beta_2 \text{Dose}_{i}^2 + \epsilon_i
$

where:
- $\text{Dose}_{i}^2$ is the square of the dosage level.

**Interpretation:**
- This model accounts for the possibility that the relationship between dosage and response is not purely linear but may have a quadratic (curved) relationship. The coefficient $\beta_2$ captures the curvature, indicating how the effect of dosage changes as the dosage level increases.

### Problem 6: Simple Linear Regression (SLR)

**Objective:** 
- The objective is to write the likelihood function for simple linear regression, and to derive the maximum likelihood estimate (MLE) for the coefficient, assuming the error variance is known.

#### Part A: Write the likelihood for SLR.

**Given Data:**
- Simple Linear Regression model: $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$, where $\epsilon_i \sim N(0, \sigma^2)$.

**Formula:**
- The likelihood function for a simple linear regression model is derived based on the assumption that the errors $(\epsilon_i) are normally distributed with mean 0 and variance $(\sigma^2)$.

  For n observations, the likelihood function is given by:
  $
  L(\beta_0, \beta_1, \sigma^2 | y_1, y_2, \dots, y_n) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \beta_0 - \beta_1 x_i)^2}{2\sigma^2}\right)
  $
  
  - $y_i$ is the observed value,
  - $x_i$ is the value of the independent variable,
  - $\beta_0$ and $\beta_1$ are the intercept and slope of the regression line, respectively,
  - $\sigma^2$ is the error variance.

**Interpretation:**
- The likelihood function represents the probability of observing the given data $(y_1, y_2, \dots, y_n)$ as a function of the parameters $\beta_0$, $\beta_1$, and $\sigma^2$.

#### Part B: Derive the MLE for the coefficient $\beta_1$, assuming the error variance is known.

**Objective:**
- Derive the MLE for $\beta_1$ assuming that the error variance $\sigma^2$ is known.

**Formula:**
- First, take the natural logarithm of the likelihood function to obtain the log-likelihood function:
  $
  \log L(\beta_0, \beta_1 | y_1, y_2, \dots, y_n) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2
  $

- To find the MLE for $\beta_1$, differentiate the log-likelihood function with respect to $\beta_1$ and set the derivative equal to zero:

  $
  \frac{\partial \log L(\beta_1)}{\partial \beta_1} = \frac{1}{\sigma^2} \sum_{i=1}^{n} x_i (y_i - \beta_0 - \beta_1 x_i) = 0
  $

- Simplifying this equation:
  $
  \sum_{i=1}^{n} x_i y_i - \sum_{i=1}^{n} x_i \beta_0 - \beta_1 \sum_{i=1}^{n} x_i^2 = 0
  $

- Solve for $\beta_1$:
  $
  \hat{\beta}_1 = \frac{\sum_{i=1}^{n} x_i y_i - \hat{\beta}_0 \sum_{i=1}^{n} x_i}{\sum_{i=1}^{n} x_i^2}
  $

  However, since $\hat{\beta}_0$ can be estimated separately by considering the intercept, we usually express $\hat{\beta}_1$ as:
  $
  \hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}
  $
  
  where:
  - $\bar{x}$ and $\bar{y}$ are the sample means of $x_i$ and $y_i$, respectively.

**Interpretation:**
- The MLE for $\beta_1$ is the slope of the regression line that minimizes the sum of squared differences between the observed values $y_i$ and the values predicted by the model. This estimate is unbiased and consistent under the assumption that the errors $\epsilon_i$ are normally distributed with mean 0 and variance $\sigma^2$.

### Problem 7: Impact of Adding Irrelevant Covariates on $R^2$

**Objective:** 
- The objective is to explain why adding irrelevant covariates (i.e., covariates that are not associated with the response) to a linear regression model will strictly increase the $R^2$ measure.

#### Explanation:

**Given Data:**
- Linear regression model with $R^2$ as a measure of goodness-of-fit.

**Concept:**
- $R^2$, or the coefficient of determination, is a measure that indicates the proportion of the variance in the dependent variable that is predictable from the independent variables in the model. It is calculated as:
  $
  R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}
  $
  
  where:
  - $\text{SS}_{\text{res}}$ is the residual sum of squares (unexplained variation),
  - $\text{SS}_{\text{tot}}$ is the total sum of squares (total variation in the dependent variable).

**Explanation:**
- **Adding Irrelevant Covariates:**
  - When an irrelevant covariate (a variable not associated with the response) is added to the model, it does not help explain any additional variance in the dependent variable. However, because the model now has more parameters, the residual sum of squares $(\text{SS}_{\text{res}})$ typically decreases or stays the same.
  
  - Even though the irrelevant covariate does not contribute meaningful information, the mathematical structure of the least squares estimation process can still find a coefficient for it that minimizes the residual sum of squares. This is because the $R^2$ measure is calculated purely based on the reduction in $\text{SS}_{\text{res}}$, regardless of whether the reduction is statistically meaningful.

- **Impact on $R^2$:**
  - $R^2$ is a non-decreasing function of the number of covariates in the model. This means that adding any new covariate, relevant or not, will either increase $R^2$ or leave it unchanged, but it will never decrease it.
  
  - Therefore, even if the added covariate is irrelevant (i.e., its true coefficient is zero), the model's $R^2$ can still increase because the additional parameter allows the model to fit the data more closely, even if only due to random fluctuations.

**Interpretation:**
- The increase in $R^2$ from adding irrelevant covariates does not imply that the model has improved in a meaningful way. Instead, it reflects a mathematical artifact of how $R^2$ is calculated. To prevent overfitting and to ensure that added covariates are truly useful, adjusted $R^2$ or other model selection criteria like AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) should be used, which penalize the addition of unnecessary covariates.

**Adjusted $R^2$:**
- Adjusted $R^2$ is defined as:
  $
  \text{Adjusted } R^2 = 1 - \left( \frac{(1-R^2)(n-1)}{n-p-1} \right)
  $
  
  where n is the number of observations, and p is the number of predictors. This version of $R^2$ takes into account the number of predictors in the model and only increases if the added covariate improves the model sufficiently to justify the increase in complexity.

### Problem 8: Estimating the Average Growth Rate of Bacteria

**Objective:** 
- The objective is to propose a method for estimating the average growth rate of bacteria given population counts at evenly spaced intervals and to critique a method for averaging growth rates across multiple samples.

#### Part A: The Principal Investigator (PI) asks you to produce a confidence interval for the average growth rate. The data consists of population counts for one culture, recorded at evenly spaced intervals, which can be assumed to be scaled to [0, 1]. Propose an appropriate method or model.

**Given Data:**
- Population counts for one culture recorded at evenly spaced intervals.
- Growth is known to be exponential with an unknown rate.

**Method:**
- The population growth of bacteria can be modeled using an exponential growth model. The exponential growth equation is:
  $
  y(t) = y_0 \exp(rt)
  $
  
  where:
  - $y(t)$ is the population size at time t,
  - $y_0$ is the initial population size,
  - r is the growth rate.

- Taking the natural logarithm of both sides:
  $
  \log y(t) = \log y_0 + rt
  $
  
  This is a linear relationship where $\log y(t)$ is the dependent variable, t is the independent variable, $\log y_0$ is the intercept, and r is the slope (growth rate).

**Steps:**
1. **Fit a Linear Model:**
   - Fit a simple linear regression model using $\log y(t)$ as the response variable and t as the predictor.
   - The slope of the fitted line will be an estimate of the growth rate r.

2. **Estimate the Growth Rate:**
   - The estimated growth rate $\hat{r}$ is the slope from the linear regression.

3. **Construct a Confidence Interval for $\hat{r}$:**
   - Assuming the residuals from the linear regression are normally distributed, construct a confidence interval for the growth rate (r) using the standard error of the slope.
   - The confidence interval can be calculated as:
     $
     \hat{r} \pm t_{\alpha/2, n-2} \cdot \text{SE}(\hat{r})
     $
     
     where $(t_{\alpha/2, n-2}$ is the critical value from the t-distribution with ( n-2 ) degrees of freedom, and $\text{SE}(\hat{r})$ is the standard error of the estimated growth rate.

**Interpretation:**
- This method provides a statistically valid confidence interval for the average growth rate r of the bacteria population based on the observed data.

#### Part B: The PI now has data from 4 additional cultures. They ask you to repeat the analysis you did in the previous part on each of these new samples so that the average growth rates can be averaged together to provide a better estimate of the population growth rate. Explain one or two significant flaws in this method and propose a better one.

**Objective:**
- Critique the method of simply averaging the growth rates from multiple samples and propose a better method.

**Flaws in Averaging the Growth Rates:**
1. **Ignoring Variability:**
   - Simply averaging the estimated growth rates from multiple samples ignores the variability in the estimates. Different cultures might have different growth conditions or measurement errors, leading to growth rates with different levels of uncertainty.
   - Averaging the rates without considering their variance might give undue weight to less reliable estimates.

2. **Assumption of Independence:**
   - This method assumes that the growth rates are independent across samples. However, if there are shared environmental factors or systematic biases affecting all cultures, this assumption might not hold, leading to misleading results.

**Proposed Better Method:**
- **Use a Hierarchical (Random Effects) Model:**
  - A better approach would be to use a hierarchical model that accounts for the variability between cultures. This model treats the growth rate as a random effect that varies across cultures.

**Steps:**
1. **Model Setup:**
   - Assume the growth rate $(r_j)$ for the j-th culture is drawn from a normal distribution:
     $
     r_j \sim N(\mu_r, \tau^2)
     $
     
     where $\mu_r$ is the overall mean growth rate, and $\tau^2$ is the between-culture variance.

2. **Estimation:**
   - Fit the hierarchical model using all the data from the 5 cultures. The overall mean $\mu_r$ will be estimated, taking into account both the within-culture variability and the between-culture variability.

3. **Construct a Confidence Interval:**
   - Construct a confidence interval for $\mu_r$, which represents the population-level average growth rate.

**Interpretation:**
- This method provides a more accurate estimate of the population growth rate by accounting for variability between different cultures. It also provides a more reliable confidence interval by incorporating the uncertainty from multiple sources.

### Problem 9: Two-Sample t-Test and ANOVA

**Objective:** 
- The objective is to show the equivalence between the two-sample t-test and ANOVA, and to demonstrate how a regression model can be used to conduct an equivalent test.

#### Part A: Show that the two-sample t-test and ANOVA are identical.

**Given Data:**
- We have two groups, with observations from each group assumed to be normally distributed with equal variances.

**Two-Sample t-Test:**
- The two-sample t-test compares the means of two independent groups to determine if they are significantly different.
- The test statistic for the two-sample t-test is:
  $
  t = \frac{\bar{y}_1 - \bar{y}_2}{\sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}
  $
  
  where:
  - $\bar{y}_1$ and $\bar{y}_2$ are the sample means for the two groups,
  - $n_1$ and $n_2$ are the sample sizes for the two groups,
  - $s_p^2$ is the pooled variance, calculated as:
    $
    s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}
    $
    
  - $s_1^2$ and $s_2^2$ are the sample variances for the two groups.

**One-Way ANOVA:**
- One-way ANOVA compares the means across multiple groups (including just two groups) by partitioning the total variation into variation between groups (SSB) and within groups (SSW).
- The F-statistic in ANOVA is given by:
  $
  F = \frac{\text{SSB}/\text{df}_{\text{between}}}{\text{SSW}/\text{df}_{\text{within}}}
  $
  
  where:
  - $\text{SSB}$ is the sum of squares between the groups,
  - $\text{SSW}$ is the sum of squares within the groups,
  - $\text{df}_{\text{between}}$ = k - 1 (for two groups, (k = 2)),
  - $\text{df}_{\text{within}} = n_1 + n_2 - 2$.

**Equivalence:**
- When there are only two groups, the F-statistic from ANOVA is directly related to the square of the t-statistic from the two-sample t-test. Specifically:
  $
  F = t^2
  $
  
- This shows that the F-statistic from a one-way ANOVA with two groups will yield the same result as the square of the t-statistic from a two-sample t-test. Therefore, the two tests are equivalent in this case.

**Interpretation:**
- The equivalence of the t-test and ANOVA for two groups means that either test can be used to determine if the means of two groups are significantly different. They both rely on the same underlying assumptions and will produce the same p-value.

#### Part B: Show that the regression model $y_{ij} = \beta_0 + \beta_1 I(j = 1) + \epsilon_{ij}$ can also be used to conduct an equivalent test as the two-sample t-test and ANOVA.

**Given Data:**
- $y_{ij}$ represents the response for the i-th observation in the j-th group.
- I(j = 1) is an indicator variable that takes the value 1 if the observation is in group 1, and 0 otherwise.

**Regression Model:**
- The model can be written as:
  $
  y_{ij} = \beta_0 + \beta_1 I(j = 1) + \epsilon_{ij}
  $
  
  where:
  - $\beta_0$ represents the mean of the group when I(j = 1) = 0 (i.e., the second group),
  - $\beta_1$ represents the difference in means between the two groups,
  - $\epsilon_{ij}$ is the error term assumed to be normally distributed with mean 0 and constant variance.

**Equivalence to t-Test:**
- The hypothesis $H_0: \beta_1$ = 0 in this regression model tests whether there is a significant difference in means between the two groups. This is equivalent to the hypothesis tested in the two-sample t-test and one-way ANOVA.
- The t-statistic for $\beta_1$ in the regression model is calculated as:
  $
  t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)}
  $
  
  where $\text{SE}(\hat{\beta}_1)$ is the standard error of the estimated coefficient $\hat{\beta}_1$.

**Equivalence to ANOVA:**
- The sum of squares explained by the regression model (i.e., due to $\beta_1$) corresponds to the sum of squares between groups (SSB) in ANOVA.
- The residual sum of squares in the regression model corresponds to the sum of squares within groups (SSW) in ANOVA.
- The F-statistic for the regression model is the square of the t-statistic for $\beta_1$, just as in the equivalence between the two-sample t-test and ANOVA.

**Interpretation:**
- The regression model $y_{ij} = \beta_0 + \beta_1 I(j = 1) + \epsilon_{ij}$ provides an equivalent test to the two-sample t-test and one-way ANOVA. This demonstrates that the same statistical test can be framed in multiple ways, each of which provides the same result when the underlying assumptions are met.