### Problem 7: Interaction Effect of Smoking Status and BMI on Systolic Blood Pressure

#### Objective:
To determine if there is an interaction effect between current smoking status and BMI on systolic blood pressure using the Framingham Heart Study dataset.

#### Part (a): Fit the Baseline Main Effect Model

**Given Data:**
- Outcome variable: Systolic Blood Pressure (sysBP)
- Covariates: Current Smoking Status, BMI

**Formula:**
$\text{sysBP} \sim \text{Current Smoking Status} + \text{BMI}$

**R Code for Model Fitting:**
```r
# Fit the baseline main effect model
baseline_model <- lm(sysBP ~ currentSmoker + BMI, data = framingham_data)
summary(baseline_model)

# Confidence intervals for coefficients
confint(baseline_model)
```

**Calculation:**
- Fit a linear regression model using the given covariates.
- Extract and display the regression coefficients, their p-values, and the 95% confidence intervals.

**Interpretation:**
The table will show the relationship between current smoking status, BMI, and systolic blood pressure without considering any interaction effects.

#### Part (b): Fit the Interaction Model

**Formula:**
$ 
\text{sysBP} \sim \text{Current Smoking Status} + \text{BMI} + \text{Current Smoking Status} \times \text{BMI}
$

**R Code for Model Fitting:**
```r
# Fit the interaction model
interaction_model <- lm(sysBP ~ currentSmoker * BMI, data = framingham_data)
summary(interaction_model)

# Confidence intervals for coefficients
confint(interaction_model)
```

**Calculation:**
- Fit a linear regression model including the interaction term between current smoking status and BMI.
- Extract and display the regression coefficients, their p-values, and the 95% confidence intervals.

**Interpretation:**
The table will now show how the interaction between current smoking status and BMI affects systolic blood pressure, along with the main effects.

#### Part (c): Comparison and Interpretation

**Objective:**
Compare the results of the baseline and interaction models to understand how the interpretation of the effect of smoking and BMI changes when considering their interaction.

**Interpretation:**
- In the baseline model, the effect of smoking and BMI on systolic blood pressure is considered independently.
- In the interaction model, the presence of an interaction term allows the effect of BMI on systolic blood pressure to differ based on smoking status.
- This comparison can help identify whether the effect of BMI on systolic blood pressure is significantly modified by smoking status.

#### Part (d): Boxplot of sysBP Stratified by Male

**Objective:**
To visualize the distribution of systolic blood pressure across different levels of the binary variable 'male', with color differentiation based on TenYearCHD.

**R Code for Boxplot:**
```r
# Boxplot of sysBP stratified by male
library(ggplot2)
ggplot(framingham_data, aes(x = factor(male), y = sysBP, fill = factor(TenYearCHD))) +
  geom_boxplot() +
  labs(x = "Male", y = "Systolic Blood Pressure", fill = "Ten Year CHD") +
  theme_minimal()
```

**Interpretation:**
- The boxplot will show how systolic blood pressure varies between males and females.
- The color coding based on TenYearCHD allows for easy identification of any interaction between gender and TenYearCHD on systolic blood pressure.

### Problem 8: Variance of the Estimator for the Difference in Means

#### Objective:
Calculate the variance of the estimator for the difference in means of two observations, where one observation has a covariate value $x_2$ = 1 and the other has $x_2$ = 0. Assume that the linear model is given by:

$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1} x_{i2} + \epsilon_i$

with $\epsilon_i \sim \text{i.i.d. } N(0, \sigma^2)$.

#### Given Data:
- Linear model: $y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1} x_{i2} + \epsilon_i$
- Covariate values for two observations: 
  - Observation A: $x_2$ = 1
  - Observation B: $x_2$ = 0

#### Formula:
The difference in means for the two observations is:

$ 
\Delta = E[y_A] - E[y_B] = (\beta_0 + \beta_1 x_{A1} + \beta_2 x_{A2} + \beta_3 x_{A1} x_{A2}) - (\beta_0 + \beta_1 x_{B1} + \beta_2 x_{B2} + \beta_3 x_{B1} x_{B2})
$

Given $x_{A2}$ = 1 and $x_{B2}$ = 0:

$\Delta = (\beta_1 x_{A1} + \beta_2 + \beta_3 x_{A1}) - (\beta_1 x_{B1})$

$\Delta = \beta_2 + \beta_1(x_{A1} - x_{B1}) + \beta_3 x_{A1}$

The variance of $\Delta$ is given by:
$
\text{Var}(\Delta) = \text{Var}(\beta_2 + \beta_1(x_{A1} - x_{B1}) + \beta_3 x_{A1})
$

Since $\beta_2$, $\beta_1$, and $\beta_3$ are constants, the variance simplifies to:

$
\text{Var}(\Delta) = \sigma^2 [ (x_{A1} - x_{B1})^2 + x_{A1}^2 ]
$

#### Calculation:
1. Identify the values of $x_{A1}$ and $x_{B1}$.
2. Substitute these values into the variance formula.

**Example Calculation:**

Assume $x_{A1}$ = 2 and $x_{B1}$ = 1.

$
\text{Var}(\Delta) = \sigma^2 [(2 - 1)^2 + 2^2]$

$\text{Var}(\Delta) = \sigma^2 [1^2 + 4]$

$\text{Var}(\Delta) = \sigma^2 [1 + 4] = 5\sigma^2$

#### Interpretation:
The variance of the estimator for the difference in means is directly proportional to the variance of the errors, $\sigma^2$. This variance is also influenced by the specific values of the covariates $x_{A1}$ and $x_{B1}$.

### Problem 9: Power Plot for Testing a Main Effect

#### Objective:
Construct a power plot for testing a main effect $H_0: \beta_1$ = 0 with the following parameter settings:

- $\beta_0$ = -1
- $\beta_1$ = 0.25
- $\sigma^2$ = 0.7

Vary the sample size (n) in increments of 50, from n = 50 up to n = 250. Run 500 replicates for each setting of n and plot the results.

#### R Code:

```r
# Load necessary libraries
library(ggplot2)

# Set parameters
beta_0 <- -1
beta_1 <- 0.25
sigma2 <- 0.7
sample_sizes <- seq(50, 250, by = 50)
replicates <- 500
alpha <- 0.05

# Function to simulate power for a given sample size
simulate_power <- function(n, beta_0, beta_1, sigma2, replicates, alpha) {
  power <- numeric(replicates)
  for (i in 1:replicates) {
    x <- rnorm(n)
    y <- beta_0 + beta_1 * x + rnorm(n, mean = 0, sd = sqrt(sigma2))
    model <- lm(y ~ x)
    p_value <- summary(model)$coefficients[2, 4]
    power[i] <- ifelse(p_value < alpha, 1, 0)
  }
  return(mean(power))
}

# Calculate power for each sample size
power_results <- sapply(sample_sizes, simulate_power, beta_0 = beta_0, beta_1 = beta_1, sigma2 = sigma2, replicates = replicates, alpha = alpha)

# Create data frame for plotting
power_data <- data.frame(SampleSize = sample_sizes, Power = power_results)

# Plot the power curve
ggplot(power_data, aes(x = SampleSize, y = Power)) +
  geom_line(color = "blue") +
  geom_point(color = "red") +
  labs(title = "Power Plot for Testing Main Effect", x = "Sample Size", y = "Power") +
  theme_minimal()
```

#### Calculation:
- The code simulates the power of the test for different sample sizes by running 500 replicates for each sample size.
- For each replicate, a linear model is fitted, and the p-value for $\beta_1$ is checked against the significance level $\alpha$ = 0.05.
- The proportion of replicates where the null hypothesis is rejected (p-value < 0.05) is calculated to estimate the power for that sample size.

#### Interpretation:
- The resulting plot shows how the power of the test increases as the sample size increases.
- Typically, as sample size increases, the power of the test also increases, making it more likely to detect a true effect if one exists.

### Problem 10: Power Plot for Testing an Interaction Effect

#### Objective:
Construct a power plot for testing an interaction effect $H_0: \beta_3$ = 0 with the following parameter settings:

- $\beta_0$ = -1
- $\beta_1$ = 0.25
- $\beta_2$ = 0.1
- $\beta_3$ = 0.2
- $\sigma^2$ = 0.7

Vary the sample size (n) in increments of 50, from (n = 50) up to (n = 250). Run 500 replicates for each setting of n and plot the results.

#### R Code:

```r
# Load necessary libraries
library(ggplot2)

# Set parameters
beta_0 <- -1
beta_1 <- 0.25
beta_2 <- 0.1
beta_3 <- 0.2
sigma2 <- 0.7
sample_sizes <- seq(50, 250, by = 50)
replicates <- 500
alpha <- 0.05

# Function to simulate power for a given sample size
simulate_power_interaction <- function(n, beta_0, beta_1, beta_2, beta_3, sigma2, replicates, alpha) {
  power <- numeric(replicates)
  for (i in 1:replicates) {
    x1 <- rnorm(n)
    x2 <- rbinom(n, 1, 0.5)  # Binary covariate
    y <- beta_0 + beta_1 * x1 + beta_2 * x2 + beta_3 * x1 * x2 + rnorm(n, mean = 0, sd = sqrt(sigma2))
    model <- lm(y ~ x1 * x2)
    p_value <- summary(model)$coefficients[4, 4]  # p-value for interaction term
    power[i] <- ifelse(p_value < alpha, 1, 0)
  }
  return(mean(power))
}

# Calculate power for each sample size
power_results_interaction <- sapply(sample_sizes, simulate_power_interaction, beta_0 = beta_0, beta_1 = beta_1, beta_2 = beta_2, beta_3 = beta_3, sigma2 = sigma2, replicates = replicates, alpha = alpha)

# Create data frame for plotting
power_data_interaction <- data.frame(SampleSize = sample_sizes, Power = power_results_interaction)

# Plot the power curve
ggplot(power_data_interaction, aes(x = SampleSize, y = Power)) +
  geom_line(color = "blue") +
  geom_point(color = "red") +
  labs(title = "Power Plot for Testing Interaction Effect", x = "Sample Size", y = "Power") +
  theme_minimal()
```

#### Calculation:
- The code simulates the power of the test for different sample sizes by running 500 replicates for each sample size.
- For each replicate, a linear model is fitted, and the p-value for the interaction term \( \beta_3 \) is checked against the significance level \( \alpha = 0.05 \).
- The proportion of replicates where the null hypothesis is rejected (p-value < 0.05) is calculated to estimate the power for that sample size.

#### Interpretation:
- The resulting plot will show how the power of the test increases with sample size.
- This plot helps in understanding how likely the test is to detect a true interaction effect if it exists, depending on the sample size.

### Problem 11: Confidence Interval for a Prediction in a New Observation

#### Objective:
Calculate a 95% confidence interval for the predicted response in a new observation using the linear model $y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1} x_{i2} + \epsilon_i$.

Assume the new observation has covariate values $x_1$ = 1.5 and $x_2$ = 0.75, and the model parameters are as follows:
- $\beta_0 = -1$
- $\beta_1 = 0.5$
- $\beta_2 = 0.25$
- $\beta_3 = 0.1$
- The residual variance $\sigma^2$ = 0.7

#### Given Data:
- Covariate values for new observation: $x_1$ = 1.5, $x_2$ = 0.75
- Model parameters: $\beta_0$ = -1, $\beta_1$ = 0.5, $\beta_2$ = 0.25, $\beta_3$ = 0.1
- Residual variance: $\sigma^2$ = 0.7

#### Formula:
The predicted response for the new observation is given by:
$
\hat{y}_{\text{new}} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2
$

The variance of the predicted response is given by:
$
\text{Var}(\hat{y}_{\text{new}}) = \sigma^2 \left(1 + \mathbf{x}_{\text{new}}^\top (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{x}_{\text{new}} \right)
$

where $\mathbf{x}_{\text{new}}$ = (1, $x_1$, $x_2$, $x_1 x_2$) and $\mathbf{X}$ is the design matrix for the original data.

The 95% confidence interval is then:
$\hat{y}_{\text{new}} \pm t_{n-4,0.025} \times \sqrt{\text{Var}(\hat{y}_{\text{new}})}$

#### Calculation:

1. **Calculate the predicted response:**
   $
   \hat{y}_{\text{new}} = -1 + 0.5 \times 1.5 + 0.25 \times 0.75 + 0.1 \times 1.5 \times 0.75
   $
   
   $
   \hat{y}_{\text{new}} = -1 + 0.75 + 0.1875 + 0.1125 = 0.05
   $

2. **Assume $\mathbf{X}^\top \mathbf{X}$ is known (typically this would be provided or computed from the data). For simplicity, assume the variance formula simplifies to $\text{Var}(\hat{y}_{\text{new}}) = \sigma^2 (1 + \text{some small value})$.**

3. **Estimate the standard error for the prediction:**
   $
   \text{SE}(\hat{y}_{\text{new}}) = \sqrt{\sigma^2 \times (1 + \text{small value})} = \sqrt{0.7 \times (1 + 0.05)} = \sqrt{0.7 \times 1.05} \approx \sqrt{0.735} \approx 0.857
   $

4. **Calculate the 95% confidence interval:**
   - The critical value $t_{n-4,0.025}$ (for large n) is approximately 1.96.
   - The confidence interval is:
   $
   0.05 \pm 1.96 \times 0.857
   $
   
   $
   0.05 \pm 1.68 \quad \text{(approximately)}
   $
   
   $
   \text{Confidence Interval} \approx [-1.63, 1.73]
   $

#### Interpretation:
The 95% confidence interval for the predicted response in the new observation is approximately \([-1.63, 1.73]\). This means that we can be 95% confident that the true response value for this new observation will lie within this range.

### Problem 12: Influence of Leverage Points in Linear Regression

#### Objective:
Discuss how leverage points influence the fitted linear regression model, including their impact on the slope, intercept, and the overall fit of the model. Provide an example using R code to illustrate the effect of a leverage point on a simple linear regression model.

#### Explanation:
**Leverage Points** are data points that have an unusually large influence on the fit of the regression model. These points have extreme values for the predictor variable(s) and can significantly affect the slope and intercept of the regression line. 

- **High Leverage Points:** These are points that are far from the mean of the predictor variable. They have the potential to "pull" the regression line towards themselves, which can result in a biased model fit if they are not representative of the general trend in the data.
- **Impact on Slope and Intercept:** If a high leverage point is consistent with the overall trend, it can stabilize the regression line. However, if it is an outlier, it can distort the slope and intercept, leading to incorrect inferences.
- **Influence on Fit:** Leverage points can also impact diagnostic measures such as R-squared and p-values, often making the model appear better or worse than it actually is.

#### Example Using R Code:

```r
# Load necessary libraries
library(ggplot2)

# Simulate data
set.seed(123)
x <- rnorm(100)
y <- 2 * x + rnorm(100)

# Add a leverage point
x <- c(x, 10)  # Adding a point far from the mean of x
y <- c(y, 20)  # Corresponding y value

# Fit linear model without the leverage point
model_without_leverage <- lm(y[1:100] ~ x[1:100])

# Fit linear model with the leverage point
model_with_leverage <- lm(y ~ x)

# Summary of models
summary(model_without_leverage)
summary(model_with_leverage)

# Plot the data and the regression lines
data <- data.frame(x = x, y = y)
ggplot(data, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue", formula = y ~ x) +  # With leverage point
  geom_abline(intercept = coef(model_without_leverage)[1], slope = coef(model_without_leverage)[2], color = "red") +  # Without leverage point
  labs(title = "Effect of Leverage Point on Regression Line", x = "Predictor (x)", y = "Response (y)") +
  theme_minimal()
```

#### Calculation and Interpretation:
1. **Simulate Data:** The code generates 100 random points following a simple linear relationship.
2. **Add Leverage Point:** A single point with a large x-value is added to create a high leverage point.
3. **Fit Models:** Two linear models are fitted: one without the leverage point and one with it.
4. **Compare Models:** The summaries of the models show the impact of the leverage point on the slope, intercept, and overall fit of the model.
5. **Plot:** The plot visually shows the effect of the leverage point by comparing the regression line with and without the leverage point. The red line (without leverage) and blue line (with leverage) demonstrate how the leverage point can skew the regression line.

#### Conclusion:
Leverage points can have a profound impact on the regression model. While they can stabilize the model if they are consistent with the trend, they can also lead to significant bias if they are outliers. Detecting and addressing leverage points is crucial in building robust linear regression models.