1. The theoretical Simple Linear Regression model predicts a linear relationship between two variables:  independent X(predictor) and dependent Y(outcome).
X: This is the independent variable that we believe influences the outcome. It is a fixed or observed value in each sample.
Y: This is the dependent variable, whose value we want to predict based on X.
intercept: This is the baseline value of Y when X = 0.
slope: The change in X per unit change of Y.
In a theoretical regression model, we assume that for any value of X, the distribution of Y is normally distributed around the line defined by intercept plus slope with a variance.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Parameters for the regression model
beta_0 = 5      # Intercept
beta_1 = 2      # Slope
sigma = 1.5     # Standard deviation of the error term

# Generate some predictor values (X) as an array
X = np.linspace(0, 10, 100)

# Calculate the expected outcome values without error (pure linear relationship)
Y_true = beta_0 + beta_1 * X

# Simulate the actual outcome values with random error
# Each Y is a sample from a normal distribution centered on (beta_0 + beta_1 * X)
Y = Y_true + np.random.normal(0, sigma, size=X.shape)

# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(X, Y_true, label="True Regression Line (no error)", color="green", linestyle="--")
plt.scatter(X, Y, label="Sampled Y values with error", color="blue", alpha=0.6)

# Show normal distributions around selected points on the line to visualize error distribution
for x in [2, 5, 8]:
    mean_y = beta_0 + beta_1 * x
    y_vals = np.linspace(mean_y - 3*sigma, mean_y + 3*sigma, 100)
    plt.plot([x]*100, y_vals, 'k-', lw=0.5, alpha=0.5)
    plt.fill_betweenx(y_vals, x - 0.3, x + 0.3, color='gray', alpha=0.2)
    plt.text(x + 0.1, mean_y, r"$\mathcal{N}({}, {})$".format(round(mean_y, 1), sigma), color="black")

plt.xlabel("Predictor Variable X")
plt.ylabel("Outcome Variable Y")
plt.legend()
plt.title("Simple Linear Regression Model with Error Term")
plt.show()

2. To demonstrate how to fit and visualize a Simple Linear Regression model, we'll first simulate a dataset based on the theoretical model using numpy. Then, we will use pandas to organize data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

# Step 1: Simulate the dataset

# Parameters for the regression model
beta_0 = 5      # Intercept
beta_1 = 2      # Slope
sigma = 1.5     # Standard deviation of the error term

# Generate some predictor values (X) as an array
np.random.seed(42)  # For reproducibility
X = np.random.uniform(0, 10, 100)  # Random X values between 0 and 10

# Calculate the true outcome values with error term
Y = beta_0 + beta_1 * X + np.random.normal(0, sigma, size=X.shape)

# Create a DataFrame
data = pd.DataFrame({'X': X, 'Y': Y})

# Step 2: Fit the Simple Linear Regression Model using statsmodels

# Define the formula and fit the model
model = smf.ols(formula='Y ~ X', data=data).fit()

# Print the summary of the model
print(model.summary())

# Step 3: Visualize the fitted regression line and the data points

# Scatter plot of the data
plt.figure(figsize=(10, 6))
plt.scatter(data['X'], data['Y'], color='blue', label='Data points', alpha=0.6)

# Plot the fitted regression line
plt.plot(data['X'], model.fittedvalues, color='red', label='Fitted Regression Line', linewidth=2)

# Label the plot
plt.xlabel("Predictor Variable X")
plt.ylabel("Outcome Variable Y")
plt.legend()
plt.title("Fitted Simple Linear Regression Model")

plt.show()

3. To add the line from Question 1 (the theoretical regression line) to the fitted regression line in the figure from Question 2, we’ll modify the plotting code to include both lines for comparison. The theoretical line represents the "true" model used to generate the data, while the fitted line represents the model estimated from the sample data.
Difference explanation:
Theoretical line: It represents the underlying data-generating process, this line is fixed and does not vary based on the sample.
Fitted line: The fitted line is the regression line calculated from our sample data. It is our estimate of the true relationship based on this specific sample.

4. Each value in fitted_model.fittedvalues is computed similarly for all observed X values in the dataset, using these estimated coefficients from fitted_model.params. This process provides us with a fitted line that represents the best linear approximation of the relationship between X and Y.

5. The Ordinary Least Squares (OLS) method chooses the line that minimizes the sum of squared differences between observed and predicted Y values. Squaring the differences ensures all deviations are positive and gives more weight to larger errors, allowing OLS to find the line that best fits the data by minimizing overall prediction error.

6. In Simple Linear Regression, R square, shows the proportion of variation in Y explained by the model. It’s calculated by comparing the variation in Y. Higher R square means a better fit.
The expression mentioned captures the same idea, showing R square as the square of the correlation between Y and the model's predictions, indicating how well X predicts Y.

7. Linearity: The model assumes a linear relationship between X and Y, meaning the change in X leads to the proportion change in Y. If the data shows a curved or non-linear pattern, the linearity assumption is violated, and a simple linear model may not be appropriate.
Homoscedasticity: If the residuals spread out more at higher or lower values of X, this assumption of homoscedasticity is violated, suggesting that the model's accuracy varies across different levels of X.

 8. p < 0.05, moderate evidence against the null hypothesis

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.utils import resample

# Load the Old Faithful dataset
old_faithful = sns.load_dataset('geyser')

# Subset the data for long wait times (waiting > 63)
long_wait = old_faithful[old_faithful['waiting'] > 63]

# 1. Bootstrap Sampling and Visualizing Slope Coefficients
n_bootstrap = 1000  # Number of bootstrap samples
slope_bootstrap = []

# Generate bootstrap samples, fit models, and collect slope coefficients
for _ in range(n_bootstrap):
    bootstrap_sample = resample(long_wait)
    model = smf.ols('duration ~ waiting', data=bootstrap_sample).fit()
    slope_bootstrap.append(model.params['waiting'])

# Visualize the distribution of bootstrapped slope coefficients
sns.histplot(slope_bootstrap, kde=True, bins=30, color='blue')
plt.title("Bootstrap Sampling Distribution of the Slope Coefficients")
plt.xlabel('Slope Coefficients')
plt.ylabel('Frequency')
plt.show()

# 2. Simulate Samples Under the Null Hypothesis
n_simulations = 1000  # Number of simulations
slope_simulated = []

# Generate simulated data under the null hypothesis (intercept = 1.65, slope = 0, stddev = 0.37)
for _ in range(n_simulations):
    # Simulate waiting times based on observed waiting times
    waiting_simulated = np.random.choice(long_wait['waiting'], size=160, replace=True)
    # Simulate durations using the null model
    error_term = np.random.normal(0, 0.37, size=160)  # Standard deviation = 0.37
    duration_simulated = 1.65 + 0 * waiting_simulated + error_term  # Slope = 0 under null hypothesis
    
    # Fit the model and collect the slope coefficient
    model = smf.ols('duration ~ waiting', data=pd.DataFrame({'waiting': waiting_simulated, 'duration': duration_simulated})).fit()
    slope_simulated.append(model.params['waiting'])

# Visualize the bootstrapped sampling distribution under the null hypothesis
sns.histplot(slope_simulated, kde=True, bins=30, color='red')
plt.title("Simulated Sampling Distribution of the Slope Coefficients (Null Hypothesis)")
plt.xlabel('Slope Coefficients')
plt.ylabel('Frequency')
plt.show()

# 3. Report Confidence Interval and p-value

# 95% Confidence Interval from the bootstrap distribution
ci_lower = np.percentile(slope_bootstrap, 2.5)
ci_upper = np.percentile(slope_bootstrap, 97.5)
print(f"95% Confidence Interval for the Slope Coefficient: ({ci_lower:.4f}, {ci_upper:.4f})")

# Check if 0 is in the 95% Confidence Interval
contains_zero = 0 >= ci_lower and 0 <= ci_upper
print(f"Does the 95% CI contain 0? {contains_zero}")

# Compare p-value from the fitted model to the simulated model's p-value
model_full = smf.ols('duration ~ waiting', data=long_wait).fit()
p_value_full = model_full.pvalues['waiting']
print(f"P-value from the full model fit: {p_value_full:.4f}")

# Simulated p-value calculation
# Calculate the proportion of simulated slopes greater than or equal to the full model's slope
simulated_p_value = np.mean(np.array(slope_simulated) >= model_full.params['waiting'])
print(f"Simulated p-value: {simulated_p_value:.4f}")

12. Model 1 (All Data using slope) is most likely to support the assumption of normality of residuals if its histogram looks symmetric, bell-shaped, and aligns closely with the normal distribution curve.
Models 2, 3, and 4 might fail to support the normality assumption due to their specific data subsets and model structures.

13. A) Steps for the permutation test:
Null hypothesis: There is no difference in duration between the short and long wait groups. This means that the labels of the two groups (short and long) are exchangeable.
Test Statistic: The observed difference in means between the short and long wait times.
Shuffling Process: Randomly shuffle the waiting labels (without regard to the actual group labels) and calculate the new difference in means. Repeat this process many times (e.g., 10,000 times) to build a distribution of differences under the null hypothesis.
p-value: The proportion of permuted differences that are as extreme (or more extreme) than the observed difference.

B) Steps for Bootstrapping:
Sample with Replacement: For each group (short and long wait), sample with replacement and calculate the mean of the bootstrap sample.
Repeat: Perform this resampling process a large number of times.
Compute the Difference: For each bootstrap sample, compute the difference in the means between the two groups.
Percentile Method: Calculate the 2.5th and 97.5th percentiles of the bootstrapped differences to form the 95% confidence interval.

Permutation Test: The p-value helps us determine whether the difference in means is statistically significant by comparing it to a threshold (typically 0.05).
Bootstrap Confidence Interval: The 95% confidence interval provides a range of plausible values for the difference in means, and it can be used to assess the statistical significance of the difference.
By applying both methods, you can assess the strength of evidence for or against a difference in the eruption durations between short and long wait times in the Old Faithful dataset.

14. I've read through the wiki textbook.