<div class="alert alert-warning">

# PS 1 - Bias and variance

In this problem set, we will replicate the code from lecture 2, to visualize bias and variance in model fitting, and get an understanding of what overfitting means.

## Reminder -  problem

We are trying to decide how well a model fits data. We assume our data is a set of $(x_i,y_i)$ points (e.g. age, task accuracy) generated by a true target function $f$ (unknown; for example, accuracy in a task as a function of age): $$y_1 \sim f(x_i) + noise$$
Our goal is to figure out $f$ (for example, does task accuracy increase linearly with age, or is the link more complex?).


## Reminder - approach
We model it with a proposed "computational model", which is here a function $g(x)$, for example a linear regression (2 parameters: intercept, slope), or an n degree polynomial function ($n+1$ parameters). 

After we find the best parameters for a model, we can measure how well our model is doing by 
- checking its prediction for each data point in the training set (e.g. how well do I expect a 16 year-old to do in the task?): $g(x_i)$. 
- computing the prediction error for this data point as the distance to the true value $(y_i-g(x_i))^2$.
- averaging this error over all the data points: $Error_{train} = \frac{1}{N}\sum_i ^N (y_i-g(x_i))^2$.

This tells us how close we are to the training data, but not the true function. To see that, we want to measure how well we can predict *new* test data, that is a new set of points $(x^{test} _i, y^{test} _i)$. We can compute the testing error the same way: $Error_{test} = \frac{1}{N}\sum_i ^N (y^{test} _i-g(x^{test} _i))^2$.

## Reminder - Bias and variance
If our training error is low but our testing error is high, it means our model gets close to the training data, but actually can't predict any new data. That's called "overfitting" : our model is too complex, captures the noise in the data instead of the signal. Instead, we want both the training and testing error to be low, which shows that the model can generalize to new data.

We did some maths to show that the expected error can be divided into two terms: the bias and the variance:
- the bias is the distance from the model to the true function, in expectation over noisy data sets [am I systematically not capturing the true phenomenon?]
- the variance is the variance in the model itself [with different training data sets, do I make very different predictions?]


## This problem set

We'll use the target function from class, plot the results, and see how different assumptions (number of data points, amount of noise, etc.) impact the model and its fit. Your goal is to experience the phenomenon yourself to understand better what happens, and get a good feel for the concepts. Your goal is also to practice coding a model and good visualization. 

We haven't learned how to fit a model yet, so we're providing this here. You do not need to understand the model fitting part at this point.

In [None]:
# numpy as always
import numpy as np
# plotting!
import matplotlib.pyplot as plt
# the packages below are used to fit the polynomial models and generate predictions/measure errors from them. 
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# --- Constants for x-axis ranges ---
X_TRAIN_MIN = 12  # Minimum age for training data
X_TRAIN_MAX = 30  # Maximum age for training data
X_PLOT_MIN = 10   # Minimum age for plotting (slightly wider to show edge behavior)
X_PLOT_MAX = 32   # Maximum age for plotting

# Setting up the polynomial regression model for bias-variance

In the code below, we will create a true function that we want to model (target_function, the 2nd degree polynomial from class that we were trying to model: accuracy as a function of age). We will create a noisy data set generated by this function, with different number of samples (n_samples) and noise levels (noise_level).


In [None]:
# 1. Setup Data Generation
np.random.seed(42)  # For reproducibility

def target_function(x):
    # Relevant quadratic function: y =a+b x^2
    return 1-.0025*(x-25)**2

# Generate Training Data (10 points)
n_samples = 10
noise_level = .05  # Standard deviation of noise

X_train = np.sort(np.random.uniform(X_TRAIN_MIN, X_TRAIN_MAX, n_samples))
y_train = target_function(X_train) + np.random.normal(0, noise_level, n_samples)

# Generate Test Data (10 points)
X_test = np.sort(np.random.uniform(X_TRAIN_MIN, X_TRAIN_MAX, n_samples))# pick ages uniform from 12-30
y_test = target_function(X_test) + np.random.normal(0, noise_level, n_samples)# generate data as target + noise

# For plotting the smooth curves
X_plot = np.linspace(X_PLOT_MIN, X_PLOT_MAX, 100)
y_true_plot = target_function(X_plot)

# Reshape X for sklearn (needs 2D array)
X_train_r = X_train[:, np.newaxis]
X_test_r = X_test[:, np.newaxis]
X_plot_r = X_plot[:, np.newaxis]


Same as in class, we will fit 3 models: a linear regression (2 parameters), a 2nd degree polynomial (3 parameters), and an 8th degree polynomial (9 parameters).

In [None]:

# 2. Define Models
degrees = [1, 2, 8]
model_names = ["Linear (Underfitting)", "Quadratic (Optimal)", "8th Degree (Overfitting)"]
colors = ['red', 'green', 'blue']



Below is the fitting code. Make sure you read and understand what each step is doing at the high level. At this stage of the class, you do not need to understand the details of how the sklearn package functions perform the fitting - you should just understand at the conceptual level what their output is.

In [None]:

# Use 'constrained_layout=True' to fix the tight spacing/overlap
fig, axes = plt.subplots(1, 3, figsize=(18, 6), constrained_layout=True)
# 3. Fit Models and Visualize
for i, degree in enumerate(degrees):
    ax = axes[i] # Access the specific subplot
    
    # Create and fit the model
    # obtain the best parameters for each model, given the data
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X_train_r, y_train)
    
    # Predictions
    # obtain the predictions based on known x values (ages), and fit parameters (values on the fit curve)
    y_plot_pred = model.predict(X_plot_r)
    y_train_pred = model.predict(X_train_r)
    y_test_pred = model.predict(X_test_r)
    
    # Calculate Errors
    # distance between true data (accuracy) and predicted
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    
    # Plotting
    ax.plot(X_plot, y_true_plot, color='gray', linestyle='--', label="True Function")
    ax.scatter(X_train, y_train, color='navy', s=50, label="Train Data")
    ax.scatter(X_test, y_test, color='orange', marker='x', s=50, label="Test Data")
    ax.plot(X_plot, y_plot_pred, color=colors[i], linewidth=2, label=f"Model (deg={degree})")
    
    # Formatting
    ax.set_title(f"{model_names[i]}\nTrain MSE: {train_mse:.3f} | Test MSE: {test_mse:.3f}")
    ax.set_ylim(0.5, 1.1)
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)
    
    plt.xlabel('age')
    plt.ylabel('Accuracy')

#plt.tight_layout()
plt.show()

<div class="alert alert-success">

# 1. Exercise

1. fix the code above so that all three plots have an x and y label. 
2. change the random seed (or do not set it) to see what happens with different data sets. Which fit model changes more?
3. change the noise level up and down. What happens to the training and testing error?
4. change the number of training data points up and down. What happens to the training and testing error?

Tips for playing with models:
Consider:
- creating new cells for each attempt, so you can compare the plots; 
- adding markdown cells with your notes/questions/conclusions to help you review your work;
- adding titles that indicate the value of the variable you're investigating, so you know which plot is which; 
- and refactoring the code as functions so that you can call the function with parameter values of interest, rather than copy-pasting the code again.

In [None]:
# YOUR CODE

### YOUR NOTES

# 2. Iterating over multiple datasets

Let's verify the insights from this first exercise one by visualizing them systematically: what happens if we train the model on different data sets?

This will also allow us to formally compute the bias and variance, in expectation over training on multiple data sets. 

Make sure to carefully read through the code to understand what we are doing.

In [None]:
# --- Configuration ---
np.random.seed(42)
n_simulations = 100
n_samples = 15        # Size of each training set
noise_level = .05     # Irreducible error (sigma)
degrees_to_plot = [1, 2, 8] # The specific models we want to visualize depth for
max_degree_analysis = 9     # For the final tradeoff curve

# Fixed Test Set (ground truth for evaluating bias/variance)
X_test = np.linspace(X_PLOT_MIN, X_PLOT_MAX, 100)
y_true = target_function(X_test)
X_test_r = X_test[:, np.newaxis]

In [None]:
# Storage for predictions across all simulations
# Structure: {degree: [pred_sim_1, pred_sim_2, ...]}
all_predictions = {d: [] for d in range(1, max_degree_analysis + 1)}

# --- 1. Run Simulations ---
for i in range(n_simulations):
    # Generate a fresh training set with random noise
    X_train = np.sort(np.random.uniform(X_TRAIN_MIN, X_TRAIN_MAX, n_samples))
    y_train = target_function(X_train) + np.random.normal(0, noise_level, n_samples)
    X_train_r = X_train[:, np.newaxis]
    
    # Fit models of varying complexity
    for d in range(1, max_degree_analysis + 1):
        model = make_pipeline(PolynomialFeatures(d), LinearRegression())
        model.fit(X_train_r, y_train)
        y_pred = model.predict(X_test_r)
        all_predictions[d].append(y_pred)

# Convert lists to numpy arrays for calculation
for d in all_predictions:
    all_predictions[d] = np.array(all_predictions[d])

In [None]:
# --- 2. Visualization Part A: Model Stability (Spaghetti Plots) ---
plt.figure(figsize=(18, 5))
titles = ["Degree 1 (High Bias, Low Var)", "Degree 2 (Optimal)", "Degree 8 (Low Bias, High Var)"]

for i, d in enumerate(degrees_to_plot):
    ax = plt.subplot(1, 3, i + 1)
    
    # Plot the 100 different models generated
    # We use high transparency (alpha) to show density
    preds = all_predictions[d]
    for j in range(n_simulations):
        ax.plot(X_test, preds[j], color='blue', alpha=0.1, linewidth=1)
        
    # Plot the Average Model (Expected Value)
    avg_pred = np.mean(preds, axis=0)
    ax.plot(X_test, avg_pred, color='red', linewidth=3, linestyle='--', label="Average Model")
    
    # Plot Truth
    ax.plot(X_test, y_true, color='black', linewidth=2, label="True Function")
    
    ax.set_title(titles[i])
    ax.set_ylim(.5, 1.1)
    if i == 0: ax.legend()

plt.suptitle("Visualizing Variance: 100 Simulations of Training Data", fontsize=16)
plt.show()

<div class="alert alert-success">

# 2.  Exercise

Oh no! Poor plotting practices! What are the blue lines???? add to the legend, explain.

### Your Notes here

Write your answers, thoughts, questions

In [None]:
# --- 3. Calculation & Visualization Part B: The Tradeoff Curve ---
max_degree_for_plotting = 4
degrees = range(1, max_degree_for_plotting + 1)
bias_squared = []
variance = []
total_error = []

for d in degrees:
    preds = all_predictions[d] # Shape: (100 simulations, 100 test points)
    
    # Expected Prediction (Mean across simulations)
    E_y_hat = np.mean(preds, axis=0)
    
    # Bias^2: (Expected_Pred - Truth)^2
    # We take the mean over all test points to get a single scalar score
    bias_sq = np.mean((E_y_hat - y_true) ** 2)
    
    # Variance: E[(Pred - Expected_Pred)^2]
    # We take the mean variance over all test points
    var = np.mean(np.var(preds, axis=0))
    
    bias_squared.append(bias_sq)
    variance.append(var)
    total_error.append(bias_sq + var + noise_level**2)

# Plotting the metrics
plt.figure(figsize=(10, 6))
plt.plot(degrees, bias_squared, 'o-', label='$Bias^2$', color='blue', linewidth=2)
plt.plot(degrees, variance, 'o-', label='Variance', color='orange', linewidth=2)
plt.plot(degrees, total_error, 'o-', label='Total Error', color='red', linewidth=2)

plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Error')
plt.title('The Bias-Variance Tradeoff')
plt.axvline(x=2, color='gray', linestyle=':', label='Optimal Complexity')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

<div class="alert alert-success">

# 3. Exercise

Same as in exercise 1, manipulate noise and number of samples in the training data. You can also test more values of model complexity by changing max_degree_for_plotting.

You should draw the same conclusions as in exercise 1, but see if the new visualizations make it more obvious/easy to interpret.


Make sure this helps you understand the concepts of bias and variance to assess model fit, as well as overfitting.

<div class="alert alert-success">

# 4. Exercise

The cell below defines a new, more complex target function. Add code to visualize this function. Reproduce the exercises above with the more complex function. What changes?

In [None]:
def new_target_function(x):
    # Relevant degree 4 function
    return 1-.0025*(x-25)**2 - .00005*(x-11)*(x-17)*(x-27)*(x-31)

### YOUR NOTES/ANSWERS