In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Problem: Does More Data Improve a Linear Fit?

In this exercise, you will explore how adding more data points can improve a linear regression model. The task involves generating data from a linear model, adding noise, and analyzing how the model's fit improves as the number of data points increases.

We will assume that there is a true linear relationship between a driver's speed $x$ and their lap time $y$ in a Formula 1 race, represented by the equation:
$y = \alpha x + \beta$
where  $\alpha$  is the true slope and $\beta$ is the true intercept. Your job is to estimate $\alpha$ and $\beta$ from noisy data using linear regression.

## Data Generation:

In order to use linear regression, we need data that is generated from a linear model. This does not mean that all the data points fall on the same exact line, but it does mean that a scatterplot of the data shows a general trend that seems to be increasing linearly. One way to generate data that would be appropriate for linear regression is to find points that fall on the same straight line, then bump those points up or down by a small amount.

For our experiment, we will generate data in this way.
Let $\alpha$ and $\beta$ be the true slope and intercept, respectively.
Generate data by randomly choosing $x$ values from the interval $[-10, 10]$ and calculating corresponding $y$ values using the formula:
$$
y_i = \alpha x_i + \beta + \epsilon_i
$$
where each $\epsilon_i$ is a random noise term uniformly drawn from the interval $[-\delta, \delta]$.


## Task:

Below we will outline the different components you need to code in order to complete this question. For each component we provide a function heading, which you need to complete, including defining the function arguements (inputs).

### **Generate Data:**
   - Randomly generate the true values $\alpha$ and $\beta$.
   - Generate $n$ data points using the relationship above, where $\epsilon_i$ represents noise added to the true model.


In [None]:
alpha = np.random.uniform(1.0,5.0)
beta = np.random.uniform(0.0,10.0)

def generate_data():




### **Fit a Linear Model:**
   - Using the generated data, find the optimal slope $w_1^*$ and intercept $w_0^*$ by explicitly calculating the least squares estimates for a linear model:
   $$y = w_1x + w_0$$


In [None]:
def find_optimal_slope():



def find_optimal_intercept():




### **Evaluate the Fit:**

To evalaute the quality of the solution, calculate the Empirical Risk $R_sq(w^*_0,w^*_1)$.





In [None]:
def calculate_mse():

### **Plotting**

- Plot the generated data points, the true line $y = \alpha x + \beta$, and the fitted line $y = w_1^*x + w_0$ on the same plot.


In [None]:
def plot_fit():


## Experiments:
You will need to include in the latex the plots that you generate in this question.

- For each experiment below, report the estimated slope $a$ and intercept $b$, and compare them with the true values $\alpha$ and $\beta$.


### **Experiment 1: Varying the Number of Points**
   - Fix the uniform noise interval $\delta = 1$.
   - Vary the number of points $n = 2,5,7,10, 20, 50, 75, 100$.
   - For each value of $n$, generate data, fit the model, calculate MSE, and plot MSE as a function of $n$.
   - Generate 1 plot with 8 different subplots, one for each value of $n$, subplot each showing the true model, generated data points with noise, and the predicted model.

### **Experiment 2: Varying the Noise Interval**
   - Fix the number of points $n = 100$.
   - Vary the uniform interval $\delta = 0.1, 1, 5, 10$.
   - For each value of $\delta$, generate data, fit the model, calculate MSE, and plot MSE as a function of $\delta$.
   - Generate 1 plot with 4 subplots, one for each value of $\delta$, each subplot such showing the true model, generated data points with noise, and the predicted model.







In the latex,

- discuss how adding more data points improves the fit, and
- how increasing noise impacts the model's accuracy.

