# Project 1.1: Understanding linear models on synthetic data

```
From ML Theory to Practice
Universität Potsdam, fall semester 2025

Authors: Juan L. Gamella and Simon Bing
License: CC-BY-4.0 https://creativecommons.org/licenses/by/4.0/
```

## Imports

These packages should already be installed in your Python virtual environment.

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

## Generating synthetic data

<mark style="background-color: #40E0D0;"> Task </mark> 

Write code to generate `N` samples from the following linear model

$Y := \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon$

where $\beta \in \mathbb{R}^4$ and $\epsilon \sim \mathcal{N}(0,1)$, and $X_1$, $X_2$, $X_3$ are sampled independently and uniformly at random from $[0,10]$.

For now, set $\beta = (0,1,2,3)$, `N = 100` and store the samples in a dataframe with columns `X0, X1, X2, X3, Y`, where `X0` is just a vector of ones.

In [None]:
true_coefficients = np.array([0,1,2,3])
def generate_dataset(N, seed, model_coefficients = true_coefficients):
    rng = np.random.default_rng(seed)
    
    # TODO: your code goes here
    
    # Store in a dataframe    
    data = pd.DataFrame({'X0': X0, 'X1': X1, 'X2': X2, 'X3': X3, 'Y': Y})
    return data

In [None]:
data = generate_dataset(N=100, seed=42)

## Visualize the data

<mark style="background-color: #40E0D0;"> Task </mark> 

Before fitting a model, lets do a sanity check on the data. Make a corner plot using [`sns.pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html), and specify `vars=['X1', 'X2', 'X3', 'Y']` and `corner=True`. Look at the row for `Y` vs. the other variables and check that there is indeed the linear effect you expect.

In [None]:
# TODO: your code goes here

<mark style="background-color: #648fff;">Question:</mark> Write down anything that surprises you to discuss later with your classmates.

## Fitting a linear model using `statsmodels`

Now, we will use statsmodels to fit a linear model using Ordinary Least Squares (OLS).

You can do this by calling:

```results = sm.OLS(<outcome>,<predictors>).fit()```

where `outcome` and `predictors` are dataframes. You can find an example [here](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html#statsmodels.regression.linear_model.OLS) (feel free to ask your favorite chatbot for help).

<mark style="background-color: #40E0D0;"> Task </mark> 

Fit a linear model to predict `Y` from `X0, X1, X2, X3`. Store the fitted model in a variable called `results`.

In [None]:
# TODO: your code goes here

## Interpreting the results

<mark style="background-color: #40E0D0;"> Task </mark> 


Print the summary table of the linear model calling `print(results.summary())`.

In [None]:
print(results.summary())

<br>
<mark style="background-color: #648fff;">Question:</mark> What does the column `coef` show?

<mark style="background-color: #648fff;">Question:</mark> Why is $\hat{\beta}_0$ not zero?

<mark style="background-color: #648fff;">Question:</mark> What do the columns `[0.025` and `0.975]` show?

<mark style="background-color: #648fff;">Question:</mark> What does the columns `P>|t|` show? Why is the entry for `X0` larger than the others?

<mark style="background-color: #648fff;">Question:</mark> Change the random seed above, generate fresh data, and re-fit the model. What has happened to the values in `P>|t|`? Why?

<mark style="background-color: #648fff;">Question:</mark> If you increase the sample size `N`, what will happen to the columns `coef`, `P>|t|`, `[0.025`, and `0.975]`?

<mark style="background-color: #40E0D0;"> Task </mark> 

Now, generate fresh data with `N=1000` and `N=10000` and check your hypothesis:

In [None]:
# TODO: your code goes here

<br>
<mark style="background-color: #648fff;">Question:</mark> What happened when you increased the sample size? Was your hypothesis correct?

## Understanding confidence intervals

We will now do a few experiments to test your understanding of confidence intervals.

You can access the $\alpha=0.05$ confidence interval computed by a model by calling `results.conf_int()`.

In [None]:
results.conf_int()

Where the rows correspond to the coefficient for each predictor, and the columns give you the lower (0) and upper (1) bound. You can access using `.loc`:

In [None]:
# The lower bound on the CI for X1
results.conf_int().loc['X1', 0]

To access the actual estimates for the coefficients, you can call `results.params`:

In [None]:
results.params

<mark style="background-color: #40E0D0;"> Task </mark> 

Now, write code to
- generate a fresh dataset with a different random seed (but same N=100 and true_coefficients).
- fit a linear model on this dataset
- store the coefficient estimates and the confidence intervals for each model
  
Run this code 1000 times, storing the results in (for example) `all_coefs` and `all_cis`.

In [None]:
# TODO: your code goes here

<mark style="background-color: #40E0D0;"> Task </mark> 

Now, for each variable `const, X1, X2, X3` plot the distribution of the fitted coefficients using [`sns.kdeplot`](https://seaborn.pydata.org/generated/seaborn.kdeplot.html).

In [None]:
# TODO: your code goes here

<br>
<mark style="background-color: #648fff;">Question:</mark> What kind of distributions are these?

<br>
<mark style="background-color: #648fff;">Question:</mark> Why does the distribution for the coefficient of $X_0$ have larger variance?

<br>
<mark style="background-color: #40E0D0;"> Task </mark> 

Now, make the following plot to visualize the confidence intervals resulting from the 1000 models.

For each variable $X_j \in $(`X0, ..., X3`):

- Using [`plt.hlines`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hlines.html) plot each $\alpha = 0.95$ interval as a horizontal line, extending from its lower to upper limits. The line should be drawn at height `y=i`, where `i` is the index of the model (out of the 1000 fitted above).
- Using [`plt.vlines`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.vlines.html), draw a vertical line (e.g., in red), at the value of the true coefficient $\beta_j$ for that variable.

In [None]:
# TODO: your code goes here

<br>


<mark style="background-color: #648fff;">Question:</mark> Now, make a guess: for the resulting confidence intervals, what precentage do you think contain the true coefficients $\beta_0, \ldots, \beta_3$. Will this percentage be the same for all coefficients? 

<br>
<mark style="background-color: #40E0D0;"> Task </mark>

Now, for each variable ($X_0, X_1, X_2, X_3$), compute how often the true coefficient falls inside the corresponding confidence interval (i.e., give a percentage).

In [None]:
# TODO: your code goes here

<br>
<mark style="background-color: #648fff;">Question:</mark> Was your prediction correct? If not, what do you think is happening?

Now, we will do the same, but this time we will count how often the coefficients of all variables _simultaneously_ fall inside their confidence intervals.

<mark style="background-color: #648fff;">Question:</mark>  Will the resulting percentage be higher, lower or the same (make a guess)? Try to explain your reasoning.

<br>
<mark style="background-color: #40E0D0;"> Task </mark>

Now write the code to compute the simultaneous coverage percentage.

In [None]:
# TODO: your code goes here

<br>
<mark style="background-color: #648fff;">Question:</mark>  Were you right in your predicition? If not, what could be going wrong?

## Understanding prediction intervals

We will now look at prediction intervals.

Given a DataFrame `new_covariates` containing observations of covariates (i.e., the predictors `X0, ..., X1`), you can compute the prediction intervals at level $alpha=0.05$ for each observation by calling:

In [None]:
new_covariates = fresh_data[['X0','X1', 'X2', 'X3']] # As a placeholder for this example
results.get_prediction(new_covariates).summary_frame(alpha=0.05)

The limits of the prediction interval are given by `obs_ci_lower` and `obs_ci_upper`.

<br>
<mark style="background-color: #40E0D0;"> Task </mark>

Generate a fresh dataset by calling `generate_dataset`. Pick a random seed that has not been selected before and set `N=1000`. Then compute the prediction intervals for each observation following the example above.

In [None]:
# TODO: your code goes here

<br>
<mark style="background-color: #648fff;">Question:</mark> Now, make a guess. What percentage of the measured outcomes ($Y$) will be contained inside their prediction interval?

<br>
<mark style="background-color: #40E0D0;"> Task </mark>

Compute the actual percentage and print it.

In [None]:
# TODO: your code goes here

<br>
<mark style="background-color: #648fff;">Question:</mark> Was your hypothesis correct? If not, what is going wrong (make a guess).