# Solution to Exercise set 1: Least squares regression

This exercise focuses on practical applications of least squares regression with Python. You will learn how to apply least squares regression to fit models of different kinds and how to evaluate the results.

**Learning Objectives:**

After completing this exercise set, you will be able to:

* Apply ordinary least squares regression to fit a linear model to data, using [NumPy](https://numpy.org/), [SciPy](https://scipy.org/), [scikit-learn](https://scikit-learn.org/), and [statsmodels](https://www.statsmodels.org/).
* Compare the quality of different regression models by inspecting their residuals.
* Apply weighting to data points in least squares regression.
* Estimate errors for coefficients by obtaining their confidence intervals.
* Use non-linear least squares regression.


**To get the exercise approved, complete the following problems:**

- [1.1(c)](#1.1(c)): Ordinary least squares regression and comparison of quality by residuals.
- [1.3(b)](#1.3(b)): Non-linear least squares regression.

**Files required for this exercise:**
* For [Exercise 1.1](#Exercise-1.1:-Polynomial-regression-with-least-squares): [temperature.csv](temperature.csv)
* For [Exercise 1.2](#Exercise-1.2:-Weighted-least-squares): [erdinger.csv](erdinger.csv)
Please ensure that these files are saved in the same directory as this notebook.

## Exercise 1.1: Polynomial regression with least squares

The temperature (°C) is measured continuously over time at a high-altitude
in the atmosphere using a
weather balloon. Every hour, a measurement is made and sent to an on-board computer.
The measurements are 
shown in Figure 1 and can be found in the [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file [temperature.csv](temperature.csv).

<figure>
<img src="Fig_1_1.png" width="50%">
<figcaption><p style='text-align: center;'><b>Figure 1:</b> Measured temperature as a function of time.</p></figcaption> 
</figure>

### 1.1(a)

To model the temperature ($y$) as a function of time ($x$), we choose a second-order polynomial:

\begin{equation}
y = b_0 + b_1 x + b_2 x^2.
\end{equation}

Explain how you can formulate this in a form suitable for least-squares regression,
$\mathbf{y} = \mathbf{X} \mathbf{b}$. That is, **what do the vectors $\mathbf{y}$ and $\mathbf{b}$ and the matrix $\mathbf{X}$ contain?**

#### Your answer to question 1.1(a): What do the vectors $\mathbf{y}$ and $\mathbf{b}$ and the matrix $\mathbf{X}$ contain?

First, we rewrite the given model as a linear model by introducing
the variables $z_1 = x$ and $z_2 = x^2$:

\begin{equation*}
\hat{y} = b_0 + b_1 x + b_2 x^2 = b_0 + b_1 z_1 + b_2 z_2 .
\end{equation*}

We then let $y_{i}$ be the result of
measurement no. $i$ and
$z_{ij}$ the value of variable $z_j$ in the same measurement. Our model for this particular
point is then:

\begin{equation*}
y_i = b_0 + b_1 z_{i1} + b_2 z_{i2} .
\end{equation*}

In matrix form (assuming we have $n$ measurements), we get,

$$
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{bmatrix}
= \begin{bmatrix}
1 & z_{11} & z_{12} \\
1 & z_{21} & z_{22} \\
\vdots & \vdots & \vdots  \\
1 & z_{n1} & z_{n2} 
\end{bmatrix}
\begin{bmatrix}
b_0 \\
b_1 \\
b_2
\end{bmatrix} .
$$

We can also write the design matrix $\mathbf{X}$ using the original variables,

$$
\mathbf{X} =
\begin{bmatrix}
1 & z_{11} & z_{12} \\
1 & z_{21} & z_{22} \\
\vdots & \vdots & \vdots \\
1 & z_{n1} & z_{n2}
\end{bmatrix}
=
\begin{bmatrix}
1 & x_{1} & x_{1}^2 \\
1 & x_{2} & x_{2}^2 \\
\vdots & \vdots & \vdots \\
1 & x_{n} & x_{n}^2 
\end{bmatrix}
.
$$

Thus we have the following:

1. $\mathbf{y}$ contains the measured $y_i$ values and $\mathbf{b}$ contains the parameters $b_0$, $b_1$, and $b_2$.
2. $\mathbf{X}$ contains the variables as columns and a column of $1$'s
  to account for the constant term.

### 1.1(b)

Fit a second-order polynomial model,

\begin{equation}
y = b_0 + b_1 x + b_2 x^2 ,
\end{equation}

to the given data by making use of [numpy.polyfit](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html) from [NumPy](https://numpy.org/). 

1. Obtain the parameters $b_0$, $b_1$, and $b_2$.
2. Plot your model: Create a scatter plot of the original data points and overlay the fitted quadratic curve to visualise the model's fit.
3. Calculate the [residuals](https://en.wikipedia.org/wiki/Errors_and_residuals) and create a scatter plot of the residuals against the fitted values. 
4. Based on your results, how do you assess your model? Please see [What is Considered a Good vs. Bad Residual Plot?](https://www.statology.org/good-vs-bad-residual-plot/) for a short explanation of what to look for in the residual plot.

Below, you will find some code to get you started:

In [None]:
# Import libraries:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

# Styling for plots:
sns.set_theme(style="ticks", context="notebook", palette="colorblind")

# For displaying matplotlib plots within the Jupyter Notebook:
%matplotlib inline

In [None]:
# Load the raw data:
data = pd.read_csv("temperature.csv")
# Show the first few rows of the data:
data.head()

To fit a polynomial to your data, you have several options. Here are three popular choices:

1. [polyfit](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html) from [NumPy](https://numpy.org/). This is the simplest option, suitable for most basic polynomial fitting tasks. It uses a least squares approach to fit a polynomial of a given degree to your data. 
2. [Ordinary least squares (OLS)](https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html) from [statsmodels](https://www.statsmodels.org): This method provides more detailed results than polyfit, including error estimates for the coefficients. It is a good choice if you need more information about the fit. 
3. [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) from [scikit-learn](https://scikit-learn.org/): This is a more general approach that can be used for a variety of regression tasks, including polynomial fitting. It is particularly useful if you want to combine polynomial fitting with other methods, such as cross-validation.

For options 2 and 3, we have to "construct" the $\mathbf{X}$-matrix (see [1.1(a)](#1.1(a))), while `polyfit` does this automatically. We select the simplest option and use `polyfit` in this exercise.

Here is an example for finding a first-order polynomial to get you started:

In [None]:
# Extract the data:
x = data["hour"]
y = data["temperature"]

# Fit a first-order polynomial:
param = np.polyfit(x, y, deg=1)  

# Show the parameters found (stored in `param`):
equation = f"y = {param[0]:.3f}x + {param[1]:.3f}"

# To evaluate the polynomial, we use np.polyval.
# This will evaluate the polynomial using the parameters
# we found at each x-value:
y_hat = np.polyval(param, x)

# Compute the residuals:
residual = y - y_hat

# Plot the fitted polynomial and residuals:
fig, (ax1, ax2) = plt.subplots(
    constrained_layout=True,  # Remove some whitespace in the figure
    ncols=2,  # Create two columns
    figsize=(8, 4)  # Adjust the size of the plot
)
# Plot the raw data:
ax1.scatter(data["hour"], data["temperature"])
ax1.set(xlabel="Time (hour)", ylabel="Temperature (°C)")

# Plot the fitted curve:
ax1.plot(x, y_hat, lw=3, color="black", label=f"Fitted line\n{equation}")

# Show a legend:
ax1.legend()

# Plot the residuals:
ax2.scatter(y_hat, residual)
ax2.set(xlabel="Predicted by model ($ŷ_i$)", ylabel="Residual ($y_i - ŷ_i$)")

# Remove some of the spines (top and right lines around the plot):
sns.despine(fig=fig)

In [None]:
# Your code here for fitting the second-order model, plotting it, and the residuals
x = data["hour"]
y = data["temperature"]
param = np.polyfit(x, y, deg=2)  # Selects a second order polynomial

# param now contains the parameters:
equation = f"y = {param[0]:.3f}x² + {param[1]:.3f}x + {param[2]:.3f}"

# To evaluate the polynomial, we use np.polyval:
y_hat = np.polyval(param, x)

# And we can find the residuals/errors
residual = y - y_hat

ssr = np.sum(residual**2)  # Sum of squared residuals

# Plot the fitted polynomial and residuals
fig, (ax1, ax2) = plt.subplots(
    constrained_layout=True, ncols=2, figsize=(8, 4)
)
ax1.scatter(data["hour"], data["temperature"])
ax1.set(xlabel="Time (hour)", ylabel="Temperature (°C)")
ax1.plot(x, y_hat, lw=3, color="black", label=f"Fitted line\n{equation}")
ax1.legend()
ax2.scatter(y_hat, residual)
ax2.set(xlabel="Predicted by model ($ŷ_i$)", ylabel="Residual ($y_i - ŷ_i$)")
ax2.set_title(f"SSR = {ssr:.3f}", loc="left")
sns.despine(fig=fig)

#### Your answer to question 1.1(b): What are the coefficients of the second-order polynomial and how do you assess (based on the two plots you made) your model?

<p style="text-align: center;"><b>Table A:</b> Coefficients for the quadratic polynomial</p>  

|  Coefficient | Value       |
|--------------|-------------|
| $b_0$        | 12.96       |
| $b_1$        | 0.0091      |
| $b_2$        | -0.012      |


The quadratic curve captures the general downward trend but fails to account for the systematic "waves" or oscillations in the temperature data. The residuals are not randomly distributed around zero and they show a trend. In regression analysis, such a trend indicates that the model is "underfitting". This suggests that a higher-order polynomial or a different model type might be more appropriate.

### 1.1(c)

In this problem, you will explore how the choice of polynomial degree affects the model's ability to fit the temperature data.

1. Extend your code from [1.1(b)](#1.1(b)) to fit polynomial models of degrees 1 to 5 to the temperature data.
2. Create a single plot displaying the raw data as a scatter plot and overlay the fitted curves for all five models as lines.
3. Plot the residuals for each model in separate scatter plots.

Which polynomial degree best models temperature as a function of time?

In [None]:
x = data["hour"]
y = data["temperature"]


order = [1, 2, 3, 4, 5]
models = [np.polyfit(x, y, deg=orderi) for orderi in order]

fig1, ax1 = plt.subplots(constrained_layout=True)
ax1.scatter(x, y, label="Raw data", color="black")

fig2, axes2 = plt.subplots(
    nrows=1,
    ncols=len(order),
    constrained_layout=True,
    sharex=True,
    sharey=True,
    figsize=(len(order) * 4, 4),
)
fig2.suptitle("Residuals:")


ssr_all = []
for i, orderi in enumerate(order):
    param = np.polyfit(x, y, deg=orderi)
    y_hat = np.polyval(param, x)
    residual = y - y_hat
    ssr = np.sum(residual**2)
    ssr_all.append(ssr)

    ax1.plot(x, y_hat, label=f"Order: {orderi}")
    axes2[i].scatter(y_hat, residual)
    if i == 0:
        axes2[i].set(ylabel="Residual ($y_i - ŷ_i$)")
    axes2[i].set(xlabel="Predicted by model ($ŷ_i$)")
    axes2[i].set_title(f"Order: {orderi}, SSR = {ssr:.3f}", loc="left")


# Plot the fitted polynomial and residuals
ax1.set(xlabel="Time (hour)", ylabel="Temperature (°C)")
ax1.legend()

sns.despine(fig=fig1)
sns.despine(fig=fig2)

#### Your answer to question 1.1(c): Which polynomial degree best models temperature as a function of time?

The residual plots for the third, fourth, and fifth-order models show no clear patterns or trends, indicating that these models are capturing the underlying structure of the data well. These polynomial degrees also fit the raw data well (see the first figure). While higher-order models (degree 4 and 5) provide a marginally better fit to the raw data (lower SSR values), the improvement is small. 

As there is no compelling evidence to prefer a higher-degree model, we follow the principle of parsimony (Occam's Razor) and recommend the third-degree polynomial. The third-degree polynomial captures the main features of the data without unnecessary complexity.

### 1.1(d)
Obtain the [sum of squared residuals](https://en.wikipedia.org/wiki/Residual_sum_of_squares) for each polynomial you made in [1.1(c)](#1.1(c)) and plot this as a function of the
polynomial degree. 

Use this plot to determine (from visual inspection) the best polynomial
degree for modelling the temperature as a function of time. Does this agree with your
assessment from [1,1(c)](#1.1(c))?

In [None]:
# Using the data stored from 1.1(c):
fig, ax = plt.subplots(constrained_layout=True)
ax.plot(order, ssr_all, ls="--", marker="o")
ax.set_title("Sum of squared residuals", loc="left")
ax.set_xticks(order)
ax.set_xlabel("Polynomial degree")
ax.set_ylabel("Sum of squared residuals (°C)²")
sns.despine(fig=fig)

#### Your answer to question 1.1(d): What polynomial degree do you recommend?

We observe a significant drop in the sum of squared residuals when going from a second to a third-order polynomial. After that, the decrease in the sum of squared residuals becomes negligible. This suggests that further increases in complexity (for instance, going to a 4th or 5th order polynomial) does not substantially improve the fit. This is consistent with our findings in 1.1(c), where we observed that a third-order polynomial captured the main features of the data without unnecessary complexity.

## Exercise 1.2: Weighted least squares

In this exercise, we will use least-squares regression to investigate a real-world phenomenon: the decay of beer froth over time. The goal is to illustrate how regression can be used to extract meaningful physical quantities and quantify their associated uncertainties.

Arnd Leike was awarded the 2002 [Ig Nobel prize](https://en.wikipedia.org/wiki/Ig_Nobel_Prize) for this [research on the decay of beer froth](https://doi.org/10.1088/0143-0807/23/1/304), and we will here reproduce the data analysis. In particular, we will use the reported raw data and carry out a weighted least squares regression. In addition, we will also obtain an error estimate (as a confidence interval) for the determined physical quantity.


The file [erdinger.csv](erdinger.csv)
contains [measured heights](https://doi.org/10.1088/0143-0807/23/1/304) for beer
froth as a function of time, along with the errors in the measured heights. 


**Please use [scikit-learn](https://scikit-learn.org/) and [statsmodels](https://www.statsmodels.org) for the fitting in this exercise!**

### 1.2(a)
Create a linear model (first-order polynomial) for the beer froth height as a function of time using least squares.
Plot your model with the raw data, calculate the [coefficient of determination ($R^2$)](https://en.wikipedia.org/wiki/Coefficient_of_determination), and plot
the residuals. Is this linear model suitable for estimating the froth height as a function of time?

**Note:** You do not need to write the code for this part, but ensure that you understand the implementation and the resulting output.

In [None]:
# Import libraries:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

# Styling for plots:
sns.set_theme(style="ticks", context="notebook", palette="colorblind")

# For displaying matplotlib plots within the Jupyter Notebook:
%matplotlib inline

# Load the data:
data = pd.read_csv("erdinger.csv")
data.head()

In [None]:
# Extract data as NumPy arrays:
time = data["time"].to_numpy()
height = data["height"].to_numpy()
height_error = data["height-error"].to_numpy()

In [None]:
# Import libraries for using scikit-learn:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# To fit a model with scikit-learn, we do the following:
model1 = LinearRegression(fit_intercept=True)

# We create an X-matrix for the fitting by reshaping 'time':
X = time.reshape(-1, 1)
model1.fit(X, height)

# We can use the model for prediction by:
y_hat_1 = model1.predict(X)

# To calculate R²:
r2_model1 = model1.score(X, height)
# or:
r2_model1 = r2_score(height, y_hat_1)

# To calculate the mean squared error (MSE) for the model:
mse_model1 = mean_squared_error(height, y_hat_1)
# To summarise the model with a short text string:
model1_txt = f"y = {model1.coef_[0]:.3g}x + {model1.intercept_:.3g}"
model1_txt = f"{model1_txt}\n(R² = {r2_model1:.3g}, MSE = {mse_model1:.3g})"
print(model1_txt)

In [None]:
# Plot the results:
fig, (ax1, ax2) = plt.subplots(
    constrained_layout=True, ncols=2, figsize=(8, 4)
)
# Add the raw data with error bars:
ax1.errorbar(
    time,
    height,
    yerr=height_error,
    label="Raw data",
    fmt="o",  # Show only the markers, no connecting lines
    capsize=4,  # Adjust the width of the error bar caps
)
# Plot the fitted curve:
ax1.plot(
    time,
    y_hat_1,
    lw=3,
    label=model1_txt,
)
ax1.set(xlabel="Time (s)", ylabel="Height (cm)")
ax1.legend()
# Plot the residuals:
ax2.scatter(y_hat_1, height - y_hat_1)
ax2.set(xlabel="Predicted by the model (ŷ)", ylabel="Residuals (y - ŷ)")
sns.despine(fig=fig)

#### Your answer to question 1.2(a): Is this linear model suitable for estimating the froth height as a function of time?

The linear model captures most of the trend in the decreasing height as a function of time and R² is quite high (0.944). However, it fails to account for the curvature in the data, especially at earlier times. The residual plot shows a clear pattern (U-shape), and the presence of this pattern suggests that the linear model does not capture the relationship between time and height accurately. Furthermore, the linear model predicts a negative froth height at around 480 seconds, which is physically impossible!

In [None]:
# Extrapolate to see where we get negative heights:
X2 = np.arange(400, 500, 10).reshape(-1, 1)
model1.predict(X2)

### 1.2(b)
If we assume that the change in froth volume is proportional
to the volume present at any given time, we can show that this leads to
exponential decay of the froth height,

\begin{equation}
\frac{h(t)}{h(0)} = \exp \left(-\frac{t}{\tau} \right),
\end{equation}

where $h(t)$ is the height of the froth as a function of time $t$, and $\tau$ is a parameter (characteristic time constant).
We will assume that $h(0)$ is a known parameter, equal to the initial height of the froth.

Show how you can transform the equation above into a linear equation of the form,

\begin{equation}
y = b x,
\end{equation}

and express $b, x, y$ in terms of $h, h(0), t, \tau$.

#### Your answer to question 1.2(b):
If we take the natural logarithm on both sides of the equation, we get,

\begin{equation}
\ln \left( \frac{h(t)}{h(0)} \right) = -\frac{t}{\tau} = -\frac{1}{\tau} \times t .
\end{equation}

Setting,
\begin{equation}
y = \ln \left( \frac{h(t)}{h(0)} \right), \quad x = t, \quad b=-\frac{1}{\tau},
\end{equation}
 we get,
\begin{equation}
\underbrace{\ln \left( \frac{h(t)}{h(0)} \right)}_{y} = -\frac{t}{\tau} = \underbrace{-\frac{1}{\tau}}_{b} \times \underbrace{t}_{x},
\end{equation}

or $y = bx$.

### 1.2(c)
Use the linear transformation you found in [1.2(b)](#1.2(b)) to create a new linear model where you estimate
the value of $\tau$. Plot your new model together with the raw data and calculate $R^2$.

**Hint:** The equation, $y=bx$, above does not include the usual constant term.
This will modify the least-squares equation as shown in [Appendix A](#A.-Least-squares-without-the-intercept).
To do the fitting without the intercept, set `fit_intercept=False` when creating the linear regression model:
```python
model2 = LinearRegression(fit_intercept=False)  # New model, without intercept
```

In [None]:
# First, transform y:
y = np.log(height / height[0])
X = time.reshape(-1, 1)

# Create the new model, without intercept:
model2 = LinearRegression(fit_intercept=False)
model2.fit(X, y)

# Calculate R²:
r2_model2 = model2.score(X, y)

# Convert predicted y back to heights:
y_hat_2 = model2.predict(X)
height_hat_2 = height[0] * np.exp(y_hat_2)

# Calculate the mean squared error, based on the heights
# (this is to compare with model 1)
mse_model2 = mean_squared_error(height, height_hat_2)

tau = -1.0 / model2.coef_[0]
print(f"τ = {tau:.4g} s")

model2_txt = f"h(t) = h(0) exp(-t/{tau:4g})"
model2_txt = f"{model2_txt}\n(R² = {r2_model2:.3g}, MSE = {mse_model2:.3g})"

In [None]:
fig, (ax1, ax2) = plt.subplots(
    constrained_layout=True, ncols=2, figsize=(8, 4)
)
ax1.errorbar(
    time,
    height,
    yerr=height_error,
    label="Raw data",
    fmt="o",
    capsize=4,
)
# Plot the new model:
ax1.plot(
    time,
    height_hat_2,
    lw=3,
    label=model2_txt,
)
ax1.set(xlabel="Time (s)", ylabel="Height (cm)")
ax1.legend()

# Show residuals in height for the new model:
ax2.scatter(height_hat_2, height - height_hat_2)
ax2.set(xlabel="ŷ", ylabel="Residual, y - ŷ")
ax2.set_ylim(-1.1, 2.0)

sns.despine(fig=fig)

#### Your answer to question 1.2(c): What value did you get for $\tau$?
From the coefficient found in the least squares fit: $\tau \approx 290$. We see that the residuals are now all
smaller in magnitude, but we are overestimating the height for a lot of the points. The residuals still show a slight curvature, suggesting that the exponential model might not be perfectly capturing the decay of the froth height.

The R² value is now very close to 1, indicating that the exponential model explains a large proportion of the variance in the transformed data.

### 1.2(d)
[Leike](https://doi.org/10.1088/0143-0807/23/1/304) found a
value of $\tau = 276 \pm 14$s, which is likely lower than the value you obtained in [1.2(c)](#1.2(c)).
We will now attempt to reproduce the results of Leike by using weighted least squares regression. The motivation for this approach is that the measurement errors in the raw data are not constant. Weighted regression accounts for this by assigning more influence to data points with smaller uncertainties.

To assign the weights ($w_i$) we can use $w_i = 1/\sigma_i^2$ where $\sigma_i$ is the
reported error for observation $i$. But we need to consider the fact that we
are now fitting log-transformed values to $y = \log (h(t) / h(0))$, and this will modify the errors.
If you are familiar with [propagation of errors](https://en.wikipedia.org/wiki/Propagation_of_uncertainty),
you should be able to show that the error in $y$ ($\sigma_y$) is given by,

\begin{equation}
\sigma_y^2 = \frac{\sigma_h^2}{h^2} ,
\end{equation}

which says that we can get the error in $y$ by dividing the measured error in the height ($\sigma_h$) by the measured height ($h$).

Do the following steps to perform the weighted least squares:

1. Calculate errors for your $y$ values according to $\sigma_y^2 = \sigma_{h}^2 / h^2$.

2. Calculate weights for your $y$ values as $1/\sigma_y^2$. Note: If
  a $\sigma_y$ value is zero, set the corresponding weight to zero.
  
3. Perform a weighted least squares fitting (see the example below) using the calculated weights. Estimate $\tau$, plot your new model and calculate $R^2$.

In [None]:
# Example on how to run weighted least squares:

# Set up the model:
model = LinearRegression(fit_intercept=False)

# Create some weights (note: these are for illustration and not correct for 1.2(d))
weights = 1.0 / data["height-error"].to_numpy() ** 2

# Handle potential division by zero by setting infinite weights to zero:
weights[weights == float("inf")] = 0

# Perform the fit using the weights:
model.fit(X, height, sample_weight=weights)

# Calculate R² (ensuring the weights are considered in the calculation):
r2 = model.score(X, height, sample_weight=weights)

In [None]:
model3 = LinearRegression(fit_intercept=False)


# 1. Calculate errors for y:
sigma_y_sq = height_error**2 / height**2

# 2. Calculate weights for y:
weights = 1.0 / sigma_y_sq
weights[weights == float("inf")] = 0  # Set infinite values to zero

# 3. Do weighted least squares:
model3.fit(X, y, sample_weight=weights)

# Calculate R² (considering the weights).
r2_model3 = model3.score(X, y, sample_weight=weights)
y_hat_3 = model3.predict(X)

# Recalculate the heights:
height_hat_3 = height[0] * np.exp(y_hat_3)

# Calculate MSE, based on heights. For this, we need the weights for the heights:
weights_h = 1.0 / height_error**2
weights_h[weights_h == float("inf")] = 0
# Normalize the weights (since scikit-learn is using normalized weights)
weights_h /= sum(weights_h)

mse_model3 = mean_squared_error(height, height_hat_3, sample_weight=weights_h)

In [None]:
tau_ = -1.0 / model3.coef_[0]
print(f"τ = {tau_:.4g} s")

model3_txt = f"h(t) = h(0) exp(-t/{tau_:4g})"
model3_txt = f"{model3_txt}\n(R² = {r2_model3:.3g}, MSE = {mse_model3:.3g})"
print(model3_txt)

In [None]:
fig, (ax1, ax2) = plt.subplots(
    constrained_layout=True, ncols=2, figsize=(8, 4)
)
ax1.errorbar(
    time,
    height,
    yerr=height_error,
    label="Raw data",
    fmt="o",
    capsize=4, 
)
ax1.plot(
    time,
    height_hat_3,
    lw=3,
    label=model3_txt,
)
ax1.set(xlabel="Time (s)", ylabel="Height (cm)")
ax1.legend()


ax2.scatter(height_hat_3, (height - height_hat_3) * np.sqrt(weights_h))
ax2.set(xlabel="ŷ", ylabel="Weighted residual, w × (y - ŷ)")
ax2.set_ylim(-1.1, 2.0)

sns.despine(fig=fig)

#### Your answer to question 1.2(d): What value did you get for $\tau$? How does it compare to Leike's result?
With the weighted approach, we get $\tau = 277$ s, which is very close to the $276$ s reported by Leike.
The weighted residuals are smaller in magnitude, but there might still be a weak trend. The high R² value indicates that the model captures the general trend well.

We will next quantify the uncertainty in our estimate of $\tau$.

### 1.2(e)
We can use the measured errors to estimate the uncertainty in the $\tau$ parameter. Adopt the example code below, using [statsmodels](https://www.statsmodels.org/stable/examples/notebooks/generated/wls.html) to compute a 95% confidence interval for $\tau$.

In [None]:
# Example code for obtaining a 95% confidence interval.

# Use statsmodels:
import statsmodels.api as sm

# Prepare the data for statsmodels:
X = time.reshape(-1, 1)

# Obtain weights:
weights = 1.0 / data["height-error"].to_numpy() ** 2
# Set infinite values to zero:
weights[weights == float("inf")] = 0

# Create the model and fit it:
model_wls = sm.WLS(height, X, weights=weights)
results_wls = model_wls.fit()

# Display the full statistical summary:
# alpha=0.05 calculates a 95% confidence interval (100 * (1 - alpha)%).
print(results_wls.summary(alpha=0.05))

**Note:** A description of the summary from statsmodels can be found in [Appendix B](#B.-The-summary-results-from-statsmodels). We only need this part:

```code
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0452      0.015      2.997      0.010       0.013       0.077
==============================================================================
```

where `coef` gives the fitted coefficient and the numbers below `[0.025      0.975]` is the confidence interval.

In [None]:
# Alternatively, we can get the confidence interval by:
results_wls.conf_int()

In [None]:
# Your code here:

# We use the same X and we use the logarithmic y:
X = time.reshape(-1, 1)
y = np.log(height / height[0])

# 1. Calculate errors for y (same as before):
sigma_y_sq = height_error**2 / height**2

# 2. Calculate weights for y (same as before):
weights = 1.0 / sigma_y_sq
weights[weights == float("inf")] = 0  # Set infinite values to zero

model_wls = sm.WLS(y, X, weights=weights)
results_wls = model_wls.fit()

In [None]:
# To get the confidence interval we use:
tau_statsmodels = -1.0 / results_wls.params[0]
tau_ci = results_wls.conf_int()[0]
lower_limit = min(-1.0 / tau_ci)
upper_limit = max(-1.0 / tau_ci)
print(f"τ = {tau_statsmodels:.4g} s")
print(f"Confidence interval: {lower_limit:.4g} to {upper_limit:.4g}")

# To get this as a ±
uncertainty1 = upper_limit - tau_statsmodels
uncertainty2 = tau_statsmodels - lower_limit
uncertainty = max(uncertainty1, uncertainty2)
print(f"τ = {round(tau_statsmodels, 0):.4g} ± {round(uncertainty, 0):.4g} s")

#### Your answer to question 1.2(e): What confidence interval did you get for $\tau$?
The confidence interval was 265.7 to 290.1 s. The uncertainty is 13 s, so τ = 277 ± 13 s. 
This compares well with the results of Leike: τ = 276 ± 14 s at 95% confidence.

## Exercise 1.3: Non-linear least squares

In spectroscopy, it is often necessary to "deconvolve" a spectrum into a number of overlapping spectral peaks. The individual peaks can often be approximated as Gaussian functions of amplitude $\beta$, peak wavelength $\lambda$   and standard deviation $\sigma$. Consequently, the total spectrum $S(\lambda)$ may be written as, 

\begin{equation}
S(\lambda) = \sum_{i=1}^N \beta_i \exp \left( -\frac{(\lambda - \lambda_i)^2}{2\sigma_i^2} \right),
\end{equation}

where $N$ is the number of overlapping peaks.

In this exercise, we will use non-linear least squares to determine the parameters for a spectrum consisting of $N=3$ peaks. Specifically, we will use the parameters provided in Table 1 to generate a synthetic spectrum, add random noise, and then attempt to recover the original parameters through regression.


<p style="text-align: center;"><b>Table 1:</b> Parameters for spectral peaks</p>  

|             | Peak 1   | Peak 2   | Peak 3   |
|-------------|----------|----------|----------|
| $\beta_i$   | 0.2      | 0.4      | 0.3      |
| $\lambda_i$ | 4.0      | 5.5      | 7.2      |
| $\sigma_i$  | 0.5      | 0.8      | 0.9      |

### 1.3(a)

We will first generate a synthetic spectrum using the parameters in Table 1 to use for the non-linear least-squares fitting. To create the noise-free absorption spectrum, execute the following code:

In [None]:
# Import libraries
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from scipy.optimize import minimize

# Styling for plots:
sns.set_theme(style="ticks", context="notebook", palette="colorblind")

# For displaying matplotlib plots within the Jupyter Notebook:
%matplotlib inline

In [None]:
# Define parameters from Table 1:
betas = [0.2, 0.4, 0.3]
lambdas = [4.0, 5.5, 7.2]
sigmas = [0.5, 0.8, 0.9]


def gaussian(x, beta_i, lambda_i, sigma_i):
    """Evaluate a Gaussian function at points x."""
    return beta_i * np.exp(-((x - lambda_i) ** 2) / (2.0 * sigma_i ** 2))

# Create the noise-free spectrum, sample it at 100 points between 0 and 10:
x = np.linspace(0, 10, 100)
spectrum_without_noise = np.zeros_like(x)

# Sum the individual Gaussian peaks to create the synthetic spectrum:
for beta_i, lambda_i, sigma_i in zip(betas, lambdas, sigmas):
    spectrum_without_noise += gaussian(x, beta_i, lambda_i, sigma_i)

Next, create a noisy spectrum by adding zero-mean Gaussian noise with a standard deviation $\gamma=0.025$, using the code below:

In [None]:
gamma = 0.025

# Since we will be using random numbers, we set the
# seed to get reproducible numbers:
np.random.seed(4175)

# Generate Gaussian noise and add it to the original spectrum:
noise = np.random.normal(loc=0, scale=gamma, size=spectrum_without_noise.shape)
spectrum_with_noise = spectrum_without_noise + noise

# Show the spectrum with and without noise:
fig, ax = plt.subplots()
ax.scatter(x, spectrum_with_noise, label="With noise")
ax.plot(x, spectrum_without_noise, color="k", label="Without noise")
ax.set(xlabel="λ", ylabel="S(λ)")
ax.set_title(f"γ = {gamma}", loc="left")
ax.legend()
sns.despine(fig=fig)

In [None]:
# Since we will try different gamma values, we generate some
# more signals with noise here:

gammas = [0.025, 0.05, 0.075]

spectra_with_noise = []

for gamma in gammas:
    noise = np.random.normal(loc=0, scale=gamma, size=spectrum_without_noise.shape)
    with_noise = spectrum_without_noise + noise
    spectra_with_noise.append(with_noise)

fig, axes = plt.subplots(constrained_layout=True, ncols=3, figsize=(9, 3), sharex=True, sharey=True)

for ax, signal, gamma in zip(axes, spectra_with_noise, gammas):
    ax.scatter(x, signal, label="With noise")
    ax.plot(x, spectrum_without_noise, color="k", label="Without noise")
    ax.set(xlabel="λ", ylabel="S(λ)")
    ax.set_title(f"γ = {gamma}", loc="left")
axes[0].legend(fontsize="xx-small")
sns.despine(fig=fig)

### 1.3(b)

Use non-linear least squares to estimate the Gaussian parameters for the noisy spectrum you generated in [1.3(a)](#1.3(a)). Investigate how the solution depends on your initial guess and the noise level:

- Initial guess: Compare "good" initial guesses (e.g., those from Table 1) with "bad" values (e.g. placing the peak centres far from their actual positions).
- Noise level: Try running the fitting for a signal containing a higher level of noise, for instance, by setting $\gamma=0.05$ in [1.3(a)](#1.3(a)).

**Hint:** The code below outlines how to perform the fit using the noise-free spectrum.

**Note:** We use [minimize](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html) from [SciPy](https://scipy.org/) to run non-linear least squares. You can also use [curve_fit](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) from [SciPy](https://scipy.org/) for non-linear least squares: this is simpler and requires less setup, but you have less control over the optimisation process.

In [None]:
# First we define the model we will fit:
def model(x, params):
    """Calculate the total spectral signal using the given parameters.

    Args:
        x: The independent variable (e.g., wavelength).
        params: A flat list of parameters for N peaks.
            The length must be a multiple of 3, ordered as:
            [beta_1, lambda_1, sigma_1, beta_2, lambda_2, sigma_2, ...]

    Returns:
        The value of the function at each given x.
    """
    betas = params[0::3]
    lambdas = params[1::3]
    sigmas = params[2::3]
    
    signal = np.zeros_like(x)

    for beta_i, lambda_i, sigma_i in zip(betas, lambdas, sigmas):
        signal += gaussian(x, beta_i, lambda_i, sigma_i)
    return signal

In [None]:
# Next, we define the objective function we will minimise:
def objective(params, x, y):
    """Calculate sum of squared errors between the model and the data."""
    y_model = model(x, params)
    residuals = y - y_model
    # Return sum of squared errors:
    return np.sum(residuals**2)

In [None]:
# Set up the optimisation:

# Initial guess for the parameters:
initial_guess = [
    0.3, 4.0, 1.0,  # Peak 1
    0.3, 6.0, 1.0,  # Peak 2
    0.3, 8.0, 1.0,  # Peak 3
]

# Set up boundaries for the parameters, these
# are in the form (min, max) for each parameter
bounds = [
    (0.01, 2.0), (0.0, 10.0), (0.01, 2.0),  # Peak 1
    (0.01, 2.0), (0.0, 10.0), (0.01, 2.0),  # Peak 2
    (0.01, 2.0), (0.0, 10.0), (0.01, 2.0),  # Peak 3
]

# Run the minimisation:
result = minimize(
    objective,
    initial_guess,
    args=(x, spectrum_without_noise),
    bounds=bounds,
    options={
        "maxiter": 5000,
    },
)
# Check if the optimisation converged successfully:
print(result.message)

# Extract the fitted parameters:
fit_params = result.x
print(f"Fitted parameters: {fit_params}")

In [None]:
# Recalculate using the fitted parameters:
y_model = model(x, fit_params) 

# Plot the fitted signal
fig, ax = plt.subplots(constrained_layout=True)
ax.plot(x, spectrum_without_noise, label="True signal", lw=5, alpha=0.4)
ax.plot(x, y_model, label="Fitted model", color="black", lw=2, ls=":")
ax.set(xlabel=r"$\lambda$", ylabel=r"$S(\lambda)$")
ax.legend()
sns.despine(fig=fig)

In [None]:
# We first do the fitting to the two cases with noise,
# and we use the guess from before:
results_with_good_guess = []

for spectrum in spectra_with_noise:
    result = minimize(
        objective,
        initial_guess,
        args=(x, spectrum),
        bounds=bounds,
        options={"maxiter": 5000},
    )
    print(result.message)
    print(f"Fitted parameters: {result.x}")
    results_with_good_guess.append(result)

In [None]:
from IPython.display import display_html
import pandas as pd

# Make a function to create a table for the fitted parameters:
def make_table(result):
    table_data = {
        "Peak 1": [round(i, ndigits=3) for i in result.x[:3]],
        "Peak 2": [round(i, ndigits=3) for i in result.x[3:6]],
        "Peak 3": [round(i, ndigits=3) for i in result.x[6:]],
    }
    tablei = pd.DataFrame(
        table_data, index=[r"$\beta_i$", r"$\lambda_i$", r"$\sigma_i$"]
    )
    return tablei

# Print out the parameters:
for result in results_with_good_guess:
    tablei = make_table(result)
    display_html(tablei)

In [None]:
# Let us show the results graphically to make them easier to compare:
fig, axes = plt.subplots(
    constrained_layout=True, ncols=3, sharex=True, sharey=True, figsize=(9,3)
)

for result, gamma, spectrum, ax in zip(results_with_good_guess, gammas, spectra_with_noise, axes):
    fit_params = result.x
    # Recalculate using the fitted parameters:
    y_fit = model(x, fit_params)
    ax.scatter(x, spectrum, label="Spectrum with noise", alpha=0.3)
    ax.plot(x, y_fit, color="k", label="Fitted model")
    ax.set_title(f"γ = {gamma}", loc="left")
    ax.set(xlabel="λ", ylabel="S(λ)")
axes[0].legend(fontsize="xx-small")
sns.despine(fig=fig)

We see that there are some differences in the fitted models. To quantify the accuracy, we compare them to the noise-free spectrum. We plot the fitted absorption values against the true values. Ideally, the points would fall on the line $y=x$, indicating a perfect fit. Deviations from this line reveal discrepancies between the model and the true spectrum.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score


fig, axes = plt.subplots(
    constrained_layout=True, ncols=3, nrows=3, figsize=(12,8)
)

for i, (result, gamma, spectrum) in enumerate(zip(results_with_good_guess, gammas, spectra_with_noise)):
    
    fit_params = result.x
    # Recalculate using the fitted parameters:
    y_fit = model(x, fit_params)
    r2 = r2_score(spectrum_without_noise, y_fit)
    mse = mean_squared_error(spectrum_without_noise, y_fit)
    residual = spectrum_without_noise - y_fit
    
    axes[i, 0].plot(x, spectrum_without_noise, label=f"Without noise", color="k")
    axes[i, 0].scatter(
        x,
        spectrum,
        marker="o",
        edgecolor="k",
        facecolor="none",
        alpha=0.3,
        label="With noise"
    )
    axes[i, 0].plot(x, y_fit, label=f"Fitted spectrum", color="red")
    axes[i, 0].set_title(f"γ = {gamma}, R² = {r2:.3f}, MSE = {mse:.2g}", loc="left")
    
    axes[i, 1].scatter(
        spectrum_without_noise,
        y_fit,
        marker="o",
        edgecolor="red",
        facecolor='none',
    )
    axes[i, 2].axhline(y=0.0, ls="--", color="k")
    axes[i, 2].scatter(
        y_fit,
        residual,
        marker="o",
        edgecolor="red",
        facecolor='none',
    )
    
    axes[i, 0].legend(fontsize="xx-small")
    axes[i, 0].set(xlabel="λ", ylabel="S(λ)")
    axes[i, 1].set(xlabel="Spectrum without noise (y)", ylabel="Fitted model (ŷ)")
    axes[i, 2].set(xlabel="Fitted model (ŷ)",ylabel="Residual, y - ŷ")
    
    axes[i, 0].sharex(axes[0, 0])
    axes[i, 0].sharey(axes[0, 0])
    axes[i, 1].sharex(axes[0, 1])
    axes[i, 1].sharey(axes[0, 1])
    axes[i, 2].sharex(axes[0, 2])
    axes[i, 2].sharey(axes[0, 2])
    
def add_xy_line(ax):
    """Add the x=y line to help reading plots."""
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    ax.plot(xlim, ylim, color="black", ls="--")
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    
for ax in axes[:, 1]:
    add_xy_line(ax)


axes[0, 0].autoscale()
axes[0, 1].autoscale()
axes[0, 2].autoscale()
sns.despine(fig=fig)

In [None]:
# Next, we try poorer initial guesses. For simplicity, we use the lowest noise level:
initial_guesses = [
    # First the "good" guess from before:
    [0.3, 4.0, 1.0, 0.3, 6.0, 1.0, 0.3, 8.0, 1.0],
    # Then, moving all locations to the left:
    [0.3, 4.0, 1.0, 0.3, 3.0, 1.0, 0.3, 2.0, 1.0],
    # Increase sigmas for all:
    [0.3, 4.0, 1.5, 0.3, 6.0, 1.5, 0.3, 8.0, 1.5],
    # Set everything to 1.0:
    [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
    # With beta: 
    [0.8, 4.0, 1.0, 0.8, 6.0, 1.0, 0.8, 8.0, 1.0],
    # With locations in the "center" of the range:
    [0.1, 5.0, 1.0, 0.1, 5.0, 1.0, 0.1, 5.0, 1.0],
]
    
results_with_different_guess = []

fig, axes = plt.subplots(
    constrained_layout=True, ncols=3, nrows=2, sharex=True, sharey=True
)
axes = axes.flatten()

for i, guess in enumerate(initial_guesses):
    result = minimize(
        objective,
        guess,
        args=(x, spectrum_with_noise),
        bounds=bounds,
        options={"maxiter": 5000},
    )
    print(result.message)
    print(f"Fitted parameters: {result.x}")
    results_with_different_guess.append(result)
    
    fit_params = result.x
    tablei = make_table(result)
    display_html(tablei)
    y_fit = model(x, fit_params)
    mse = mean_squared_error(spectrum_with_noise, y_fit)
    r2 = r2_score(spectrum_with_noise, y_fit)
    
    axes[i].plot(x, spectrum_without_noise, label=f"Without noise", color="k")
    axes[i].scatter(x, spectrum_with_noise, marker="o", edgecolor="k",facecolor="none", alpha=0.3, label="With noise")
    axes[i].plot(x, y_fit, label=f"Fitted spectrum", color="red")
    axes[i].set_title(f"R² = {r2:.3f}, MSE = {mse:.2g}", loc="left", fontsize="xx-small")

axes[0].legend(fontsize="xx-small")
sns.despine(fig=fig)

#### Your answer to question 1.3(b): What values did you find for the parameters? Do you find your parameters to depend on the initial guess and the noise level?

From the results above, we see that the non-linear least squares method finds a good overall fit to the noisy spectrum, as indicated by the high R² values and low mean squared errors in all cases. 
However, increasing the noise level leads to overfitting (we start to model more of the noise) and deviations from the true spectral shape (see for instance the difference between the lowest and highest noise levels). We also note that we have some patterns in the residuals in all cases, meaning that we make some systematic error.

Inspecting the plots above, we also see that the fitted parameters can vary significantly depending on the initial guess. The initial guess for the peak position seems to be important for the quality of the fitted curve.

The parameter values are given in the different tables above and while they vary with noise levels and the initial guesses, the method generally yields reasonable estimates in most cases. This suggests that while careful selection of initial guesses and noise reduction are important, the method exhibits some robustness to variations in these factors.

### 1.3(c)

Modify your code to use separable least squares (SLS).

In [None]:
# We modify our set-up so that we can find the beta parameters
# via least squares, and the other parameters via non-linear least squares:


def model_sls(x, betas, lambdas, sigmas):
    """Calculate y using the given parameters."""
    signal = np.zeros_like(x)
    for beta_i, lambda_i, sigma_i in zip(betas, lambdas, sigmas):
        signal += gaussian(x, beta_i, lambda_i, sigma_i)
    return signal


def calculate_beta(x, y, lambdas, sigmas):
    X = []
    for lambda_i, sigma_i in zip(lambdas, sigmas):
        X.append(gaussian(x, 1, lambda_i, sigma_i))
    X = np.array(X).T
    return np.linalg.lstsq(X, y, rcond=None)[0]


def sls_objective(nonlinear_params, x, y):
    lambdas = nonlinear_params[0:3]
    sigmas = nonlinear_params[3:]

    betas = calculate_beta(x, y, lambdas, sigmas)
    y_fit = model_sls(x, betas, lambdas, sigmas)
    return np.sum((y - y_fit) ** 2)

In [None]:
initial_guess_sls = [
    4.0, 6.0, 8.0,  # Lambda
    1.0, 1.0, 1.0,  # Sigma
]

# Set up boundaries for the coefficients, these
# are on form (min, max) for each parameter
bounds_sls = [
    (0.0, 10.0), (0.0, 10.0), (0.0, 10.0),  # Lambda
    (0.01, 2.0), (0.01, 2.0), (0.01, 2.0),  # Sigma
]


spectrum_to_use = spectra_with_noise[1]
gamma_selected = gammas[1]

result = minimize(
    sls_objective,
    initial_guess_sls,
    args=(x, spectrum_to_use),
    bounds=bounds_sls,
    options={
        "maxiter": 5000,
    },
)

lambdas = result.x[0:3]
sigmas = result.x[3:]
betas = calculate_beta(x, spectrum_to_use, lambdas, sigmas)


table_data = {
    "Peak 1": [round(i, ndigits=3) for i in [betas[0], lambdas[0], sigmas[0]]],
    "Peak 2": [round(i, ndigits=3) for i in [betas[1], lambdas[1], sigmas[1]]],
    "Peak 3": [round(i, ndigits=3) for i in [betas[2], lambdas[2], sigmas[2]]],
}
table = pd.DataFrame(
    table_data, index=[r"$\beta_i$", r"$\lambda_i$", r"$\sigma_i$"]
)
table

In [None]:
y_fit = model_sls(x, betas, lambdas, sigmas)

fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(x, spectrum_to_use, label="Spectrum with noise", alpha=0.3)
ax.plot(
    x, spectrum_without_noise, label="Spectrum without noise", alpha=0.7, lw=3
)
ax.plot(x, y_fit, label="Fitted model", color="black", alpha=0.5, lw=3)
ax.set(xlabel="λ", ylabel="S(λ)")
ax.set_title(f"γ = {gamma_selected}", loc="left")
ax.legend()
sns.despine(fig=fig)

#### Your answer to question 1.3(c): What values did you find for the parameters? Are they different from the ones found in [1.3(b)](#1.3(b))?

The SLS method results in parameters that are similar to the ones found by non-linear least squares. The parameters depend on both the initial guess and the noise amplitude (it is easier to fit to lower noise levels). Varying the initial guess indicates that SLS is not particularly more robust than non-linear least squares. Nevertheless, SLS successfully recovers reasonable parameter estimates in many scenarios, even with high noise or less accurate initial guesses. Again, we find that careful selection of initial guesses and noise reduction are important factors to obtain accurate parameters.

# Appendix

## A. Least squares without the intercept
We are going to determine the parameter $b$ for the linear model,

\begin{equation}
y =  b x,
\end{equation}

and we do this by minimising the sum of squared errors ($S$). With $n$
measurements of $y$ and $x$,

\begin{equation}
S = \sum_{i=1}^n (y_i - b x_i)^2 = \sum_{i=1}^n r_i^2
\end{equation}

To minimise $S$ we calculate the derivative:

\begin{equation*}
\frac{\partial S}{\partial b} = -2 \sum_{i=1}^n r_i x_i, \quad
\frac{\partial^2 S}{\partial b^2} = 2\sum_{i=1}^n x_i^2 \geq 0,
\end{equation*}

Note that the second derivative is positive, except for the
trivial case when $x_i = 0$, and we are indeed going to
find a minimum.
Requiring that $\frac{\partial S}{\partial b} = 0$ gives,

\begin{equation}
-2 \sum_{i=1}^n r_i x_i = 0 \implies \sum_{i=1}^n (y_i x_i - b x_i^2) = 0 \implies 
b = \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i^2} .
\end{equation}

We can also repeat this derivation for weighted least squares. The sum of squared errors
is then,

\begin{equation}
S = \sum_{i=1}^n w_i (y_i - b x_i)^2,
\end{equation}

where $w_i$ are the weights and, after minimisation,

\begin{equation}
b = \frac{\sum_{i=1}^n w_i y_i x_i}{\sum_{i=1}^n w_i x_i^2} .
\end{equation}

You can find more information on the weighted least squares method (with error analysis)
in Bevington and Robinson <a name="cite_ref-1"></a>[[1]](#bevington).
Taylor <a name="cite_ref-2"></a>[[2]](#taylor) states error formulas for
the parameters that might be useful for cases when
the error in $y$ is known and constant (e.g., as in "normal" least squares).


<a name="bevington"></a>[[1]](#cite_ref-1) Philip R. Bevington and D. Keith Robinson. Data reduction and error analysis for the physical sciences. 3rd ed. New York, NY: McGraw-Hill, 2003.

<a name="taylor"></a>[[2]](#cite_ref-2) John R. Taylor. An Introduction to Error Analysis: The Study of Uncertainties in Physical
    Measurements. 2nd ed. Sausalito, CA: University Science Books, 1997.

## B. The summary results from `statsmodels`

The summary method in `statsmodels` prints out a lot of information.
Here is an example where we have fitted a model $y=a + bx$ to 10 $(x, y)$ points with `statsmodels`
and the resulting summary output is printed below. In the table below, the parameter `a` corresponds to `const` and `b` corresponds to `x1`. The output is described further in the
sections below. 

```text
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.956
Model:                            OLS   Adj. R-squared:                  0.951
Method:                 Least Squares   F-statistic:                     175.6
Date:                Tue, 14 Feb 2023   Prob (F-statistic):           1.00e-06
Time:                        08:42:06   Log-Likelihood:                -16.957
No. Observations:                  10   AIC:                             37.91
Df Residuals:                       8   BIC:                             38.52
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.4248      0.931      4.754      0.001       2.278       6.571
x1             1.9235      0.145     13.253      0.000       1.589       2.258
==============================================================================
Omnibus:                        3.674   Durbin-Watson:                   2.067
Prob(Omnibus):                  0.159   Jarque-Bera (JB):                0.755
Skew:                           0.464   Prob(JB):                        0.686
Kurtosis:                       3.975   Cond. No.                         13.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

### B.1. Information about the model


- **Dep. Variable:** The dependent variable (the variable we are predicting, $y$) in the model.
- **Model:** The type of model we have created (OLS = Ordinary Least Squares).
- **Method:** We have used Least squares to find the parameters.
- **Date & Time:** The date and time for when we created the model.
- **No. Observations:** The number of observations in the data set (we had 10 ($x$,$y$) values here).

### B.2. Information about the calculation
- **Df Residuals:** Degrees of freedom for the residuals (sum of squares). 
  This is equal to $n - k - 1$ where $n$ is the number of observations and $k$ is
  the number of predictors (variables excluding the constant).
  In our case: $n - k - 1 = 10 - 1 - 1 = 8$. If we did the
  fitting without the constant term (for instance, by centring the data first), this
  number would be $n-k = 10-1=9$.
- **Df Model:** Degrees of freedom for the model (number of variables in the model).
- **Covariance type:** Calculations of standard errors assume homoscedastic errors.
  If this is not the case, then the standard error is not computed correctly. There
  are alternative ways of calculating the standard error; this field tells you
  if statsmodels used a more robust method.
  
### B.3. Information about the overall quality
- **R-squared:** Coefficient of determination ($R^2$) for the model.
- **Adj. R-squared:** The adjusted $R^2$ for the model. Useful for comparing
  models as this one will only increase (when adding more variables) if the
  increase in $R^2$ is more than one would expect by chance.
- **F-statistic:** This is the result of an F-test where the null hypothesis is that all
  regression coefficients are equal to zero! Effectively, this compares the model we
  have just made to an alternative model equal to the constant intercept term. 
  To use this value, we would have to decide on a $\alpha$ level and look up a critical F-value.
  This is some extra work for us, so we typically rather focus on the **Prob (F-statistic)**.
- **Prob (F-statistic):** This is the probability of getting an **F-statistic** at
  least as extreme as the one above if all regression coefficients are zero. 
  It is also known as the $p$-value.
  If we have selected $\alpha$ value, we will reject the null hypothesis if 
  the $p$-value is smaller than $\alpha$. Here, we have a very small $p$-value, and we reject the
  null hypothesis: We conclude that at least one regression parameter is
  significant for predicting $y$.
- **Log-Likelihood:** In least squares, we are minimizing the squared error.
  This is equivalent (if the errors are normally distributed)
  to maximizing the likelihood. The value printed here is the
  logarithm of the likelihood for the model.
- **AIC and BIC:** The
  [Akaike information criterion (AIC)](https://en.wikipedia.org/wiki/Akaike_information_criterion) and
  [Bayesian information criterion (BIC)](https://en.wikipedia.org/wiki/Bayesian_information_criterion).
  These can be directly calculated from the Log-Likelihood and are useful for comparing alternative
  models. Generally, we prefer models with lower AIC and BIC.
  
### B.4. Information about the coefficients

- **coef:** The determined coefficients for the model.

- **std err:** The standard error of the coefficients. This
  is calculated from,
  
  \begin{equation*}
  \text{Var}(\mathbf{b}) = s^2 \cdot \text{diag} \left(\mathbf{X}^\top \mathbf{X} \right)^{-1},
  \end{equation*}
  
  where,
  
  \begin{equation*}
  s^2 = \frac{SSE}{n - k - 1},
  \end{equation*}
  
  and $SSE$ is the sum of squared error/residuals, $n$ the number of data points (10 in this case)
  and $k$ the number of variables (1 in this case).

- **t, P>|t|, and [0.025 0.975]:** Some statistics for the
  coefficients. **t** is the $t$ statistic, which is obtained by dividing
  the coefficient by the standard error.
  This is the statistic in a test where the null hypothesis is that the coefficient is zero.
  To use the $t$ statistic we would have to consult a table with critical $t$-values for $n-k-1$
  degrees of freedom. The **P>|t|** is the $p$-value for such a $t$-test.
  Here, the $t$ statistic
  is high (and the p-value is low) and we would reject this null hypothesis for both the
  constant and x1. In other words, these coefficients are indeed different from
  zero.
  Finally, the **[0.025 0.975]**
  represents a $100(1-\alpha)\%$ confidence interval for the coefficients. We did not specify 
  $\alpha$ here, but we can give it as a parameter. The default is $\alpha=0.05$.