# Exercise 3.5

The temperature (°C) is measured continuously over time at a high altitude
in the atmosphere using a
weather balloon. Every hour a measurement is made and sent to an on-board computer.
The measurements can be found in [the data file](Data/temperature.txt) (located at 'Data/temperature.txt').

**(a)**  Create a Python script that performs polynomial
fitting to the data using a first, second, third, fourth,
and fifth order polynomial model. Hint: Make use of `numpy`, `matplotlib`
and `pandas`.

Here, we will also make use of scikit-learn and define the fitting using a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). A pipeline is a series of transformations that we apply to some data, before making a model. The pipeline below consists of:

1. Transforming the input $x$ into a polynomial of some degree. E.g. if the degree is 2, this transform will generate $1$, $x$, and $x^2$ from the input x-values. This is a convenient way of generating a data matrix to use with least squares.

2. Creating a least squares model using the output data matrix from step 1.

In [None]:
# Here, we are going to make use of sklearn to generate some polynomials:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error

sns.set_theme(style="ticks", context="notebook", palette="muted")
%matplotlib notebook

# Load the data
data = pd.read_csv("temperature.txt", delim_whitespace=True)

# Select the x and y values
xdata = data["hour"].values
ydata = data["yobs"].values

x = xdata.reshape(-1, 1)

models = []

degrees = [1, 2, 3, 4, 5, 11]

for degree in degrees:
    # We define the steps in our pipeline. Here, we give them some descriptive names
    # in case we have to refer to them later:
    steps = [
        # ('scale', StandardScaler()),
        ("polynomial", PolynomialFeatures(degree=degree)),
        ("leastsquares", LinearRegression(fit_intercept=False)),
    ]
    pipeline = Pipeline(
        steps=steps
    )  # Define a pipeline using the steps above.
    pipeline.fit(
        x, ydata
    )  # Use the pipeline to fit a polynomial of the specified degree
    models.append(pipeline)  # Store the pipeline/model for later

**(b)**  Plot the fitted curves for the five models to the raw data.



Here, we add R² and the [adjusted R²](https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2), which
is given by,

\begin{equation}
 R^{2}_\text{adjusted} =1-(1-R^{2}){n-1 \over n-k-1},
 \end{equation}
 
where $n$ is the number of observations and $k$ the number of variables (not including the constant term).

In [None]:
fig, axs = plt.subplots(
    constrained_layout=True,
    ncols=len(models),
    figsize=(20, 4),
    sharex=True,
    sharey=True,
)

for degree, model, axi in zip(degrees, models, axs):
    y_hat = model.predict(x)
    axi.scatter(xdata, ydata)
    r2 = r2_score(ydata, y_hat)
    n = len(ydata)
    r2_adj = 1 - (1 - r2) * (n - 1) / (n - degree - 1)  # Add adjusted R²
    axi.plot(xdata, y_hat, color="darkorange")
    axi.set_title(f"Degree {degree}\nR² = {r2:.3f}\nR²(adj) = {r2_adj:.3f}")
    axi.set_xlabel("Time (hour)")
    axi.set_ylabel("Temperature (°C)")
sns.despine(fig=fig)

**(c)** Plot the residual curves for the five models and determine,
from a visual inspection, the best polynomial order to use for modeling the
temperature as a function of time. 



In [None]:
# First prepare the figure
fig, axs = plt.subplots(
    constrained_layout=True,
    ncols=len(models),
    figsize=(20, 4),
    sharex=True,
    sharey=True,
)

for degree, model, axi in zip(degrees, models, axs):
    y_hat = model.predict(x)
    res = ydata - y_hat
    axi.scatter(xdata, res)
    # Plot a hline at 0 as a guidance for the eye:
    axi.axhline(0, c="k", lw=2, ls=":")
    axi.set_title(f"Degree {degree}")
    axi.set_xlabel("Time (hour)")
    axi.set_ylabel("Temperature residual (°C)")
sns.despine(fig=fig)

**Answer 3.5c:** Degree 3 or higher seems to have unstructured residuals, so we go with the easiest description of our model, **degree 3**.

**(d)**  Obtain the sum of squared residuals for each polynomial. Plot this as a function
of the degree of the polynomial and determine from visual inspection
the best polynomial order to use for modeling the
temperature as a function of time. Does this agree with your conclusion in point **3.5(c)**?


In [None]:
all_mse = []
for model, axi in zip(models, axs):
    y_hat = model.predict(x)
    mse = mean_squared_error(ydata, y_hat)
    all_mse.append(mse)
fig, ax = plt.subplots(constrained_layout=True)
ax.plot(degrees, all_mse, "--o")
# '--o' is the dashed line with circled datapoints format
ax.set_title("Mean squared residuals")
ax.set_xticks(degrees)
ax.set_xlabel("Order polynomal fit")
ax.set_ylabel(r"Sum of squared residuals (°C)²")
sns.despine(fig=fig)

**Answer 3.5d:** Again **order 3** seems to be correct as there is no significant drop in the graph after that

We repeat the fitting with statsmodels to obtain some additional statistics:

In [None]:
import statsmodels.api as sm

In [None]:
table = {
    "degree": [],
    "R2": [],
    "R2(adj)": [],
    "AIC": [],
    "BIC": [],
    "MSE": [],
}
x_scaled = x
x_scaled = StandardScaler().fit_transform(x)
for degree in degrees:
    print(degree)
    polynomial = PolynomialFeatures(degree=degree)
    X = polynomial.fit_transform(x_scaled)
    print(polynomial.get_feature_names_out())
    model = sm.OLS(ydata, X).fit()
    table["degree"].append(degree)
    table["R2"].append(model.rsquared)
    table["R2(adj)"].append(model.rsquared_adj)
    table["AIC"].append(model.aic)
    table["BIC"].append(model.bic)
    table["MSE"].append(model.ssr)
    print(model.summary())

In [None]:
table = pd.DataFrame(table)
table

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
table.plot("degree", "BIC", ax=ax, marker="o", ylabel="BIC", ls="--")
sns.despine(fig=fig)