# Exercise 03

This exercise will not be graded and is provided for practice.

## Linear regression
Implement the linear regression model $\hat{y} = w_0 + w_1*x_1 + \dots + w_n * x_n$. 
Use the analytical solution to find the optimal parameters.
How do you have to modify to the data matrix $\mathbf{X}$ to find the optimal $w_0$?

In [4]:
import numpy as np
from numpy.linalg import inv
import sklearn.datasets
import sklearn.metrics
import sklearn.model_selection

In [7]:

class LinearRegressor():
    def __init__(self) -> None:
        self.w = None

    def _add_constant(self, X: np.ndarray):

        return np.hstack((X, np.ones((len(X), 1))))

    def fit(self, X: np.ndarray, y: np.ndarray) -> None:
        """Fit the parameters of the model to the data

        Args:
            X (np.ndarray): features
            y (np.ndarray): targets
        """
        X = self._add_constant(X)
        X_tr = np.transpose(X)
        self.w = inv(X_tr @ X) @ X_tr @ y

    def predict(self, X: np.ndarray) -> np.ndarray:
        """Use parameters to predict values

        Args:
            X (np.ndarray): features

        Returns:
            np.ndarray: predicted targets
        """
        X = self._add_constant(X)
        return  X @ self.w


X, y, true_coefs = sklearn.datasets.make_regression(
    n_samples=100, n_features=50, n_informative=3, random_state=0, coef=True, noise=10)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=0, train_size=0.7)
model = LinearRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
print("R-squared (train)", sklearn.metrics.r2_score(y_train, y_pred))


R-squared (train) 0.9962173319667742


Run the next cell and observe how the model performs. Apart from splitting the data like in the previous cell, how can you evaluate the model robustly (04_slides)? Use `sklearn.model_selection` to implement it. Also implement R-squared as scoring function.

In [8]:
y_pred = model.predict(X_test)
print("R-squared (test)", sklearn.metrics.r2_score(y_test, y_pred))

R-squared (test) 0.917687692647308


In [11]:
def r2_score(y, y_pred):
    return 1 - np.mean((y - y_pred) ** 2) / y.var()


# Evaluation of model here:
scores = []
for train_idx, test_idx in sklearn.model_selection.KFold(n_splits=10).split(X, y):
    model = LinearRegressor()
    model.fit(X[train_idx], y[train_idx])
    score = r2_score(y[test_idx], model.predict(X[test_idx]))
    scores.append(score)

print("Avg R-squared:", np.mean(scores))

Avg R-squared: 0.9174901577520226
