Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# CSE204 - Introduction to Machine Learning - Lab Session 4: regression methods

<img src="https://raw.githubusercontent.com/adimajo/CSE204-2021/master/data/logo.jpg" style="float: left; width: 15%" />

[CSE204-2021](https://moodle.polytechnique.fr/course/view.php?id=10682) Lab session #04

Jérémie DECOCK - Adrien EHRHARDT

## Objectives

In the lab session 02, we have used a **parametric model** to solve **regression problems**. In lab session 03, we used k-NN, a **non-parametric model** on (mostly) classification **and** regression problems, as well as logistic regression, a **parametric model** on classification tasks.

Today you will continue the exploration of regression methods (both parametric and non-parametric), and of logistic regression. On today's agenda:

- Pathological Cases: ordinary least squares gone wrong
- Regularization with linear and logistic regression
- Weighted Least Squares
- Kernel Regression: "local" dependence, like k-NN
- Local Linear Regression

**Note**: as in the previous labs, there are some differences in notations with the lecture slides. For instance, parameters are noted $w$ (machine learning community) in lectures but they are noted $\theta$ here (statistics community).

## Imports and tool functions

In [None]:
%matplotlib inline

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
import sklearn.linear_model
import sklearn.pipeline
import sklearn.preprocessing
from sklearn.utils import shuffle

In [None]:
def gen_2d_classification_samples(n_samples: int = 20, nclass: int = 3) -> pd.DataFrame:
    """
    Generates 2-dimensional samples which belong to either 2 or 3 classes

    :param int n_samples: number of samples to draw per class
    :param int nclass: number of classes the samples belong to (either 2 or 3)
    :returns: dataframe containing X (2 coordinates x1, x2) and y (as int!)
    """
    cov = np.diag([2., 2.])

    x1 = np.random.multivariate_normal(mean=[0., 0.], cov=cov, size=n_samples)
    y1 = np.full(n_samples, 1, dtype=int)

    x2 = np.random.multivariate_normal(mean=[4., 0.], cov=cov, size=n_samples)
    y2 = np.full(n_samples, 2, dtype=int)

    x3 = np.random.multivariate_normal(mean=[2., 4.], cov=cov, size=n_samples)
    y3 = np.full(n_samples, 3, dtype=int)

    if nclass == 3:
        X = np.concatenate([x1, x2, x3])
        y = np.concatenate([y1, y2, y3])
    elif nclass == 2:
        X = np.concatenate([x1, x2])
        y = np.concatenate([y1, y2])
    else:
        raise ValueError("Only 2 or 3 classes")

    df = pd.DataFrame(X, columns=['x1', 'x2'])
    df['y'] = y

    df = shuffle(df).reset_index(drop=True)
    
    return df

In [None]:
def gen_1d_polynomial_regression_samples(n_samples: int = 15) -> pd.DataFrame:
    """
    Generate 1-dimensional regression samples (x, y)

    :param int n_samples: how many samples to return
    """
    x = np.random.uniform(low=0., high=1.5, size=n_samples)
    y = np.cos(2. * np.pi * x) + np.random.normal(scale=0.1, size=x.shape)
    df = pd.DataFrame(np.array([x, y]).T, columns=['x', 'y'])
    df = sklearn.utils.shuffle(df).reset_index(drop=True)
    return df

In [None]:
def plot_1d_regression_samples(dataframe: pd.DataFrame, model=None):
    """
    Plot the data in dataframe, as wellas (optionnally) the predictions from a model

    :param pandas.DataFrame dataframe: dataframe containing 'x' and 'y'
    :param model: model to predict
    """
    fig, ax = plt.subplots(figsize=(8, 8))
    
    df = dataframe.copy()  # make an alias
    
    ERROR_MSG1 = "The `dataframe` parameter should be a Pandas DataFrame having the following columns: ['x', 'y']"
    assert df.columns.values.tolist() == ['x', 'y'], ERROR_MSG1
    
    if model is not None:
        
        # Compute the model's prediction
        x_pred = np.linspace(df.x.min(), df.x.max(), 100).reshape(-1, 1)
        y_pred = model.predict(x_pred)
        df_pred = pd.DataFrame(np.array([x_pred.flatten(), y_pred.flatten()]).T, columns=['x', 'y'])
        df_pred.plot(x='x', y='y', style='r--', ax=ax)

    # Plot also the training points
    df.plot.scatter(x='x', y='y', ax=ax)
    delta_y = df.y.max() - df.y.min()
    plt.ylim((df.y.min() - 0.15 * delta_y,
              df.y.max() + 0.15 * delta_y))

In [None]:
def plot_regression_1d(X, y, theta=None, x_min=0, x_max=2):
    """
    Plot linear regression of X on y given theta
    """
    assert X.ndim == 2 and X.shape[1] == 2, X.shape
    assert y.ndim == 2 and y.shape[1] == 1, y.shape
    if theta is not None:
        assert theta.ndim == 2 and theta.shape == (2, 1), theta.shape
    
    fig, ax = plt.subplots()
    ax.scatter(X[:,1], y)

    if theta is not None:
        x = np.linspace(x_min, x_max, 50)
        y = theta[0] + theta[1] * x

        ax.plot(x, y, "--r")

## Pathological cases

Consider the following implementation of the least squares method:

In [None]:
def least_squares(X: np.array, y: np.array) -> np.array:
    """
    Perform linear regression via least squares, return coefficient

    :param numpy.array X: design matrix (n-sample as rows, p features as columns)
    :param numpy.array y: response vector (p elements)
    :return: linear regression coefficient found via ols
    :rtype: numpy.array
    """
    XX = np.dot(X.T, X)
    Xy = np.dot(X.T, y)
    invXX = np.linalg.inv(XX)
    theta = np.dot(invXX, Xy)
    
    return theta

We want to use it to apply linear regression to some datasets.

### Exercise 1

#### Question 1

What is wrong with the following dataset ? (Try running both cells below, and give your answer in the third one.)

In [None]:
X = np.array([[1, 3, 0],
              [2, 3, 4]])
y = np.array([1.8, 2.7])

X

In [None]:
# theta = least_squares(X, y)   # <- **TODO: UNCOMMENT**
# theta                         # <- **TODO: UNCOMMENT**

YOUR ANSWER HERE

#### Question 2

What is wrong with the following dataset ? (Try running both cells below, and give your answer in the third one.)

In [None]:
X = np.array([[1, 2, 3, 4, 5], [2, 4, 6, 8, 10]]).T
y = np.array([1.8, 2.7, 3.4, 3.8, 3.9])

X

In [None]:
# theta = least_squares(X, y)   # <- **TODO: UNCOMMENT**
# theta                         # <- **TODO: UNCOMMENT**

YOUR ANSWER HERE

## Regularization with Ridge regression

We have the following dataset:

In [None]:
x = np.array([1, 2.5, 3, 4.2, 5.5])
y = np.array([3.1, 3.5, 6.8, 10.9, 12.3])

plt.scatter(x, y);

We apply basis expansion to fit a polynomial model to the data (similar to lab_session_02).

In [None]:
def basis_expansion(x: np.array, degree: int = 4) -> np.array:
    """
    Basis expansion (1, x, ..., x^degree)

    :param numpy.array x: vector to be expanded
    :param int degree: degree up to which (included) to perform the expansion
    """
    # Intercept
    Z_list = [np.ones(shape=x.shape)]
    
    # x^1, x^2, ..., x^degree
    for deg_index in range(1, degree + 1):
        Z_list.append(x**deg_index)
    
    return np.array(Z_list).T

In [None]:
Z = basis_expansion(x)

# Instanciate and fit the model
model = sklearn.linear_model.LinearRegression(fit_intercept=False)
model.fit(Z, y)

print("Coefs:", model.coef_)

In [None]:
def plot_regression(x: np.array, model=None, theta=None, degree: int = 4):
    # Compute the model's prediction
    x_pred = np.linspace(x.min(), x.max(), 100)
    Z_pred = basis_expansion(x_pred, degree=degree)
    if model is not None:
        y_pred = model.predict(Z_pred)
    elif theta is not None:
        y_pred = np.dot(Z_pred, theta)
    else:
        raise ValueError('Provide either model or theta')

    # Plot prediction and training set
    fig, ax = plt.subplots(figsize=(18, 8))
    ax.plot(x_pred, y_pred)
    ax.scatter(x, y)
    plt.show();

In [None]:
plot_regression(x, model=model)

As you can see, a polynomial function of degree 4 is certainly not adapted to fit efficiently these data. Here we have a clear over-fitting: the model is too complex for the data and it will have poor generalization performance (i.e. big error on new unknown data).
In fact, since we only have 5 data points, any polynomial regression with degree >=4 would have more than 5 coefficients (when accounting for the intercept term) and would thus fit perfectly (i.e. go through) our 5 points.

A solution is to reduce the complexity of the model using a lower polynomial degree.

An alternative is to apply a *regularization method* like the *ridge regularization* (a.k.a. *L2 regularization*) which applies a penalty on the value of $\theta$ to constrain it to be as small as possible.

A $\boldsymbol{\theta}$ with small elements usually makes the model simpler and brings better generalization performances.

This L2 regularization is included in the least square method as follows:

$$
\boldsymbol{\theta}^*
\leftarrow \arg\min_{\boldsymbol{\theta}} E(\boldsymbol{\theta})
\quad \text{with} \quad
E(\boldsymbol{\theta})
= \underbrace{||\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\theta}||^2_2}_{\text{error term}} ~~ + \underbrace{\lambda ||\boldsymbol{\theta}||^2_2}_{\text{regularization}}$$

where $\lambda \in \mathbb{R}^+$ is the *regularization strength* coefficient:
- when $\lambda$ goes to infinity, the regularization term dominates the error term (MSE) and the coefficients $\boldsymbol{\theta}$ tend to zero;
- when $\lambda$ goes to 0, the regularization term looses the importance and eventually the regularization term is ignored;
- $\lambda$ is a *meta* or *hyper parameter*;
- the best $\lambda$ for a problem can be computed empirically or automatically
- we're not interested in the best $\lambda$ *per se* but in the best prediction performance on a test set, *i.e.* achieving a good bias-variance tradeoff. This is made easy by having a single parameter, $\lambda$, to control this tradeoff, and calculating the desired criterion, e.g. MSE on a test set, for many values of $\lambda$ (e.g. grid search);
- to find this "best" lambda (i.e. corresponding to the best MSE on a test set), we usually plot the training and testing errors w.r.t. to lambda, and more generally, w.r.t. model complexity. This will be part of subsequent labs.

### Exercise 2

#### Question 1

On a sheet of paper:
- Compute the analytic formulation of the gradient $\nabla_{\boldsymbol{\theta}} E(\boldsymbol{\theta})$
- Compute the analytic formulation of the optimal parameter $\boldsymbol{\theta^*}$

YOUR ANSWER HERE

#### Question 2

Is it a convex optimization problem like *Ordinary Least Squares* ?

YOUR ANSWER HERE

#### Question 3

Check the following Scikit Learn implementation of the Ridge Regression (more info here: https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression ).

In [None]:
model = sklearn.linear_model.Ridge(alpha=0, fit_intercept=False)
model.fit(Z, y)
coefs = [model.intercept_] + model.coef_
print("Coefs:", coefs)

plot_regression(x, model=model)

Change the value of the `alpha` parameter in `sklearn.linear_model.Ridge` and explain what happens (in Scikit Learn the $\lambda$ regression strength is named $\alpha$, and sometimes its inverse is referred to as $C$, e.g., in `LogisticRegression`, see previous lab).

YOUR ANSWER HERE

#### Question 4

Plot Ridge coefficients as a function of the regularization parameter.

Evaluate the following sequence of regularization strength: `alphas = np.logspace(-2, 5, 50)`.

In [None]:
# Compute paths
alphas = np.logspace(-2, 5, 50)

coefs = []
for a in alphas:
    # Fit a `Ridge` object
    # coefs.append(...)  # TO UNCOMMENT: append the ridge coefficients to the `coefs` list
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# Display results
fig, ax = plt.subplots(figsize=(18, 8))
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()

#### Question 5

Update the following function to implement the ridge regression in Python (without using Scikit Learn). Check it as in question 3.

In [None]:
def ridge_regression(X, y, lambda_):
    XX = np.dot(X.T, X)
    Xy = np.dot(X.T, y)
    # invXX = ...
    # theta = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return theta

In [None]:
theta = ridge_regression(Z, y, lambda_=6.)

print("Coefs:", theta)

plot_regression(x, theta=theta)

## Ridge Logistic Regression

### Exercise 3

Let's use our 2D classification example from lab_session_03.

In [None]:
df = gen_2d_classification_samples(n_samples=50, nclass=2)

In [None]:
x_min, x_max = np.array((df.x1, df.x2))[0, :].min() - .5, np.array((df.x1, df.x2))[0, :].max() + .5
y_min, y_max = np.array((df.x1, df.x2))[1, :].min() - .5, np.array((df.x1, df.x2))[1, :].max() + .5
h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
plt.scatter(df.x1, df.x2, c=df.y);

We used gradient descent to fit a Logistic Regression and draw a linear decision boundary between these two classes.

Recall the `LogisticRegression` class:

In [None]:
logistic_regression = sklearn.linear_model.LogisticRegression(C = 1e9).fit(
    df[['x1', 'x2']].values,
    df.y.values - 1)

In [None]:
Z = logistic_regression.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 0]
plt.pcolormesh(xx, yy, (Z.reshape(xx.shape) > 0.5) * 1, cmap=plt.cm.Paired)
plt.scatter(df.x1, df.x2, c=df.y - 1);

Let's try doing polynomial logistic regression:

In [None]:
logistic_regression = sklearn.linear_model.LogisticRegression(C = 1e9).fit(
    np.array([df.x1, df.x2, df.x1 ** 2, df.x2 ** 2, df.x1 ** 3, df.x2 ** 3]).T,
    np.array(df.y) - 1)

In [None]:
Z = logistic_regression.predict_proba(np.c_[xx.ravel(), yy.ravel(),
                                            xx.ravel() ** 2, yy.ravel() ** 2,
                                            xx.ravel() ** 3, yy.ravel()] ** 3)[:, 0]
plt.pcolormesh(xx, yy, (Z.reshape(xx.shape) > 0.5) * 1, cmap=plt.cm.Paired)
plt.scatter(df.x1, df.x2, c=df.y - 1);

This is a relatively poor fit. Just like linear regression, let's try to regularize logistic regression with a ridge penalty ($\lambda ||\boldsymbol{\theta}||_2^2$).

On a sheet of paper, compute the analytic formulation of the gradient $\nabla_{\boldsymbol{\theta}} E(\boldsymbol{\theta})$

*Note*: this is similar to Exercise 2 **but** the error function is now the log-loss.

YOUR ANSWER HERE

We can use the $C$ parameter in `sklearn`, inversely proportional to $\lambda$, to fit such a penalization.

You can play with the parameter $C$, as well as the penalty and solver arguments. [See the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [None]:
logistic_regression = sklearn.linear_model.LogisticRegression(C = 0.000001, penalty='l2').fit(
    np.array([df.x1, df.x2, df.x1 ** 2, df.x2 ** 2, df.x1 ** 3, df.x2 ** 3]).T,
    np.array(df.y) - 1)

In [None]:
Z = logistic_regression.predict_proba(np.c_[xx.ravel(), yy.ravel(),
                                            xx.ravel() ** 2, yy.ravel() ** 2,
                                            xx.ravel() ** 3, yy.ravel()] ** 3)[:, 0]
plt.pcolormesh(xx, yy, (Z.reshape(xx.shape) > 0.5) * 1, cmap=plt.cm.Paired)
plt.scatter(df.x1, df.x2, c=df.y - 1);

## Weighted Least Squares

For some regression problems, it may be helpful to give different importance to examples in the *learning set* $\mathcal{D} = \{(\boldsymbol{x^{(i)}}, y^{(i)})\}_{1 \leq i \leq n}$ that is to say associate a weight $\omega^{(i)}$ to example $\boldsymbol{x}^{(i)}$ in order to prioritize some of them and ignore some others (e.g. outliers).

Introducing these weights in the method of Least Square, the regression problem becomes:

$$E(\boldsymbol{\theta}) = \sum_{i=1}^n \omega^{(i)} (y^{(i)} - \boldsymbol{x}^{(i)} \boldsymbol{\theta})^2$$

In order to use the matrix notation, we put weights $\omega^{(i)}$ in the diagonal of the following matrix $\Omega$:

$$
\Omega =
\begin{pmatrix}
\omega^{(1)} & 0            & \cdots & 0 \\
0            & \omega^{(2)} & \cdots & 0 \\
\vdots       & \vdots       & \ddots & \vdots \\
0            & 0            & \cdots & \omega^{(n)} \\
\end{pmatrix}
$$

Then we can write:

$$E(\boldsymbol{\theta}) = (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\theta})^T \Omega (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\theta})$$

with:
$$
\boldsymbol{X} = \begin{pmatrix} 1 & x_1^{(1)} & \dots & x_p^{(1)} \\ \vdots & \vdots & \dots & \vdots \\ 1 & x_1^{(n)} & \dots & x_p^{(n)} \end{pmatrix}
\quad \quad
\boldsymbol{y} = \begin{pmatrix} y^{(1)} \\ \vdots \\ y^{(n)} \end{pmatrix}
\quad \quad
\boldsymbol{\theta} = \begin{pmatrix} \theta_0 \\ \vdots \\ \theta_p \end{pmatrix}
$$

### Exercise 4

#### Question 1

On a sheet of paper:
- Compute the analytic formulation of the gradient $\nabla_{\boldsymbol{\theta}} E(\boldsymbol{\theta})$
- Compute the analytic formulation of the optimal parameter $\boldsymbol{\theta^*}$

YOUR ANSWER HERE

#### Question 2

Is it a convex optimization problem like *Ordinary Least Squares* ?

YOUR ANSWER HERE

#### Question 3

We have the following dataset and weights:

In [None]:
X = np.array([[1, 1],
              [1, 2],
              [1, 3],
              [1, 4],
              [1, 5]])

y = np.array([1.8, 4.5, 3.4, 3.6, 4.2]).reshape([-1, 1])

Omega = np.diag([1, 2, 3, 2, 1])

Omega

In [None]:
plt.scatter(X[:,1], y);

Complete the following Python implementation of the `weighted_least_squares()` procedure.
It should return the optimal parameter $\boldsymbol{\theta^*}$ using the method of Least Square for the matrix of weights $\Omega$.

Here, we expect $\theta$ to be a Numpy array of shape `(1, 2)` (i.e. a vector of two elements).

Numpy recall:
- The transpose of a matrix `X` is obtained with `X.T`
- The inverse of a matrix `X` is obtained with `np.linalg.inv(X)`
- The product of two matrices `X` and `Y` is obtained with `np.dot(X, Y)` or `np.matmul(X, Y)` or `X @ Y`
- The dot product of a matrix `X` and a vector `y` is obtained with `np.dot(X, y)`

In [None]:
def weighted_least_squares(X, Omega, y):
    # theta = ...  # TO UNCOMMENT AND COMPLETE
    # YOUR CODE HERE
    raise NotImplementedError()
    return(theta)

In [None]:
theta = weighted_least_squares(X, Omega, y)
assert len(theta) == 2

#### Question 4

Check graphically your model using the following code:

In [None]:
plot_regression_1d(X, y, theta, x_min=0, x_max=6)

#### Question 5

Change the weights in $\Omega$ to ignore the second point $x = 2$ (give the same weight to all other points) then recompute $\theta$ using `weighted_least_squares()` and check the results on plots with `plot_regression_1d()`.

In [None]:
# Omega = ...
# YOUR CODE HERE
raise NotImplementedError()
theta = weighted_least_squares(X, Omega, y)
plot_regression_1d(X, y, theta, x_min=0, x_max=6)

In [None]:
assert len(theta) == 2

## Nadaraya-Watson Kernel Regression (Bonus)

Like k-Nearest Neighbors, *Nadaraya-Watson Kernel Regression* is a non-parametric model, i.e. decisions are made according to known examples from the *learning set* $\mathcal{D} = \{(y^{(i)}, \boldsymbol{x^{(i)}})\}_{1 \leq i \leq n}$ of $n$ examples and considering a kind of proximity relationship.

With k-Nearest Neighbors, decisions are based only on the closest neighbors and other examples are simply ignored. If you remember correctly from the end of Lab 03, we used `sklearn`'s implementation of k-Nearest Neighbors and its `weights="distance"` option to use the label of the nearest neighbors **proportionally to their distance from the point to predict**.
Contrary to standard k-NN but in the same fashion as the aforementioned option, with Kernel Regression all examples $(\boldsymbol{x}^{(i)}, y^{(i)})$ from $\mathcal{D}$ are used to predict the label $y$ of any new point $\boldsymbol{x}$, but their respective contribution in this prediction is weighted using a *kernel function* $K(\boldsymbol{x}^{(i)}, \boldsymbol{x})$. 

$$
y
= f(\boldsymbol{x})
= \frac{\sum^{n}_{i=1} K(\boldsymbol{x}^{(i)}, \boldsymbol{x}) ~ y^{(i)}}{\sum^{n}_{j=1} K(\boldsymbol{x}^{(j)}, \boldsymbol{x})}
= \sum^{n}_{i=1} y^{(i)} \omega^{(i)}
$$

with $\sum^{n}_{i=1} \omega^{(i)} = 1$

Recall about the notation used here:
- $\boldsymbol{x}^{(i)}$ is the feature (input) vector of the $i^{\text{th}}$ example in $\mathcal{D}$ (and $y^{(i)}$ is its label). Beware: $\boldsymbol{x}^{(i)}$ is not the $i^{\text{th}}$ power of $\boldsymbol{x}$ (we will write $\boldsymbol{x}^{(i)2}$ for the square of $\boldsymbol{x}^{(i)}$)!
- $x_i$ is the value of $\boldsymbol{x}$ on the $i^{\text{th}}$ dimension

### Exercise 5

#### Question 1

Implement the Gaussian kernel $K$ in the following `gaussian_kernel()` Python function.

$$
K(\boldsymbol{u}, \boldsymbol{v})
= \exp\left(\frac{-||\boldsymbol{u} - \boldsymbol{v}||^2_2}{2 \sigma^2} \right)
$$

where $\sigma$ is a parameter equal to $1$ by default.

You can assume $u$ and $v$ to be simple scalars to simplify this Python implementation (i.e. restrict yourself to regression problem with 1 dimension inputs $x \in \mathbb{R}$).

Recall: $e^x$ is written `math.exp(x)` in Python.

In [None]:
def gaussian_kernel(u, v, sigma = 1.):
    # return expression above
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert gaussian_kernel(0,0) == 1
assert gaussian_kernel(1,0) == math.exp(-1/2)

#### Question 2

Implement the Nadaraya-Watson kernel regression in the following `kernel_regression()` Python function.

$$
\text{kernel_regression}(\boldsymbol{x}, \mathcal{D})
= \frac{\sum^{n}_{i=1} K(\boldsymbol{x}^{(i)}, \boldsymbol{x}) ~ y^{(i)}}{\sum^{n}_{j=1} K(\boldsymbol{x}^{(j)}, \boldsymbol{x})}
= y
$$

You can assume that $x$ is a scalar to simplify the Python implementation.
We assume `dataset` contains examples $\mathcal{D} = \{(\boldsymbol{x^{(i)}}, y^{(i)})\}_{1 \leq i \leq n}$ in a Pandas DataFrame having:
- one row per example
- a column "x" containing the examples' features (only one dimension here)
- a column "y" containing the examples' labels

**Hint**: you can use the following `for` loop to compute $\sum K(\boldsymbol{x}^{(i)}, \boldsymbol{x}) ~ y^{(i)}$: `for xi, yi in zip(df.x, df.y)`.

In [None]:
def kernel_regression(x, dataset):
    # Hint: calculate numerator and denominator separately with list comprehensions
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert kernel_regression(1, pd.DataFrame({"x": [0], "y": [0]})) == 0

We have the following dataset:

In [None]:
dataset = pd.DataFrame([[2., 0.],
                        [5., 2.],
                        [7., 1.],
                        [10., 2.],
                        [14., 4.],
                        [16., 3.],
                        [17., 0.]], columns=['x', 'y'])
dataset

Check your `kernel_regression()` function with the following code:

In [None]:
x_pred = np.linspace(0., 20., 200)
y_pred = [kernel_regression(x, dataset) for x in x_pred]

ax = dataset.plot.scatter(x='x', y='y', label="Dataset", figsize=(12,8))
ax.plot(x_pred, y_pred, "-r", label="Kernel regression")
plt.legend();

## Local Linear Regression (Bonus)

Another possible application of *Weighted Least Squares* and the *Nadaraya-Watson Kernel regression* is the *Local Linear Regression*. It uses a *Kernel* $K(\boldsymbol{x}^{(i)}, \boldsymbol{x})$ to define the weight $\omega^{(i)}$ assigned to example $i$. Thus, it's a linear regression giving more importance to examples close to the point $\boldsymbol{x}$ to predict. This means that this method does a new fit (in other words it computes a new $\boldsymbol{\theta}^*$) for each new point to predict! (Of course, this is much more computationally intensive than ordinary / weighted least squares).

For each point $\boldsymbol{x}$ to predict:
1. Compute weights $\omega^{(i)}$ assigned to examples $\boldsymbol{x}^{(i)}$ w.r.t their distance to $\boldsymbol{x}$: $\omega^{(i)} = K(\boldsymbol{x}^{(i)}, \boldsymbol{x})$
2. Fit Weighted Least Squares to obtain the $\boldsymbol{\theta}^*$ vector associated to $\boldsymbol{x}$
3. Return the prediction $y = \boldsymbol{x\theta}^*$

### Exercise 6

We have the following dataset:

In [None]:
dataset = gen_1d_polynomial_regression_samples(n_samples=30)

plot_1d_regression_samples(dataset)

#### Question 1

Complete the following Python implementation of the `locally_weighted_regression()` procedure defined above.
It should use the previously implemented `gaussian_kernel()` function (with `sigma=0.1`) and `weighted_least_squares()` function. It should return the predicted label $y$ corresponding to one input $\boldsymbol{x}$.

In [None]:
def locally_weighted_regression(x, dataset, sigma=0.1):    
    # Compute a weight wi for each example xi of the dataset: the closer xi is to x, the smaller wi is
    
    # wi = ...                                      # <- **TODO: UNCOMMENT AND COMPLETE**
    # Omega = ...                                   # <- **TODO: UNCOMMENT AND COMPLETE**
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # Fit weighted least squares to obtain theta
    intercept = np.ones(shape=len(dataset.x))
    X = np.array([intercept, dataset.x]).T
    y = dataset.y.values.reshape([-1, 1])
    theta = weighted_least_squares(X, Omega, y)
    
    # Return prediction y = f(x)
    # y = ...                                       # <- **TODO: UNCOMMENT AND COMPLETE**
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return y

In [None]:
assert len(locally_weighted_regression(0, dataset, sigma=0.1)) == 1

In [None]:
# Generate "test" points in the ~ [0;1.5] support of the data previously generated 
x_pred = np.linspace(0., 1.5, 100)
# Predict the target y of these test points using the Locally-weighted regression model
y_pred = [locally_weighted_regression(x, dataset, sigma=0.1) for x in x_pred]
# Plot the results
ax = dataset.plot.scatter(x='x', y='y', figsize=(16, 8))
ax.plot(x_pred, y_pred);

#### Question 2

What happens when you change the value of the variable `sigma` parameter (try e.g. `sigma=0.3`)?
Why?

In [None]:
x_pred = np.linspace(0., 1.5, 100)
# y_pred = ...

# YOUR CODE HERE
raise NotImplementedError()

ax = dataset.plot.scatter(x='x', y='y', figsize=(16, 8))
ax.plot(x_pred, y_pred);

YOUR ANSWER HERE

#### Question 3

Can we use Local Linear Regression to forecast time series as it was asked in the exercise 7 of the lab session 2? (You can try it out in the cells below.) Why?

In [None]:
URL = "https://raw.githubusercontent.com/adimajo/CSE204-2021/master/data/natural_gas_co2_emissions_for_electric_power_sector.csv"
df = pd.read_csv(URL, parse_dates=[0])

df['x'] = df.index
df['y'] = df.co2_emissions

df[['x','y']].head()

In [None]:
x_pred = np.linspace(0., len(df) + 5, 1000)
# y_pred = ...

# YOUR CODE HERE
raise NotImplementedError()

ax = df.plot.scatter(x='x', y='y', figsize=(16, 8))
ax.plot(x_pred, y_pred, "-r", label="Prediction")
plt.legend();

YOUR ANSWER HERE