Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# CSE204 - Introduction to Machine Learning - Lab Session 5 - Exam

<img src="https://raw.githubusercontent.com/adimajo/CSE204-2021/master/data/logo.jpg" style="float: left; width: 15%" />

[CSE204-2021](https://moodle.polytechnique.fr/enrol/index.php?id=12838) Lab session #05 - Lab Exam

Théo Lacombe - Adrien Ehrhardt

## Overall presentation

This lab is composed of 3 exercises, granting up to 6, 7.5 and 5.5 points, and 2 additional exercises (2 pts each - the final grade is over 20, so there are bonus points).

These exercises are independant.

There are examples of automatic tests that are run against your code. **They are not exhaustive nor sufficient** (we will run other - hidden - tests), **but they are necessary**: they have to pass, otherwise you can be sure *not* to get the points.

You **cannot** use past lab sessions' solutions, nor Google anything. The exam is open book w.r.t. the lectures. Some help and hints are provided for each question (e.g. which function to use and how), and you can also use the `help(...)` function.

- **Do not** delete any pre-existing cell (you can create and delete your own cells for testing).
- **Do not** change the type (Markdown / Code / ...) of any pre-existing cell.
- Run the notebook with the **CSE204** kernel - if you didn't install it beforehand (come on, it's the $5^{th}$ lab!), proceed at your own risk.
- **Do not** rename the file when uploading your work on Moodle.
- **Do not** edit the notebook's or a cell's metadata.

## Imports

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn import datasets
from sklearn import model_selection

In [None]:
def gen_d_regression_samples(n_samples: int = 50, p: int = 3,
                             reg_type: str = 'linear', seed: int = 1) -> pd.DataFrame:
    """
    Generate p-dimensional regression samples

    :param int n_samples: number of samples to draw
    :param int p: dimension of the inputs
    :param str reg_type: either linear or polynomial
    :param int seed: random seed
    :return: dataframe with (x_i)_1^p and y
    :rtype: pandas.DataFrame
    """
    np.random.seed(1) 
    if reg_type=='linear':
        x = np.random.uniform(low=1.5, high=3, size=(n_samples, p))
        y = np.sum(x, axis=1) + np.random.normal(scale=0.15, size=n_samples)
        df = pd.DataFrame(x, columns=['x' + str(i) for i in range(p)])
        df['y'] = y
    else:
        x = np.random.uniform(low=0, high=1.5, size=(n_samples, p))
        y = np.sum(x, axis=1) + np.sum(x**2, axis=1) + np.random.normal(scale=0.1, size=n_samples)
        df = pd.DataFrame(x, columns=['x' + str(i) for i in range(p)])
        df['y'] = y

    return df

# Exercise 1: Linear regression and its extensions

We briefly recall the framework of linear regression.

We suppose data is generated following: $y = f(x) + \epsilon = \theta_0 + \theta_1 x^{(i)}_1 + \dots + \theta_p x^{(i)}_p + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma)$.

Consider observations $\boldsymbol{X} = \left( x^{(1)} \dots x^{(n)} \right)^T$, with $x^{(i)} \in \mathbb{R}^p$, and labels $\boldsymbol{y} = \left( y^{(1)} \dots y^{(n)} \right)^T$.
Given an observation $x^{(i)}$, and a vector $\theta = (\theta_0 \dots \theta_p)^T$, we produce an estimation $\hat{y}^{(i)}$ of $y^{(i)}$ of the following form:
$$ \hat{y}^{(i)} = \hat{f}(x) = \theta_0 + \theta_1 x^{(i)}_1 + \dots + \theta_p x^{(i)}_p.$$

## Vanilla linear regression

Let's sample some data.

In [None]:
linear_df = gen_d_regression_samples(n_samples=200, p=2)

In [None]:
linear_df.head()

**Question 1: (1pt)** Implement a function `pred` which, given an observation `x` of shape `p` and a vector `theta` of shape `p+1`, returns a predicted value `y_hat`, given the mathematical expression above.

_Hint:_ Do not forget the constant term $\theta_0$ - hence the different shapes for `x` and `theta` - for example, `1` can be "added" to `x` inside the `pred` function if you wish to use vector / matrix multiplication.

_Python hints_ : 
- You can use `np.concatenate((A, B))` to concatenate two numpy arrays. 
- You can use `np.dot(A,B)`, or equivalently `A.dot(B)` to compute the matrix-matrix (or matrix-vector) $A \cdot B$ product between two numpy arrays `A` and `B`. Equivalently, you can use matrix multiplication with vectors (either `A @ B` or `np.matmul(A, B)`), but beware of the shapes!

In [None]:
def pred(x: np.array, theta: np.array) -> float:
    """
    Implementation of f_hat

    :param numpy.array x: a sample with p features
    :param numpy.array theta: a vector of (p + 1) parameters
    :return: y_hat
    :rtype: float
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert pred(np.array([0]), np.array([0,0])) == 0  # This computes y_hat = 0 + 0 * 0 which should be 0

Recall that all samples are gathered in matrices $\boldsymbol{X}$ and $\boldsymbol{y}$, where
$$
\boldsymbol{X} = \begin{pmatrix} x_1^{(1)} & \dots & x_p^{(1)} \\ \vdots & \vdots & \vdots \\ x_1^{(n)} & \dots & x_p^{(n)} \end{pmatrix}
\quad \text{and} \quad
\boldsymbol{y} = \begin{pmatrix} y^{(1)} \\ \vdots \\ y^{(n)} \end{pmatrix}.
$$

We propose a loss function $E(\theta) := \| \boldsymbol{y} - \boldsymbol{X}^T \theta\|^2_2$ which is the sum of squared errors (differences between the true and predicted value for each $x$ in $\boldsymbol{X}$), or equivalently (in vector form) the norm of the vector of each sample's prediction error $y^{(i)} - \hat{y}^{(i)}$.

**Question 2: (1pt)** Implement a function `E` which, given `X, y, theta` computes the error $E(\theta)$ defined above (you can use the `pred` function defined in question 1).

_Python hint:_ You can compute the norm of a vector `A` using `np.linalg.norm(A)`.

In [None]:
def E(X: np.array, y: np.array, theta: np.array) -> float:
    """
    Implementation of the error function: sum of squared errors

    :param numpy.array X: design matrix of shape N x p
    :param numpy.array y: response values of shape N
    :param numpy.array theta: coefficient of shape (p+1)
    :return: evaluation of the error function
    :rtype: float
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# Test for a particular value of E
assert int(E(linear_df[['x0', 'x1']].to_numpy(), linear_df['y'], np.array([0,0,0]))) == 4180

We want to find an optimal vector $\theta^*$ so that the loss made by using $\hat{y}^{(i)}$ to approximate $y^{(i)}$ is small.
With vector notations, it reads:
$$\theta^* \in \mathrm{argmin} (E(\theta)).$$

Since $E(\theta)$ is convex (the square is convex and everything else is linear), $\theta^\star$ exists and is unique (you can also calculate the Hessian matrix to convince yourself).

Recall that to find $\theta^\star$, a straightforward solution is to differentiate $E(\theta)$ so as to obtain its gradient, set it to 0 and solve the resulting equation. 

This yields a closed form equation. The optimal $\theta^\star$ is given by:
$$ \theta^\star = (\boldsymbol{X}^T \boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{y}$$

**Question 3: (1pt)** Implement a function `linreg` which, given `X` and `y`, returns the optimal vector `theta_star` as suggested above.

_Python hints_ : 
- You can use `np.linalg.inv(A)` to compute the inverse of a (non-singular) **square** matrix. 
- You can use `np.transpose(A)` or equivalently `A.T` to compute the transpose of a numpy array `A`. 
- You can use `np.dot(A,B)`, or equivalently `A.dot(B)` to compute the matrix-matrix (or matrix-vector) $A \cdot B$ product between two numpy arrays `A` and `B`. Equivalently, you can use matrix multiplication with matrix and vectors (either `A @ B` or `np.matmul(A, B)`), but beware of the shapes!

In [None]:
def linreg(X: np.array, y: np.array) -> np.array:
    """
    Compute linear regression coefficient with OLS

    :param numpy.array X: design matrix of shape N x (p+1)
    :param numpy.array y: response vector of shape N
    :return: coefficients of shape (p+1)
    :rtype: numpy.array
    """
    # YOUR CODE HERE
    raise NotImplementedError()

Test your code with the following code - please **do not** be satisfied with the fact that it runs... Figure out what is printed, and whether it's correct - no qualitative answer expected though.

In [None]:
# Transform columns of features into numpy array and add a column of 1
design_matrix = np.concatenate((np.ones(linear_df.shape[0]).reshape(-1, 1),
                                linear_df[['x0', 'x1']].to_numpy()), axis=1)

# Linear regression coefficient
theta = linreg(design_matrix,
               linear_df['y'])

print("Coefficients:\t", theta)

Another way to verify the validity of your implementation is to plot your solution, with the following code:

In [None]:
# Instantiate a figure
fig = plt.figure(dpi=150)
ax = fig.gca(projection='3d')

# Plotting our data points
ax.scatter(linear_df[['x0']].to_numpy(),
           linear_df[['x1']].to_numpy(),
           linear_df['y'].to_numpy())

# Creating a mesh to draw the linear regression hyperplane
x0_surf = np.arange(linear_df[['x0']].min().values, linear_df[['x0']].max().values, 0.1)
x1_surf = np.arange(linear_df[['x1']].min().values, linear_df[['x1']].max().values, 0.1)
x0_surf, x1_surf = np.meshgrid(x0_surf, x1_surf)
exog = pd.core.frame.DataFrame({'x0': x0_surf.ravel(), 'Radio': x1_surf.ravel()})

# Prediction on the mesh
out = [pred(x, theta) for x in exog.to_numpy()]

# Plotting the hyperplane
ax.plot_surface(x0_surf,
                x1_surf,
                np.array(out).reshape(x0_surf.shape),
                rstride=1,
                cstride=1,
                color='red',
                alpha = 0.4);

## Polynomial regression

We consider the specific case where the observations $x^{(i)}$ are real-valued (that is $p=1$). For an integer $k$ fixed, and a vector $\theta = (\theta_0 \dots \theta_k)^T$, we propose to estimate $y^{(i)}$ in the following way:
$$ \hat{y}^{(i)} = \theta_0 + \theta_1 x^{(i)} + \theta_2 (x^{(i)})^2 + \dots + \theta_k (x^{(i)})^k. $$
Here, $(x^{(i)})^j$ means the $j$-th power of $x^{(i)}$.

As in "standard" linear regression, the goal is to find a $\theta^*$ so that 
$ \sum_{i=1}^n (y^{(i)} - \hat{y}^{(i)})^2 $
is minimal.

We consider a dataset of $n = 100$ pairs of form (observation, label), that is split into a _train set_ of size $80$ and a _test set_ of size $20$. These data are loaded using the code below.

In [None]:
polynomial_df = gen_d_regression_samples(n_samples=100, p=1, reg_type='polynomial')

In [None]:
polynomial_df.head()

In [None]:
X = polynomial_df['x0'].to_numpy()
y = polynomial_df['y'].to_numpy()

print(X[:5], "\n")
print(y[:5])

In [None]:
x_train, x_test, y_train, y_test = sk.model_selection.train_test_split(X, 
                                                                       y, 
                                                                       test_size=1/5, 
                                                                       random_state=1)

**Question 4 (1pt):** Write a function `polynomial_expand` which, given a **real valued** vector $x \in \mathbb{R}^n$ and an integer $k$, returns the $n \times (k+1)$ matrix
$$X = \begin{pmatrix} 1 & x^{(1)} & \dots & (x^{(1)})^k \\ \vdots & \vdots & \dots & \vdots \\ 1 & x^{(n)} & \dots & (x^{(n)})^k \end{pmatrix}$$

In [None]:
def polynomial_expand(x: np.array, k: int = 5) -> np.array:
    """
    Compute design matrix X as above

    :param numpy.array x: vector of inputs of shape n
    :param int k: polynomial degree
    :return: design matrix X (as above) of shape n x (k+1)
    :rtype: numpy.array
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert polynomial_expand(x_train, 10).shape == (80, 11)  # correct (n, k+1) shape?
assert polynomial_expand(x_train, 10)[0, 0] == 1  # column of 1?

**Question 5** (open question): Solve the following questions using the functions that you have written above. 

**5.a. (1pt)** Perform a polynomial regression to fit `x_train, y_train` and obtain `theta` with $k = 1 \dots 12$ and then predict labels for the test set `x_test`. 

The test error is defined as previously as
$$ E_\mathrm{test} (\theta ; \hat{\boldsymbol{y}}_\mathrm{test}, \boldsymbol{y}_\mathrm{test}) =  \|\hat{\boldsymbol{y}}_\mathrm{test} - \boldsymbol{y}_\mathrm{test}\|_2^2, $$
where $\boldsymbol{y}_\mathrm{test}$ is the vector of all true values we're trying to predict from all the samples in `x_test` and $\hat{\boldsymbol{y}}_\mathrm{test}$ is the vector of values that we're predicting given the vector $\theta$ learned using `x_train, y_train`. The training error has the same expression, using the training set, i.e. $ E_\mathrm{train} (\theta ; \hat{\boldsymbol{y}}_\mathrm{train}, \boldsymbol{y}_\mathrm{train}) =  \|\hat{\boldsymbol{y}}_\mathrm{train} - \boldsymbol{y}_\mathrm{train}\|_2^2$.

Store the training errors and test errors you obtain in two lists of size $12$, which names must be (respectively) `train_errors` and `test_errors`, and such that `train_errors[k]` (resp. `test_errors[k]`) gives the training error (resp. test error) obtained by performing polynomial regression with maximal degree $k$.

*Remark:* you are **not** allowed to use `sklearn`.

*Hint:* Notice you already implemented the cost function, the polynomial expansion and the linear regression, but pay attention to the size of their inputs / outputs...

In [None]:
# train_errors = ...  # <- TO UNCOMMENT AND COMPLETE
# test_errors = ...  # <- TO UNCOMMENT AND COMPLETE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(train_errors) == 12  # Training errors is a list of size 12?
assert len(test_errors) == 12  # Test errors is a list of size 12?

**5.b. (0.5pt)** Plot both arrays, i.e. the training and test errors, w.r.t. the polynomial degree $k$ on the same graph in the following cell.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**5.c. (0.5pt)** What is the name of the phenomenon / phenomena we are observing here? (short qualitative answer in subsequent cell - bonus points for conciseness AND completeness)

YOUR ANSWER HERE

# Exercise 2: logistic regression

The logistic regression is a common model used to perform binary classification that is actually formulated as a regression problem. 

We have a *learning / training set* $\mathcal{D} = \{\boldsymbol{x^{(i)}}, y^{(i)}\}_{1 \leq i \leq n}$, where $y_i \in \{0,1\}$ and $x_i \in \mathbb{R}^p$.

We consider the logistic / sigmoid function:
$$ \sigma : t \mapsto \frac{1}{1 + e^{-t}}.$$
Note that $\sigma$ takes values in $(0,1)$ and will be used to estimate $y^{(i)}$ given $x^{(i)}$. More precisely, given a weight vector $\theta = ( \theta_0, \dots, \theta_p )^T$ and a vector $x^{(i)} = (1, x^{(i)}_1, \dots ,x^{(i)}_p)^T$, we make the following estimation:
$$
    \hat{y}^{(i)} = \begin{cases} 1 \text{ if } \sigma(x^{(i)T} \theta) > 1/2 \\
                                  0 \text{ otherwise } 
                    \end{cases}
$$

We want to find a good $\theta$. One can show that the optimal (in the sense of the *likelihood*) $\boldsymbol{\theta}^* = \begin{pmatrix} \theta_0^* & \dots & \theta_p^* \end{pmatrix}^T$ minimizes the following optimization problem:

$$\boldsymbol{\theta}^* \leftarrow {\arg\!\min}_{\boldsymbol{\theta}} E(\boldsymbol{\theta}) \quad \text{with} \quad E(\boldsymbol{\theta}) = - \sum_{i=1}^n (1 − y^{(i)}) \ln(1 − \sigma^{(i)}) + (y^{(i)}) \ln(\sigma^{(i)})$$
and

$$
\boldsymbol{X} = \begin{pmatrix} 1 & x_1^{(1)} & \dots & x_p^{(1)} \\ \vdots & \vdots & \dots & \vdots \\ 1 & x_1^{(n)} & \dots & x_p^{(n)} \end{pmatrix}
\quad \quad
\boldsymbol{y} = \begin{pmatrix} y^{(1)} \\ \vdots \\ y^{(n)} \end{pmatrix}
\quad \quad
\boldsymbol{\theta} = \begin{pmatrix} \theta_0 \\ \vdots \\ \theta_p \end{pmatrix}
$$

**Question 1: (0.5pt)** Does the error function $E$ have a unique minimum? Why? If it exists, is there an analytical (closed form) solution corresponding to this minimum?

YOUR ANSWER HERE

### Iris dataset
We now consider the `iris` dataset of $n = 150$ points, with $3$ labels 'setosa', 'versicolor', and 'virginica', encoded respectively by $0, 1, 2$ in the following. This dataset is split in a training set `x_train, y_train` with $3/4$th of the $150$ observations, and a test set `x_test, y_test` of $1/4$th of the $150$ observations; this dataset must be loaded by running the following cell.

In [None]:
iris = datasets.load_iris()
X = iris.data[:, :2][np.where([target in [0, 1] for target in iris.target])[0]]
y = iris.target[np.where([target in [0, 1] for target in iris.target])[0]]
np.random.seed(seed=1)
x_train, x_test, y_train, y_test = sk.model_selection.train_test_split(X, 
                                                                       y, 
                                                                       test_size=1/4, 
                                                                       random_state=1)

The description of this dataset (not mandatory for what follows - note that we've kept only the two first features, sepal length and sepal width).

In [None]:
print(iris.DESCR)

10 first rows of `x_train`:

In [None]:
x_train[:10]

The corresponding labels (corresponding resp. to 'setosa', 'versicolor', and 'virginica').

In [None]:
y_train[:10].reshape(-1, 1)

**Question 2:** 

**2.a. (1pt)** Implement a function `sigma` which computes, given $x \in \mathbb{R}^p$ and $\theta \in \mathbb{R}^{p+1}$ the quantity $\sigma(x^{T} \theta)$.

_Python hints_ : 
- You can use `np.exp(...)` to compute the exponential. 
- You can use `np.concatenate((A, B))` to concatenate two numpy arrays. 
- You can use `np.dot(A,B)`, or equivalently `A.dot(B)` to compute the matrix-matrix (or matrix-vector) $A \cdot B$ product between two numpy arrays `A` and `B`. Equivalently, you can use matrix multiplication with vectors (either `A @ B` or `np.matmul(A, B)`), but beware of the shapes!

In [None]:
def sigma(x: np.array, theta: np.array) -> float:
    """
    Compute the sigmoid function

    :param numpy.array x: input vector shape p
    :param numpy.array theta: parameter vector of shape (p + 1)
    :return: probability for class 1
    :rtype: float
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert sigma(X[0,:], np.array([0, 0, 0])) == 0.5

**2.b. (1pt)** Implement a function `pred` which provides, given $x \in \mathbb{R}^p$ and $\theta \in \mathbb{R}^{p+1}$, an estimate $\hat{y}$ of $y$ as defined above.

In [None]:
def pred(x: np.array, theta: np.array) -> int:
    """
    Compute a class prediction from a class probability

    :param numpy.array x: input vector of shape p
    :param numpy.array theta: parameter vector of shape p + 1
    :return: predicted class
    :rtype: int
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert pred(X[0,:], np.array([1,1,1])) == 1  # Test for a particular value

**Question 3 (1pt):** Implement the error function $E$ defined above, and which we want to subsequently minimize w.r.t. $\theta$. It takes $X, y, \theta$ as arguments.

In [None]:
def E(X: np.array, y: np.array, theta: np.array) -> float:
    """
    Compute error for logistic regression

    :param numpy.array X: design matrix of shape N x p
    :param numpy.array y: vector of real classes of shape N
    :param numpy.array theta: parameter vector of shape (p+1)
    :return: error
    :rtype: float
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
print(E(X, y, np.array([0, 0, 0])))  # Prints a particular value

As there is no closed form for the optimal $\theta^*$, we will minimize $E$ using a gradient descent. Compute, either here or on a separate sheet of paper (upload it in a **new** Markdown cell via Edit > Insert Image), the **analytical expression** (as opposed to a particular value given some particular inputs $\boldsymbol{X}, \boldsymbol{y}$) gradient of $E$ w.r.t. $\boldsymbol{X}, \boldsymbol{y}, \theta$.

YOUR ANSWER HERE

Implement a function `grad_E(X,y,theta)` which returns the gradient (with respect to $\theta$) of $E$ for a given estimate $\theta$. Again, beware of the shape of the input arguments. You can make use of the `sigma` function already implemented, the `np.concatenate`, `np.dot`, `np.matmul`, `@` functions already suggested earlier.

In [None]:
def grad_E(X: np.array, y: np.array, theta: np.array) -> np.array:
    """
    Compute the gradient of the error w.r.t. theta

    :param numpy.array X: design matrix of shape n x p
    :param numpy.array y: vector of real classes of shape n
    :param numpy.array theta: parameter vector of shape (p+1)
    :return: gradient of shape (p+1)
    :rtype: numpy.array
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert len(grad_E(X, y, np.array([0,0,0]))) == 3  # Shape of the gradient

**Question 4 (1pt):** Complete the following code in order to implement the gradient descent algorithm to minimize $E$ which returns the final estimation of $\theta$ and the evolution of energy over time. The hyper parameters (number of steps, learning rates, stopping criterion) are set by default and should not be changed. Note that you **must** implement an early stopping criterion. Recall the complete algorithm:

Starting from a random point $\boldsymbol{\theta}$, the gradient descent method proposes a new point 
$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}}E(\boldsymbol{\theta})$ at each iteration $t$ until a stopping criterion has been reached: e.g. $E^{(t-1)}(\boldsymbol{\theta}^{(t-1)}) - E^{(t)}(\boldsymbol{\theta}^{(t)}) < \epsilon$ with $\epsilon$ a chosen minimal improvement of the error, $t$ and $t-1$ denoting the current (resp. previous) estimation of the error $E$ and the parameter $\theta$.

In [None]:
def grad_descent(X: np.array, y: np.array, eta: float = 0.001,
                 nb_max_step: int = 1000, stopping_criterion: float = 0.01,
                 seed: int = 1):
    """
    Perform gradient descent
    Remark: do not change the default values, do not normalize the gradient by the number of points!

    :param numpy.array x: input observations, shape N x p
    :param numpy.array y: input labels, shape N (filled with 0 and 1)
    :param float eta: learning rate.
    :param int nb_max_step: maximal number of steps done in the gradient descent
    :param float stopping_criterion: stopping criterion on gradient process
    :param int seed: numpy random seed
    :return: last parameter value of shape p+1 (theta), list of float (evolution_of_loss)
    :rtype: numpy.array, list
    """
    # Storing some useful values
    N, p = X.shape
    # List to store the evolution of E(theta) over steps
    evolution_of_loss = []
    # Store the current value of loss to stop gradient descent
    e = np.inf
    # Initialization of theta
    np.random.seed(seed=seed)  # DO NOT change this
    theta = np.random.rand(p + 1)  # DO NOT change this

    ### Perform the gradient descent
    for t in range(nb_max_step):
        
        # Perform a gradient step
        # grad = ... # <-- to complete
        # theta = ... # <-- to complete
        # new_e = ... # <-- to complete, store the loss according to the new parameter theta
        # Do not normalize by the number of points!

        # YOUR CODE HERE
        raise NotImplementedError()
    
    return theta, evolution_of_loss

Test your code with the cell below.

In [None]:
theta, evol = grad_descent(x_train, y_train)  # Perform gradient descent
plt.plot(evol);  # Plot evolution of loss: does this seem correct?

**Question 6:** 

**6.a. (0.5pt)** What is the training error rate, defined in classification as the percentage of predicted labels different from the true ones, of the logistic regression on the training dataset? Put the result in `training_error`.

In [None]:
# training_error = ...  # <- TO UNCOMMENT AND COMPLETE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 0 <= training_error <= 1  # Must be between 0 and 1

**6.b. (0.5pt)** What is the test error rate, defined in classification as the percentage of predicted labels different from the true ones, of the logistic regression on the test dataset? Put the result in `test_error`.

In [None]:
# test_error = ...  # <- TO UNCOMMENT AND COMPLETE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 0 <= test_error <= 1  # Must be between 0 and 1

# Exercise 3: $k$-nearest neighbors

We briefly recall the $k$-nearest neighbors ($k$-NN) algorithm seen in lab session 2.

Consider a dataset $\mathcal{D} = (X, y)$ where the $j$-th coordinate of the $i$-th observation is denoted by $x^{(i)}_j$ and the corresponding label is $y^{(i)}$, or equivalently:
$$\boldsymbol{X} = \begin{pmatrix} x_1^{(1)} & \dots & x_p^{(1)} \\ \vdots & \dots & \vdots \\ x_1^{(n)} & \dots & x_p^{(n)} \end{pmatrix}
\quad \quad
\boldsymbol{y} = \begin{pmatrix} y^{(1)} \\ \vdots \\ y^{(n)} \end{pmatrix}.$$

Fix an integer $k$. Given a new observation $x \in \mathbb{R}^p$, we predict its class $\hat{y}$ in the following way.

Let $x^{(i_1)}, \dots, x^{(i_k)}$ be the $k$ observations in $X$ which minimize the Euclidean distance $\|x - x^{(i)}\|_2$.

In classification, we choose $\hat{y}$ as the dominant occurence among the corresponding classes $\{ y^{(i_1)}, \dots, y^{(i_k)} \}$ (majority vote). In case of a tie, i.e., two (or more) classes are the most frequent, $\hat{y}$ is chosen randomly (uniformly) among these equally dominant classes.

In regression, we compute $\hat{y}$ as the mean of the values of $\{ y^{(i_1)}, \dots, y^{(i_k)} \}$.

We generate some data so that you can test your implementations.

In [None]:
knn_df = gen_d_regression_samples(n_samples=50, p=2)

In [None]:
knn_df.head()

## $1$-nearest neighbors regressor

**Question 1 (1pt):** Implement a function which takes as input a matrix `X_train` of shape `n x p`, corresponding values `y_train` (shape `n`) a new observation `x` of shape `p`, and which returns a prediction of `y_hat` for this new observation using the **one**-nearest neighbor algorithm.

_Python hint:_ You can compute the norm of a numpy array using `np.linalg.norm`. You can use `np.argmin` (resp. `np.argmax`) to find the index of the minimum (resp. maximum) of a numpy array.

In [None]:
def oneNN(X_train: np.array, y_train: np.array, x: np.array) -> int:
    """
    Search nearest neighbor of x in X_train and output its predicted label

    :param numpy.array X_train: training dataset of shape N x p
    :param numpy.array y_train: training labels of shape N
    :param numpy.array x: test point of shape p
    :return: y_hat, class prediction
    :rtype: int
    """
    # First, we construct a list of distances from x to each data point in X_train.
    # Each distance is computed using np.linalg.norm(x - x_train).
    # Then, we predict y_hat label by choosing the label of the closest point in X_train (1-NN).
    # The smallest distance is the min in the list. However, we don't need the distance itself but its position.
    # YOUR CODE HERE
    raise NotImplementedError()

Test your function by running the following cell (note that `df` is a DataFrame and we require a numpy array).

In [None]:
# Print the value y_hat corresponding to the closest point in knn_df to the origin
print(oneNN(knn_df[['x0', 'x1']].to_numpy(),
            knn_df['y'].to_numpy(), np.array((0,0))))

**Question 2 (1pt):** What is the training error (recall that in classification, it is the proportion of misclassified points in the training dataset; in regression, the squared sum of errors) achieved by this **algorithm** for $k=1$?

*Hint*: you can try returning `y_hat` from your previous function for `x in X_train` and compare it to `y_train` to give you an idea in the following cell (optional), and explain your answer in the subsequent cell (mandatory).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## $k$-nearest neighbors classifier

**Question 3 (1pt):** Is the $k$-NN algorithm a _parametric_ or _non-parametric_ model?

YOUR ANSWER HERE

We go back to the iris dataset (in particular, its two first features):

In [None]:
x_train, x_test, y_train, y_test = sk.model_selection.train_test_split(iris.data[:, :2], 
                                                                       iris.target, 
                                                                       test_size=1/4, 
                                                                       random_state=1)

**Question 4 (1pt):** In this question, we allow the use of the `scikit-learn` library. 

**4.a.** Using the library, perform $k$-NN classification for all $k = 1 \dots 20$ by training on (`x_train`, `y_train`), and use it to predict labels for `x_train` and `x_test`. Evaluate the quality of the classification using the `accuracy_score` function provided by `scikit-learn`. Store the results in two lists, `accuracy_scores_train` and `accuracy_scores_test`, which should contain, for $k = 1 \dots 20$, the error rate (evaluated by `accuracy_score`) on `x_train` and `x_test` respectively.

We recall how to perform $k$-NN classification using `scikit-learn`:
- `clf = KNeighborsClassifier(n_neighbors=k)` instantiates an object `clf` of the class `classifier`, which can be used to perform `n_neighbors`-NN classification (of course, you have to provide the value of `k` when instantiating the classifier).
- `clf.fit(x_train, y_train)` trains this classifier on a training set, where `x_train` is the observation matrix, `y_train` contains the labels.
- `y_pred = clf.predict(x_test)` for a new set of observations, encoded as a matrix `x_test`, returns label predictions `y_pred`.
- `accuracy_score(y_true, y_pred)` given the true labels of the test set `y_true` (`y_test` here) and the labels `y_pred` predicted by the $k$-NN classifier, computes the accuracy score of the classification.

In [None]:
# accuracy_scores_train = ...  # <- TO UNCOMMENT AND COMPLETE
# accuracy_scores_test = ...  # <- TO UNCOMMENT AND COMPLETE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(accuracy_scores_train) == 20  # Did you compute 20 values of k?
assert np.max(accuracy_scores_train) <= 1  # The proportion should be between 0 and 1
assert np.min(accuracy_scores_train) >= 0  # The proportion should be between 0 and 1

You can plot the accuracies on the same graph w.r.t. to $k$ using the following cell:

In [None]:
plt.plot(range(1,21), accuracy_scores_train, label='train')
plt.plot(range(1,21), accuracy_scores_test, label='test')
plt.legend()
plt.show()

**4.b. (0.5pt)** Which value(s) of $k$ yield(s) the best accuracy on the test set?

YOUR ANSWER HERE

**4.c. (0.5pt)** What is the phenomenon you are witnessing for $k > 6$?

YOUR ANSWER HERE

**4.d. (0.5pt)** What is the phenomenon you are witnessing for $k < 6$?

YOUR ANSWER HERE

# Additional exercise 1

## Cross-validation

**Open question (2pts)** Perform cross-validation of $k$ for the Titanic dataset.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

titanic = pd.read_csv("https://raw.githubusercontent.com/adimajo/CSE204-2021/master/data/titanic_train.csv")
titanic = titanic.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)
titanic.dropna(inplace=True)
titanic['Embarked'] = titanic['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)
titanic['Sex'] = titanic['Sex'].map({'male': 0, 'female': 1}).astype(int)
X = titanic.drop("Survived", axis=1).to_numpy()
y = titanic["Survived"].to_numpy()

# YOUR CODE HERE
raise NotImplementedError()

# Additional exercise 2 (hard)

## Ridge Logistic Regression 

**Open question (2pts)** Add the ridge penalty to the gradient descent algorithm implemented for logistic regression, apply it to the Titanic dataset, and perform cross-validation to choose the amount of penalty. We can expect the code be computationally expensive, so a sketch of the solution is enough.

In [None]:
from sklearn.model_selection import KFold

def grad_descent(X: np.array, y: np.array, eta: float = 0.0001,
                 nb_max_step: int = 1000, stopping_criterion: float = 0.00001,
                 alpha: float = 0.01) -> np.array:
    """
    Compute gradient descent with ridge penalty for logistic regression

    :param numpy.array X: design matrix of shape (N, p)
    :param numpy.array y: vector of responses of shape N
    :param float eta: step size
    :param int nb_max_step: maximum number of iterations
    :param float stopping criterion: whenever theta does not move more than this criterion, stop
    :return: last value of parameter, list of loss values, list of values of theta
    :rtype: numpy.array, list, list
    """
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()