# Assignment 1

In this assignment you are tasked to fill this notebook by answering the questions, sometimes you will find questions that require you to type, you can do that by inserting a Markdown cell below the question and type your answer in it.

## Goal:
The goals of this assignment are:
- Implement, debug and visualize multivariate linear regression to nonlinear data
- Get introduced to the concepts of overfitting and underfitting
- Implementing linear regression with regualrization and understanding the importance of train\test errors

Throughout this assignment you will be using the `Assignment_Data.csv` file.

## Import Libraries and Load Data

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv("./Assignment_Data.csv")

## Understand and Visualize the Data

In [None]:
data.head()

In [None]:
data.describe()

### Question 1
Make a scatter plot to visualize the data, what are your comments?

In [None]:
# write your code here

As explained in class, linear regression might not be directly suitable for nonlinear data.
However, by doing feature expansion we can still use linear regression techniques to fit nonlinear data. As a result, you will be able to fit the data using different degrees polynomials, e.g. a degree two polynomial (which is a linear combination of $1, x$ and $x^2$), or a degree three polynomial (which is a linear combination of $1, x, x^2$ and $x^3$), etc...

Higher degree polynomials are more expensive to compute and to fit, but can capture finer details in the data, which results in more expressive models.

In [None]:
x = np.array(data['X'])
y = np.array(data['Y'])
print(f'Shape of x {x.shape}')
print(f'Shape of y {y.shape}')

## Process the Data

### Question 2
Complete the following function `build_poly()` which is a function that takes the 1D array x as input along with an integer value degree and outputs a 2D array phi that expands x into a polynomial with the associated degree

In [None]:
def build_poly(x, degree):
    """Polynomial expansion of x with the given degree"""
    # write your code in between these two lines
    # ***************************************************

    # ***************************************************
    
    return phi

The function `plot_fitted_curve()` is used to plot the learned curve on top of the data point

In [None]:
def plot_fitted_curve(y, x, weights, degree, ax):
    """plot the fitted curve."""
    ax.scatter(x, y, color='b', s=12, facecolors='none', edgecolors='r')
    xvals = np.arange(min(x) - 0.1, max(x) + 0.1, 0.1)
    tx = build_poly(xvals, degree)
    f = tx.dot(weights)
    ax.plot(xvals, f)
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    ax.set_title("Polynomial degree " + str(degree))

In [None]:
def ToRescaled(x, mean=None, std=None):
    if mean is None:
        mean = np.mean(x)
    if std is None:
        std = np.std(x)
    x_ = (x - mean) / std
    return x_, mean, std
    
def FromRescaled(x_, mean, std):
    x = x_ * std + mean 
    return x

### Question 3
Use the function ToRescaled() to rescale the values of X and Y.

In [None]:
# x_ = ...
# y_ = ...

## Learning

### Question 4
Complete the function `polynomial_regression_direct()` to implement the direct method of solving linear regression for polynomials of degree 1, 3, 7, and 12. Use `plot_fitted_curve()` to plot the results, and show the MSE for each.

In [None]:
def polynomial_regression_direct():
    """Constructing the polynomial basis function expansion of the data,
       and then running least squares regression."""
    # define parameters
    degrees = [1, 3, 7, 12]
    
    # define the structure of the figure
    num_row = 2
    num_col = 2
    f, axs = plt.subplots(num_row, num_col)

    for ind, degree in enumerate(degrees):
        # write your code in between these two lines
        # ***************************************************

        # ***************************************************
        print("Processing {i}th experiment, degree={d}, mse={mse}".format(
              i=ind + 1, d=degree, mse=mse))
        # plot fit
        plot_fitted_curve(
            y_, x_, weights, degree, axs[ind // num_col][ind % num_col])
    plt.tight_layout()
    plt.show()

In [None]:
polynomial_regression_direct()

### Question 5
Comment on the values of the MSE

In [None]:
# write your answer here

### Question 6
Complete the function `polynomial_regression_GradientDescent()` to do the same as Question 4 but by using gradient descent this time.

In [None]:
def polynomial_regression_GradientDescent():
    degrees = [1, 3, 7, 12]
    T = [10000, 10000, 100000, 10000000]   # number of times steps
    alpha = [1.0, 0.01, 0.001, 0.00001]  # learning parameter
    num_row = 2
    num_col = 2
    f, axs = plt.subplots(num_row, num_col)
    for ind, degree in enumerate(degrees):
        # write your code in between these two lines
        # ***************************************************

        # ***************************************************
        theta = np.random.rand(M)
        for t in range(T[ind]):
            # write your code in between these two lines
            # ***************************************************

            # ***************************************************
            
        print("Processing {i}th experiment, degree={d}, mse={loss}".format(
                  i=ind + 1, d=degree, loss=J))
        plot_fitted_curve(
            y_, x_, theta, degree, axs[ind // num_col][ind % num_col])
    plt.tight_layout()
    plt.show()


In [None]:
polynomial_regression_GradientDescent()

## Train/Test Split and Evaluation

### Question 7
Let us show the train and test splits for various polynomial degrees. First of all, please fill in the function `split_data()`

In [None]:
def split_data(x, y, ratio, seed=1):
    """
    split the dataset based on the split ratio. If ratio is 0.8 
    you will have 80% of your data set dedicated to training 
    and the rest dedicated to testing, this function returns four arrays
    x & y for training, and x & y for testing
    """
    np.random.seed(seed)
    data_size = len(y)
    # write your code in between these two lines
    # ***************************************************
   
    # ***************************************************
    
    return train_x, train_y, test_x, test_y

### Question 8
Fill the `train_test_split_demo()` function that splits the data by using the function `split_data()` and then uses the direct method to solve the polynomial regression and prints the split ratio, polynomial degree, the training error, and the testing error.

In [None]:
def train_test_split_demo(x, y, degree, ratio, seed):
    """polynomial regression with different split ratios and different degrees."""
    # write your code in between these two lines
    # ***************************************************

    # ***************************************************
    
    print("proportion={p}, degree={d}, Training MSE={tr:.3f}, Testing MSE={te:.3f}".format(
          p=ratio, d=degree, tr=mse_tr[0,0], te=mse_te[0,0]))

Run the cell below to test your functions

In [None]:
seed = 6
degrees = [1, 3, 7, 12]
split_ratios = [0.9, 0.5, 0.1]

for split_ratio in split_ratios:
    for degree in degrees:
        train_test_split_demo(x_, y_, degree, split_ratio, seed)

### Question 9
Do the training and testing MSE make sense for diffrent ratios? Why? Which one is the best? <br>
What if instead of having 50 samples you had 5000, which split might be best? <br>
Comment on the high testing MSE for 10% split especially with high degree polynomials

In [None]:
# write your answer here

## Ridge Regression

The previous exercise shows overfitting when using complex models. Here we will try to correct that by using Ridge Regression. Ridge Regression is very similar to linear regression, the only difference is that we will add to the cost function a regularization term that penalizes high weight values.
$$J(\theta) = \frac{1}{2N}
\left[\sum_{i=1}^N \left(h_\theta(x^{(i)}) - y^{(i)} \right)^2 + \lambda\sum_{j=1}^D {\theta_j}^2\right]$$
With $D$ being the degree of the polynomials used, and $\lambda>0$ the penalty parameter. <br>
Doing that is useful to prevent overfitting when the model used is too complex. <br>
It turns out that Ridge Regression also has closed form solution which is the following:
$$\theta^*_{ridge} = \left(X^T X + \lambda \tilde I\right)^{-1} X^T y $$
Where $\tilde I$ is the diagonal matrix of dimension $D+1$ with $\{0,1\cdots 1\}$ on the diagonal. <br>
In what follows we will be working with polynomials of degree 7 only.

### Question 10
Complete the function `ridge_regression()` that takes the output y, the feature matrix tx, and the penalty parameter $\lambda$ to compute the ridge regression solution weights and the corresponding MSE. You can test the function by setting $\lambda=0$ and check that you should get the same solution as regular multivariate linear regression.

In [None]:
def ridge_regression(y, tx, lambda_):
    """implement ridge regression."""
    # write your code in between these two lines
    # ***************************************************

    # ***************************************************
    return w,mse

In [None]:
def test_ridge_regression():
    degree = 7
    phi = build_poly(x_, degree)
    weights, mse = ridge_regression(y_, phi, 0)
    print("degree={d}, mse={mse}".format(d=degree, mse=mse))

In [None]:
test_ridge_regression()

The function `plot_train_test()` is used to plot the train and test errors for different values of $\lambda' = \frac{\lambda}{2N}$

In [None]:
def plot_train_test(train_errors, test_errors, lambdas, degree):
    """
    train_errors, test_errors and lambdas should be list (of the same size) the respective train error and test error for a given lambda,
    * lambda[0] = 1
    * train_errors[0] = RMSE of a ridge regression on the train set
    * test_errors[0] = RMSE of the parameter found by ridge regression applied on the test set
    
    degree is just used for the title of the plot.
    """
    plt.semilogx(lambdas, train_errors, color='b', marker='*', label="Train error")
    plt.semilogx(lambdas, test_errors, color='r', marker='*', label="Test error")
    plt.xlabel("$\lambda$'")
    plt.ylabel("MSE")
    plt.title("Ridge regression for polynomial degree " + str(degree))
    leg = plt.legend(loc=1, shadow=True)
    leg.draw_frame(False)
    plt.savefig("ridge_regression")

### Question 11
complete the function `ridge_regression_demo()` to run ridge regression with different values of $\lambda'$, run the cell below that to test the function and generate the plot, and comment on the generated plot.

In [None]:
def ridge_regression_demo(x, y, degree, ratio, seed):
    """ridge regression demo."""
    # define parameter
    lambdas = np.logspace(-5, 0, 15)
    phi = build_poly(x, degree)
    train_x, train_y, test_x, test_y = split_data(phi, y, ratio, seed=seed)

    mse_tr = []
    mse_te = []
    for ind, lambda_ in enumerate(lambdas):
        # write your code in between these two lines
        # ***************************************************

        # ***************************************************
        print("proportion={p}, degree={d}, lambda={l:.3f}, Training MSE={tr:.3f}, Testing MSE={te:.3f}".format(
               p=ratio, d=degree, l=lambda_, tr=mse_tr[ind], te=mse_te[ind]))
        
    # Plot the obtained results
    plot_train_test(mse_tr, mse_te, lambdas, degree)

In [None]:
seed = 56
degree = 7
split_ratio = 0.5
ridge_regression_demo(x, y, degree, split_ratio, seed)