# Backprop and Gradient Descent Test
MLPs, SGD, MSE, all of the 3 letter acronyms

## Section 1 (15 min)

In this section you will derivive the gradient of an MLP with Sigmoid nonlinearities for use in gradient descent

1. For the following nonlinearity, calculate the derivative of it w.r.t (with respect to) its input -- this was on the previous test, but now you must simplify (hint: $\sigma(X)$ will appear in your solution):

\begin{align*}
    σ(X) = \frac{1}{1+e^{-X}}
\end{align*}

\begin{align*}
    \frac{∂σ}{∂X}=
\end{align*}

2. Imagining that we’re using this nonlinearity after a single layer of neural network – calculate the derivative of the output w.r.t the weights.

\begin{align*}
    A = X \cdot W + b \quad \text{  and  } \quad \hat{y} = σ(A) = \frac{1}{1+e^{-A}}
\end{align*}

\begin{align*}
    \frac{∂\hat{y}}{∂W}=
\end{align*}

3. Now consider finding the derivative of a model with an additional layer like so:

\begin{align*}
    \hat{y} = (σ(X\cdot W_{0}+b_{0})) \cdot W_{1} + b_{1}
\end{align*}

\begin{align*}
    \frac{∂\hat{y}}{∂W_0}=
\end{align*}

4. Describe and implement a modification that can be made to X and W, to X' and W',such that:

\begin{align*}
    X \cdot W + b = X' \cdot W'
\end{align*}

Describe here:

In [None]:
def modify_x_w(x, w, b):
    '''
    You may assume that x,w, and b are all given as np arrays
    '''
    X_new = None
    W_new = None

    return X_new, W_new

In [None]:
# Test for your implementation
import numpy as np

# test vectors:
X = np.array([1,2,3,4])
W = np.ones((4,2))
b = np.ones(2)

X_new, W_new = modify_x_w(X,W,b)

print("original (vector): " + str((X @ W) + b))
print("modified (vector): " + str((X_new @ W_new)))

# test matrices:
X = np.array(([1,2,3], [4,5,6]))
W = np.array(([1,2,3,10], [4,5,6,10], [7,8,9,10]))
b = np.array([[5,4,6,9]])

X_new, W_new = modify_x_w(X,W,b)

print("original (matrix): " + str((X @ W) + b))
print("modified (matrix): " + str((X_new @ W_new)))

## Section 2 (15 min)
In this section you will create a dataset and plot it

1. Create a function to sample n points from the following distribution:

\begin{align*}
X \sim U[-1, 1], \quad \quad
Y \sim U[-1, 1], \quad \quad          
Err \sim N(0, 0.5) \\ \\
Z = X^2 - Y^2 + 1.2 + Err
\end{align*}

The function should return an n x 3 matrix

In [None]:
def sample_points(n):
    '''
    Sample from the above function n times and return the resultant n x 3 matrix as an np array
    Ensure that the order of the colomns are X, Y, Z (i.e. points[:][2] should be an nx1 col of just the Z values)
    '''
    points = np.ones((n, 3))

    return points

2. Now plot these points as a 3d scatter plot using matplotlib, x, y, and z should correspond to their respective axes. The points should be colored (*) according to their z value (blue is low, red is high) and should have opacity 0.5.

In [None]:
import matplotlib.pyplot as plt

def plot_3d_scatter(points):
    
    plt.show()
    return None

In [None]:
# Test for sampling and plotting:

points = sample_points(100)
plot_3d_scatter(points)

3. Next create a function that will create a train and test dataset of size 100 and 20 respectively. Then plot both on the same 3d scatter in different shapes.

In [None]:
def create_train_and_test(train_size=100, test_size=20):
    pass

In [None]:
def plot_train_and_test(train_data, test_data):
    pass

In [None]:
# Test your code
train_data, test_data = create_train_and_test()
plot_train_and_test(train_data, test_data)

# Section 3 (10 min)
Now we will initilize a model to train on this data

1. First we intialize weights, create a function that will initialize an IN x OUT size weight matrix with each value sampled from a normal distribution with mean 1 and std 0.25

In [4]:
import numpy as np

def init_weight_matrix(In=2, Out=1):

    weight_matrix = np.ones((In, Out)) # Not correct, just an example

    return weight_matrix

In [None]:
# test this

test_mat = init_weight_matrix(5, 10)
print(test_mat)
print(test_mat.shape)
print(np.mean(test_mat))

2. Now we want to be able to initialize an MLP. We will be given a list of output sizes for our matrices. Since the data we are using as input is 2d the first matrix will always have an input size of the first element of the list but subsequent layers will have input size equal to the previous layers output size. e.g. a size list [2,5,10,1] should produce a 2x5, 5x10, and 10x1 sized matrices.

The model should be given as a dictionary with the key 'W+str(n)' corresponding to the n-th weight matrix of the MLP.

In [None]:
def init_mlp(output_sizes=[5,1]):
    model = {}

    return model

In [None]:
# test

test_mlp = init_mlp([10,5,2,1])

for layer in test_mlp:
    print('layer : ' + layer)
    print(test_mlp[layer].shape)

# Section 4 (10 min)
Now we will complete the forward pass and loss calculation of our model

1. First let's complete the forward pass of the sigmoid function (we will use this as our nonlinearity).

In [None]:
def sigmoid_forward(x):
    '''
    returns the output of the sigmoid non-linearity performed on input X (elementwise)
    '''

    pass

In [None]:
import matplotlib.pyplot as plt
import numpy as np

example_inputs = np.arange(-5,5,0.25)
your_output = sigmoid_forward(example_inputs)

plt.plot(example_inputs, your_output)
plt.title("Your sigmoid function")
plt.xlabel("Input")
plt.ylabel("Output")
plt.show()


2. Now we will do the full forward pass of our model.

In [None]:
def mlp_forward(my_mlp, x):
    '''
    To complete this we must do the following steps:
    1. Alter x such that the bias term is implicit in the matrix operations (via your solution to 1.4)
    2. Repeatedly perform matrix multiplication via our mlp weights, with a nonlinearity (sigmoid forward) after all but the final
    3. Each layer's output (after sigmoid) must be stored in a cache as 'A+str(l)' for the L-th layers output, with 'A0' as X
    '''

    cache = {'A0': x}
    output = None

    return cache, output

    

In [None]:
# testing your code in tandem

test_mlp = init_mlp([2,5,1]) # change the input size if necessary for your implicit bias
test_data = sample_points(100)
test_input = test_data[:][:2]
test_label = test_data[:][2]

# You may make modifications to this if necessary to test your code -- thought I would like this as a quick way to check correctness

test_cache, test_output = mlp_forward(test_input)

# Section 5 (*) (25 min)
Now we will finally do our loss, backprop and gradient descent

1. First let's create our loss function

In [None]:
def loss(label, pred):
    '''
    return the MSE loss of some pred against some true label -- can be one line
    '''


    return 

2. Second lets get the gradient of our model via backprop

In [None]:
def backprop(my_mlp, cache, pred):
    '''
    Using the cache, find the derivatives of loss with respect to each weight matrix,
    store them each in a cache with the key 'dW+str(l)' corresponding to the value of the derivative of Weight layer l
    '''

    dcache = {}

    return dcache

In [None]:
# test that the dimensions line up

test_dcache = backprop(test_mlp, test_cache)

for derivative in test_dcache:
    print(derivative + ' shape: ' + test_dcache[derivative].shape)

# Should have the same shapes as the MLP -- here 2x5, 5x1


3. Now we can finally implement gradient descent. We will allow a choice of the number of iterations and the learning rate as inputs

In [None]:
def grad_descent(data, my_mlp, iterations, learning_rate):
    '''
    This will repeatedly calculate the output of the model and then apply a step of gradient descent.
    A list of the loss after each iteration (and also the loss before any iterations) should be returned
    along with the final model.
    '''

    losses = []

    return losses, my_mlp

In [None]:
# test grad descent

test_losses, test_final_mlp = grad_descent(test_data, test_mlp, iterations=10, learning_rate=0.0001) # You may need to tune learning rate

plt.plot(np.arange(len(test_losses)), test_losses)
plt.title("Learning Curve")

4. Now we plot our model

In [None]:
def plot_data_and_pred(my_mlp, train_points, test_points):
    '''
    This should plot the data as a 3d scatter and the prediction of the model (i.e. the forward)
    as a 3d surface (that is it should be across all possible values of x and y).
     
    Make sure that both the surface and the points are distinguishable
    '''

    plt.show()

# Section 6 (**) Hyperparameter Tuning (20 min)
Now we want to find a good MLP for a problem

Note on this section -- if any of the test data is used at any point to influence the choice of hyperparameters then your solution will be useless. 

1. First, let us observe the behavior of a one-layer vs two layer network on the above problem. Plot the learning curve (loss vs iteration) of a single-layer and two-layer network trained via your gradient descent. Then plot the surfaces of each compared to the training and testing data.

2. Now, train each, but reserve 20% of the *training data* and evaluate each model on that portion each iteration without training on it. Add these "validation" plots to the learning curve plot (each line of a given model should be the same color but the validation line should be dashed).

3. Design and code a procedure to choose the best hyperparameters (i.e. the arcitecture (e.g. layers/weight mat sizes, learning rate, loss func)) to fit to this data. You may only evaluate on the test data (or plot it) once you have chosen a final model. Your performance on this data is considered your final score here.