# Project 3: Advanced Training (from scratch) - Part 2

In this notebook, you will adapt your code from Jupyter Notebook 3 to train a feedforward neural network (FNN) on two different datasets: 

- Lincoln Home Sales: contains data of single family homes sold between January 2016 and August 2022 with the features
    - Overall Rating of Home Condition
    - Rating of Building Material Quality
    - Year Remodeled
    - Remodel Type
    - Total Living Area
    - Year Built
    - Garage Capacity
    - Bedroom Count
    - Total Basement Area
    - Minimally Finished Basement Area
    - Completely Finished Basement Area
    - Fireplace Count
    - Number of Fixtures
    - Pool Area
    - Total Acres
    - Sale Date
    - Price

- MNIST: contains handwritten digits 0-9

In part 1, you trained on the Lincoln Home Sales. Now we turn our attention to the MNIST dataset.

## MNIST Dataset

To train the MNIST dataset effectively, we need to explore the following adjustments to your training functions:

1. using a new loss function called *Categorical Crossentropy*,

1. implementing a different activation function in the final layer than in the hidden layer, and

1. building a FNN with multiple hidden layers of different sizes.

In [24]:
# I'll load the MNIST dataset for you (this is the only code cell that your allowed to use TensorFlow)
import numpy as np
from tensorflow.keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

### Categorical Crossentropy

Recall when we first constructed a model for the MNIST dataset in the FNN Slides and Project 1, we utilized the softmax activation function
    $$ \text{softmax}({\bf z}) = \frac{1}{\sum_{j=1}^k e^{z_j}} \begin{bmatrix} e^{z_1} \\ e^{z_2} \\ \vdots \\ e^{z_k} \end{bmatrix} .$$
Remember, this activation function yields a probability vector.



**Definition:** Consider the label of a single sample 
    $$ {\bf y} = \begin{bmatrix} y_1\\ y_2 \\ \vdots \\ y_k \end{bmatrix} $$
Then the *Categorical Cross-Entropy* loss function is defined as follows: 
    $$ CE({\bf y}, \tilde{\bf y}) = -\sum_{j=1}^k y_j \ln(\tilde{y}_j) $$



Categorical Cross-Entropy is ideal for classification neural networks. Specifically, it is the usual loss function used when softmax is applied in the final layer of a nueral network.

### **Problem 1**

Define and compute functions for Categorical Cross-Entropy and $\nabla_{{\bf a}^F} CE$. Recall that

$$ \nabla_{{\bf a}^F} CE = \begin{bmatrix} \dfrac{\partial CE}{\partial a^F_1} \\ ~ \\ \dfrac{\partial CE}{\partial a^F_2} \\ ~\\ \vdots \\ ~ \\ \dfrac{\partial CE}{\partial a^F_k} \end{bmatrix} , $$

where $k$ is the number of neurons in the final layer and 

$$ CE({\bf y}, \tilde{\bf y}) = CE({\bf y}, {\bf a}^F) = -\sum_{j=1}^k y_j \ln({a}^F_j) $$

In [None]:
def CategoricalCrossEntropy(y_true, y_pred):
    """
        Add your code below
    """
    CE = 
    return CE

# test your function with some example values
y_true = np.array([1, 0, 0, 0])
y_pred = np.array([0.7, 0.1, 0.1, 0.1])
print(CategoricalCrossEntropy(y_true, y_pred)) # should be 0.35667494393873245

def grad_CategoricalCrossEntropy(y_true, y_pred):
    """
        Add your code below
    """
    grad = 
    return grad

# test your function with some example values
print(grad_CategoricalCrossEntropy(y_true, y_pred)) # should be array([-1.42857143,  0.        ,  0.        ,  0.        ])

### **Problem 2A**

Define a function for softmax. 

In [None]:
def softmax(x):
    """
        Add your code below
    """
    s =
    return s

# test your function with some example values
x = np.array([2.0, 1.0, 0.1])
print(softmax(x)) # should be array([0.65900114, 0.24243297, 0.09856589])

### **Problem 2B**

Note that softmax is vector-valued function. So, its "derivative" is more complicated than functions like sigmoid or ReLU. In fact, we must compute $\dfrac{\partial a^F_k}{\partial z^F_j}$ for all $1\leq j, k \leq n_F$ where $n_F$ is the number of neurons in the final layer. So,
$$  {\bf a}^F = \text{softmax}({\bf z^F}) = \frac{1}{\sum_{i=1}^{n_F} e^{z^F_i}} \begin{bmatrix} e^{z^F_1} \\ e^{z^F_2} \\ \vdots \\ e^{z^F_{n_F}} \end{bmatrix} , $$
and
$$ a^F_k = \dfrac{e^{z^F_k}}{\sum_{i=1}^{n_F} e^{z^F_i}} .$$

Compute $\dfrac{\partial a^F_k}{\partial z^F_j}$. *Hint: your answer should be a peicewise function based on $j$ and $k$.*

For this problem, you do not need to program anything.

### **Problem 2C**

Notice that $\dfrac{\partial a^F_k}{\partial z^F_j}$ is dependent on two indices. So, we use a matrix to record all of these partials:

$$ \nabla_{{\bf z}^F} {\bf a}^F = 
\begin{bmatrix} 
    \dfrac{\partial a^F_1}{\partial z^F_1} & \dfrac{\partial a^F_1}{\partial z^F_2} & \cdots & \dfrac{\partial a^F_1}{\partial z^F_k} \\ ~ \\
    \dfrac{\partial a^F_2}{\partial z^F_1} & \dfrac{\partial a^F_2}{\partial z^F_2} & \cdots & \dfrac{\partial a^F_2}{\partial z^F_k} \\ ~ \\
    \vdots & \vdots & \vdots & \vdots \\ ~\\
    \dfrac{\partial a^F_k}{\partial z^F_1} & \dfrac{\partial a^F_k}{\partial z^F_2} & \cdots & \dfrac{\partial a^F_k}{\partial z^F_k}
\end{bmatrix} $$

Using your answer from part 3B, define a function for computing the matrix $\nabla_{{\bf z}^F} {\bf a}^F$. 

This seems crazy, but your answer should be fairly nice if you write it in terms of softmax. You also could do this in a for loop if you'd like!

In [None]:
def softmax_derivative(x):
    """
        Add your code below
    """
    s = softmax(x)
    deriv =
    return deriv

# test your function with some example values
print(softmax_derivative(x)) # should be array([[ 0.22451502, -0.05907744, -0.16543758], [-0.05907744,  0.17540056, -0.11632312], [-0.16543758, -0.11632312,  0.2817607 ]])

### **Problem 2D**

Where does the activation function of the last layer appear in our `backpropagation` function? Will the derivative of softmax change how we compute the error vector ${\bf d}^F$?

Discuss in a senetence or two.


### **Problem 3**

At this point, we could work on adapting your `backpropagation` function to use softmax in the final layer, but for the sake of time let's instead rely on TensorFlow to do the hard work for us.

Using TensorFlow, design and train a FNN on the MNIST dataset using `tf.activations.softmax` in your output layer, `tf.activations.relu` for your hidden layers, `tf.losses.CategoricalCrossentropy` for your loss function, and `tf.optimizers.SGD(learning_rate=0.01)`.

*Tip: It may be helpful to review your submission for Project 1!* 

### **Problem 4A**

During our Project 2 presentations, we saw that not only are there different kinds of activation and loss functions but there are also different kinds of optimizers!

Experiment with using the optimizer `tf.optimizers.Adam(learning_rate=0.01)`. 

The Adam optimizer is just a variation of gradient descent. Rather than simply updating by a multiple of the gradient of our weights and biases, we include other terms in the update.

### **Problem 4B**

What did you find? Explain how Adam affected your training.

### When to Stop Training?

Deciding when to stop training is nontrivial. So far, we've relied on our model's performance on training data. However, in practice our model will be used to make decisions on data its never seen before. So, it's important to evaluate its performance on data its not trained on. This leads us to partition our data into training, validation, and testing sets--which we saw in the Project 2 presentations!

We'll use `x_test` and `y_test` as our validation dataset.



### **Problem 5A**

Add the `x_test` and `y_test` as a validation dataset to your `mode.fit` command. You can do this by adding `validation_data = (x_test, y_test)` to your inputs and retrain your model to demonstrate how this changes the output of `model.fit`.

### **Problem 5B**

Discuss how we should use the output of our validation data to determine when to stop training.

### **Problem 6**

Revisit the Fashion MNIST dataset and train a new model with your new found knowledge.

Include your code in the code cell below and use the final markdown cell below that to explain why you've trained a reasonable model. Be sure to reference your validation data.
