# Week 5: Neural Network Training

- Neural network training
- Activation functions
- Multiclass classification (MNIST)
- Softmax regression model for multiclass classification
- Adam Algorithm Intuition
- Convolutional Neural Network


General Training Steps
1. specifcy how to compute output given inpux x and parameters w, b to define model as fw,b(x)
2. specifcy loss and cost L, J(w,b)
3. train on data to minimize J(w,b)

### Step 1 for neural networks

model = Sequential([
Dense(units = ...)
Dense(...)
])

### Step 2
Loss function examples
 - MeanSquaredError() for example if we're predicting numbers and not categories
 - BinaryCross() should only be used for classification with exactly 2 cases
 - SparseCategoricalCrossentropy() used for Softmax

Implementation
model.complie(loss = BinaryCrossentroypy(from_logits = True))

from_logits = True fixes some rounding errors

### Step 3
model.fit(X,y, epochs = 100)


**Activation functions**
- Sigmoid g(z) = 1/(1+e^-z)
    - typically used for binary classification
    - but could also be used in multilabel classification: is it a car? is it a bus? is it a pedestrian?
- Linear activation function g(z) = z
    - if linear activation (with positive or negative values)
- ReLu (rectified linear unit) g(z) = max(0,z)
    - if y can only have positive values or 0
    - most common to use in activation layers
- Softmax(multiclass classification)
    - we can either set activation = softmax OR activation = linear and from_logit = True - this will be more accurate - see optional lab below
 
*if all g(z) are linear, then it's no different from a linear regression


**Multiclass Classification (MNIST)**

example: trying to identify handwritten numbers 0-9
Y can take on a set number of discrete numbers
Can use Softmax

Softmax
z = w . x + b
all possible z outputs (e^z1 / sum(all e^z)) should equal to 1
if the final output layer has more than 1 output


**Adam Algorithm Intuition (ADAptive Moment Intuition**

- faster than gradient descent

Add to model.compile
model.comile(optimizer = tf.keras.optimizers.Adam(learning_rate = 1e-3, loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logit = True)


**Convolutional Neural Network**

example: looking at a window of values (like a window function of an EKG)

Can set the first layer in the neural network as a window of multiple values

### Optional Lab - Softmax Function

In [None]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from IPython.display import display, Markdown, Latex
from sklearn.datasets import make_blobs
%matplotlib widget
from matplotlib.widgets import Slider
from lab_utils_common import dlc
from lab_utils_softmax import plt_softmax
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)

In [None]:
def my_softmax(z):
    ez = np.exp(z)              #element-wise exponenial
    sm = ez/np.sum(ez)
    return(sm)

In [None]:
# make  dataset for example
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)

In [None]:
preferred_model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'linear')   #<-- Note use this instead of softmax when using from_logits = True
    ]
)
preferred_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  #<-- Note
    optimizer=tf.keras.optimizers.Adam(0.001),
)

preferred_model.fit(
    X_train,y_train,
    epochs=10
)
        

Notice that in the preferred model, the outputs are not probabilities, but can range from large negative numbers to large positive numbers. The output must be sent through a softmax when performing a prediction that expects a probability. 

In [None]:
p_preferred = preferred_model.predict(X_train)
print(f"two example output vectors:\n {p_preferred[:2]}")
print("largest value", np.max(p_preferred), "smallest value", np.min(p_preferred))

The output predictions are not probabilities!
If the desired output are probabilities, the output should be be processed by a [softmax](https://www.tensorflow.org/api_docs/python/tf/nn/softmax).

In [None]:
sm_preferred = tf.nn.softmax(p_preferred).numpy()
print(f"two example output vectors:\n {sm_preferred[:2]}")
print("largest value", np.max(sm_preferred), "smallest value", np.min(sm_preferred))

To select the most likely category, the softmax is not required. One can find the index of the largest output using [np.argmax()](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html).

In [None]:
for i in range(5):
    print( f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}")