Answers (3-9)
------------
3. A single perceptron will not be able to learn a non-linear function. To use perceptron to learn a non-linear function,
  we need to use a non-linear activation function.

4. It's differentiable, therefore we can use it with gradient descent. In addition, its derivative is always non-zero, promising
  an update of the weights and improving training.

5. Sigmoid
  ReLU
  tanH

6. MLP:
  * Input: 10 
  * Hidden 1: 50
  * Output: 3
  a. Shape of input matrix: (m, 10)
  b. Shape of hidden layer:
    b1. Weights matrix: (50, 10)
    b2. Bias vector (50, 1)
    X @ W.T + b = (m,10) @ (10,50) + (50,1) = (m,50) 
  c. Shape of output layer:
    c1. Weights matrix: (3, 50)
    c2. Bias vector: (3, 1)
  d. Shape of output matrix:
    (m, 50) @ (50, 3) = (m, 3)
  e. Equation of network:
    f(X,W1,b1,W2,b2) = a(a(X @ W1.T + b1) @ W2.T + b2)
    
7. For spam vs. ham, we'd need a single output neuron. It's result is the probability for spam (e.g. positive) or ham (1-probability).
  Activation function should be sigmoid. 
  For MNIST, we'd need 10 output neurons, each for a single label. Activation function should be softmax.
  For house prices (regression), we'd need a single output neuron with no activation function.

8. Backpropagation is an algorithm for training MLPs. It works by applying the chain rule to all transformations of the MLP. It computes
  the gradient of the loss function with respect to each parameter in the network and then it applies gradient descent to update the 
  parameters. Reverse mode auto-diff is a computational technique to efficiently compute the gradient of a composite function. It's
  used as the implementation of backpropagation in tensorflow.

9. Hyperparameters available for MLPs:
  - Number of hidden layers
  - Number of neurons in each hidden layer
  - Activation functions 
  - Learning rate
  - Batch size
  - Optimizer
  - Number of epochs
  - If an ANN overfits, we should decrease:
    * Number of hidden layers and neurons
    * Reduce number of epochs or use early stopping


## Hyperparameter Tuning over MNIST

In [14]:
import keras_tuner as kt
import tensorflow as tf
import numpy as np
from pathlib import Path
from time import strftime

In [16]:
(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

X_train_full, X_test = X_train_full / 255, X_test/ 255

X_valid, y_valid = X_train_full[-5000:], y_train_full[-5000:]
X_train, y_train = X_train_full[:-5000], y_train_full[:-5000]



In [18]:
# We write a function that builds and compiles a model using dedicated keras objects for ranges

def build_model(hp: kt.HyperParameters):
  n_hidden = hp.Int("n_hidden", min_value=0, max_value=8, default=2)
  n_neurons = hp.Int("n_neurons", min_value=50, max_value=500)
  learning_rate = hp.Float("learning_rate", min_value=1e-4, max_value=1e-2, sampling="log")
  optimizer = hp.Choice("optimizer", values=["sgd", "adam"])

  if optimizer == "sgd":
    optimizer = tf.keras.optimizers.legacy.SGD(learning_rate=learning_rate)
  else:
    optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=learning_rate)

  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Flatten())
  for _ in range(n_hidden):
    model.add(tf.keras.layers.Dense(n_neurons, activation="relu"))
  model.add(tf.keras.layers.Dense(10, activation="softmax"))

  model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

  return model

def get_run_logdir(root_logdir="tensorboard_logs"):
  return Path(root_logdir) / strftime("run_%Y_%m_%d_%H_%M_%S")

# example: tensorboard_logs/run_2023_10_06_08_31_16
run_logdir = get_run_logdir()

bayesian_opt_tuner = kt.BayesianOptimization(
  build_model, objective="val_accuracy", seed=42, max_trials=10, alpha=1e-4, beta=2.6, overwrite=True,
  directory="my_mnist", project_name="bayesian_opt"
)

tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir, profile_batch=(100,200))
bayesian_opt_tuner.search(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), callbacks=[tensorboard_cb])

Trial 10 Complete [00h 00m 32s]
val_accuracy: 0.9814000129699707

Best val_accuracy So Far: 0.9832000136375427
Total elapsed time: 00h 04m 41s


In [19]:
# Get best model's parameters:
top3_params = bayesian_opt_tuner.get_best_hyperparameters(num_trials=3)
top3_params[0].values

{'n_hidden': 7,
 'n_neurons': 253,
 'learning_rate': 0.0005509513888645584,
 'optimizer': 'adam'}

In [21]:
# Create a model using these parameters and train over entire data set for longer
model = tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=[28, 28]),

        tf.keras.layers.Dense(253, activation="relu"),
        tf.keras.layers.Dense(253, activation="relu"),
        tf.keras.layers.Dense(253, activation="relu"),
        tf.keras.layers.Dense(253, activation="relu"),
        tf.keras.layers.Dense(253, activation="relu"),
        tf.keras.layers.Dense(253, activation="relu"),
        tf.keras.layers.Dense(253, activation="relu"),

        tf.keras.layers.Dense(10, activation="softmax")
])

optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=0.0005)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
tensorboard_cb = tf.keras.callbacks.TensorBoard("tensorboard_logs/best_model", profile_batch=(100,200))

history = model.fit(X_train_full, y_train_full, epochs=50, validation_split=0.1, callbacks=[early_stopping_cb, tensorboard_cb])

Epoch 1/50


2023-10-06 14:05:46.832869: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-10-06 14:05:46.832881: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2023-10-06 14:05:46.832894: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.


 163/1688 [=>............................] - ETA: 3s - loss: 0.8101 - accuracy: 0.7387

2023-10-06 14:05:47.381311: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-10-06 14:05:47.381323: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.


 253/1688 [===>..........................] - ETA: 4s - loss: 0.6404 - accuracy: 0.7966

2023-10-06 14:05:47.701375: I tensorflow/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2023-10-06 14:05:47.714753: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2023-10-06 14:05:47.715009: I tensorflow/tsl/profiler/rpc/client/save_profile.cc:144] Collecting XSpace to repository: tensorboard_logs/best_model/plugins/profile/2023_10_06_14_05_47/Adams-MacBook-Pro.local.xplane.pb


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50


In [22]:
model.evaluate(X_test, y_test)



[0.08346851915121078, 0.9768999814987183]