# Introduction to Artificial Neural Networks

# Imports

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import matplotlib
import matplotlib_inline.backend_inline

matplotlib_inline.backend_inline.set_matplotlib_formats('retina')
matplotlib.rcParams['figure.figsize'] = (12, 8)
matplotlib.rcParams['figure.dpi'] = 100

# Artificial Neural Networks (ANNs)

An Artificial Neural Network is a nesting of functions arranged in layers.

## What is a Neuron?

![image](https://drive.google.com/uc?export=view&id=1qSVls_zidJ-nGcExfBHQrdEwD_xO_ZjR)


Linear function: $Wx + b$

Non-linearity (activation function): $f(x)$

Every neuron computes: $f(Wx + b)$

A single neuron is a Linear Regression if specified without an activation function.

![image](https://drive.google.com/uc?export=view&id=1SniRhePcgf4SBNKUBaFsU4VDjzo3Kz1V)

---



Hidden layers predict connections between inputs automatically.

Deep Neural Networks (NNs) have more hidden layers.

![image](https://drive.google.com/uc?export=view&id=1pzhyOacSy7rzcOsMPzdaxVmOr0g5CCTF)

## Why multiple layers?

The intuition is that the network gradually makes relations with data from simple to complex.
In each layer it tries to model a relation with the previous layer.

Face recognition:
Image -> Edges -> Face parts -> Faces -> Desired face

Audio recognition:
Audio -> Low level sound features -> Phonemes -> Words -> Sentences

# Shallow or deep?

Deep > Shallow

## Is it similar to how our brain works?

This is an over simplified analogy between a single brain neuron and NN unit.

## Activation Functions

Popular functions:
* `Sigmoid` $\sigma (x) = {1 \over{1 + e^{-x}}}$
* `tanh`
* Leaky ReLU, $Leaky ReLU(x) = max(0.1x, x)$
* Rectified Linear Unit (ReLU), $ReLU(x) = max(0, x)$

In [None]:
def sigmoid(x):
    return 1. / (1. + np.exp(-x))

def relu(x):
    return np.maximum(x, 0)

def leakyrelu(x):
    return np.maximum(x, 0.1*x)

def tanh(x):
    return np.tanh(x)

fig, ax = plt.subplots(nrows=2, ncols=2)

x = np.linspace(-10, 10, num=50)
ax[0, 0].plot(x, relu(x))
ax[0, 0].set_title('relu(x)')
ax[0, 0].grid(True)

ax[0, 1].plot(x, leakyrelu(x))
ax[0, 1].set_title('leakyrelu(x)')
ax[0, 1].grid(True)

ax[1, 0].plot(x, sigmoid(x))
ax[1, 0].set_title('sigmoid(x)')
ax[1, 0].grid(True)

ax[1, 1].plot(x, tanh(x))
ax[1, 1].set_title('tanh(x)')
ax[1, 1].grid(True)

plt.show()

### Why non-linear activation functions?

If we remove the activation function, our algorithm would be linear.
Stacking multiple linear functions together result in a linear function.
The intuition here is that since our dataset is non-linear, we give the network the ability to learn non-linearities.

# More vocabulary

## Hyperparameters

* Number of hidden layers
* Number of neurons in each layer
* Learning rate
* Activation functions
* initialization of weights
* batch size

## Weight initialization

It's important to initialize the weights with values different than $0$, and not constant.

There are multiple initialization strategies:
* uniform/normal distributions
* LeCun uniform/normal
* Glorot/Xavier

## Types of ANNs

* Fully Connected NN, structured data
* Convolutional Neural Networks (CNN), for Computer Vision
* Recurrent Neural Networks (RNN), speech recognition, natural language processing (NLP)

## Tensors

A Tensor is a generalization of vectors and matrices to higher dimensions.
It has a rank (number of dimensions):
* 0, scalar, magnitude only
* 1, vector, magnitude and direction
* 2, matrix, table of numbers
* 3, cube of numbers
* n, n-dimensional array

## Computational Graph

Example from GoogLeNet (Inception v1)

![image](https://drive.google.com/uc?export=view&id=1E-hbbTwb7cJb2-6Gy0vUXjgwiD908Ri7)


## Forward/Backward Propagation

Forward pass for inference

Backward Propagation to optimize weights

## Optimizer

The optimizer specifies in which way the gradient of the loss will be used to update parameters.

* SGD
* RMSprop
* Adam
...


## Loss function

* Mean Squared Error (MSE)
* Mean Absolute Error (MAE)
* Categorical Crossentropy
...


## Epochs

How many times the training will use the training data.

## Batch/mini-batch size

Batch size, how many samples to use for every iteration of the Optimizer.

Ranges from small numbers (e.g. 1) to higher values (1024, 2048, etc).

In general, the higher the number of samples in the batch, the faster the model converges and more accurate since the effectiveness of the optimization steps depends on the size of the batch.

Since we want to use the entire training set during an epoch and most likely the dataset does not fit the CPU/GPU memory, we'll have to split it over multiple mini batches.

# Summary

**Training** an Artificial Neural Networks boils down to minimizing a cost (loss) function by optimizing the parameters (weights) of the model.
This is not an easy task since there can be milions of parameters to optimize.

**The loss function** that is being minimized includes a sum, or a form of penalty over all of the training samples available.

**The optimization** is commonly done using Stochastic Gradient Descent over batches of input data, until all the data has been used.

**Epoch**. A complete cycle through all of the training data is called an **epoch**.
The training (optimization) runs for a number of epoch until the loss function is minimized and the model has reached an accuracy level that is acceptable or it stopped improving.

# Popular Frameworks

* TensorFlow backed by Google
* Keras High level API backed by Google
* PyTorch backed by Facebook
* Caffe2 backed by Facebook
* Microsoft Cognitive Toolkit (CNTK)

# TensorFlow

Backed by Google, Open Source (Apache 2.0 license), https://github.com/tensorflow/tensorflow

Released in November, 2015.

Has stable Python and C APIs.

Mobile variant for Android, iOS: TensorFlow Lite.

Backends for CPU, GPU (Nvidia CUDA, ROCm AMD, OpenCL/SYSCL), Google Tensor Processing Units (TPUs) ASICs.

# Keras

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.

It was developed with a focus on enabling fast experimentation.

http://keras.io

# Linear Regression with Keras and TensorFlow

The core data structure in Keras is a *model*, which is a way to organize layers.

The simplest type of model is the `Sequential` model, a linear stack of layers.

More complex architectures are also possible, check the **Keras functional API**.

Let's define the simplest possible neural networks. It has 1 layer, and that layer has a single neuron, so the input shape is just one value.

In [None]:
# check doc, you can directly add the layers here
model = keras.Sequential()

Adding layers is done with the `add()` method.

In [None]:
from tensorflow.keras.layers import Dense

In [None]:
mylayer = keras.layers.Dense(units=1, input_shape=[1])

model.add(mylayer)

In [None]:
# equivalent approach
model = keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])

Now we need to compile the network, but in order to do that, we need to specify an optimizer and a loss function.

In [None]:
# All parameter gradients will be clipped to a maximum norm of 1.
sgd = keras.optimizers.SGD(learning_rate=0.01, clipnorm=1.)

In [None]:
model.compile(optimizer=sgd, loss='mean_squared_error')

In [None]:
model.summary()

# see the number of parameters?

## Example data

Let's assume that our function is the following:

In [None]:
def myfunc(x):
    return 3 * x - 2

In [None]:
# X = np.array([-1.0,  0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
X = np.linspace(-1, 20, 100)
y = myfunc(X).astype(float)

plt.plot(X, y)
plt.grid()

## Training the network

The process of training the neural network, where it learns the relationship between the Xs and Ys is in the `model.fit` call.

This involves a loop in which the loss function and the optimizer are used to optimize the network over *epoch* number of times. The prints that we get allow us to baby sit the process.

In [None]:
print(X.shape)

In [None]:
history = model.fit(X, y,
                    epochs=200,
                    validation_split=0.1
)

In [None]:
print(history.history['loss'])

Now we have a model that has been trained. We can use this model to predict the values on previously unseen data using `model.predict`.

In [None]:
# Plot training & validation loss values

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.grid()

## Making predictions

In [None]:
x_test = np.array([0, 10])
print(model.predict(x_test))
print(f"True values: {myfunc(x_test)}")

In [None]:
len(model.layers)

In [None]:
l1 = model.layers[0]

print(type(l1))

In [None]:
l1.get_weights()

# Going Deeper with a Computer Vision Example

Now that we've seen how to model the relationship between a single feature and a single output, let's try something more difficult.

The Fashion MNIST dataset contains 60000 examples, and a test set of 10000 examples.
Each example is a 28x28 grayscale image (i.e. 28,28,1) associated with a label from 10 classes.

Classes:
* 0 T-shirt/top
* 1 Trouser
* 2 Pullover
* 3 Dress
* 4 Coat
* 5 Sandal
* 6 Shirt
* 7 Sneaker
* 8 Bag
* 9 Ankle boot 


The Fashion MNIST dataset is available directly using `keras.dataset` API.

In [None]:
# notice we get both training and test images

(training_images, training_labels), (test_images, test_labels) = keras.datasets.fashion_mnist.load_data()

Remember, these are images, hence we're dealing with pixels organized on rows and columns

In [None]:
print(training_images.shape)
print(test_images.shape)

In [None]:
28 * 28

In [None]:
training_labels

In [None]:
a_sample = training_images[0]
print(a_sample.shape)
print(a_sample[:10])

In [None]:
for i in range(0, 10):
    index = np.argwhere(training_labels == i).ravel()
    first_idx = index[0]
    plt.figure()
    plt.imshow(training_images[first_idx], cmap='gray')
    plt.title('class {}'.format(i))
    plt.show()

## Normalization

Notice that all of the values are between 0 and 255.

Training Neural Networks is easier if all values are between $[0, 1]$ or $[-1, 1]$.
This process is called normalization.

In [None]:
training_images = training_images / 255.0
test_images = test_images / 255.0

training_images = training_images.reshape((training_images.shape[0], training_images.shape[1] * training_images.shape[2]))
test_images = test_images.reshape((test_images.shape[0], test_images.shape[1] * test_images.shape[2]))

print(training_images.shape)
print(test_images.shape)

# Standardization

In [None]:
# ON TRAINING DATA
mean = np.mean(training_images)
stddev = np.std(training_images)

training_images = (training_images - mean) / stddev
test_images = (test_images - mean) / stddev

## Building the model

Let's create a model which has two layers:
* One hidden layer 128 units (neurons)
* The output layer with 10 units since we're doing a classification problem with 10 classes

In [None]:
model = keras.models.Sequential(
    [
        keras.layers.Dense(128, input_shape=(784, ), activation='relu'), 
        keras.layers.Dense(10, activation='softmax')
    ]
)

In [None]:
model = keras.models.Sequential()

#model.add(keras.layers.Dense(128, input_shape=(784, ), activation='relu'))
model.add(keras.layers.Dense(128, input_shape=(784, ), activation=keras.activations.relu))
model.add(keras.layers.Dense(10, activation=keras.activations.softmax))

## Softmax?

We want the outputs of the output layer to be probabilities between $[0, 1]$.

As we saw, the output of the units depends on the function $wx + b$ which can output any real value.
Softmax takes $K$ real numbers, and normalizes them into a probability distribution consisting of K probabilities.
This basically means that all output will add up to 1, and they can be interpreted as probabilities.
Larger input components will correspond to larger probabilities.

In [None]:
# sparse_categorical_crossentropy
# categorical_crossentropy

optimizer = keras.optimizers.Adam()
model.compile(optimizer = optimizer,
    loss = 'sparse_categorical_crossentropy',
    metrics=['acc'],
)

In [None]:
model.compile(optimizer='adam',
    loss = 'sparse_categorical_crossentropy',
    metrics=['acc'],
)

In [None]:
model.summary()

In [None]:
history = model.fit(
    training_images,
    training_labels,
    validation_split=0.1,
    epochs=50,
    workers=1,
    shuffle=False,
    batch_size=2048*8,
)

Once the training is done, you'll see the accuracy value at the end of the final epoch.

Remember, this accuracy value is on the training set, so it's not particularly fair.

In [None]:
history = model.fit(training_images, training_labels,
    validation_split=0.1,
    epochs=10,
    workers=1,
    shuffle=False,
    batch_size=2048*16,
)

In [None]:
print(history.history.keys())

In [None]:
# Plot training & validation accuracy values
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.grid()
plt.show()

# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.grid()
plt.show()

In [None]:
model.summary()

In [None]:
784 * 128 + 128 + 10 * 128 + 10

Evaluate the network on unseen data

In [None]:
model.evaluate(test_images, test_labels,
    batch_size=2048*16
)

In [None]:
for i in range(0, 10):
    index = np.argwhere(test_labels == i).ravel()
    first_idx = index[0]
    plt.figure()
    plt.imshow(test_images.reshape(10000,28,28)[first_idx], cmap='gray')
    plt.title('True class : {}, Predicted class : {}'.format(i, np.argmax(model.predict(test_images[first_idx,np.newaxis]))))
    plt.show()

In [None]:
for i in range(0, 10):
    index = np.argwhere(test_labels == i).ravel()
    t = True
    j = 0
    while t==True:
        first_idx = index[j]
        if np.argmax(model.predict(test_images[first_idx,np.newaxis])) != i:
            t = False
        j += 1
    plt.figure()
    plt.imshow(test_images.reshape(10000,28,28)[first_idx], cmap='gray')
    plt.title('True class : {}, Predicted class : {}'.format(i, np.argmax(model.predict(test_images[first_idx,np.newaxis]))))
    plt.show()

# Neural networks for image reconstruction

In [None]:
(X_train, _), (X_test, _) = tf.keras.datasets.mnist.load_data()

In [None]:
X_train.shape

In [None]:
image_shape = X_train.shape[1:]
image_shape

In [None]:
plt.imshow(X_train[0, ...], cmap='gray')

In [None]:
def prepare_train_data(X_train):
    image_shape = X_train.shape[1:]
    
    # Neural networks work best with normalized data
    X_train_normalized = X_train / 255
    # Since our goal is to reconstruct the original images, the reference is the same array
    X_reference = X_train_normalized.reshape(-1, image_shape[0] * image_shape[1])
    
    return X_train_normalized, X_reference

In [None]:
X_train_normalized, X_reference = prepare_train_data(X_train)

In [None]:
def build_model(X_train_normalized, X_reference, hidden_layer_size):
    image_shape = X_train_normalized.shape[1:]
    
    # Define the neural network architecture
    model = tf.keras.Sequential([
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
        tf.keras.layers.Dense(image_shape[0] * image_shape[1], activation='relu'),
    ])
    
    # Choose an optimizer and a metric
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
        loss=tf.keras.losses.MeanSquaredError(),
    )
    
    # Train the model on data
    model.fit(
        X_train_normalized,
        X_reference,
        epochs=10,
        batch_size=64)
    
    return model

In [None]:
HIDDEN_LAYER_SIZE = image_shape[0] * image_shape[1]

In [None]:
model = build_model(X_train_normalized, X_reference, HIDDEN_LAYER_SIZE)

In [None]:
def reconstruct_image(model, img):
    image_shape = img.shape
    
    # Normalize the input image
    img_normalized = img / 255
    img_normalized = img_normalized.reshape(1, image_shape[0], image_shape[1])

    # Use the model to create a reconstruction
    reconstructed_img_normalized = model.predict(img_normalized)
    
    # Undo the normalization for plotting purposes
    reconstructed_img_normalized = reconstructed_img_normalized.reshape(image_shape)
    reconstructed_img = reconstructed_img_normalized * 255
    
    return reconstructed_img

In [None]:
def compare_reconstruction(model, img):
    plt.imshow(img, cmap='gray')
    plt.title('Original image')
    plt.show()
   
    reconstructed_img = reconstruct_image(model, img)

    plt.imshow(reconstructed_img, cmap='gray')
    plt.title('Reconstructed image')
    plt.show()

In [None]:
IMG_ID = 42
img = X_test[IMG_ID, ...]

compare_reconstruction(model, img)

# Resources

* https://github.com/lmoroney/dlaicourse/blob/master/Course%201%20-%20Part%202%20-%20Lesson%202%20-%20Notebook.ipynb
* https://github.com/fonnesbeck/Bios8366/blob/master/notebooks/Section7_1-Introduction-to-Tensorflow.ipynb
* https://github.com/fonnesbeck/Bios8366/blob/master/notebooks/Section7_2-Neural-Networks.ipynb
* https://github.com/fonnesbeck/Bios8366/tree/master/notebooks
* https://github.com/mbadry1/DeepLearning.ai-Summary/tree/master/1-%20Neural%20Networks%20and%20Deep%20Learning
