# Applied AI: Deep Learning

Key components of cloud architecture for applied AI:
* Object Storage - for capacity, redundancy, automated backup and I/O performance
* Real-Time Data Stream - for real-time ingestion and model applications
* Scaling - how to scale models on GPU and CPU clusters
* Jupyter Notebooks - to create models
* Deep Learning Framework - high or low level options to implement/test neural networks (ex: Keras)
* Open Neural Network Exchange Formats - to facilitate import/export between frameworks (ex: ONNX)
* Execution Environment - for large scale parallel execution (ex: DeepLearning4J or Apache SystemML on top of Apache Spark)

Popular deep learning frameworks include Keras, CNTK and Theano (depricated). Keras, the de facto choice, uses TensorFlow as the execution engine and can export models to be ingested by other frameworks such as DeepLearning4J and Apache System ML via open standard exchange formats such as ONNX.

## Neural Networks

Given an input vector and associated weight/parameter vectors, a neural network attempts to minimize the error between estimated $\hat{y}$ and actual output $y$, also known as the cost function $J$. It does this by optimizing the weights vectors for all training data.

Although grid search (brute force) and Monte Carlo can be used to determine the correct weights, they are far too computationally expensive to be used in any practical applications. Instead, **gradient descent** is used to iteratively refine weights in the "downhill" direction along the hypersurface of the cost function. That is, parameters $\theta$ are updated for each timestep $t$ such that: $\theta_{t+1}= \theta_t - \eta\Delta_\theta J(\theta_t, X, Y)$, where $\eta$ is a chosen learning rate, $\Delta_\theta J(\theta_t, X, Y)$ is the derivative of the cost function, $X$ is the input vector and $Y$ is the output vector. Other methods seek to improve upon vanilla gradient descent:

* **Stochastic Gradient Descent**: by taking the gradient for each training example
  ** More efficient, can be used for online data/streaming
  ** Requires tuning of learning rate to ensure convergence; otherwise, tends to bounce around
* **Mini-Batch Gradient Descent**: by taking the gradient for localized batches of points (typically between 50 and 256 values)
  ** Both efficient and stable convergence
  ** Takes advantage of speedy matrix derivative calculations

Still, additional challenges remain, such as tuning/adapting the learning rate over time and for specific parameter upates. Other important algorithms that implement different updater strategies include:

* **Momentum**: accelerates SGD in the correct direction and smooths oscillations by carrying forward a "momentum term" from the previous timestep
* **Nesterov Accelerated Gradient**: gives Momentum a forward-looking intelligence, i.e. "smart ball"
* **Adagrad**: adapts learning rate to the parameters based on gradients that have been previously calculated, allowing us to deal with sparse data
* **Adadelta**: a version of Adagrad that attempts to more intelligently update learning rate
* **RMSprop**: similar to Adadelta
* **Adam**: RMSprop plus Momentum
* **AdaMax**: update to Adam
* **Nadam**: Adam plus Nesterov Accelerated Gradient

These algorithms produce varying results in terms of convergence on optima, particularly around saddle points.

Selecting the correct activation function for a given problem requires knowledge of the problems linearity. Non-linear functions can be approximated only using a non-linear activation function, such as sigmoid or tanh (similar to sigmoid but covering negative values). Multi-class classification applications also utilize softmax, which produces an ensemble of values that sum to one. Relu (Rectified Linear Unit) is the most widely used activation function due to its simplicity, however it can cause dead neurons; in these cases leaky Relu, which extends into the negative range, can be used instead. Often it is best to start with Relu for input and hidden layers and adjust the output layer based on the task at hand: regressor - linear output unit, classifier - softmax or sigmoid.

Due to neural networks inherent flexibility in approximating any function, they tend to fit datasets (and noise) extremely well. To prevent overfitting, always compare training and validation results; the former should never be much higher than the latter. You can also use regularization, to penalize higher weights, and early stopping, to halt training at a certain threshold. Lastly, you can use Drop Out to randomly deactivate neurons at each epoch (training iteration), forcing the network to generalize.

Why neural networks? Linear machine learning models are limited to linear functions; on the other hand, neural networks can be used when data isn't linearly separable. 

#### Deep Feedforward Neural Networks

The simplest type of neural network is called a **perceptron**, which consists of a linear combination of an input vector and a weights vector, passed into a step activation function. This system acts as a binary linear classifier and is used to approximate some function. Deep feedforward neural networks consist of mutilayer perceptrons, with information flowing in the forward direction, through a hidden layer, in order to calculate the function output. 

Deep feedforward neural networks can represent any mathematical function (the Universal Function Approximation Theory). However, even if you can represent any mathetmatical function, having a single hidden layer is not viable for training the network.

#### Convolutional Neural Networks

Convolutional neural networks are favored for image classification due to their lower computational cost and ability to capture pixel dependencies throughout the image. Deep feedforward neural networks can represent any mathematical function.

#### Recurrent Neural Networks

While deep feedforward networks are effective at learning functions, they do not work well with sequences or time series data - enter RNNs. In these networks, feedback connections between neurons pass back temporal information, giving the system a form of memory.

#### Long Short Term Networks

LSTMs map an input vector to an output vector using weights and an activation function along with additional components, including an input gate, an output gate and a forget gate. Data flows through a central node called a cel state, which is the memory of the neuron. 

The input vector is used not only as input to the neuron but also input to the input gate. This gate has its own weights vector, which enables it to modulate the influx of information into the cell state. Likewise, the output gate, which controls the output to downstream neurons, takes the input vector and the actual cell state and applies a weights vector. FInally, the neuron needs a way to forget the cell state, hence the addition of a forget gate. This gate is controlled by the input vector and the current cell state to control how much of the prior state is preserved.

#### Autoencoders

Autoencoders map an input vector to itself via a bottlenecking architecture. In other words, it attempts to reconstruct a dataset by mimicing the identity function. Since intermediary layers have fewer neurons that outer layers, data must be compressed, forcing the network to learn efficient compression. Autoencoders are outperforming longstanding dimensionality reduction techniques including PCA (linear) and t-distributed Stochastic Neighbor Embedding aka t-SNE (non-linear).

One application of autoencoders is anomaly detection. Since the network must learn how to reconstruct the training data, if it fails to do so on subsequent data, it is likely that that data is anomalous.

## TensorFlow

TensorFlow, originally created by Google Brain, is an open-source symbolic math library for tasks involving heavy numerical computations. While not limited to one specific field, its main application is machine learning, particularly deep neural networks. TensorFlow enables algorithms to run at scale across a cluster backed by CPUs, GPUs, TPUs and mobile devices.

In TensorFlow, every numerical computation is expressed as a graph, with nodes representing computations and links representing the flow of multidimensional arrays (tensors). **Placeholders** enable us to add data (ex: training data) to the computational graph during execution time, after the graph is constructed. **Variables** represent tensors whose values can be changed during training (ex: weights and biases). In TensorFlow, no computation takes place until after the computational graph is constructed and a session is instantiated. This deploys the execution graph onto an execution context (ex: CPU or GPU). 

In a healthy neural network, accuracy and loss should be inversely related during training. Convergence to the local optimum can be visualized using TensorBoard and adjustments to the learning rate might be required to dampen oscillations. In assessing the weights histogram, extreme values indicate oversaturation, while a uniform distribution indicates insufficient parameter updates. Weights centered very close to zero mean that the gradients are very small.

A nice secondary output of TensorFlow is its so-called automatic differentiation. Since every operator registers the first derivative of its operation, TF can compute the derivative of any complex function by applying the chain rule.

#### TensorFlow Intro: MNIST Digit Recognition

```python
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import matplotlib.pyplot as plt

%matplotlib inline

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

# Preview a digit and a label
batch_xs, batch_ys = mnist.train.next_batch(1)
X = batch_xs
X = X.reshape([28, 28])
plt.gray()
print(batch_ys)
plt.imshow(X)

# Set up variables/placeholders
x = tf.placeholder(tf.float32, shape=[None,784]) # Training data
W = tf.Variable(tf.zeros([784,10])) # Weights
b = tf.Variable(tf.zeros([10])) # Biases
y_ = tf.placeholder(tf.float32, [None, 10]) # Training labels
y = tf.nn.softmax(tf.matmul(x,W) + b) # Create model

# Cross entropy cost function: the sum of predicted values multiplied by log(actual values)
cross_entropy = tf.reduce_mean( -tf.reduce_sum( y_ * tf.log(y), reduction_indices=[1] ) )

# Gradient descent with learning rate = 0.5
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

# Create a session, necessary to deploy a computation graph on a specific context
sess = tf.InteractiveSession()

# Initialize all variables
tf.global_variables_initializer().run()

# Gradient descent loop
for _ in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={x:batch_xs, y_:batch_ys})
    
# Create boolean vector of equality between predictions and actuals
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

# Determine number of correctly predicted values
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Execute via session to calcuate accuracy on test data
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
```

## Keras

Keras is a deep learning framework written in Python that is known for its strong user base and support documentation. Test data sets are widely available as are pre-built models. Keras can be hooked into a variety of backends including TensorFlow, CNTK and Theano.

There are two types of models in Keras: **Sequential** and **Model (non-sequential)**. To create a sequential model consisting of stacked layers:

* Instatiate a Sequential model
* Add layers, one by one
* Compile the model using a loss function (ex: mean squared error) and optimizer (ex: SGD)
* Fit the model to the training data
* Evaluate the model
* Apply the model to new data to generate predictions

Non-sequential models following a functional programming API.

Keras models can be saved either as a complete model (architecture, weights and training configuration - HDF5) or as individual components (JSON or YAML).

#### Example Feedforward Network

```python
from keras.datasets import mnist
from keras.utils import to categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout

batch_size = 128
num_classes = 10
epochs = 20

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype("float32")
x_test = x_test.astype("float32")
x_train /= 255
x_test /= 255

y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)

model = Sequential()

model.add(Dense(512, activation="relu", input_shape(784,)) # 512 output
model.add(Dropout(0.2))
model.add(Dense(512, activation="relu") # 512 output
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation="softmax")) # 10 output

model.summary()

model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test))

score = model.evaluate(x_test, y_test, verbose=0)          

print("Loss: {}, Accuracy: {}".format(score[0], score[1]))
```