# Applied AI: Deep Learning

Key components of cloud architecture for applied AI:
* Object Storage - for capacity, redundancy, automated backup and I/O performance
* Real-Time Data Stream - for real-time ingestion and model applications
* Scaling - how to scale models on GPU and CPU clusters
* Jupyter Notebooks - to create models
* Deep Learning Framework - high or low level options to implement/test neural networks (ex: Keras)
* Open Neural Network Exchange Formats - to facilitate import/export between frameworks (ex: ONNX)
* Execution Environment - for large scale parallel execution (ex: DeepLearning4J or Apache SystemML on top of Apache Spark)

Popular deep learning frameworks include Keras, CNTK and Theano (depricated). Keras, the de facto choice, uses TensorFlow as the execution engine and can export models to be ingested by other frameworks such as DeepLearning4J and Apache System ML via open standard exchange formats such as ONNX.

### Neural Networks

Given an input vector and associated weight/parameter vectors, a neural network attempts to minimize the error between estimated $\hat{y}$ and actual output $y$, also known as the cost function $J$. It does this by optimizing the weights vectors for all training data.

Although grid search and Monte Carlo can be used to determine the correct weights, they are far too computationally expensive to be used in any practical applications. Instead, **gradient descent** is used to iteratively refine weights in the "downhill" direction along the hypersurface of the cost function. That is, parameters $\theta$ are updated for each timestep $t$ such that: $\theta_{t+1}= \theta_t - \eta\Delta_\theta J(\theta_t, X, Y)$, where $\eta$ is a chosen learning rate, $X$ is the input vector and $Y$ is the output vector. Additional techniques seek to improve upon vanilla gradient descent by either taking the gradient for individual points (**Stochastic Gradient Descent**) or localized batches of points (**Mini-Batch Gradient Descent**) rather than for the entire dataset.

Why neural networks? Linear machine learning models are limited to linear functions; on the other hand, neural networks can be used when data isn't linearly separable. 

#### Deep Feedforward Neural Networks

The simplest type of neural network is called a **perceptron**, which consists of a linear combination of an input vector and a weights vector, passed into a step activation function. This system acts as a binary linear classifier and is used to approximate some function. Deep feedforward neural networks consist of mutilayer perceptrons, with information flowing in the forward direction, through a hidden layer, in order to calculate the function output. 

Deep feedforward neural networks can represent any mathematical function (the Universal Function Approximation Theory). However, even if you can represent any mathetmatical function, having a single hidden layer is not viable for training the network.

#### Convolutional Neural Networks

Convolutional neural networks are favored for image classification due to their lower computational cost and ability to capture pixel dependencies throughout the image. Deep feedforward neural networks can represent any mathematical function.

#### Recurrent Neural Networks

While deep feedforward networks are effective at learning functions, they do not work well with sequences or time series data - enter RNNs. In these networks, feedback connections between neurons pass back temporal information, giving the system a form of memory.

#### Long Short Term Networks

LSTMs map an input vector to an output vector using weights and an activation function along with additional components, including an input gate, an output gate and a forget gate. Data flows through a central node called a cel state, which is the memory of the neuron. 

The input vector is used not only as input to the neuron but also input to the input gate. This gate has its own weights vector, which enables it to modulate the influx of information into the cell state. Likewise, the output gate, which controls the output to downstream neurons, takes the input vector and the actual cell state and applies a weights vector. FInally, the neuron needs a way to forget the cell state, hence the addition of a forget gate. This gate is controlled by the input vector and the current cell state to control how much of the prior state is preserved.

#### Autoencoders

Autoencoders map an input vector to itself via a bottlenecking architecture. In other words, it attempts to reconstruct a dataset by mimicing the identity function. Since intermediary layers have fewer neurons that outer layers, data must be compressed, forcing the network to learn efficient compression. Autoencoders are outperforming longstanding dimensionality reduction techniques including PCA (linear) and t-distributed Stochastic Neighbor Embedding aka t-SNE (non-linear).

One application of autoencoders is anomaly detection. Since the network must learn how to reconstruct the training data, if it fails to do so on subsequent data, it is likely that that data is anomalous.