## Section 1
Hybrid models - combines different AI models together
* CNN, MDN, GA, VAE, RNN, ES
* Very powerful but computationally expensive to train

**Useful resources:**
<br>
worldmodels.github.io
<br>
blog.otoro.net

## Section 2

**Neurons**
<br>
Recreation of a human neuron
<br>
Takes in input signals and has an output signal
<br>
Inputs are usually normalized
<br>
Weights - connectors between the inputs and neurons and is how the network learns
<br>
Sum of the weights multiplied by the input values are taken, then is passed through an activation function
<br>
**Activation Functions**
<br>
Threshold: values < 0 are set to 0; values >=0 are set to 1
<br>
Sigmoid: 1 / 1 + e^(-x)
* S shaped curve bounded by 0 and 1
<br>
Rectifier: values < 0 are set to 0; values >=0 follow y=x line
<br>
Hyperbolic tangent: 1-e^(-2x) / 1+e^(-2x)
* Similar to sigmoid but the S shaped curve is bounded between -1 and 1
<br>

**Neural Networks**
<br>
Hidden layers allow the network to pick out specific attributes
<br>
Two different approaches (providing the AI with labeled data vs. having the AI learn on its own)
<br>
(supervised vs. unsupervised learning)
<br>
Cost function tells us the error in our prediction (which we try to minimize)
<br>
Weights are updated according to the cost function
<br>
Process: inputs fed into network -> weights are applied -> output is generated -> output is compared to actual value -> if cost/error/loss is significant: -> weights are adjusted -> inputs refed into network -> process restarts again
<br>
Gradient descent: finding the smallest value in a given function (used to minimize cost function)
<br>
Stochastic gradient descent: finds the global minimum value (whereas gradient descent may error in that it finds the local minimum)
* Weights are readjusted after each input whereas in normal gradient descent (batch gradient descent) all inputs are fed into the network and THEN the weights are readjusted
* Is actually faster than batch gradient descent

Backpropagation: readjusts all the weights at once in order to minimize the cost function and better approach desired behavior
<br>
Epoch: one complete run through of all input values through the network

## Section 3

**Convolutional Neural Networks**
<br>
Feature maps are created by aligning kernels with the original image
* Kernels are a small grid of numbers that constitute a feature
* As the kernel passes over the image, any matches with a specific part of the kernel with the image is captured and stored in the feature map
* Purpose is to decrease the size of the image while preserving special features and relationship among pixels
* Feature map is then passed through ReLU (rectified linear unit) - an activation function
<br>

**Processing**
<br>
Pooling: The ability for the CNN to detect certain features despite distortions (feature is rotated, smaller than what the CNN trained on, etc...)
<br>
Max pooling: Feature map is shrunk and a small grid is passed over the original feature map
* The biggest number in the grid is extracted and put into the corresponding slot
* The ability for pooling to remove excess information is good because it prevents overfitting

Mean pooling: Instead of taking the maximum value in the grid, take the average of all the values
<br>
Flattening: Take the values of the pooled feature map and align them into a column
<br>
The final steps involve connecting the column of values and passing them as inputs in an artifical neural network
<br>
When performing classification tasks, the output layer learns which neurons from the final fully connected layer (hidden layer in ANN terms) are important/relevant
<br>
* If certain neurons are producing a high value and they lead to a correct classification, then the network will listen to those specific neurons more when classifying
* Conversely, if certain neurons are generating a high value but they lead to an incorrect classification, then those neurons will be ignored
<br>

**Performance/Error Analysis**
<br>
Softmax function: Takes values and squeezes them between 0 and 1 so that their collective sum equals 1
<br>
Important for CNN classification tasks (and other applications) in order to generate a probability that makes sense (Dog: 95%, Cat: 5% vs. Dog: 80%, Cat: 45%)
<br>
Cross-entropy function: loss function but for CNNs (cost function for ANNs)
* Cross-entropy is also great for detecting improvements that may be subtle, difficult to notice, or regarded as only small improvements under MSE but is in reality a huge improvement (0.0000001 -> 0.0001)



## Section 4

**Autoencoders**
<br>
Autoencoders aim to make the output layers generate values similar to the input layer
* Encode data to make inputs take up less space
* Important features are extracted while lesser important features are disregarded
* Result are the remaining important features which do not take up as much space as the original input due to lesser important features being disregarded
* Uses ANNs as usual

Overcomplete hidden layers - more nodes in the hidden layers than the input layer
* Slightly confused on the significance of this
* Problem: network can cheat by simply passing information from the input layer directly to the next node in the hidden layer and then into the output layer
* Various types of autoencoders exist to address this issue

**Types of autoencoders**
<br>
Sparse autoencoders
* Very popular type of autoencoder
* At any time, the autoencoder can only use a portion of the total amount of hidden nodes (ex: nodes ABC out of the 8 total nodes for the first pass, nodes DEF out of the 8 total nodes for the second pass, nodes ACF out of the 8 total nodes for the third pass, etc)
* Different hidden nodes are activated/deactivated for each epoch
* The amount of hidden nodes activated at any given time will be lesser than the input layer

Denoising autoencoders
* Some input nodes are converted to 0

Contractive autoencoders
* A penalty is applied if the network simply copies the input into the output

Stacked autoencoders
* Adds an additional hidden layer

Deep autoencoders
* Restricted boltzmann machines stacked on top of each other (?)

## Section 5

**Variational Autoencoders**
<br>
Variational autoencoders are a type of encoder where "dreams' are created to help the AI learn
* These "dreams" are just variations in training that improves the learning process
* Ex: If an AI is learning how to drive a car through a race track, then the "dreams" may feature altered versions of the race track which help the AI learn in various environments

**Structure**
<br>
Instead of passing inputs into the latent vector (the compressed hidden layer), the inputs are first passed into a mean vector and a standard deviation vector
* The values from those two vectors are then passed into the latent vector, which is then passed into the output layer
* This process enables the values passed into the latent vector to be stochastic/varying

Reparameterization
<br>
Since the latent vector will vary every time, this causes backpropagation to fail
* To solve this, the "randomness" component is separated from the neural network
* This way, backpropagation can continue to improve the network while randomness is being introduced as a separate component
* Input -> Mean + Standard Deviation -> Latent -> Output
* Input -> Mean + Standard Deviation + Random -> Latent -> Output
* This way, part of the learning process the neural network has to go through is learning to adapt to this randomness introduced
* This is where the "dreams" comes from (I believe); the network "dreams", or creates variations in its training environment to help better its learning