## Chapter 2: Deep learning and language: the basics

## Basic Architectures of Deep Learning

### Deep Multilayer Perceptrons (MLP)
* Consists of MLP layers containing artifical neurons
    * Artificial Neurons: Mathematical functions that receive input from weighted connections to other neurons.
        * ANs produce output values through a variety of mathematical operations.
    * Typically a depp network will have many neurons with lots of weights to calculate!
    * While there is no magic number, networks with more than 2 layers may be deemed 'deep'.
* Sequential Model: A container for a stacked set of layers and facilities for defining layers.
* This is the plain vanilla version of neural networks!
* Creating a Model:
    * Fully Connected: When dense layers connect all incoming neurons from one layer to the next layer.
    * Tensor: Container for numerical data
    * Important Features:
        * Batch Size: Determines the grouping of data points in batches that are handled collectively during training.
        * Loss Function: Function that computes mismatches between predictions and reality alongside the labels it should assign according to the training data.
        * Numerical Optimizer Algorithm: Carries out the gradient descent process
        * Evaluation Metric: Performs intermediate evaluation of the model
        * Loss: The loss of the loss function
            * Ex: Accuracy, MSE, etc.
        * Cost Function: Function that takes all of the current weights and determines the 'lousiness' of the weights based on the training examples.
        * Gradient Descent: Function that takes the cost function and minizmies it.

### Two Basic Operators: Spatial and Temporal
* Spatial Filtering:
    * Addresses properties of the structure of input data
    * Weeds out irrelevant patches and lets valuables ones through
* Convolutional Neural Network (CNN):
    * Applies a set of convolutions to input data and learns the weights of these filters on the basis of training data.
        * Convolutions: Weighted filters
    * The convolutions scan the data on multiple points and gradually extract out the important pieces of information.
        * Imagine that this scanning is like looking at a small subsection of the image that shifts over each stride:
            * Pixel step size
    * The return of these filters is just a weight matrices!
        * The filters merely emphasize or deemphasize input
    * While you start with a ton of initial random weights, the network will eventually figure out which weights are better and become optimized.
    * Activation Operations:
        * reLu: Max(x, 0)
        * Sigmoidal
        * Max Pooling: Return the maximum value 
    * Dimensionality Reduction of Feature Representations!
        * Converts representations of a certain dimension into lower dimensional representations.
    * Hyperparameters:
        * Dimensionality of the Output Space
        * Kernel Size: Size of the filter
        * Stride: Stepsize
        * These valeus are typically chosen arbitrarily or estimated based on validation data
    * Layers:
        * Dense: Binary representation of the output labels
        * Dropout: Randomly resets a specified faction of the input units to 0 to prevent overfitting
    * Results:
        * Epoch: Full sweep through the training data
        * Binary Cross-Entropy: Expresses how well a classification model producing probabilities performs.
        * Time: Amount of time spent on the epoch
        * Acc: Accuracy of the model during training
* Temporal Filtering:
    * Like Spatial Filtering but with information from the past
        * The depth here extends across the horizontal direction of a timeline.
* Recurrent Neural Networks (RNN):
    * Similairities to CNN:
        * Selection process boils down to learning weight matrices
        * Emphasis/de-emphasis of certain aspects of the info
    * RNNs memorize the decisions made in the past!
        * Allows for the ability to implement bias and carry out classifications in line with what was done in the past
    * Simple RNNs:
        * Neural network with a limited amount of memory
        * At any time, an RNN memory state is determined by:
            * Memory state at the previous time tick
            * A weight matrix weighting the previous memory state
            * A weight matrix weighting the current input to the RNN
        * The depth of the network is related to the amount of inputs!
        * Weights here are shared/updated across all inputs!
            * This allows the network to 'learn' from previous experiences. 
            * The weights are now minimized globally to reduce training error
        * Example: Predict the next character for a string, based on characters that precede that one
            * Onehot-encode the characters which get fed into the RNN
            * These units form a temporal chain which feeds into the next hidden unit
            * The one-hot vecotrs are then reconstructed from the hidden unit outputs
        * RNNs are quite crude!
            * They fail on long sequences
            * They blindly reuse hidden states which means that it can't tell apart trash/valuable data.
    * Long Short Term Memories (LSTM):
        * This was developed to help fix the problems associated with RNNs
            * Fixed via adding gating operations
            * These gates decide which historical information should be used for local processing in the current input
        * Every LSTM consists of a number of cells, chain in sequence, which consumes the same input
            * The number of cells is a hyperparameter which is up to the engineer to determine
        * Since it's encoding the context of the character/word, this info again is available globally.
        * 9 Important Weight Matrices
            * Weight on the input gate, applied to the input
            * Weight on the output gate, applied to the input
            * Weight on the forget gate, applied to the input
            * Weights on the input, computing the activity of the entire cell
            * Input gate weights applied to the previous hidden state
            * Weights applied to the hidden state of the net, for computing the activity of the cell
            * Forget gate: Applied to the previous hidden state of the net
            * Weight on the output gate, applied to the previous hidden state of the net
            * Weights applied to the cell activity
        * Inputs to an LSTEM:
            * Samples: Amount of data blocks
            * Timesteps: Number of observations / batch
            * Features: Dimensionality in each observation
        * When doing this on characters that repeat multiple times (almost randomly), the network figures out when to produce suffixes and is able to condition the suffixes on old information! 

### Deep Learning and NLP: A New Paradigm
* Deep Learning is not just a revamped, highly performant type of machine learning!
    * Spatial filtering: Allows for the emphasis/deemphasis of parts of data
    * Temporal filtering: Allows historical information to be gated in a keep/forget manner
    * You can also combine the RNNs with CNNs to get many different combinations!

### Summary:
* Basic deep learning architectures are multilayer perceptrons, spatial (convolutional) and temporal (RNN and LSTM-based) filters.
* Convolutional and Recurrent Neural Networks, as well as Long Short Term-based networks, can be easily coded in Keras.
* Deep learning examples applied to language analysis in this chapter include sentiment analysis and character prediction.