## Chapter 1: Deep learning for NLP

## 1.1 A Selection of Machine Learning Methods for NLP

* Classification Goal: To arrive at linear separability of data that is labeled with classes
    * Classes: Labels that indicate (a usually exclusive) category to which points belong
    * Input Space: Vector representations of descriptive traits
    * Feature Space: Processing, manipulation, and abstraction of the input space during the learning stage
    * Outer Space: Class labels that separate the various data points based on class boundaries
* Input Space -> Feature Space -> Output Space
    * Through deep learning, the relationships between the input and the output are defined
* Training a machine larning component involves learning boundaries between classes.
    * Linear Classifier: A linear function that separates classes with a straight line

### The Perceptron (Neural / Cognitive)
* Thought: What if you had a vector of features that descrives aspects of a certain object and you wanted to create a function to turn these features into a binary label
    * Eg: Words in a document -> +/- sentiment
* Rosenblatt's perceptron
    * Biologically inspired machien learning component
    * Apparatus: 20x20 photosensitive cells
    * Weights: Set by electromotors driving potentiometers
    * Learning: One-layer neural network
* Example Perceptron: 
    * Build a document classifier that categorizes raw texts as atheist or medical
    * Logic:
        * Make a subselection for two newsgroups of interest
        * Train a simple perceptron on a vector representation of the documents
    * Code:
        * Import sklearn's basic perceptron classifier
        * Import and filter down the perceptron
        * Fit the CountVectorizer onto the training data
        * Compute TF.IDF representations of the count vectors
            * TF.IDF -> Documents into vectors for ML modeling
        * Train the perceptron on the TF.IDF vectors
        * Convert the test data into a form for the perceptron
        * Apply the perceptron to the results
        * Print the results
* IN REAL LIFE, TOPICS ARE NOT EASY ENOUGH TO SEPERATE LINEARLY

### Support Vector Machines (Eager)
* Thought: What if you could add an additional dimension to object perception such that you create linear boundaries between data points in that dimension? 
* An SVM is a binary classifier that maps data using a kernel function in feature space to higher dimensions in which data is separable by a hyperplane.
    * Kernel functions transform the input space into an alternative representation that has a higher dimensionality with the aim of making data linearly separable.
* Kernel Function:
    * A kernel function computes a product between two vectors.
        * This product is a number expressing a relation between two input vectors.
    * Takes two vectors, mixes in a constant, and produces a specific form of a dot product of the two vectors.
    * Example quadratic kernel: K(x,y) = (c+x^T * y)^2
* Kernel Trick:
    * Hopefully in the higher-dimensional space, things become easier to separate. This is allowed to be used as you're not explicitly transforming the data.
* Two classes are at best separated with maximally wide boundaries (maximal margins)
* Support Vectors: The data points determining the slope of these boundaries
    * Learning weights that optimize the margins with the least error is what SVMs do!
* SVMs are eager because they throw away a lot of their training data and only care about the support vectors. Eager is compact and representational of the training data.

### Memory-Based learning (Lazy)
* Lazy in that it does not generalize training data and keeps all training data in memory
* Similar to SVM in that they also still compute distance measures for simularity.
    * But no dimensionality tricks!
* IB1 Distance Metric:
    * Computes the distance between feature vectors based on exact similarity for non-numeric values. Matches get 0 else 1
    * Find the training data with the same distance to the current test item
    * Then you would vote to see which is the most probably label for the test item.
        * K parameter lets you limit how many similar items you look at.
        * K-nearest distances NOT k-nearest neighbors
* Keeping everything in memory allows for exception handling to occur!
    * Eager ML models tend to compile away exceptions

## 1.2 Deep Learning

* Deep Learning: A neural network with lots of internal/hidden layers and specific filtering operations
    * Very effective statistical technique for working with (very) many parameters.
        * Millions!
    * General Architecture
        * Input Data -> Layer 1 -> Layer 2 ... -> Layer N -> (Output label 1 .. Output label n)
    * Hierarchical representations of data
        * "Lower" layers get fed into "higher" layers
            * Layers = complex functions processing inputs and weights
            * Weights encode the importance of the information within the network
                * These weights are estimated and fine-tuned by neurons
                    * Neurons = Basic processing units of a neural network
        * Once the layers are complete, the network produces probabilities for the possible outcomes
            * All layers are hidden b/c they can't be readily observed except the input and output.
            * The possibility with the highest probability gets the final output label.
    * Developments in Deep Learning: 
        * Restricted Boltzmann Machines (RBMs):
            * Issue: Vanishing Gradient
                * When there are so many parameters that weight adjustments become too tiny to be useful.
                * Repeated multiplication of small numbers going from layer to layer just eventaully 'dissapear' b/c they're not useful
            * RMB: Complete networks that learn probability distributions from data
                * You setup the RMB s.t. every layer sends its hidden layer data as input to the next layer rather than the hidden layer sending itself up as input.
            * Now that there's layer-wise training and the graidents don't travel as far, the vanishing gradient is removed.
        * Rectified Linear Unit (ReLU):
            * Issue: Having a function that allows for your function to learn both is computationally expensive and may be super complicated.
            * Very simple function that returns the max of the value or 0
                * This eliminates all negative numbers
            * Allows for increased speed and scalability of the network computations
*  Advantages:
    * Repeated application of data decluterring
    * Great for handling sequential information with memory operators and buffers

## 1.3 Vector representations of language
* Since ML is all about measuring distances between objects in multi-dimensional spaces, we must convert all text to vectors. They must be computed directly and exaclty from data.
* Vector representations:
    * Representational vectors:
        * Represents text by describing them across a number of human-interpretable feature dimensions.
        * Ex:
            * hospitaal hospitaal+tje and woning wonin+kje
            * +,h,O,s,-,p,i,=,-,t,a,l,T
            * In this representation, words receive dummy values for absent dimensions
            * Words over the 12 characters are truncated
        * CountVectorizer:
            * Mas words to vector positions st every word gets a unique position
        * OneHot Vector Encoding:
            * On a sparsely populated N-dimensional vector
            * Every word is represented by a digit
    * Operational vectors
        * Derived representation of data as produced by an algorithm
        * Term Frequency Inverse Document Frequency (TF IDF):
            * Words are weighted with numerical scores based on saliency
            * Frequency of the word in a document x Frequency of the word in other documents.
                * Higher scores indicate
                * The lower the score, the lower the specialness of the word as it appears a lot.
        * Neural Word Embeddings
            * Embeddings are produced by neural networks that predict context-based words
            * Popular input to deep neural networks.
            * Distributional Semantic Similarity:
                * Associate words together based on their context then associate the vectors with relatively similar scores
            * Shallow: 1 Hidden Layer + 1 Input Layer
            * Deep: >1 Hidden Layer or >1 Input Layer
            * Thus each embedding represents a phrase that geometrically corresponds to a centroid of a vector space spanned by all the word embedded bectors of its words. 
            * Sample Flow:
                * Text -> Words via Tokenization
                * Words -> Vectors via Word2Vec
                * Vectors -> Documents as Vectors via Averaging
                * Documents as Vectors -> Maching Learning algorithm

## 1.4 Vector Sanitization
### The Hashing Trick
* Large vectors are unwieldy to handle!
    * Large memory allotments w/ sparsely populated dimensions
* Feature Hashing:
    * Map every feature to an index and the algorithm updates the info at those indices only
    * Utilize an inverted lexicon (Integers -> Words)
        * Similar input values will lead to similar numerical indices
            * Amount of similarity dependent on the specific hash function

### Vector Normalization:
* Vectors represent quanitities as a magnitude and a direction
* Unit Vector: Vectors that have been squeezed into a subspace which reduces the variance across dimensions.
* Normalization is good for forcing vectors to be within the same data range which reduces sensitivity to outlier data!

## Summary
* Many different forms of NLP are rooted in machine learning and statistics.
* Deep learning traces back to teh 1960s, but it only became operational a few decades later.
* Text needs to be vectorized in order for machine learning to perform natural language processing
* While many options are open for vectorization, inferring and optimizing vectorization from data within machine learning is preferable.