**Lecture 5 - Recurrent Neural Networks**


- Modern neural networks are enormous (over 100 billion parameters)
	- We need regularization to prevent neural networks from overfitting
		- Now regularization produces models that generalize well
		- But we do not care that we overfit training data

**Dropout**
- When training, for each data point each time, randomly set input to 0 with probability p (dropout ratio, often p = 0.5 except p = 0.15 for input layer) via dropout mask
- Why it works
	-Knows that some features are missing, so has be flexible in learning
    - Prevents feature co-adaptation = good regularization
- In a single layer - middle ground between NaÃ¯ve Bayes (all feature weights set independently) and logistic regression models (weights are set in the context of all others)
    - A form of model bagging like an ensemble model
    - Usually thought of as strong, feature-dependent regularizer

**Vectorization**
- Looping over word vectors vs. concatenating them all into one large matrix and then multiplying the softmax weights with that matrix
- Always try to use vectors and matrices instead of for loops
- Use vector operations for your mask

In [None]:
from numpy import random
    N = 500 # number of windows to classify
    d = 300 # dimensionality of each window
    c = 5 # number of classes
    w = random.rand(C,d)
    wordvectors_list = [random.rand(d,1) for I in range(N)]
    wordvectors_one_matrix = random.rand(d,N)

**Parameter initialization**
- Must initialize weights to small random values (not zero matrices)
- To avoid symmetries that prevent learning/specialization


**Optimizers**
- Getting SGD rights is very dependent on getting the scales right for step size and learning rate
- For more complex nets, try more sophisticated "adaptive" optimizers that scale the adjustment to individual parameters by an accumulated gradient
    - These models give differential per-parameter learning rates
        - Adagrad - simplest, but tends to stall early
        - RMSprop
        - Adam - usually being with this
        - AdamW
        - NAdamW - can be better with word vectors (W) and for speed (Nesterov acceleration)
    - Start them with an initial learning rate, around 0.001 - many have other hyperparameters

**Language modeling**
- The task of predicting the next word
- Given a sequence of words, compute the probability distribution of the next word, where next word can be any word in the vocabulary V
- A system that assigns a probability to a piece of text

**n-gram Language Models**
- How to learn a language model
	- Pre-deep learning - learn an n-gram language model
		- n-gram = chunk of n consecutive words (we are trying to predict)
		- Unigram, bigram, trigram, four-gram
		- Idea: collect statistics about how frequent different n-grams are and use these to predict next word
		- we make a Markov assumption: x_t+1 depends only on the preceding n-1 words
			- Get n-gram and (n-1)-gram probabilities by counting them in some large corpus of text (statistical approximation)
- Ex. Learning a 4-gram Language Model
	- "students opened their ____"
	- P(w | students opened their) = count(students opened their w) / count(students opened their)
- Sparsity problems
	- Problem 1: what if "students opened their w" never occurred in data? Then w has probability 0.
		- Solution - add small _delta_ to the count for every w in V ("smoothing")
	- Problem 2: what if "students opened their" never occurred in data? Then can't calculate probability for any w
		- Solution - just condition on "opened their" instead ("backoff")
	- 5-grams are the largest people really use
- Storage problems
	- Need to store count for all n-grams you saw in the corpus
	- Increasing n or increasing corpus increases model size
- In practice
	- You can built a simple trigram language model over a 1.7 million word corpus (Reuters) in a few seconds on your laptop (nlpforhackers.io/language-models/
- Generating text
	- Incoherent. We need to consider more than 3 words at a time in we want to model language well
	- But increasing n worsens sparsity problem, and increases model size
		- Solution - neural language model

**Neural Language Models**
- How to build a neural language model?
	- Window-based neural model
		- Output distribution
		- Hidden layer
		- Concatenated word embeddings
		- Words/one-hot vectors
		- improvements over n-gram LM:
			- no sparsity problem
			- don't need to store all observed n-grams
		- remaining problems:
			- fixed window is too small
			- enlarging window enlarges W
			- window can never be large enough
			- x^(1) and x^(2) are multiplied by completely different weights in W. no symmetry in how the inputs are processed.
		- --> we need a neural architecture that can process any length input

**Recurrent Neural Networks (RNN)**
- idea: apply the same weights W repeatedly and update parameters

- a simple RNN LM:
    - words/one-hot vectors --> embeddings --> hidden states (W matrix, add bias term, passing through non-linearity func)
    - hidden states store memory of everything that's be seen so far
    - repeat for each layer!
    - then output

- advantages:
    - can process any length input
    - computation for step t can use information from many steps back (in theory)
    - model size doesn't increase for longer input context
    - same weights applied on every timestep, so there is a symmetry in how inputs are processed

- disadvantages:
    - recurrent computation is slow
    - in practive, difficult to access information from many steps back

- training an RNN LM
    - get a big corpus of text which is a sequence of words
    - feed into RNN-LM --> compute output distribution for every step t (predict prob dist of every word, given words so far)
    - loss function on step t is cross-entropy between predicted prob dist and the true next word (one hot)
    - average this to get overall loss for entire training set
    - (**teacher forcing** method)

- backpropagation for RNNs
    - what's the derivative of the loss (gradient) wrt the repreated weight matrix W_h?
        - dJ^(t) / dW_h = sum of the gradient wrt each time it appears (multivariable chain rule)
    - calculate this via backpropogation through time
        - in practice, often truncated after 20 timesteps for training efficiency reasons

- generating with an RNN LM ("generating **roll outs**")
    - like n-gram LM, can use an RNN LM to generate text by repeated sampling
    - sampled output becomes next step's input
    - &lt;s&gt; = start of sequence
    - &lt;s&gt; = end of sequence
    - ex. chatGPT