# RNNs

## Introduction

Recurrent networks strictly operate on 1-D sequences. They can be used for a variety of tasks, pictured below:

<img src="http://karpathy.github.io/assets/rnn/diags.jpeg" width = 500>

Examples of the settings in the picture:
- one to one: vanilla MLPs that map a fixed size 1-D vector to a 1-D vector for classification or regression
- one to many: Image captioning, given an input embedding (obtained with a CNN), a textual caption of variable length is generated.
- many to one: (1) Sentence classification such as sentiment analysis or (2) image generation from text: in both cases variable input texts are given as input and a fixed dimensional output is generated.
- many to many: (1) machine translation of a variable-length sentence to another variable-length sentence or (2) transcription of a variable-length .mp3 audio to a variable length text.
- many to many (1to1 correspondence): (1) Video classification: one label for a variable number of frames in the video (the video frame embedding can be obtained with a CNN and then input into a RNN), (2) autoregressive language modeling: trying to predict the next word in the sentence, for generative purposes or (3) word classification: classify every word as belonging to a category.

Note that these settings are not exclusive to recurrent neural networks. In fact, any network type that works on variable input sequences can be used towards these ends. Most famously of which are of course, Transformers, which have all but replaced RNNs in NLP and many other fields. An explanation and implementation of transformers is out of the scope of this course. It suffices to know that RNNs process input sequence sequentially through memory cells, whereas transformers do it in parallel through an $n \times n$ attention matrix. Other than RNNs and Transformers, convolutional networks can also be used on variable length inputs: a 1D kernel can equally well convolve over a sequence of length $100$ as $1000$. It is only because of the linear layers at the end for classification requiring a specific number of input nodes that typical CNNs become applicable on only one specific input size.

## Partim 1: Character-level autoregressive language modeling

Autoregressive language modeling is the task of trying to predict the next word or character in a sentence given which words or characters came before it: $P(x_i | x_{i-1}, x_{i-2}, ..., x_1)$.

In this PC-lab we will explore language modeling on the char level as opposed to the word level. A few considerations in this regard:

The biggest advantage is that it has less classes ($N_{words} >>> N_{chars}$), hence resulting in a classification task that is easier to optimize, and generally needing less data. The disadvantage is that the model will not learn to predict words itself and has to compose them from scratch, often resulting in gibberish words.

Architecturally, autoregressive language modeling of characters using a vanilla RNN looks like this:

<img src="http://karpathy.github.io/assets/rnn/charseq.jpeg" width = 300>

The model will embed input words to a hidden layer which acts as a memory bank. The memory bank of every input will consist of a combination of the information at that time point and the information coming in from the memory cell at the previous time point. The specific way this information is brought together depends on the specific construction of the RNN. We refer you to the theory lectures for details. The most popular constructions are the LSTM and the GRU memory cells. For every timestep, the model outputs a vector of $n$ dimensions, with $n$ the number of possible characters. We compute the cross entropy for every character on these vectors as a loss function and backpropagate.

Code-wise, it is important to know that for a given sentence, we have an input $x$ consisting of the words in that sentence, and an output $y$, consisting of the same words, but **shifted one time-step to the left**. **Because of the directionality of the RNN, for every time-step, it will predict the next character given only the preceding ones.**

To train a character LM: We'll use the content of the book "Anna Karenina" by Leo Tolstoy

In [None]:
import torch
import torch.nn as nn
import numpy as np

import urllib.request
urllib.request.urlretrieve("https://raw.githubusercontent.com/cdemutiis/LSTM_RNN_Text_generation/master/anna.txt", "anna.txt")


with open('./anna.txt', 'r') as f:
    text = f.read()