In [1]:
from fastai.text.all import *

We are initially going to use fastai

In [3]:
path = untar_data(URLs.IMDB)

In [40]:
file_id = 1

In [41]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])
txt = files[file_id].open().read(); txt[:75] # show the first 75 characters of the file

'I have seen this movie and I did not care for this movie anyhow. I would no'

# Neural Networks for Sequences using FastAI

- input is a sequence
- ouput is a sequence

A model that has been trained to guess the next word in a text. This is a *self-supervised learning* (no labels are required for learning). 

**Basic Outline**

1) Tokenization (Word or subword) 
2) Numericalization

## 1) Tokenization

There are three main approaches to tokenization:
1) Word-based
2) Subword-based
3) Character-based

### Word Tokenization

In [42]:
tokenizer = WordTokenizer()
toks = first(tokenizer([txt]))
toks

(#133) ['I','have','seen','this','movie','and','I','did','not','care'...]

`Tokenizer` class adds additionally functionality with keywords

In [43]:
tkn = Tokenizer(tokenizer)
print(coll_repr(tkn(txt), 31))

(#152) ['xxbos','i','have','seen','this','movie','and','i','did','not','care','for','this','movie','anyhow','.','i','would','not','think','about','going','to','xxmaj','paris','because','i','do','not','like','this'...]


Special Tokens
- `xxbos` indicates the beginning of a text, “BOS” is a standard NLP acronym that means “beginning of stream”
- `xxmaj` Indicates the next word begins with a capital (since we lowercased everything)

### Subword Tokenization (more popular approach)
Word tokenization relies on an assumption that spaces provide a useful separation of components of meaning in a sentence. However, this assumption is not always appropriate. It can even handle other sequential data better such as genomic sequences or MIDI music notation.

Subword Tokenization works in two steps:
1. Analyse corpus of documents to find the most common occuring groups of letters. These become the **vocab**.
2. Tokenize the corpus using this vocab of *subword units*

Instantiate the tokenizer: 
- you have the pass the size of the vocab. This size will result in different lenghts of tokens required to represent a sentence.
    - If we use a larger vocab, most common words will end up in the vocab itself
    - If we use a smaller vocab, each token will be made up of fewer characters and therefore it will take more tokens to represent a sentence.
- then we need to "train it" -> Done by calling setup

Picking a subword vocab size represents a compromise: 

a larger vocab means fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn.

In [35]:
txts = L(o.open().read() for o in files[:2000])

In [18]:
def subword(txt_train, txt_tok, vocab_size):
    subwordtok = SubwordTokenizer(vocab_sz=vocab_size) # instantiate a subword tokenizer
    subwordtok.setup(txt_train) # train the subword tokenizer
    return ' '.join(first(subwordtok([txt_tok]))[:40]) # tokenize the text and return the a length of 40 

subword(txts,txt, 1000)

'▁A l an ▁R ick man ▁ & ▁E mm a ▁Th o mp s on ▁give ▁good ▁performance s ▁with ▁so u ther n / N e w ▁O r le an s ▁a c c ent s ▁in'

### 2) Numericalization

Numericalization is the process of mapping tokens to integers. The steps are identical to creating for example one-hot-encoded vector for each class in a multi-class classification problem such a MNIST.

1) Make a list of all possible levels of the vocab (the categorical variable)
2) Replace each level with its index in the vocab

Just as with `SubwordTokenizer`, we need to call `setup` on `Numericalize`; This is how we create the vocab.

In [44]:
toks200 = txts[:200].map(tkn)
toks200[file_id]

(#152) ['xxbos','i','have','seen','this','movie','and','i','did','not'...]

In [45]:
num = Numericalize(min_freq=2, max_vocab=1000) # defaults are min_freq=3, max_vocab=60000
num.setup(toks200)
coll_repr(num.vocab,20)

"(#1008) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','and','a','to','of','i','it','is','in'...]"

In [46]:
nums = num(toks)[:20]
nums 

TensorText([  0,  38, 140,  20,  27,  12,   0,  73,  36, 255,  28,  20,  27,
              0,  10,   0,  68,  36, 142,  60])

We can convert the tensor of integers back to text

In [47]:
print(' '.join(num.vocab[o] for o in nums))

xxunk have seen this movie and xxunk did not care for this movie xxunk . xxunk would not think about


## Natural language Processing

## Training a Text Classifier

## Recurrent Neural Networks (RNN)