# Natural Language Processing (NLP) and RNNs
> A notebook that helps us to discover one of pioneered framework in Deep Learning - FastAI
- toc: true
- branch: master
- badges: true
- comments: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2
- image: https://miro.medium.com/max/768/1*JbQ58utmVAnAQ1G7ueLMAA.png
- description: Fifth in a series on understanding FastAI.

# Objectives
In this notebook, we are going to deep dive into natural language processing (NLP) using Deep Learning ([info](https://medium.com/dair-ai/deep-learning-for-nlp-an-overview-of-recent-trends-d0d8f40a776d)). Relying on the pretrained language model, we are going to fine-tuning it to classify the reviews and it works as sentiment analysis. 

Based on a `language model` which has been trained to guess what the next word in the text is, we will apply transfer learning method for this NLP task.

We will start with the Wikipedia language model with a subset which we called Wikitext103. Then, we are going to create an ImdB language model which predicts the next word of a movie reviews. This intermediate learning will help us to learn about IMDb-specific kinds of words like the name of actors and directors. Afterwards, we end up with our classifier. 
![](https://github.com/fastai/fastbook/blob/master/images/att_00027.png?raw=1)

# Text Preprocessing

In order to build a language model with many complexities such as different sentence lengths in long documents, we can build a neural network model to deal with that issue. 

Previously, we talked about categorical variables (words) which can be used as independant variables for a neural network (using embeding matrix).  Then, we could do the same thing with text! 
First, we concatenate all of the documents in our dataset into a big long string and split it into words. Our independant variables will be the sequence of words starting with the first word and ending with the second last, and our dependant variable would be the sequence of words starting with the second word and ending with the last words. 

In our vocab, it might exist the very common words and new words. For new words, because we don't have any pre-knowledge, so we will just initialize the corresponding row with a random vector. 

These above steps can be listed as below:
- Tokenization: convert the text into a list of words
- Numericalization: make a list of all the unique words which appear, and convert each word into a number, by looking up its index in the vocab.
- Language model data loader creation : handle creating dependant variables
- Language model creation: handle input list by using recurrent neural network.

## Tokenization
Basically, tokenization convert the text into list of words. Firstly, we will grap our IMDb dataset and try out the tokenizer with all the text files.

In [1]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

In [2]:
files = get_text_files(path,folders=['train','test','unsup'])

The default English word tokenizer that FastAI used is called SpaCy which uses a sophisticated riles engine for particular words and URLs. Rather than directly using ```SpacyTokenizer```, we are going to use ```WordTokenizer``` which always points to fastai's current default word tokenizer. 

In [5]:
txt = files[0].open().read()
txt[:60]
spacy = WordTokenizer()
toks = first(spacy([txt]))

print(coll_repr(toks,30))

(#212) ['I','did',"n't",'know','what','to','expect','when','I','started','watching','this','movie',',','by','the','end','of','it','I','was','pulling','my','hairs','out','.','This','was','one','of'...]


### Subword tokenization

In additions to word tokenizer, subword tokenizer is really useful for langueges which the spaces are not neccesary for separations of components in a sentence (e.g: Chinese). To handle this, we will do 2 steps:
- Analyze a corpus of documents to find the most commonly occuring groups of letters which form the vocab
- Tokenize the corpus using this vocab of subword units

For example, we will first look into 2000 movie reviews

In [7]:
txts = L(o.open().read() for o in files[:2000])

In [8]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])


In [9]:
subword(1000)

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=tmp/texts.out --vocab_size=1000 --model_prefix=tmp/spm --character_coverage=0.99999 --model_type=unigram --unk_id=9 --pad_id=-1 --bos_id=-1 --eos_id=-1 --minloglevel=2 --user_defined_symbols=▁xxunk,▁xxpad,▁xxbos,▁xxeos,▁xxfld,▁xxrep,▁xxwrep,▁xxup,▁xxmaj --hard_vocab_limit=false


"▁I ▁didn ' t ▁know ▁what ▁to ▁expect ▁when ▁I ▁start ed ▁watching ▁this ▁movie , ▁by ▁the ▁end ▁of ▁it ▁I ▁was ▁p ul ling ▁my ▁ ha ir s ▁out . ▁This ▁was ▁one ▁of ▁the ▁most ▁pa"

Then, the long underscore is when we replace the space and we can know where the sentences actually start and stop. 

In [10]:
subword(10000)

"▁I ▁didn ' t ▁know ▁what ▁to ▁expect ▁when ▁I ▁started ▁watching ▁this ▁movie , ▁by ▁the ▁end ▁of ▁it ▁I ▁was ▁pull ing ▁my ▁hair s ▁out . ▁This ▁was ▁one ▁of ▁the ▁most ▁pathetic ▁movies ▁of ▁this ▁year"

If we use a larger vocab, then most common English words will end up in the vocab thelselves, and we will not need as many to represent a sentence. So, there is a compromise to take into account when choosing subword vocab: A larger vocab means more fetwer tokens per sentence which means faster training, less memory, less state for the model to remember but it comes to the downside of larger embedding matrix and requiring more data to learn.

## Numericalization

In order to numericalize, we need to call ```setup``` first to create the vocab. 

In [13]:
tkn = Tokenizer(spacy)
toks300 = txts[:300].map(tkn)
toks300[0]

(#231) ['xxbos','i','did',"n't",'know','what','to','expect','when','i'...]

In [16]:
num = Numericalize()
num.setup(toks300)
coll_repr(num.vocab,20)

"(#2576) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','i','is','it','this'...]"

The results return our rule tokens first and it is followed by word appeanrances, in frequency order.
Once we created our Numerical object, we can use it as if it were a function.

In [17]:
nums = num(toks)[:20]
nums

TensorText([  0,  90,  32, 133,  63,  15, 495,  73,   0, 670, 160,  19,  26,  11,
         70,   9, 138,  14,  18,   0])

In [18]:
' '.join(num.vocab[o] for o in nums)

"xxunk did n't know what to expect when xxunk started watching this movie , by the end of it xxunk"

Now, we have already had numerical data, we need to put them in batches for our model.

### Batches of texts

Recalling the batch creation for the images when we have to reshape all the images to be same size before grouping them together in a single tensor for the efficient calculation purposes. It is a little bit different when dealing with texts because it is not desiable to resize the text length. Also, we want the model read texts in order so that it can efficiently predict what the next word is. This suggests that each new batch should begin precisely where the previous one left off.

So, the text stream will be cut into a certain number of batches (with batch size) with preserving the order of the tokens. Because we want the model to read continuous rows of the text.

To recap, at every epoch, we shuffle our collection of ducuments and cocatenate them into a stream of tokens. Then, that stream will be cut into a batch of fixed size consecutive mini stream. The model will read these mini stream in order and it will produce the same activation.

In FastAI, it is all done with ```LMDataLoader```. 

In [19]:
nums300 = toks300.map(num)

In [20]:
dl = LMDataLoader(nums300)

In [22]:
x,y = first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

the batch size is 64x72. 64 is the default batch size and 72 is the default sequence length.

# Training a Text Classifier