[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1blkWFKMYqinrtTSKWO_Uvv7v_TvtX6zs?usp=sharing)

# Lesson 8: Going Deeper into NLP

Previously we saw how convenient it was to use the `pipeline` method of the HuggingFace.co `transformers` library to perform a variety of Natural Language Processing (NLP) tasks. But there's a lot going on under the hood that was hidden from us.  If we want to learn how these models work, we're going to have to peel back several layers, on multiple levels.  

What we did in the previous lesson was a bit like watching a big rocket take off from a distance. There are many systems in the rocket that are all working together to effect the launch.  To understand how the big rocket operates, it will help if we go back to study smaller, simpler rockets so that we understand the principles of rocketry. 

In this lesson we'll learn the parts of an NLP model and see how they go together. 

## 1. Tokenization

Whatever NLP task we're interested in performing, there will be a large amount of text (sometimes called a "corpus") that we will use for training the model on. That text needs to be split up somehow into bite-sized parts to operate upon. This process is known as *tokenization*. We could try treating individual characters as tokens, or [regard entire sentences as our tokens](https://claritynlp.readthedocs.io/en/latest/developer_guide/algorithms/sentence_tokenization.html), but a common mid-point is to use *words* as tokens.  

> *For a great example of a character-based neural network, see [Andrej Karpaty's Char-RNN](https://github.com/karpathy/char-rnn).

The simplest -- and typically *the default* -- scheme for word-level tokenization is just to split the text at every space and at every punctuation mark. Let's try an example


So for instance.
```
I'm going to the store, because I need some milk.
```
might become
```
["I", "'", "m", "going", "to", "the",  "store", ",", "because", "I", "need", "some", "milk", "."]
```
Tokenization is something that many computational linguists have spent a great deal of time on, and there are [a variety of tokenizers](https://towardsdatascience.com/overview-of-nlp-tokenization-algorithms-c41a7d5ec4f9?gi=73a2ec14356e) available. Generally it's generally in our best interest to just call a library such as[Natural Language Toolkit (NLTK)](https://www.nltk.org/) to do the tokenizing for us instead of trying to do it from scratch. Both FastAI and HuggingFace allow us to choose between a variety of tokenizers.  (FastAI's default tokenizer is currently from the [spaCy NLP library](https://spacy.io/).)

Let's try an actual example using the NLTK word tokenizer:

In [None]:
import nltk
nltk.download('punkt')    # this is a resource needed by NLTK
sentence = "I'm going to the store, because I need some milk."
tokens = nltk.word_tokenize(sentence)
print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['I', "'m", 'going', 'to', 'the', 'store', ',', 'because', 'I', 'need', 'some', 'milk', '.']


Interesting that the apostrophe from "I'm" went with the "m" (as in "'m") instead of being its own thing. Presumably this is so we can then expand it into "am".  What about the "n" in "don't"? 

In [12]:
sentence2 = "I don't know what's going to happen in this case, but it should be interesting!"
tokens = nltk.word_tokenize(sentence2)
print(tokens)

['I', 'do', "n't", 'know', 'what', "'s", 'going', 'to', 'happen', 'in', 'this', 'case', ',', 'but', 'it', 'should', 'be', 'interesting', '!']


Ok, so in that case the "n" from "don't" went with the "'t". Again, this best facilitates filling in the missing "o".  Let's try some spirited Tennessee-style language:

In [13]:
sentence3 = "I'm fixin' to spend $1499.95 on a new four wheeler and you ain't gonna stop me, ma!"
nltk.word_tokenize(sentence3)

['I',
 "'m",
 'fixin',
 "'",
 'to',
 'spend',
 '$',
 '1499.95',
 'on',
 'a',
 'new',
 'four',
 'wheeler',
 'and',
 'you',
 'ai',
 "n't",
 'gon',
 'na',
 'stop',
 'me',
 ',',
 'ma',
 '!']

Wow, it knows "you'uns"!  And it splits "gonna", presumably in preparation for a mapping to "going", "to".

Do we need the commas and exclamation points though?  Maybe, maybe not.  It depends on our use case.  Sometimes other punctuation is relevant, such as hashtags and @-symbols for social media.  NLTK has a special tokenizer for Twitter:

In [22]:
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()

tweet = "OMG I love @SuperFamousPerson's new look! #fridays #nofilter"

# Let's compare the two tokenizers:
print("Regular word tokenizer:", nltk.word_tokenize(tweet))
print("Tweet tokenizer:       ",tt.tokenize(tweet))

Regular word tokenizer: ['OMG', 'I', 'love', '@', 'SuperFamousPerson', "'s", 'new', 'look', '!', '#', 'fridays', '#', 'nofilter']
Tweet tokenizer:        ['OMG', 'I', 'love', '@SuperFamousPerson', "'", 's', 'new', 'look', '!', '#fridays', '#nofilter']


...So the specialty `TweetTokenizer` kept certain kinds of punctuation with their associated words, rather then splitting at all forms of punctuation like the regular word tokenizer did.



Beyond the question of which punctuation to keep, we must also recognize that words come in a variety of forms.  And some words may be "filler" that we may not need for the task at hand (e.g., articles like "a", "an", and "the" are often discarded).  So we may wish to regard related words such as "jump", "jumping", "jumps",... as variations on the *stem* of "jump".  We may hang on to the endings such as "-ing" for later use, regarding them as additional tokens. The process of *stemming* or "*stemmification*" is the breaking up of words into their stems and hanging on to endings (or not).  Also, what about compound words?  Some languages such as German will make very long single words (e.g. Geschwindigkeitsbegrenzung for "speed limit") that in other languages would be considered as separate words. If language translation is our goal, some way of tokenizing that includes such variability would be important.  Also, what about punctuation? To keep things simple, we could just delete all forms of punctuation -- or expand contractions like "I'll" to "I will", and so forth -- and yet if we want a highly accurate model we may find that holding on to some forms of punctuation will important.



####  Special Token Codes
Often language models will make use of special tokens such as `UNK` (a token to substitute for unknown words) or `PAD` (for extra padding words), or `EOS` (end of sentence), depending on the task at hand. Sometimes these will have extra characters like `<UNK>` or `[UNK]`. There may or may not be `<START>` and `<END>` tokens for the beginning and end of the text.  The exact list of special tokens depends on the tokenizer and the model, but those few are pretty universal. So when you see those, in what follows, you'll be prepared.  
 





## Numericalization / Word Vectors
Once we have the tokens, we still need to convert these into numbers somehow so we can operate on them mathematically. Depending on the application, different numericalization schemes are available. 

One *very simple* way to do this if we were, say, doing *Sentiment Classification* in tweets, movie reviews, or other kinds of "posts",  would be to count the frequency of all the words that appear in positive posts, and do the same for all the negative posts.  Expressing these frequencies as fractions of the total number of words, we could then assign to each word its pair of "positive use" and "negative use" frequency values $(f_p, f_n)$ which lie in the two-dimensional [unit square](https://en.wikipedia.org/wiki/Unit_square) (shown below). These would then form the coordinates for a *word vector* of our word in its *embedding* space (i.e., the unit square in this case).  Then to classify a post, we could just take the sum of the word vectors of all the words in the post and see whether the result is more "positive" than "negative". In other words, we could ask, which region of the following embedding diagram does the mean of the word vectors in the post lie in?

![img of regions of positive and negative](https://i.imgur.com/mLpQHBj.png)



That might suffice as a simple baseline model, and it can work "ok", but there are issues with it. For example, it's possible that different words could get mapped to the exact same point. If all you care about is how positive or how negative the post (or tweet, or review) is, this may not be a problem,  but if you want to "understand" the text, produce a translation of it, or generate new text, then this method is useless. 

> Terminology: that this method of just summing up the word vectors together pays no attention to the *order* of the words, so the above model would be termed a "Bag of Words" model.  

In order to help preserve uniqueness as well as to better allow words to express their ranges of meanings, one typically uses many more than two dimensions for word vector embeddings.  It's quite common to see 256 or more (e.g. 300) dimensions for words.  While these are too many dimensions to visualize (which is why I gave the simple example above!) the computer is able to deal with them just fine.  

The way one typically gets these word vectors is to take in the list of all the (unique) words in the corpus and produce a "vocabulary" which indexes the words and generates a one-hot encoding by treating the words as categories.  Then we map these categories into word vectors via a matrix of trainable weights. So, for example, a corpus with 10,000 unique words mapped into 300-dimensional word vectors would involve a weights matrix of 300\*10000 = 3 million weights.  Thus *the "embedding" mapping is itself a neural network* which we train as the front-end of our full (larger) neural network.
This means that the more words you allow in your vocabulary (or "vocab"), the bigger that initial embedding operation will be.  Typically, in order to keep this matrix from getting too big, one will truncate the list of words by removing the less frequent or less important words from the vocab and replacing them with special tokens such as `UNK`. 
The form the embedding takes may depend on the task.  

## Language Modeling as a Pretraining Task
One very useful method is to use a *language model* task to produce word embeddings.  A language model tries to predict the next word in a sequence given its preceding words (how many preceding words you use determines the sophistiation of the model). This forms a "self-supervised learning" method in the sense that the target data you train on is the same as the input data, just shifted ahead by one word. 

This approach was used to great effect by Jeremy Howard and Sebastian Ruder in their [ULMFit paper](https://paperswithcode.com/method/ulmfit), in which they used a language model task of predicting the next word in Wikipedia (specifically, the [Wikitext-103](https://paperswithcode.com/dataset/wikitext-103) dataset) in order to condition the model to use for other tasks such as sentiment analysis of IMDB movie reviews.  Their result was that they beat other competing sentiment analysis methods by a longshot!  

The idea is that a model that has to predict the next word in a large text has to develop somewhat of an "understanding" of how language works, and thus will be a more powerful model for text classification than a simpler model that 

> Note: A neat effect of this form of pre-training is that you also end up with a text generation model.

Now, we're not going to train a model on Wikipedia right now.  That would be a waste of time, as we can just download pretrained weights and go from there.  Let's use the fastai set of methods for doing this, and we'll work through their IMDB example problem [as described in Chapter 10 of the `fastbook`](https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb).  To get started we'll need to download the dataset and start using fastai's tokenizer(s).

In [1]:
!pip install -Uqq fastai fastbook

In [17]:
# if the next line produces an error, restart the runtime and try again.
import fastbook  
from fastai.text.all import *
from IPython.display import display, HTML

In [16]:
 path = untar_data(URLs.IMDB)  # download the dataset

In [5]:
# make a list of all the files in all the folders of the dataset
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

# let's look at the first 75 characters of the first file in the list
txt = files[0].open().read();  txt[:75]

'Kannathil Muthamittal is for sure a great movie. I have to give it to Mani '

As we mentioned above the current default tokenizer in FastAI is from the spaCy NLP package:

In [8]:
spacy = WordTokenizer()
spacified = spacy([txt])  
print(spacified)

<generator object SpacyTokenizer.__call__.<locals>.<genexpr> at 0x7f9f15d77f50>


So the word tokenizer is a generator. In order to access its output we can use `first()` and `next()`:


In [10]:
toks = first(spacy([txt]))
print(toks)   # This prints out all the tokens
print(coll_repr(toks, 30))  # fastai's coll_repr method gives the total size and first N (=30) tokens

['Kannathil', 'Muthamittal', 'is', 'for', 'sure', 'a', 'great', 'movie', '.', 'I', 'have', 'to', 'give', 'it', 'to', 'Mani', 'Ratnam', 'for', 'a', 'great', 'directing', 'job', 'and', 'A.R.', 'Rahman', 'for', 'great', 'songs', '.', 'The', 'camera', 'work', 'is', 'just', 'excellent', 'and', 'is', 'similar', 'to', 'Black', 'Hawk', 'Down', 'and', 'Saving', 'Private', 'Ryan', '.', 'I', 'will', 'be', 'shocked', 'if', 'this', 'movie', 'does', 'not', 'win', 'an', 'Oscar', 'for', 'Best', 'Foreign', 'Film', 'or', 'even', 'Best', 'Camera', 'Work', '.']
(#69) ['Kannathil','Muthamittal','is','for','sure','a','great','movie','.','I','have','to','give','it','to','Mani','Ratnam','for','a','great','directing','job','and','A.R.','Rahman','for','great','songs','.','The'...]


In addition to `WordTokenizer`, fastai adds some extra functionality via a `Tokenizer` method, that will turn all words to lower case but precede such interventions with a special code `xxmaj` indicating that the next word should be capitalized.  It also adds `xxbos` to denote the beginning of the sentence. 

In [12]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#91) ['xxbos','xxmaj','kannathil','xxmaj','muthamittal','is','for','sure','a','great','movie','.','i','have','to','give','it','to','xxmaj','mani','xxmaj','ratnam','for','a','great','directing','job','and','xxup','a.r','.'...]


> Note: fastai also has a tokenization method that will use sub-words -- i.e., groups of characters -- but we're going to skip that part for now.

To do calculations on the GPU, it's helpful to work with "batches" of data, just like we did for images.  In each batch we need the same demensions, so we will chop the text up into "chunks" of length `seq_len` and then group these into batches.  Rather than totally randomly assigning the order of the batches, we will have the model "read" the text sequentially, where each new element of a batch will simply be shifted ahead one word. 

See this fastai example where they use a batch size of `bs=6` and sequence length of `seq_len=5` to produce one batch from a sample text:

In [40]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
print(stream)
tokens = tkn(stream)
print("\n",len(tokens),"tokens in stream.")

In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.
Then we will study how we build a language model and train it for a while.

 90 tokens in stream.


Although we could randomly grab "chunks" from all over the file and try to predict the word following each chunk, the fastai folks recommend making the text in each row of each batch follow immediately from the text in the corresponding row the previous batch.  Which means making some fancy slicing code like the following, in which we show three sequential batches.   

In [52]:

bs,seq_len = 6, 5                          # batch size and sequence length
num_batches = len(tokens)// bs // seq_len  # 30 tokens per batch, 90 tokens = 3 batches. 
print("num_batches = ",num_batches)  
num_rows = len(tokens) // seq_len          # total rows of all batches == 18
print("num_rows = ",num_rows)

for b in range(num_batches):
    stride = seq_len * num_batches 
    d_tokens = np.array([tokens[i*stride + b*seq_len :i*stride + b*seq_len + seq_len] for i in range(bs)]) # i is the row number
    df = pd.DataFrame(d_tokens)
    print(f"\nbatch = {b}:")
    display(HTML(df.to_html(index=False,header=None)))


num_batches =  3
num_rows =  18

batch = 0:


0,1,2,3,4
xxbos,xxmaj,in,this,chapter
movie,reviews,we,studied,in
first,we,will,look,at
how,to,customize,it,.
of,the,preprocessor,used,in
will,study,how,we,build



batch = 1:


0,1,2,3,4
",",we,will,go,back
chapter,1,and,dig,deeper
the,processing,steps,necessary,to
xxmaj,by,doing,this,","
the,data,block,xxup,api
a,language,model,and,train



batch = 2:


0,1,2,3,4
over,the,example,of,classifying
under,the,surface,.,xxmaj
convert,text,into,numbers,and
we,'ll,have,another,example
.,\n,xxmaj,then,we
it,for,a,while,.


See how each row of each batch continues the text from the same row in the preceding batch? 

When we were training images, we shuffled the order of images between epochs.  In the case of NLP we don't want to shuffle the words or even the rows.  Instead when we take a bunch of movie reviews and concatenate them to form a stream (which then broken into tokens and then batches), what we do is randomize the *order in which the reviews are concatenated* at each epoch.  This allows for word orderings to stay the same but where they appear in the training dataset to still shift around a bit in order to prevent overfitting. 


This is generally handled automatically by fastai, that will define the Tokenizer, set it up, and specify a Numericalize function, and set that up.  Here we show a brief example of that:

In [56]:
txts = L(o.open().read() for o in files[:2000])  # read texts of the first 2000 files
txts[0]

'Kannathil Muthamittal is for sure a great movie. I have to give it to Mani Ratnam for a great directing job and A.R. Rahman for great songs. The camera work is just excellent and is similar to Black Hawk Down and Saving Private Ryan. I will be shocked if this movie does not win an Oscar for Best Foreign Film or even Best Camera Work.'

In [58]:
toks200 = txts[:200].map(tkn)   # tokenize the first 200 files, by mapping the "tkn" function to the elements of text.
toks200[0]  # show us the tokens corresponding to the text in the first file 

(#91) ['xxbos','xxmaj','kannathil','xxmaj','muthamittal','is','for','sure','a','great'...]

In [59]:
num = Numericalize()
num.setup(toks200)   # create a vocab for the stream we've created. 
coll_repr(num.vocab,20)  # show the first 20 words in the vocab, in order of descending frequency

"(#1904) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','in','it','i'...]"

Then we can show how these individual tokens are rendered as numbers.  Note that the special codes get mapped to zero:

In [65]:
nums200 = toks200.map(num);
print(toks200[0])
print(nums200[0])

['xxbos', 'xxmaj', 'kannathil', 'xxmaj', 'muthamittal', 'is', 'for', 'sure', 'a', 'great', 'movie', '.', 'i', 'have', 'to', 'give', 'it', 'to', 'xxmaj', 'mani', 'xxmaj', 'ratnam', 'for', 'a', 'great', 'directing', 'job', 'and', 'xxup', 'a.r', '.', 'xxmaj', 'rahman', 'for', 'great', 'songs', '.', 'xxmaj', 'the', 'camera', 'work', 'is', 'just', 'excellent', 'and', 'is', 'similar', 'to', 'xxmaj', 'black', 'xxmaj', 'hawk', 'xxmaj', 'down', 'and', 'xxmaj', 'saving', 'xxmaj', 'private', 'xxmaj', 'ryan', '.', 'i', 'will', 'be', 'shocked', 'if', 'this', 'movie', 'does', 'not', 'win', 'an', 'xxmaj', 'oscar', 'for', 'xxmaj', 'best', 'xxmaj', 'foreign', 'xxmaj', 'film', 'or', 'even', 'xxmaj', 'best', 'xxmaj', 'camera', 'xxmaj', 'work', '.']
TensorText([   2,    8,    0,    8,    0,   16,   30,  273,   13,   72,   29,   11,   19,   45,   15,  223,   18,   15,    8,    0,    8,    0,   30,   13,   72,  469,  297,   12,    7,    0,   11,    8,
           0,   30,   72,  470,   11,    8,    9,  426, 

^Note how the unknown / low frequency words get mapped to 0. 

These can then go into a fastai DataLoader which has been setup for language modeling, [`LMDataLoader`](https://docs.fast.ai/text.data.html#LMDataLoader), which is designed to load a batch of text as an input and the *same text shifted ahead by one word* as the target data

In [68]:
dl = LMDataLoader(nums200)

# test it
x,y = first(dl)
print(x.shape,y.shape)

# we can print out x & y but lets convert them from numbers to text when we view them
print(', '.join(num.vocab[o] for o in x[0][:20]))
print(', '.join(num.vocab[o] for o in y[0][:20]))

torch.Size([64, 72]) torch.Size([64, 72])
xxbos, xxmaj, xxunk, xxmaj, xxunk, is, for, sure, a, great, movie, ., i, have, to, give, it, to, xxmaj, xxunk
xxmaj, xxunk, xxmaj, xxunk, is, for, sure, a, great, movie, ., i, have, to, give, it, to, xxmaj, xxunk, xxmaj


See how each word in y is just the corresponding "next word in x" at the same index?  As a simple exercise, can you do the same?  Write a "shift left" function that just shifts a set of list elements to the left.  Add a "xxpad" on the end:

In [69]:
## EXERCISE. Fill in your code below as directed

def shift_left(orig:list):   
    ## Your code below. Define a variable called "shifted" that is the original 
    #  list, shifted to the left by one, and filled in with a "xxpad" at the end.
 
    shifted =  
 
    ## end of your code
    return shifted 

Test your code:

In [70]:
shift_left([1,2,3,4,5])

[2, 3, 4, 5, 'xxpad']

```
Expected ouput:
[2, 3, 4, 5, 'xxpad']
```

In [72]:
# and another check
assert shift_left([]) == ['xxpad']

## Pausing Here