[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/drscotthawley/DLAIE/blob/main/Lessons/08_DeeperNLP.ipynb)

# Lesson 8: Going Deeper into NLP

---

# Part I

[Previously](https://github.com/drscotthawley/DLAIE/blob/main/Lessons/7_NLP_via_HuggingFace_Transformers.ipynb) we saw how convenient it was to use the `pipeline` method of the HuggingFace.co `transformers` library to perform a variety of Natural Language Processing (NLP) tasks. But there's a lot going on under the hood that was hidden from us.  If we want to learn how these models work, we're going to have to peel back several layers, on multiple levels.  

What we did in the previous NLP lesson was a bit like watching a big rocket take off from a distance. There are many systems in the rocket that are all working together to effect the launch.  To understand how the big rocket operates, it will help if we go back to study smaller, simpler rockets so that we understand the principles of rocketry.

In this lesson we'll learn the parts of an NLP model and see how they go together.

## 1. Tokenization

Whatever NLP task we're interested in performing, there will be a large amount of text (sometimes called a "corpus") that we will use for training the model on. That text needs to be split up somehow into bite-sized parts to operate upon. This process is known as *tokenization*. We could try treating individual characters as tokens, or [regard entire sentences as our tokens](https://claritynlp.readthedocs.io/en/latest/developer_guide/algorithms/sentence_tokenization.html), but a common mid-point is to use *words* as tokens.  

> *For a great example of a character-based neural network, see [Andrej Karpaty's Char-RNN](https://github.com/karpathy/char-rnn). \[OPTIONAL, not required\]

The simplest -- and typically *the default* -- scheme for word-level tokenization is just to split the text at every space and at every punctuation mark. Let's try an example


So for instance, the follwing sample text:
```
I'm going to the store, because I need some milk.
```
might become
```
["I", "'", "m", "going", "to", "the",  "store", ",", "because", "I", "need", "some", "milk", "."]
```
Tokenization is something that many computational linguists have spent a great deal of time on, and there are [a variety of tokenizers](https://towardsdatascience.com/overview-of-nlp-tokenization-algorithms-c41a7d5ec4f9?gi=73a2ec14356e) available. Generally it's generally in our best interest to just call a library such as[Natural Language Toolkit (NLTK)](https://www.nltk.org/) to do the tokenizing for us instead of trying to do it from scratch. Both FastAI and HuggingFace allow us to choose between a variety of tokenizers.  (FastAI's default tokenizer is currently from the [spaCy NLP library](https://spacy.io/).)

Let's try an actual example using the NLTK word tokenizer:

In [None]:
import nltk
nltk.download('punkt')    # this is a resource needed by NLTK
sentence = "I'm going to the store, because I need some milk."
tokens = nltk.word_tokenize(sentence)
print("tokens = ",tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
tokens =  ['I', "'m", 'going', 'to', 'the', 'store', ',', 'because', 'I', 'need', 'some', 'milk', '.']


Interesting that the apostrophe from "I'm" went with the "m" (as in "'m") instead of being its own thing. Presumably this is so we can then expand it into "am".  What about the "n" in "don't"?

In [None]:
sentence2 = "I don't know what's going to happen in this case, but it should be interesting!"
tokens = nltk.word_tokenize(sentence2)
print(tokens)

['I', 'do', "n't", 'know', 'what', "'s", 'going', 'to', 'happen', 'in', 'this', 'case', ',', 'but', 'it', 'should', 'be', 'interesting', '!']


In this case the "n" from "don't" went with the "'t". Again, this best facilitates filling in the missing "o".  Let's try some spirited Tennessee-style language:

In [None]:
sentence3 = "I'm fixin' to spend $1499.95 on a new four wheeler and you ain't gonna stop me, ma!"
print(nltk.word_tokenize(sentence3))

['I', "'m", 'fixin', "'", 'to', 'spend', '$', '1499.95', 'on', 'a', 'new', 'four', 'wheeler', 'and', 'you', 'ai', "n't", 'gon', 'na', 'stop', 'me', ',', 'ma', '!']


Wow, it knows "you'uns"!  And it splits "gonna", presumably in preparation for a mapping to "going", "to".

Do we need the commas and exclamation points though?  Maybe, maybe not.  It depends on our use case.  Sometimes other punctuation is relevant, such as hashtags and @-symbols for social media.  NLTK has a special tokenizer for Twitter:

In [None]:
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()

tweet = "OMG I love @SuperFamousPerson's new look! #fridays #nofilter"

# Let's compare the two tokenizers:
print("Regular word tokenizer:", nltk.word_tokenize(tweet))
print("Tweet tokenizer:       ",tt.tokenize(tweet))

Regular word tokenizer: ['OMG', 'I', 'love', '@', 'SuperFamousPerson', "'s", 'new', 'look', '!', '#', 'fridays', '#', 'nofilter']
Tweet tokenizer:        ['OMG', 'I', 'love', '@SuperFamousPerson', "'", 's', 'new', 'look', '!', '#fridays', '#nofilter']


...So the specialty `TweetTokenizer` kept certain kinds of punctuation with their associated words, rather then splitting at all forms of punctuation like the regular word tokenizer did.



Beyond the question of which punctuation to keep, we must also recognize that words come in a variety of forms.  And some words may be "filler" that we may not need for the task at hand (e.g., articles like "a", "an", and "the" are often discarded).  So we may wish to regard related words such as "jump", "jumping", "jumps",... as variations on the *stem* of "jump".  We may hang on to the endings such as "-ing" for later use, regarding them as additional tokens. The process of *stemming* or "*stemmification*" is the breaking up of words into their stems and hanging on to endings (or not).  Also, what about compound words?  Some languages such as German will make very long single words (e.g. Geschwindigkeitsbegrenzung for "speed limit") that in other languages would be considered as separate words. If language translation is our goal, some way of tokenizing that includes such variability would be important.  Also, what about punctuation? To keep things simple, we could just delete all forms of punctuation -- or expand contractions like "I'll" to "I will", and so forth -- and yet if we want a highly accurate model we may find that holding on to some forms of punctuation will important.



####  Special Token Codes
Often language models will make use of special tokens such as `UNK` (a token to substitute for unknown words) or `PAD` (for extra padding words), or `EOS` (end of sentence), depending on the task at hand. Sometimes these will have extra characters like `<UNK>` or `[UNK]`. There may or may not be `<START>` and `<END>` tokens for the beginning and end of the text.  The exact list of special tokens depends on the tokenizer and the model, but those few are pretty universal. So when you see those, in what follows, you'll be prepared.  






## Numericalization & Word Vectors
Once we have the tokens, we still need to convert these into numbers somehow so we can operate on them mathematically. Depending on the application, different numericalization schemes are available.

One *very simple* way to do this if we were, say, doing *Sentiment Classification* in tweets, movie reviews, or other kinds of "posts",  would be to count the frequency of all the words that appear in positive posts, and do the same for all the negative posts.  Expressing these frequencies as fractions of the total number of words, we could then assign to each word its pair of "positive use" and "negative use" frequency values $(f_p, f_n)$ which lie in the two-dimensional [unit square](https://en.wikipedia.org/wiki/Unit_square) (shown below). These would then form the coordinates for a *word vector* of our word in its *embedding* space (i.e., the unit square in this case).  Then to classify a post, we could just take the sum of the word vectors of all the words in the post and see whether the result is more "positive" than "negative". In other words, we could ask, which region of the following embedding diagram does the mean of the word vectors in the post lie in?

![img of regions of positive and negative](https://i.imgur.com/WauRtOR.png)



That might suffice as a simple baseline model, and it might work "ok", but there are issues with it. For example, it's possible that different words could get mapped to the exact same point.  If all you care about is how positive or how negative the post (or tweet, or review) is, this may not be a problem,  but if you want to "understand" the text, produce a translation of it, or generate new text, then this method is useless.  Another issue is that words that mean almost the same thing but are used with different frequencies (e.g. "amazing" and "stupendous") would receive very different word vectors, even though we'd want them to have essentially the same effects on the model's output.

> Terminology: our simplistic method of just summing up the word vectors together pays no attention to the *order* of the words, so the above model would be termed a "Bag of Words" type of model.  

In order to help preserve uniqueness as well as to better allow words to express their ranges of meanings, one typically uses many more than two dimensions for word vector embeddings.  It's quite common to see 256 or more (e.g. 300) dimensions for words.  While these are too many dimensions to visualize (which is why I gave the simple example above!) the computer is able to deal with them just fine.  

The way one typically gets these word vectors is to take in the list of all the (unique) words in the corpus and produce a "vocabulary" which indexes the words and generates a one-hot encoding by treating the words as categories.  Then we map these categories into word vectors via a matrix of trainable weights. So, for example, a corpus with 10,000 unique words mapped into 300-dimensional word vectors would involve a weights matrix of 300\*10000 = 3 million weights.

\[TODO: Add a picture someday! ;-) \]

Thus *the "embedding" mapping is itself a neural network* which we train as the front-end of our full (larger) neural network.
This means that the more words you allow in your vocabulary (or "vocab"), the bigger that initial embedding operation will be.  Typically, in order to keep this matrix from getting too big, one will truncate the list of words by removing the less frequent or less important words from the vocab and replacing them with special tokens such as `UNK`.
The form the embedding takes may depend on the task.  

## Language Modeling as a Pretraining Task
One very useful method is to use a *language model* task to produce word embeddings.  A language model tries to predict the next word in a sequence given its preceding words (how many preceding words you use determines the sophistiation of the model). This forms a "self-supervised learning" method in the sense that the target data you train on is the same as the input data, just shifted ahead by one word.

This approach was used to great effect by Jeremy Howard and Sebastian Ruder in their [ULMFit paper](https://paperswithcode.com/method/ulmfit), in which they used a language model task of predicting the next word in Wikipedia (specifically, the [Wikitext-103](https://paperswithcode.com/dataset/wikitext-103) dataset) in order to condition the model to use for other tasks such as sentiment analysis of IMDB movie reviews.  Their result was that they beat other competing sentiment analysis methods by a longshot!  

The idea is that a model that has to predict the next word in a large text has to develop somewhat of an "understanding" of how language works, and thus will be a more powerful model for text classification than a simpler model that

> Note: A neat effect of this form of pre-training is that you also end up with a text generation model.

Now, we're not going to train a model on Wikipedia right now.  That would be a waste of time, as we can just download pretrained weights and go from there.  Let's use the fastai set of methods for doing this, and we'll work through their IMDB example problem [as described in Chapter 10 of the `fastbook`](https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb).  To get started we'll need to download the dataset and start using fastai's tokenizer(s).

In [None]:
!pip install -Uqq fastai fastbook

In [None]:
# if the next line produces an error, restart the runtime and try again.
import fastbook
from fastai.text.all import *
from IPython.display import display, HTML

In [None]:
 path = untar_data(URLs.IMDB)  # download the dataset

In [None]:
# make a list of all the files in all the folders of the dataset
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

# let's look at the first 75 characters of the first file in the list
txt = files[0].open().read();  txt[:75]

'One True Thing rises above its potentially schlocky material to give us a v'

As we mentioned above the current default tokenizer in FastAI is from the spaCy NLP package:

In [None]:
spacy = WordTokenizer()
spacified = spacy([txt])
print(spacified)

<generator object SpacyTokenizer.__call__.<locals>.<genexpr> at 0x7fc605448050>


So the word tokenizer is a generator. In order to access its output we can use `first()` and `next()`:


In [None]:
toks = first(spacy([txt]))
print(toks)   # This prints out all the tokens
print(coll_repr(toks, 30))  # fastai's coll_repr method gives the total size and first N (=30) tokens

['One', 'True', 'Thing', 'rises', 'above', 'its', 'potentially', 'schlocky', 'material', 'to', 'give', 'us', 'a', 'view', 'of', 'a', 'family', 'of', 'complex', 'relationships', 'and', 'flawed', ',', 'real', 'people', '.', 'It', 'opens', 'with', 'Rene', 'Zeleweger', 'discussing', 'her', 'mother', "'s", 'death', 'with', 'the', 'District', 'Attorney', ';', 'sparing', 'us', 'the', 'cheap', 'cinematic', 'shots', 'of', 'a', '"', 'shocking', '"', 'illness', 'and', 'death', '.', 'From', 'there', 'it', 'proceeds', 'into', 'a', 'look', 'at', 'a', 'family', 'system', ',', 'in', 'which', 'everyone', 'plays', 'by', 'a', 'set', 'of', 'unexamined', 'rules', ',', 'and', 'uses', 'the', 'mother', "'s", 'cancer', 'to', 'show', 'what', 'happens', 'when', 'all', 'the', 'rules', 'change', '.', '<', 'br', '/><br', '/>William', 'Hurt', 'as', 'the', 'self', '-', 'important', 'father', ',', 'and', 'Meryl', 'Streep', 'as', 'the', 'Suzy', 'Homemaker', 'mother', 'are', 'both', 'superb', ';', 'nuanced', 'and', 'not

In addition to `WordTokenizer`, fastai adds some extra functionality via a `Tokenizer` method, that will turn all words to lower case but precede such interventions with a special code `xxmaj` indicating that the next word should be capitalized.  It also adds `xxbos` to denote the beginning of the sentence.

In [None]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#217) ['xxbos','xxmaj','one','xxmaj','true','xxmaj','thing','rises','above','its','potentially','schlocky','material','to','give','us','a','view','of','a','family','of','complex','relationships','and','flawed',',','real','people','.','xxmaj'...]


> Note: fastai also has a tokenization method that will use sub-words -- i.e., groups of characters -- but we're going to skip that part for now.

To do calculations on the GPU, it's helpful to work with "batches" of data, just like we did for images.  In each batch we need the same demensions, so we will chop the text up into "chunks" of length `seq_len` and then group these into batches.  Rather than totally randomly assigning the order of the batches, we will have the model "read" the text sequentially, where each new element of a batch will simply be shifted ahead one word.

See this fastai example where they use a batch size of `bs=6` and sequence length of `seq_len=5` to produce one batch from a sample text:

In [None]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
print(stream)
tokens = tkn(stream)
print("\n",len(tokens),"tokens in stream.")

In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.
Then we will study how we build a language model and train it for a while.

 90 tokens in stream.


Although we could randomly grab "chunks" from all over the file and try to predict the word following each chunk, the fastai folks recommend making the text in each row of each batch follow immediately from the text in the corresponding row the previous batch.  Which means making some fancy slicing code like the following, in which we show three sequential batches.  

> Note: The motivation for doing this may not become clear until we define language models below (e.g. `LMModel3`) that can maintain an internal state between batches.  This internal state will be the "glue" that holds the sentences together in between batches

In [None]:
bs,seq_len = 6, 5                          # batch size and sequence length
num_batches = len(tokens)// bs // seq_len  # 30 tokens per batch, 90 tokens = 3 batches.
print("num_batches = ",num_batches)
num_rows = len(tokens) // seq_len          # total rows of all batches == 18
print("num_rows = ",num_rows)

for b in range(num_batches):
    stride = seq_len * num_batches
    d_tokens = np.array([tokens[i*stride + b*seq_len :i*stride + b*seq_len + seq_len] for i in range(bs)]) # i is the row number
    df = pd.DataFrame(d_tokens)
    print(f"\nbatch = {b}:")
    display(HTML(df.to_html(index=False,header=None)))


num_batches =  3
num_rows =  18

batch = 0:


0,1,2,3,4
xxbos,xxmaj,in,this,chapter
movie,reviews,we,studied,in
first,we,will,look,at
how,to,customize,it,.
of,the,preprocessor,used,in
will,study,how,we,build



batch = 1:


0,1,2,3,4
",",we,will,go,back
chapter,1,and,dig,deeper
the,processing,steps,necessary,to
xxmaj,by,doing,this,","
the,data,block,xxup,api
a,language,model,and,train



batch = 2:


0,1,2,3,4
over,the,example,of,classifying
under,the,surface,.,xxmaj
convert,text,into,numbers,and
we,'ll,have,another,example
.,\n,xxmaj,then,we
it,for,a,while,.


See how each row of each batch continues the text from the same row in the preceding batch?  Don't worry, you won't have to reproduce that code, fastai will do it internally.  

When we were training images, we shuffled the order of images between epochs.  In the case of NLP we don't want to shuffle the words or even the rows.  Instead when we take a bunch of movie reviews and concatenate them to form a stream (which then broken into tokens and then batches), what we do is randomize the *order in which the reviews are concatenated* at each epoch.  This allows for word orderings to stay the same but where they appear in the training dataset to still shift around a bit in order to prevent overfitting.


This is generally handled automatically by fastai, that will define the Tokenizer, set it up, and specify a Numericalize function, and set that up.  Here we show a brief example of that:

In [None]:
txts = L(o.open().read() for o in files[:2000])  # read texts of the first 2000 files
txts[0]

'One True Thing rises above its potentially schlocky material to give us a view of a family of complex relationships and flawed, real people. It opens with Rene Zeleweger discussing her mother\'s death with the District Attorney; sparing us the cheap cinematic shots of a "shocking" illness and death. From there it proceeds into a look at a family system, in which everyone plays by a set of unexamined rules, and uses the mother\'s cancer to show what happens when all the rules change. <br /><br />William Hurt as the self-important father, and Meryl Streep as the Suzy Homemaker mother are both superb; nuanced and not what they appear to be. Zeleweger is seething, angry and surprised with herself. Tom Everett Scott doesn\'t have much to do, but he does it well.<br /><br />The story is predictable, and takes at least one badly soppy turn it needn\'t have taken, but the performances, and the view of family as a place where anger and love are equally mixed, make it worthwhile.'

In [None]:
toks200 = txts[:200].map(tkn)   # tokenize the first 200 files, by mapping the "tkn" function to the elements of text.
toks200[0]  # show us the tokens corresponding to the text in the first file

(#217) ['xxbos','xxmaj','one','xxmaj','true','xxmaj','thing','rises','above','its'...]

In [None]:
num = Numericalize()
num.setup(toks200)   # create a vocab for the stream we've created.
coll_repr(num.vocab,20)  # show the first 20 words in the vocab, in order of descending frequency

"(#2016) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','it','in','i'...]"

Then we can show how these individual tokens are rendered as numbers.  Note that the special codes get mapped to zero:

In [None]:
nums200 = toks200.map(num);
print(toks200[0])
print(nums200[0])

['xxbos', 'xxmaj', 'one', 'xxmaj', 'true', 'xxmaj', 'thing', 'rises', 'above', 'its', 'potentially', 'schlocky', 'material', 'to', 'give', 'us', 'a', 'view', 'of', 'a', 'family', 'of', 'complex', 'relationships', 'and', 'flawed', ',', 'real', 'people', '.', 'xxmaj', 'it', 'opens', 'with', 'xxmaj', 'rene', 'xxmaj', 'zeleweger', 'discussing', 'her', 'mother', "'s", 'death', 'with', 'the', 'xxmaj', 'district', 'xxmaj', 'attorney', ';', 'sparing', 'us', 'the', 'cheap', 'cinematic', 'shots', 'of', 'a', '"', 'shocking', '"', 'illness', 'and', 'death', '.', 'xxmaj', 'from', 'there', 'it', 'proceeds', 'into', 'a', 'look', 'at', 'a', 'family', 'system', ',', 'in', 'which', 'everyone', 'plays', 'by', 'a', 'set', 'of', 'unexamined', 'rules', ',', 'and', 'uses', 'the', 'mother', "'s", 'cancer', 'to', 'show', 'what', 'happens', 'when', 'all', 'the', 'rules', 'change', '.', '\n\n', 'xxmaj', 'william', 'xxmaj', 'hurt', 'as', 'the', 'self', '-', 'important', 'father', ',', 'and', 'xxmaj', 'meryl', 'xx

^Note how the unknown / low frequency words get mapped to 0, which is the code for `UNK` (or "xxunk" in fastai parlance).

These can then go into a fastai DataLoader which has been setup for language modeling, [`LMDataLoader`](https://docs.fast.ai/text.data.html#LMDataLoader), which is designed to load a batch of text as an input and the *same text shifted ahead by one word* as the target data

In [None]:
dl = LMDataLoader(nums200)

# test it
x,y = first(dl)
print(x.shape,y.shape)

# we can print out x & y but lets convert them from numbers to text when we view them
print(', '.join(num.vocab[o] for o in x[0][:20]))
print(', '.join(num.vocab[o] for o in y[0][:20]))

torch.Size([64, 72]) torch.Size([64, 72])
xxbos, xxmaj, one, xxmaj, true, xxmaj, thing, xxunk, above, its, xxunk, xxunk, material, to, give, us, a, view, of, a
xxmaj, one, xxmaj, true, xxmaj, thing, xxunk, above, its, xxunk, xxunk, material, to, give, us, a, view, of, a, family


See how each word in y is just the corresponding "next word in x" at the same index?  As a simple exercise, can you do the same?  Write a "shift left" function that just shifts a set of list elements to the left.  Add a "xxpad" on the end:

In [None]:
## UNGRADED EXERCISE 8.0. Fill in your code below as directed

def shift_left(orig:list):
    ## Your code below. Define a variable called "shifted" that is the original
    #  list, shifted to the left by one, and filled in with a "xxpad" at the end.

    shifted =

    ## end of your code
    return shifted

Test your code:

In [None]:
shift_left([1,2,3,4,5])

[2, 3, 4, 5, 'xxpad']

```
Expected ouput:
[2, 3, 4, 5, 'xxpad']
```

In [None]:
# and another check
assert shift_left([]) == ['xxpad']

---

# Part II

## More Exercises!

Huggingface and fastai will end up hiding a lot of what's happening from us, so let's try writing a few more simple helper routines of our own so that we get a feel for what's involved.  The following will be graded.


### Exercise 8.1: `count_freqs`
Given a list, count up the number of times that each element appears in the list.  Return this as a Python dict called `freqs`:

Note that this can be done as a one-liner using `Counter` from the builtin Python `collections` library, or you can write something similar from scratch yourself.


In [None]:
## GRADED EXERCISE 8.1
from collections import Counter

def count_freqs(tokens:list):
    ## YOUR CODE HERE

    ### END OF YOUR CODE
    return freqs

Here's some code to check yourself:

In [None]:
test_list = ['a','b','c','a','d','z','z','q','z','b']
freqs = count_freqs(test_list); freqs

Expected output (note that your order may be different because dicts don't preserve order, but the values should be the same):
```
Counter({'a': 2, 'b': 2, 'c': 1, 'd': 1, 'q': 1, 'z': 3})
```
or
```
dict({'a': 2, 'b': 2, 'c': 1, 'd': 1, 'q': 1, 'z': 3})
```


In [None]:
# another test:
assert freqs['z'] == 3

### Exercise 8.2: `sort_by_freq`
Given a list, sort its elements in **descending** order of frequency. You should call `count_freqs` in this function.  [Here's a hint](https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value)

In [None]:
## GRADED EXERCISE 8.2
def sort_by_freq(tokens:list):
    #call count_freqs to get the frequencies
    freqs = ...

    # then sort the tokens according to freqz
    sorted_tokens =

    return sorted_tokens

Test code for you:

In [None]:
assert sort_by_freq(test_list) == ['z', 'a', 'b', 'c', 'd', 'q']


### Exercise 8.3: `set_vocab_codes`
This will be akin to the "setup" method of fastai's Numericalize: Given an input text,...

1. Tokenize it via the defined `tokenize` method. This will give you a list we'll call `tokens`.
2. Then rank `tokens` in decreasing order of frequency of their occurance in the text.  Call your `sort_by_freq` function for this.
3. Truncate the list of tokens and only keep the top `keep_frac` fraction of it.
4. Add an 'xxunk' token at the beginning of the list of tokens.
5. Finally produce a Python `dict` called `vocab_codes` that will map tokens to their index on the sorted list.

Also, make sure that any unknown words applied to `vocab_codes` return as a [default dict value](https://stackoverflow.com/questions/52195897/how-to-create-a-dict-that-can-account-for-unknown-keys) the code for `xxunk`.

> Note: The fastai/spacy tokenizer is setup as a *generator*, which is not helpful for this exercise. For this reason we'll use NLTK's tokenizer instead.

In [None]:
from fastai.text.all import *
from collections import defaultdict
import nltk
nltk.download('punkt')    # this is a resource needed by NLTK

In [None]:
## GRADED EXERCISE 8.3

def set_vocab_codes(text:string, tokenizer=nltk.word_tokenize, keep_frac=0.5):
    # INSERT YOUR OWN CODE BELOW
    # 1. Tokenize text via the defined `tokenizer` method. This will give you a list we'll call `tokens`.
    tokens = ...

    # 2. Then rank `tokens` in decreasing order of frequency of their occurance in the text.  Call your sort_by_freq()
    tokens = ...

    # 3. Truncate the list of tokens and only keep the top `keep_frac` fraction of it.
    tokens = ...

    # 4. Add an 'xxunk' token at the beginning of the ranked list of tokens.
    tokens = ...

    # 5. Finally produce a Python `dict` called `vocab_codes` that will map tokens to their index on the sorted list.
    vocab_codes = ...

    # Also, (You may want to do this before #5) Make sure that any unknown words applied to `vocab_codes` return as a default


    ## END OF YOUR CODE
    return vocab_codes

In [None]:
text = 'The quick brown fox jumped over the lazy dog'
codes = set_vocab_codes(text)
codes

Expected output:    Your codes dict may have a different order than this, but the values should be the same:

```
defaultdict(<function __main__.set_vocab_codes.<locals>.<lambda>>,
            {'The': 1, 'brown': 3, 'fox': 4, 'quick': 2, 'xxunk': 0})
```


In [None]:
# more tests for you:
assert codes['fox'] == 4
assert codes['Kwisatz Haderach'] == 0

In [None]:
text = 'It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair.'
codes = set_vocab_codes(text)
codes

Expected output:  (Again, the dict order may not be the same, but you should see the same values)

```
defaultdict(<function __main__.set_vocab_codes.<locals>.<lambda>>,
            {',': 4,
             'It': 10,
             'age': 7,
             'best': 11,
             'epoch': 8,
             'it': 5,
             'of': 3,
             'season': 9,
             'the': 2,
             'times': 6,
             'was': 1,
             'xxunk': 0})
```

In [None]:
assert codes['it'] == 5

You'll notice in the above example that "It" and "it" are treated as two separate words. We could send the whole text in as lowercase to get a different result:

In [None]:
codes = set_vocab_codes(text.lower())
codes

Expected output: (order may not be the same)
```
defaultdict(<function __main__.set_vocab_codes.<locals>.<lambda>>,
            {',': 5,
             'age': 7,
             'best': 10,
             'epoch': 8,
             'it': 1,
             'of': 4,
             'season': 9,
             'the': 3,
             'times': 6,
             'was': 2,
             'worst': 11,
             'xxunk': 0})
```

...And now "it" is the most frequent non-unk word in the list.   We could still do like fastai and insert a 'xxmaj' code before every capitalization, but... let's move on for now.  


## Exercise 8.4: `codes_to_words`
One other useful thing will be a way to convert from the codes *back* to the words themselves.  Let's create a function that will return a dict in which the keys and values have been swapped.  [Here's a hint](https://www.geeksforgeeks.org/python-program-to-swap-keys-and-values-in-dictionary/)

In [None]:
## GRADED EXERCISE 8.4

def codes_to_words(codes:dict):
    ## YOUR CODE BELOW
    words = ...

    ## END OF YOUR CODE
    return words

In [None]:
words = codes_to_words(codes)
assert words[9] == 'season'
words[0]

### How We Get Word Vector Embeddings

It's incredibly simple: The `dict` variables that map words to codes and codes to words serve as what computational scientists "look up tables". Neural networks don't exactly work with look-up tables but they can work with a very close analogue via matrix multiplication: What we do is take the word codes (/indexes) an use these to denote the rows (or columns) of a *one-hot encoding*, which is a matrix with zeros almost everywhere, and a 1 in each column & row.  The simplest version being just a diagonal matrix of 1's which can be created by functions like Numpy's `eye`.

> Note: Literally, the code you'd write is simply `np.eye(len(tokens))`. It's so simple that we're not even going to make a programming assignment for it, but we *will* do a slighly trickier exercise below.

Say for example we only had 3 words in our vocabulary, "James", "loves", and "tacos".  The one-hot encoding for these could be

```
'James' = [1, 0, 0]
'loves' = [0, 1, 0]
'tacos' = [0, 0, 1]
```

Then a look-up table could be created when we multiply the *matrix* of one-hot encoded words (i.e.  `np.eye(3)` in this case) with whatever number we want.

The "number we want" will be a set of numbers, in the form of *weights* of a neural network `Linear` layer (with no activation, i.e. linear activation).  This will produce our word embeddings!  To be clear: the weights will be the same as the activations because we are multiplying "1" by the weights and using no activation.  The dimension of the word vectors produced will be determined by how many dimensions we want in this Linear layer of weights -- for example, we mentioned 300 before.

These weights (i.e. the word embeddings themselves) are initialized randomly and *learned* in the context of training the model for whatever task we want.

You might wonder: is using just one Linear layer with no activation sufficient to accurately map out human language to the point fo being able to produce meaningful embeddings?  In practice, this is remarkably effective, and in fact the `nn.Embedding` layers in both PyTorch and Jupyter are literally just one-hot encoders attached to Linear layers (in Keras they're called "Dense" layers).  

How useful will these embeddings be?  Well, that's an interesting question. "Universal" word embedding such as Word2Vec of GloVe are trained on huge datasets in order to be as general as possible, whereas if you were to simply train on a very small dataset, your embeddings might be only useful for the specific task you want.  Generally, it's useful to start with pretrained embeddings in a "frozen" (non-trainable) state as you train the downstream part of your neural network, and then gradually "unfreeze" the network starting fro the later layers and working backward was the model trains.  (The ULMFiT method referenced earlier describes a detailed way of doing this.)

### Exercise 8.5: `token_to_one_hot`
In this example, we're not going to encode *all* the available tokens at once, rather we're just going to produce the one-hot encodings of the particular words one might find in a sequence.  This will simply involve using the `codes` dict you created to get the index of a word and then forming a one-hot version of that word -- i.e. all zeros except for a 1 at the element corresponding to the word's index/code. The length of the one-hot vector will be the total number of possible words.  So for example, if we had 1000 words, the 'xxunk' would be one-hot encoded as a 1 in the first (0th) spot followed by a list of 999 zeros.  

In the following, use `torch.zeros()` to initialize the vector, with a length of `len(codes)`.

***As an additional requirement: Your routine should return the one-hot vector for 'xxunk' for any token not already assigned a code.*** (You may use recursion to achieve this if you like.)

In [None]:
## GRADED EXERCISE 8.5
import torch
def token_to_one_hot(token, codes):
    ### Your code below. Produce a 1D numpy array corresponding to the one-hot vector for token


    ### end of your code
    return onehot_vec

In [None]:
# a little test code for you
from collections import defaultdict
test_codes = defaultdict(lambda x:0)
for key, value in {'xxunk':0, 'the':1, 'apples':2, 'are':3, 'tasty':4}.items():
    test_codes[key] = value

print(token_to_one_hot('apples', test_codes))

Expected output:
```
    tensor([0., 0., 1., 0., 0.])
```

In [None]:
# More tests:
assert torch.equal( token_to_one_hot('tasty',test_codes), torch.Tensor([0,0,0,0,1]) )

# defaultdict will not automatically handle this next one! Your routine will need to catch it
assert torch.equal( token_to_one_hot('smorgasbord',test_codes), torch.Tensor([1,0,0,0,0]) )

Now, if our "batch" is going to have sequences of words, and each word is a one-hot vector, then won't we end up having a 3-dimensional array for our input?  That was fine for images because we were using 2D convolution operations.  How will we structure the one-hot encodings in our word vectors -- as rows or columns?

Let's try a simple example.  Pretend these words are the tokens. We're going to want to process each word and produce its word vector.

In [None]:
test_batch = [['here','is','a','sequence'],['and','here','is','another'],['one','more','sequence','here']]

# Uhhh ok actually our set_vocab_codes needs a string, so we'll convert to an array and then flatten it
batch_array = np.array(test_batch)
print("batch_array = \n",batch_array)
print("batch_array.shape =",batch_array.shape)
test_text = ' '.join( batch_array.flatten().tolist() )
print("test_text = ",test_text)  # ok, now we've got our string to send to set_vocab_codes

test_codes = set_vocab_codes( test_text )
print("test_codes =",test_codes)

Now we'll produce the one-hot encodings for this batch of sequences:

In [None]:
onehot_batch = torch.zeros( (batch_array.shape[0], batch_array.shape[1], len(test_codes)) )

for i, row in enumerate(batch_array):
    for j, word in enumerate(row):
        onehot_batch[i, j] = token_to_one_hot(word, test_codes)

(print(onehot_batch) # here the onehot encodings of words will appear along rows

(^Coulda had you do that as an exercise, but imagined it might get confusing. ;-) )

That batch of inputs will then be matrix-multiplied by a set of weights in the Linear layer of a PyTorch model we'll define as we go forward.

The *output* of that PyTorch model will be a single word, namely a one-hot vector for the next word in the sequence (given by our `shift_left`-ed target data. We will use `softmax` activation and a categorical cross-entropy loss function for this since it's effectively the same thing as predicting one of a variety of categories.  These are just the multi-category versions of sigmoid and binary cross-entropy we saw before.

> Note: Or even better, we can get a bit more numerical precision if we *don't* use softmax and cross-entropy explicitly but rather use the PyTorch forms of these functions that will avoid any funny exponential blowups (as discussed in our [Santa Claus example](https://hedges.belmont.edu/naughty/) for binary classification). The PyTorch `nn.CrossEntropyLoss` actually expects pre-softmax "logit" values so we won't apply the softmax in our model.   

---

# Part III

### At this point, we've covered all the moving parts of the system except for the model itself!  
We can swap in a variety of models, but for definiteness we're going to use what's called "Recurrent Neural Network" (RNN) that will retain some "memory" of previous "states" when it looked at earlier sequences.  (And *this* is why we made the requirement earlier of having one batch feed directly into the next -- it's because of of this stateful memory).  

The particular form of RNN we'll use is called an LSTM (which stands for "Long Short-Term Memory") and it has a few nice properties for not just remembering things but for "forgetting" things too, and even for "deciding" what's worth remembering or forgetting!

I want to strike a balance between writing our own code and developing a full model from scratch vs. interfacing with more powerful packages such as PyTorch and fastai.  So, now that we have a better idea of what's going on "under the hood," in order to continue efficiently without having to write a bunch more code, we'll switch over to the PyTorch and fastai libraries (which do similar work as we have but also a whole lot more) so that we can take advantage of all the other "goodies" that these packages provide.

> **Attribution:** **In what follows, we will follow a combination of [Chapter 10](https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb) and [Chapter 12](https://github.com/fastai/fastbook/blob/master/12_nlp_dive.ipynb) from the fastai "fastbook"**, almost verbatim.  After all, *is* listed as one of the textbooks for the course. ;-)   We will however change up the order, covering the Chapter 12 content before the Chapter 10 content. (You don't need to read these separately for what follows; I'm just citing my sources.)

## The Plan
What they do is, first load in a dataset and define their data loaders, then they define and train (from scratch) a series of RNN models ordered by increasing sophistication, finally building up to the LSTM model. (This will be the part from fastbook Chapter 12).
In order for this proceed quickly rather than taking hours or days to train, they do this will a small text dataset called "Human Numbers".  

We would love to then be able to use our newly-trained models to demonstrate how language-model-pretraining can help improve performance on text classification, **however the complexity of real human language demands that we employ larger datasets and larger models which then take a lot longer to train** -- in effect, we would have to reproduce the entire [ULMFiT paper](https://arxiv.org/abs/1801.06146) which was trained on WikiText-103.

> **Ethics Advisory:** Language Models have become bigger and bigger in recent years (notably, [GPT-3](https://towardsdatascience.com/gpt-3-a-complete-overview-190232eb25fd) with its 175 billion paraemeters) in order to address more and more complexity of language. The cost of training these models is felt both economically and *environmentally*: burning all that energy [incurs a significant carbon footprint](https://www.theregister.com/2020/11/04/gpt3_carbon_footprint_estimate/).  Thus, we can see that in practical cases where we want sophisticated, accurate results that require large models and large datasets (rather than simple examples for teaching purposes),  Transfer Learning (i.e. starting from a pre-trained model and then fine-tuning on a smaller domain-specific dataset) rather than training from scratch becomes *not only the expedient choice but also the ethically sounder choice*.

In plainer language: You don't want to pay the money and wait around to train your own sophisticated model from scratch, and we don't want to burn all that energy anyway,...but the very simple models we're going to start with for illustrative purposes won't work for what we ultimately want to do. So when it comes time to do the text classification, we will switch over (to Chapter 10 content) and load the fastai checkpoints that were already trained on Wikitext.  Think of it like a "cooking show" on TV, where they come back from commercial and everything's already cooked!



### The Data
The fastai "Human Numbers" dataset is just a list of the first 10,000 numbers written out sequentially in plain English.  It's not sufficient for learning the nuanaces of all of English, but it's a good demo for how a language model can learn to predict what comes next.  Let's take a look

> **IMPORTANT:** For what follows, make sure you have GPU acceleration enabled.  Go to `Edit > Notebook settings > Hardware acclerator > GPU`.   This may reset your runtime, in which case you'll need to re-do the install * imports from the top of the notebook -- but I've gone ahead and repeated those lines below so you don't scroll up:

In [None]:
# We'll make this so that you can restart the notebook from here instead of having to scroll up
!pip install -Uqq fastai fastbook

[?25l[K     |█▊                              | 10 kB 23.4 MB/s eta 0:00:01[K     |███▌                            | 20 kB 26.3 MB/s eta 0:00:01[K     |█████▎                          | 30 kB 13.7 MB/s eta 0:00:01[K     |███████                         | 40 kB 10.1 MB/s eta 0:00:01[K     |████████▉                       | 51 kB 5.4 MB/s eta 0:00:01[K     |██████████▌                     | 61 kB 5.5 MB/s eta 0:00:01[K     |████████████▎                   | 71 kB 6.2 MB/s eta 0:00:01[K     |██████████████                  | 81 kB 6.9 MB/s eta 0:00:01[K     |███████████████▉                | 92 kB 7.0 MB/s eta 0:00:01[K     |█████████████████▋              | 102 kB 5.5 MB/s eta 0:00:01[K     |███████████████████▍            | 112 kB 5.5 MB/s eta 0:00:01[K     |█████████████████████           | 122 kB 5.5 MB/s eta 0:00:01[K     |██████████████████████▉         | 133 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████▋       | 143 kB 5.5 MB/s eta 0:00:01[K 

In [None]:
# if the next line produces an error, restart the runtime and try again.
import fastbook
from fastai.text.all import *
from IPython.display import display, HTML

In [None]:
path = untar_data(URLs.HUMAN_NUMBERS)  # download the dataset

In [None]:
lines = L()   # L() is fastai's "super" list class. It's like a regular list, but can do more. it's built on Numpy
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
print(lines[100:125]) # show somewhere in the middle
lines  # and show the beginning

['one hundred one \n', 'one hundred two \n', 'one hundred three \n', 'one hundred four \n', 'one hundred five \n', 'one hundred six \n', 'one hundred seven \n', 'one hundred eight \n', 'one hundred nine \n', 'one hundred ten \n', 'one hundred eleven \n', 'one hundred twelve \n', 'one hundred thirteen \n', 'one hundred fourteen \n', 'one hundred fifteen \n', 'one hundred sixteen \n', 'one hundred seventeen \n', 'one hundred eighteen \n', 'one hundred nineteen \n', 'one hundred twenty \n', 'one hundred twenty one \n', 'one hundred twenty two \n', 'one hundred twenty three \n', 'one hundred twenty four \n', 'one hundred twenty five \n']


(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

Turn that list into one long string, and then tokenize the string by splting at spaces:

In [None]:
text = ' . '.join([l.strip() for l in lines])
print(text[:100])
tokens = text.split(' ')
print(tokens[:10])

one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo
['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']


To make our vocab and numericalize it into codes/indexs, the `L()` class has a handy `.unique()` method.  You'll see that there are only 30 different unique tokens:

In [None]:
vocab = L(*tokens).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

The dict for mapping words to codes/indexes goes like this:

In [None]:
word2idx = {w:i for i,w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)
nums

(#63095) [0,1,2,1,3,1,4,1,5,1...]

For starters, we're going to use a model that only looks at sequences of 3 words at a time, so we'll produce a series of 3-element "windows" of the numericalized text:  (Note that period '.' gets the index of 1 so it shows up a lot.  We might be tempted to remove these, but they serve as separators between our numbers expressed as natural language.)  First, here's what that looks like in terms of words:

In [None]:
L((tokens[i:i+3], tokens[i+3]) for i in range(0,len(tokens)-4,3))

(#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

Then here's what it looks like in terms of numbers:

In [None]:
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))
seqs

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

In what follows, we're going to try to compare things against each other, so it's helpful to set the seed of the random number generate (RNG) every time.  In fastai & PyTorch there are a lot of different random numbers being generated so as a convenience, we're going to define [the following routine](https://forums.fast.ai/t/solved-reproducibility-where-is-the-randomness-coming-in/31628/5?u=drscotthawley) to set them all at once:

In [None]:
def set_seed(dls, seed=1):   # must have dls, as it has an internal random.Random
    # code from https://forums.fast.ai/t/solved-reproducibility-where-is-the-randomness-coming-in/31628/5
    random.seed(seed)
    dls.rng.seed(seed) #added this line
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

Fastai uses the DataLoader class to feed into its Learner class, so let's define a DataLoader:

In [None]:
bs = 64  # batch size of 64 will work for this small data
cut = int(len(seqs) * 0.8)    # cut will split into training and validation sets
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)
set_seed(dls)



### The First Model
Let's take a look at the simplest model they consider first. It's designed to take in three words at a time, which appear in the `.forward()` part of the class as `x[:,0]`, `x[:,1]`, and `x[:,2]`:

In [None]:
class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden,vocab_sz)

    def forward(self, x):  # SHH: I slightly edited this from fastbook to make it more similar to LMModel2 below
        h = 0
        h = h + self.i_h(x[:,0])
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,1])
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,2])
        h = F.relu(self.h_h(h))
        return self.h_o(h)

So we've got a `nn.Embedding` layer, which as we said earlier is just a (very efficient) proxy for one-hot encoding paired with a linear layer.
After that it's just a 3-layer network, but one in which we add the embeddings (`i_h`) of each word to the "hidden state" (`h`) of the network for this 3-word sequence.  This network has no "memory"; we'll add that later.

Let's train this model, and look at the accuracy over a few epochs:

In [None]:
set_seed(dls)     # for comparison with the next model's run
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.789529,2.008283,0.473972,00:01
1,1.381144,1.828828,0.468743,00:01
2,1.419095,1.656361,0.490611,00:01
3,1.40474,1.676169,0.447112,00:01


An Accuracy of 0.49 is better than randomly guessing among 30 possible tokens (which would be a score of 0.033), but not nearly as good as we'll be able to get. Thinking a little more carefully, we might inspect the dataset itself and see what the most common token is:

In [None]:
n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n

(tensor(29), 'thousand', 0.15165200855716662)

We might expect "thousand" to be rather common, since 90% of the dataset has the word "thousand" in it, but the low-value integers like "one", "two", "three" and so on also appear a lot.  


### Exercise 8.6: The Second Model: `LMModel2`  ( = `LMModel1` but with a loop )
Let's make a model that is *exactly the same as model 1* but with the `.forward()` method written using a loop.  Can you do it yourself -- without peeking at the fastbook version?  There are two empty lines in the model code below for you to fill in your own code.

In [None]:
## UNGRADED EXERCISE 8.6  (this will not be graded, just do it on your own)
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden,vocab_sz)

    def forward(self, x):
        h = 0
        for i in range(3):
            ### YOUR CODE BELOW: fill in the TWO LINES needed to perform the same
            # operations as LMModel1 (but using a loop instead)

            ### END YOUR CODE.
        return self.h_o(h)

Train the network below. It should give the same numbers as above:

In [None]:
set_seed(dls)   # for reproducibility / comparison with LMModel1's results above
learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.789529,2.008283,0.473972,00:01
1,1.381144,1.828828,0.468743,00:01
2,1.419095,1.656361,0.490611,00:01
3,1.40474,1.676169,0.447112,00:01


### The Third Model
Now we finally come to a model that has some "memory": the state `h` is going to persist inside the class (as `self.h`) from one `.forward()` call to the next instead of being reset to 0 every time.  They do include a method to reset the state but it has to be called explicitly, otherwise `self.h` remains whatever value it had *last time* the forward method was called.

In [None]:
class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0                   # this is the "memory" part. self.h persists with the class after .forward() is finished

    def forward(self, x):
        for i in range(3):           # what follows is same as LMModel2 but with "h" replaced by "self.h"
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach()  # for next time: we'll keep the value of h but discard its extra gradient info
        return out

    def reset(self): self.h = 0     # we'll call this at the beginning of a new set of text.

In order to take advantage of the state of the network, we need to make sure the "rows" of each batch line up across the batch boundaries.  (Remember when we wrote out those three batches of IMDB text, above?)

Currently our dataset isn't set up to do this, so we're going to change it and create a new dataset of "chunks" that follow one another:


In [None]:
m = len(seqs)//bs  # m will be the number of chunks
m,bs,len(seqs)

(328, 64, 21031)

In [None]:
def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds

In [None]:
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs),
    group_chunks(seqs[cut:], bs),
    bs=bs, drop_last=True, shuffle=False)

Now we can train model 3.  The first few epochs actually won't show an advantage, so we'll train it loger:

In [None]:
set_seed(dls)
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)  # ModelResetter callback will call our model's .reset()
learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.703804,1.852335,0.431971,00:01
1,1.291301,1.85834,0.426683,00:01
2,1.105003,1.684123,0.467308,00:01
3,0.996649,1.709893,0.545673,00:01
4,0.948535,1.794689,0.550481,00:01
5,0.891645,1.726101,0.571875,00:01
6,0.889067,1.50499,0.569471,00:01
7,0.823918,1.686259,0.573798,00:01
8,0.78842,1.686563,0.599519,00:01
9,0.77284,1.679055,0.599279,00:01


Better than before!  But it's a slightly different dataset split of course we also trained it longer, and you'll notice we even bumped up the learning rate.

Just so we're not comparing "apples and oranges", let's re-run LMModel1 on this modified dataset for the same number of epochs to see how it compares to LMModel3:

In [None]:
set_seed(dls)
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy)
learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.715092,1.878626,0.405769,00:01
1,1.385061,1.949975,0.35601,00:01
2,1.334724,2.01896,0.356731,00:01
3,1.317794,2.040227,0.358894,00:01
4,1.308069,2.021837,0.361779,00:01
5,1.302087,2.073891,0.371635,00:01
6,1.297517,2.083388,0.372356,00:01
7,1.294183,2.077134,0.353846,00:01
8,1.289541,2.12882,0.327163,00:01
9,1.284398,2.165285,0.325721,00:01


Notice how much worse this one was than the stateful model.

How can we improve on our results (LMModel3)?  For the datasets used in above examples we've actually been "skipping" 3 words at a time. We'd get a lot more training data (or "signal" as the fastbook describes) it if we moved through the dataset one word at a time.  

In [None]:
sl = 3     # sl is the sequence length, which we'll increase from 3 in just a bit
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
         for i in range(0,len(nums)-sl-1,sl))
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
                             group_chunks(seqs[cut:], bs),
                             bs=bs, drop_last=True, shuffle=False)

Here's an example of the result:

In [None]:
[L(vocab[o] for o in s) for s in seqs[0]]

[(#3) ['one','.','two'], (#3) ['.','two','.']]

 Notice how the second element reads `['.','two','.']` instead of the `['.','three','.']` that we had ^^up above a ways.

We need to modify our model slightly so that it can handle the change we just made.

In [None]:
class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden, sl=3):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        self.sl = sl

    def forward(self, x):
        outs = []
        for i in range(self.sl):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
            outs.append(self.h_o(self.h))
        self.h = self.h.detach()
        return torch.stack(outs, dim=1)

    def reset(self): self.h = 0

From fastbook: "This model will return outputs of shape `bs x sl x vocab_sz` (since we stacked on dim=1). Our targets are of shape `bs x sl`, so we need to flatten those before using them in F.cross_entropy:"

In [None]:
def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))

Let's train on that sequence length of 3, and keep training longer:

In [None]:
learn = Learner(dls, LMModel4(len(vocab), 64, sl=sl), loss_func=loss_func,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.844128,1.86156,0.466506,00:01
1,1.450525,1.947844,0.400721,00:01
2,1.407147,2.097088,0.346394,00:01
3,1.366056,2.111259,0.389744,00:01
4,1.306964,2.129084,0.392228,00:01
5,1.28748,2.061908,0.370753,00:01
6,1.290272,2.078658,0.449038,00:01
7,1.290166,2.45074,0.463141,00:01
8,1.275506,2.14876,0.477885,00:01
9,1.227539,2.292798,0.489984,00:01


And now let's bump up the sequence length to 16:

In [None]:
sl = 16     # sl is the sequence length, which we'll increase from 3 in just a bit
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
         for i in range(0,len(nums)-sl-1,sl))
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
                             group_chunks(seqs[cut:], bs),
                             bs=bs, drop_last=True, shuffle=False)


# we won't bother setting the random seed since the dataset will be different
learn = Learner(dls, LMModel4(len(vocab), 64, sl=sl), loss_func=loss_func,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.116494,2.964314,0.184896,00:00
1,2.225303,1.936567,0.453451,00:00
2,1.703538,1.771978,0.483724,00:00
3,1.444374,1.84152,0.504883,00:00
4,1.259405,2.190436,0.531738,00:00
5,1.130306,2.082433,0.57251,00:00
6,1.009578,2.410586,0.59082,00:00
7,0.911287,2.603492,0.603841,00:00
8,0.827644,2.804047,0.614095,00:00
9,0.759426,2.857082,0.632487,00:00


Ooh and that's even better, but can we still improve it?  What about adding more layers to the network?  "Deeper" learning should be better, right? ;-)
Let's use PyTorch's [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) class to help us make an even deeper model, where we can specify the depth via `n_layers`:

In [None]:
class LMModel5(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = torch.zeros(n_layers, bs, n_hidden)

    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(res)

    def reset(self): self.h.zero_()

In [None]:
learn = Learner(dls, LMModel5(len(vocab), 64, 2),
                loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.047085,2.589521,0.440023,00:00
1,2.160752,1.784279,0.471273,00:00
2,1.71288,1.852068,0.331706,00:00
3,1.504376,1.861352,0.355469,00:00
4,1.329749,1.794908,0.462891,00:00
5,1.152756,1.793787,0.4764,00:00
6,0.997835,1.835657,0.503499,00:00
7,0.871044,1.938171,0.510579,00:00
8,0.77198,1.946762,0.529378,00:00
9,0.689576,2.003894,0.537028,00:00


Wait, that was actually WORSE this time?  How come? Isn't deeper better?

One problem is that our [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) class includes a lot of $\tanh$ activations internally, which have gradients tend to approach zero everywhere except near the "middle" -- recall the gradient of the sigmoid function we talked about in the Santa Claus example?  tanh and sigmoid are "cousins": In the following graph, we plot sigmoid in red, $\tanh$ in blue, and then $(1+\tanh)/2$ in green -- which lays right on top of the red line so you can't even see it, because they're exactly the same:



In [None]:
HTML('<iframe src="https://www.desmos.com/calculator/1xlbu6tbx0?embed" width="500" height="500" style="border: 1px solid #ccc" frameborder=0></iframe>')

So it's only in the region near x=0 that the gradient of tanh is significantly bigger than zero.  And with lots of tanh's feeding into other tanh's, chances are that we'll get a lot of small gradients being multiplied by other (small or large) gradients, leading to what's know as the "vanishing gradient problem".

Actually, whenever multiple tanh functions in multiple layers are doing the same thing, we'll either get vanishing gradients or else we'll get "exploding gradients" when the "middle" parts of the tanh's line up, especially if these tanh's are "sharpending", i.e. become more like step functions.  

What if we replaced the tanh's in the RNN with ReLU like our earlier models were using? Let's try it...





In [None]:
class LMModel5b(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True, nonlinearity='relu') # change the tanh to a relu
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = torch.zeros(n_layers, bs, n_hidden)

    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(res)

    def reset(self): self.h.zero_()


learn = Learner(dls, LMModel5b(len(vocab), 64, 2),
                loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.222643,2.858331,0.397217,00:00
1,2.179497,1.889524,0.471191,00:00
2,1.64632,1.754807,0.518392,00:00
3,1.393186,1.772523,0.539307,00:00
4,1.125879,1.783095,0.58374,00:00
5,0.871718,2.140252,0.557617,00:00
6,0.685769,2.31324,0.673665,00:00
7,0.537473,2.512885,0.671305,00:00
8,0.437053,2.61596,0.685221,00:00
9,0.362013,2.838107,0.69751,00:00


Great!  That's the best so far! But still, HOW CAN WE DO BETTER?  
One very important architecture uses the tanh instead of ReLU activation, but does so in concert with a set of tunable "logic gates" that allow the model to learn what's worth keeping in memory and what's worth "forgetting".  It's known as Long Short-Term Memory or LSTM.  These gates allow the model to manage the problems vanishing and exploding gradients.


## Long Short-Term Memory (LSTM)
Here's a picture of a "neuron" based on LSTM.  We call it a "cell".  It looks a bit intimidating, but we'll unpack it below.


![LSTM cell from fastbook](https://raw.githubusercontent.com/fastai/fastbook/780b76bef3127ce5b64f8230fce60e915a7e0735/images/LSTM.png)
*Image source: [The fastai Book](https://github.com/fastai/fastbook/blob/master/12_nlp_dive.ipynb)*

> Note: The following "unpacking" of the LSTM is (IMHO) fairly clear, but if you find it confusing, you will not be alone.  If this next part stresses you out, don't worry about it too much, and go ahead with the training code that follows.


Importantly, the cell has not just one hidden state $h$ like before, but a new state $c$ as well, and they complement each other.  Along the bottom we see the hidden state of the cell "h" that persists from the previous "time" in the sequence (i.e. the previous token) $h_{t-1}$ to the state at the end of the cell's processing $h_t$. Note that $h_t$ is also the "output" of the cell, whereas $c$ is just passed along to the (later version of the) cell at a later time.  The new input at time $t$ is on the bottom as $x_t$. And rather than just adding $x$ and $h$ like we were doing before, we're going to concatenate them into one big matrix.

The yellow symbols are just elementwise mathematical operations like addition, subtraction, running through tanh.  The orange symbols denote the LSTM's "gates", which include weights for tuning the gate -- i.e. the gates are like mini neural networks.

The leftmost sigmoid is called the "forget gate".  The variable $c$ functions as the "memory" of the cell.  Based on the input $x$ and the previous hidden state $h$, the weights feeding into the forget gate determine whether to "forget" $c$ by multiplying it by zero, or to "remember" it (i.e. allow it to pass through) by multiplying it by 1, or -- and this is important -- *some number between 0 and 1* -- thus the remembering isn't an all-or-nothing proposition, but instead is *continuous* and therefore *differentiable* and therefore *amenable to training by gradient descent!*

The next sigmoid is called the "input gate", which works in concert with the orange tanh gate (known as the "cell gate") which performs an operation somewhat analagous to the older RNN's tanh activation, except this is *only for modifying the cell's memory $c$.  The input gate decides how much of the new input to "remember".

What the cell actually outputs is then in the bottom right, in which we combine some about of the old hidden state, the new input, and the new cell's memory, into a final output.  Interesting to note that the contribution from the memory $c$ can range from -1 to 1 (as determined by the yellow tanh function), whereas the part from $x$ and $h$ is only between 0 and 1 (as determined by the last sigmoid in the bottom right).

The following is the code for an LSTM cell. For some readers, this may help clarify exactly what's described above.

In [None]:
class LSTMCell(Module):
    def __init__(self, ni, nh):
        self.ih = nn.Linear(ni,4*nh)
        self.hh = nn.Linear(nh,4*nh)

    def forward(self, input, state):
        h,c = state
        # One big multiplication for all the gates is better than 4 smaller ones
        gates = (self.ih(input) + self.hh(h)).chunk(4, 1)  # .chunk is a PyTorch method
        ingate,forgetgate,outgate = map(torch.sigmoid, gates[:3])
        cellgate = gates[3].tanh()

        c = (forgetgate*c) + (ingate*cellgate)
        h = outgate * c.tanh()
        return h, (h,c)

The PyTorch [.chunk()](https://pytorch.org/docs/stable/generated/torch.chunk.html) just splits the tensor into (in this case) 4 chunks of length 1.

We then chain these cells together. Here's a model that does that:

In [None]:
class LMModel6(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]

    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(res)

    def reset(self):
        for h in self.h: h.zero_()

In [None]:
learn = Learner(dls, LMModel6(len(vocab), 64, 2),
                loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)    # note that we're cranking up the learnign rate even more now!

epoch,train_loss,valid_loss,accuracy,time
0,3.020345,2.683045,0.363851,00:01
1,2.124043,2.11064,0.297933,00:01
2,1.601818,1.802153,0.459473,00:01
3,1.317922,2.01407,0.507161,00:01
4,1.107276,2.052923,0.529134,00:01
5,0.872074,1.887227,0.61263,00:01
6,0.631941,1.788551,0.625651,00:01
7,0.436306,1.298946,0.67277,00:01
8,0.287841,0.9223,0.769613,00:01
9,0.191388,0.828305,0.790934,00:01


Wonderful! that's the best so far! But WAIT, THERE'S MORE!

### Regularization Methods

Another thing we can do to improve the model's performance on the validation set goes under the topic of "regularization".  Regularization, conceptually, is any means we might try to "make life harder" for the training part of the code, so that it can generalize better.  Data augmentation -- such as rotating images, changing their intensity or contrast, or removing groups of pixels from the image -- is one example of regularization that we've already seen.

#### Dropout

Akin to removing pixels from an image in the training set (never in the validation set BTW), is to *randomly turn off neurons in the network during training*.  This is called [Dropout](https://dl.acm.org/doi/pdf/10.5555/2627435.2670313) and is a powerful method that is standard practice in Deep Learning. Dropout works by randomly turning off neurons during training, but keeping them all one when doing predictions (e.g. on the validation set).  In so doing, you make the network "work harder" to develop more powerful, more general, internal representations and to filter out "noisy" details you might not want the model to focus on. Dropout appears as a standard network layer in most DL libraries, and includes a parameter whereby you can tell it what fraction of cells in the layer to turn off during each iteration.  

#### Weight Decay

There's another important method called Weight Decay that gets discussed earlier in the fastai book.  Weight Decay is where you *make the magnitude of the weigths part of the loss function* -- you just add up the squares of the weights (so called "L2 regularization" because adding up squared things is part of an "L2 norm") and add that into the loss function.  The result is that the model will learn to keep the weights from getting too big, which could otherwise lead to overfitting -- again, the idea is to make things *hard* for the model while training. Schemically it looks like:

```
loss = loss + weights.pow(2).mean()*wd
```
where `wd` is the "strength" of the weight decay regularization: More means that the weights will make up a bigger part of the loss, which we're trying to *minimize* via gradient descent, so a larger `wd` parameter will lead to more minimizing of the model weights. This has the effect of reducing overfitting and (usually) improving generalization.

#### Activation Regularization
Related to weight decay is the act of making sure the neuron activations themselves don't get too large: we just add the L2 norm of the activations to the loss function as well.  To implement this we'll need to output not just the hidden state of the LSTM cell but the activations (so we can put them in the loss function).  fastai is going to handle all this for us by using the `RNNRegularizer()` callback.



### Final Model: `LMModel7`
Here's a cell with Dropout added.

In [None]:
class LMModel7(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers, p):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.drop = nn.Dropout(p)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h_o.weight = self.i_h.weight
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]

    def forward(self, x):
        raw,h = self.rnn(self.i_h(x), self.h)
        out = self.drop(raw)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(out),raw,out    # the raw, out are the activations we'll use in the RNNRegularizer

    def reset(self):
        for h in self.h: h.zero_()

One extra thing we did in the model above is the line that reads `self.h_o.weight = self.i_h.weight`, a regularization method called "weight tying".  From the fastai book:

> "Another useful trick we can add from the [AWD LSTM paper](https://arxiv.org/abs/1708.02182) is weight tying. In a language model, the input embeddings represent a mapping from English words to activations, and the output hidden layer represents a mapping from activations to English words. We might expect, intuitively, that these mappings could be the same."

With all this then, here's (almost) our last from-scratch-weighted learner of the lesson:

In [None]:
learn = Learner(dls, LMModel7(len(vocab), 64, 2, 0.5),
                loss_func=CrossEntropyLossFlat(), metrics=accuracy,
                cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)])

Which, incidentally there's a special kind of Learner class defined that incldues some of this automatically:

In [None]:
learn = TextLearner(dls, LMModel7(len(vocab), 64, 2, 0.4),
                    loss_func=CrossEntropyLossFlat(), metrics=accuracy)

Then we can do the training and pass in a weight decay strength of 0.1:

In [None]:
learn.fit_one_cycle(15, 1e-2, wd=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,2.477206,1.809246,0.476644,00:01
1,1.614283,1.281632,0.612793,00:01
2,0.878191,0.883364,0.778971,00:01
3,0.431828,0.635024,0.840007,00:01
4,0.211996,0.594336,0.860677,00:01
5,0.114491,0.688404,0.827555,00:01
6,0.065761,0.5316,0.872803,00:01
7,0.043977,0.462177,0.876546,00:01
8,0.032514,0.451317,0.886393,00:01
9,0.025291,0.450308,0.888591,00:01


88% in 15 seconds, training from scratch?!  Wow, pretty good, eh?!

And yet, as we said before, learning the English language involves a lot more than just being able to parse numbers in natural language form. So for greater sophistication, we're going to switch datasets, switch models, and use Transfer Learning instead of training from scratch.


---

# Part IV - Using Language Model Weights for Other Things

Earlier we looked at the IMDB dataset for movie reviews. Let's re-run the same code we did in Part I before to set up what comes next:

In [None]:
# in case you're coming back to Colab and want to restart without scrolling up,
# uncomment the following lines:
!pip install -Uqq fastai fastbook

In [None]:
import fastbook
from fastai.text.all import *
from IPython.display import display, HTML

All the tokenization and numericalization and special "continue-along-rows-between-batches" slicing that  we've done earlier in Part I happens under the hood in the fastai [TextBlock](https://docs.fast.ai/text.data.html#TextBlock)*italicized text* special class which can be fed into the more generic Datablock class.


### How much time do you have to get good results?

There are two different-sized IMDB datasets.  The original fastai lesson -- which gives great results -- trains on the full 140 MB IMDB movie review dataset.  There's much smaller sample dataset which is NOT supposed to be great for language model training but it'll go a heck of a lot faster.  In what follows, I'm including both options.  


I recommend that you leave `full_dataset` set to `False` at first --- so that you can run the following code in **a few minutes** rather than **6 HOURS** --- and then we'll load the fully-trained model weights below as with the "cooking show" metaphor mentioned earlier.  Feel free to come back later and change `full_dataset` to `True` and run it yourself if you want (but you need to).

In [None]:

def setup_imdb_dataloaders(full_dataset=False, bs=128):
    "made this a function instead of a cell so we can call it again below"

    if full_dataset:  # Full IMDB dataset as per fastbook Chapter 10; everything will take long

        path = untar_data(URLs.IMDB)  # Full IMDB dataset, ~140 MB in size

        # get_imdb is a data-getter function, where path is passed in below
        get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

        dls_lm = DataBlock(
            blocks=TextBlock.from_folder(path, is_lm=True),
            get_items=get_imdb, splitter=RandomSplitter(0.1)
        ).dataloaders(path, path=path, bs=bs, seq_len=80)

    else:  # Faster, won't produce as good end results.

        path = untar_data(URLs.IMDB_SAMPLE)  # smaller IMDB examples, 558K

        # IMDB_SAMPLE is a .csv, so follow https://docs.fast.ai/text.data.html#TextBlock.from_df
        df = pd.read_csv(path/'texts.csv')

        db_lm = DataBlock(
            blocks=TextBlock.from_df('text', is_lm=True),
            get_x=ColReader('text'), splitter=RandomSplitter(0.1))
        dls_lm = db_lm.dataloaders(df, bs=bs, seq_len=80)

    return path, dls_lm


full_dataset = False   # change to True if you want to train this thing yourself (~5 hours on Colab)
bs = 64   # batch size. if you get out of memory errors in training, cut this in half, restart & retry


path, dls_lm = setup_imdb_dataloaders(full_dataset=full_dataset, bs=bs)

  return array(a, dtype, copy=False, order=order)


Let's take a look:

In [None]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj want to watch a scary horror film ? xxmaj then steer clear of this one . xxmaj there 's not enough beer in the world to make this film enjoyable . \n\n xxmaj however , there is enough xxunk . xxmaj single - xxunk , if you can manage it . \n\n xxmaj if the previous comments were n't enough to keep you from watching this film xxunk , allow me to xxunk . xxup nasa xxunk one","xxmaj want to watch a scary horror film ? xxmaj then steer clear of this one . xxmaj there 's not enough beer in the world to make this film enjoyable . \n\n xxmaj however , there is enough xxunk . xxmaj single - xxunk , if you can manage it . \n\n xxmaj if the previous comments were n't enough to keep you from watching this film xxunk , allow me to xxunk . xxup nasa xxunk one man"
1,"design , having no real impact on the story . xxmaj i 'd argue that the whole point of using drawn animation ( instead of actors / xxup cgi ) is to really push the limits of imagination and design ; to do that which is too difficult / xxunk in other xxunk . xxmaj although the animation in xxmaj renaissance is certainly stunning and incredibly well - accomplished , i never felt like i was seeing something that has",", having no real impact on the story . xxmaj i 'd argue that the whole point of using drawn animation ( instead of actors / xxup cgi ) is to really push the limits of imagination and design ; to do that which is too difficult / xxunk in other xxunk . xxmaj although the animation in xxmaj renaissance is certainly stunning and incredibly well - accomplished , i never felt like i was seeing something that has n't"


### Fine-Tuning the Pretrained Language Model

The pretrained model we're going to use is an LSTM called "AWS_LSTM" that was pretrained by Howard et al on Wikipedia. Earlier in the course when we were working with images, we used a fastai Learner called `cnn_learner` that loaded a pretrained "ResNet" model, so this is similar: There's `language_model_learner()` we can call:

In [None]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]).to_fp16()   # fp16, if our GPU will support it natively, will help this run faster

That progress bar you just saw was a download the pretrained model weights.  When we train the model in the next cell, the progress bar you'll see is going to be for *just one epoch* through this dataset, because the IMDB dataset is huge.  

> **Warning: If you use the full IMDB dataset** (`full_dataset=True`) **then expect the next cell to take 20 to 30 minutes to run!**

(If you're not using Colab and you get CUDA Out of Memory error, then go back and decrease the batch size in the dataloders definition, e.g. `bs=32`.)

In [None]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.340217,4.236506,0.268823,69.165794,00:09


What's that "perplexity" metric?  From the fastai book:
> "The perplexity metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., `torch.exp(cross_entropy)`). We also include the accuracy metric, to see how many times our model is right when trying to predict the next word, since cross-entropy (as we've seen) is both hard to interpret, and tells us more about the model's confidence than its accuracy."

Since perplexity scales with the loss, that means that ***lower values are better***.

At this point, the fastai book (Chapter 10) notes that it's high time we learn about saving model checkpoints, so that we can resume our work later if something happens, and reload a model we were training.  With fastai it's as simple as `learn.save(<filename>)` where the actual file gets a `.pth` appended to it, and it goes in a new directory called `models/` off of wherever the learner's `path` is currently set to:

In [None]:
learn.save('1epoch')  # this will create <learn.path/>models/1epoch.pth

Path('models/1epoch.pth')

Loading the trained model back is as simple as `learn.load()`:

In [None]:
learn = learn.load('1epoch')

> Note: When you're using `learn.load()`, you need to have *already defined* `learn` *and its model*: all the `load()` function does is overwrite the model weights and the state of the optimizer, and a few other things. (Think of it this way, in order to call "learn.load()", `learn` needs to be a defined variable already!)

Now, by default when you load a model in fastai, the weights are "frozen" except for the very last layer (and maybe a few of the other later layers, depending on the specific sub-class of Learner you're using).  Frozen means that most of the weights in the network are not training at all, only the last (few) layer(s).  This is to fine-tune the model, as it's been found that the most generic representations in the network happen in the early layers and are probably a great guess to get started, whereas the later layers are closer to new data and will need to evolve.

We could train all the model's layers at once in a "unfrozen" state --- and we're about to do just that -- but experience with Transfer Learning has shown that by doing this (mostly) frozen pre-training step is better because it doesn't cause the earlier layers to fluctuate wildly in response to the later layers being initialized from scratch.

That said, unfreezing the whole model means it'll take longer to train because we'll now be computing *gradients* for everything instead of just the later layers.  In many cases though, it doesn't take much longer, e.g. usually only about 20% longer (or less) than the frozen training

In [None]:
learn.unfreeze()

learn.fit_one_cycle(10, 2e-3)

learn.save('finetuned_full')

# output immediately below is shown for full_dataset=False

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.137601,4.090182,0.283821,59.750782,00:11
1,4.041873,4.033553,0.287096,56.461143,00:11
2,3.92197,4.027807,0.285063,56.13768,00:11
3,3.750067,4.066389,0.281642,58.345871,00:11
4,3.544308,4.154185,0.274848,63.700016,00:11
5,3.313109,4.279891,0.267754,72.232536,00:11
6,3.081216,4.418332,0.259444,82.957764,00:11
7,2.873045,4.520293,0.256298,91.862488,00:11
8,2.69766,4.572013,0.253701,96.73864,00:11
9,2.581851,4.581793,0.253474,97.689377,00:11


Path('models/finetuned_full.pth')

Expected output for the above when `full_dataset=True`:

(copied from my run on a GTX 3080 GPU, where I needed `bs=64` instead of 128 for memory reasons):

```
epoch train_loss valid_loss  accuracy    perplexity   time
--   --------    --------    --------    ---------    -----
0	3.778447	3.754353	0.317398	42.706593	11:45
1	3.741970	3.716871	0.321567	41.135479	11:36
2	3.654867	3.671912	0.326534	39.327045	11:35
3	3.607213	3.639998	0.330397	38.091766	11:38
4	3.534192	3.614742	0.334226	37.141747	11:34
5	3.478336	3.590213	0.337204	36.241806	11:38
6	3.395994	3.572024	0.339824	35.588547	11:33
7	3.321614	3.564811	0.341225	35.332790	11:37
8	3.243030	3.566489	0.341502	35.392105	11:28
9	3.223321	3.571197	0.341280	35.559132	11:37
```
(and note that Colab is 3x slower than the GPU used here, so that would amount to 5 to 6 hours of training on Colab.)

So, from the "expected output" for the full IMDB dataset (`full_dataset=True`) and training for 5 hours, one sees an accuracy of around 0.34 and a perplexity of about 35.  The fact that the perplexity for the small (`full_dataset=False`) method comes out to around 103 indicates that the smaller dataset is NOT going to be good for modeling language.  Also note that the `valid_loss` flattens out whereas the `train_loss` keeps decreasing, indicating that we are overfitting.

### Cooking Show Approach: ...and we're back from commercial!
Taking the "cooking show" approach, let's say you did all that training on the larger dataset and now we're ready to "pull it out of the oven":

In [None]:
## COOKING SHOW: ...AND we're back from commercial and everything's finished!

# If you skipped the above training (which is fine!), then download & load the
# (~500 MB) weights from when I ran it:
% pip install -Uqq mrspuff
from mrspuff.scrape import download

my_pretrained_url = 'https://www.dropbox.com/s/ozyhw44argi5th1/finetuned_full.pth?dl=1'
local_file = str(learn.path) + '/models/finetuned_full.pth'
if not os.path.exists(local_file): download(my_pretrained_url, local_file)

path, dls_lm = setup_imdb_dataloaders(full_dataset=True, bs=bs)  # be sure we have the full IMDB dataset

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]).to_fp16()   # fp16, if our GPU will support it natively, will help this run faster

learn.load('finetuned_full')
print("Success! Fine-tuned weights loaded.")

Success! Fine-tuned weights loaded.


In addition to the usual `learn.save()`, the `language_model_learner` comes with its own special `.save_encoder()` method that will just save the "body" of the model without the predictive text head.  We will use this body for the text classification task, after the following interlude:

In [None]:
learn.save_encoder('finetuned_encoder')  # we'll use this after the interlude

## Interlude:  Let's Generate Text!
Before we get to text classification, let's have some fun and make our language model generate text!  To do this we feed the model's own predictions back in as inputs and have continue on predicting word after word.  We could write this part of the code from scratch like before, but in this cast fastai's `language_model_learner` automatically knows to do this if we ask it to keep predicting a long time.

Part of the prediction is a parameter called "temperature" which is related to how temperature affects the probability distrubution of molecule speeds in physics and chemistry: the higher the temperature, the greater the varibility in the model's outputs. This feature is part of many generative language models so that rather than deterministically outputting the same thing each time, the output is *sampled* from a probability distribution that is skewed by the temperature.

In [None]:
def generate_text(prompt, n_words=40, n_sentences=2, temperature=0.75):

    preds = [learn.predict(prompt, n_words, temperature=temperature)
            for _ in range(n_sentences)]
    return "\n".join(preds)

prompt = "I liked this movie because"
print(generate_text(prompt))

i liked this movie because i expected it to be a mystery and i hope that this one for me will be a good one for this movie . Not as room as i thought it was . 

 i girl not sure how
i liked this movie because i watched it in Paris and Paris in a small English town . The scenes in the movie were great in the early beings ( when Paris Crippled and Deeds Scarlet were


Note how we got multiple different outputs due to the random sampling.


We can make it generate longer text too:

In [None]:
prompt = "It's strange that people are going nuts over Squid Game because"
print( generate_text(prompt, n_words=75) )

It 's strange that people are going nuts over Squid Game because they all dinner to love one another . The music of this song is just as much a part of the movie as a movie . 

 Nobody knows where Squid / Squid powerful . 

 The first i can say Squid was a genius , a genius , and now he is only the one who has a whole private to it , but it 's simply a
It 's strange that people are going nuts over Squid Game because it 's just for a movie . It 's a perfectly Hollywood movie , but it hurts to say that this movie is really short to the Hollywood subtitles . There are many minor absolute in this movie . Nothing to do with this movie , a good thriller is n't enough , but it 's certainly worth watching when you 're not worth awaiting your money because it is


Go ahead and try your own prompt below.  Also try changing the default value of the temperature (to other values between 0 and 1) to observe its effect on the generated text.

In [None]:
prompt = "YOUR PROMPT HERE"
print( generate_text(prompt, temperature=0.9) )

YOUR PROMPT HERE is a truly excellent story . 

 This is a season show The Russia Glue . This story is told from the point of view of the main character . It 's a perfect
YOUR PROMPT HERE : The Duck and Duck ( i will dinner it , contains a experiences of duck or people are called for they were one of the first dark COHEN applause shown in my life ) .


## Now for the Text Classification

In [None]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [None]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos xxmaj this movie was recently released on xxup dvd in the xxup us and i finally got the chance to see this hard - to - find gem . xxmaj it even came with original theatrical previews of other xxmaj italian horror classics like "" xxunk "" and "" beyond xxup the xxup darkness "" . xxmaj unfortunately , the previews were the best thing about this movie . \n\n "" zombi 3 "" in a bizarre way is actually linked to the infamous xxmaj lucio xxmaj fulci "" zombie "" franchise which began in 1979 . xxmaj similarly compared to "" zombie "" , "" zombi 3 "" consists of a threadbare plot and a handful of extremely bad actors that keeps this ' horror ' trash barely afloat . xxmaj the gore is nearly non - existent ( unless one is frightened of people running around with",neg
2,"xxbos xxmaj polish film maker xxmaj walerian xxmaj borowczyk 's xxmaj la xxmaj bête ( french , 1975 , aka xxmaj the xxmaj beast ) is among the most controversial and brave films ever made and a very excellent one too . xxmaj this film tells everything that 's generally been hidden and denied about our nature and our sexual nature in particular with the symbolism and silence of its images . xxmaj the images may look wild , perverse , "" sick "" or exciting , but they are all in relation with the lastly mentioned . xxmaj sex , desire and death are very strong and primary things and dominate all the flesh that has a human soul inside it . xxmaj they interest and xxunk us so powerfully ( and by our nature ) that they are considered scary , unacceptable and something too wild to be",pos


In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
                                metrics=accuracy).to_fp16()

This learner has a builtin `load_encoder()` method that will match with the `save_encoder()` we executed for our language model:

In [None]:
learn = learn.load_encoder('finetuned_encoder')  # load just the encoder part

Let's start training it.  Remember, by default this model is frozen except for the final layer, as befits the Transfer Learning methodology:

In [None]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.445914,0.387409,0.82848,01:48


Now the authors of ULMFit progressively unfreeze the model in a few steps, first unfreezing the second-to-last layer:

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.374789,0.304921,0.87172,01:59


Then the layer before that:

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.287808,0.256451,0.89456,02:43


And finally the whole model:

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.245164,0.247775,0.90304,03:19
1,0.229836,0.248729,0.90476,03:19


(Note how the execution time increased as we unfroze more & more layers, because we were added more gradient-descent updates of the weights when unfreezing layers.)

...Interesting.  We ended up with 90% accuracy which is very good but the fastai book authors showed 94% at the same point, which is a significant difference.

 **TODO:** I'll need to look into this some more.

---
# Part V:  Further Readings / More Ethics

Two more readings, that I want you to read for Friday to be able to discuss them:

* ["Disinformation and Language Models"](https://render.githubusercontent.com/view/ipynb?color_mode=auto&commit=780b76bef3127ce5b64f8230fce60e915a7e0735&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6661737461692f66617374626f6f6b2f373830623736626566333132376365356236346638323330666365363065393135613765303733352f31305f6e6c702e6970796e62&nwo=fastai%2Ffastbook&path=10_nlp.ipynb&repository_id=243838973&repository_type=Repository#Disinformation-and-Language-Models) from Chapter 10 of the fastai book.  (If URL doesn't open to the right heading, then just search on that title string within the web page.)

* ["Faithful Text Prediction"](https://www.christiancourier.ca/faithful-text-prediction/) by Calvin University CS Prof Kenneth C. Arnold.


## Appendix: Optional (Ungraded) Exercise: Compare Against Non-Tranfer Learning Approach

Repeat the above training after defining the learner *again* (i.e. just copy & paste the learner definition from above) only  this time *don't* load from the checkpoint of the language model, just train it "from scratch".  Compare the performance to what we got above.  Note
- how the earlier model already starts with a much higher accuracy.
- how this "from scratch" can overfit before it starts reaching the accuracy of the pretrained model

---
Acknowledgements: In addition to the liberal re-use of Howard & Gugger's fastai book code, the author wishes to acknowledge Zach Mueller for helpful coaching via the fastai Discord on the various NLP datasets, their sizes, and what they're good for.