In [None]:
import sys
!{sys.executable} -m pip install fastbook

%matplotlib inline

import fastbook
fastbook.setup_book()
from fastbook import *
from IPython.display import display,HTML

The goal here is to build a classifier model to tell us if IMDb reviews are positive or negative. This will be built ontop of a language model (a model that can predict the next word in a sequence) through fine-tuning it. In fact, this language model will be fine-tuned with IMDb reviews from another language model that was trained with Wikipedia data. 

This will allow us to go from a language model that has a good sense of English, to one that has specific understanding of language related to film (director & actor names, titles), which will be particularly well-suited to classify our reviews in this realm. 

So how might you create a language model? We know that neural networks work with numbers. They have weights that our numerical data gets multiplied by, summed, and potentially transformed again to get another number - our activation value. Obviously, words aren't numbers. Seems like a pretty big problem. How do we get past this? By turning the words into numb3r5. 0k4y 50 wh4t w3 w4nn4 d0 g01ng f0rw4rd 15nt 45 1337 45 th15, but it's still pretty clever. We'll essentially just assign every word we deal with to a number, and use that number for our numerical calculations. Then we'll use these numericalized words to train our model. This is how that's done at a high level:

1. Make a list of all our words (we'll call this the vocab)

2. Replace all of our words with their index in the vocab 

3. Create an 'embedding matrix' with a row for each word in our vocab 

4. Populate the embedding matrix with a vector representation of the word associated with that index -> this is basically just a list of numbers (the size of which will usually depend on the size of the vocab, among other things) that will represent our word so that we can perform calculations with the word. It'll be initialized randomly until the values are settled on during our training. 

5. Train our model - We'll split up our corpus of text into different sequences of words. Our independent variable will be the 1st to second last words of the sequence. The dependent variable will be the second to the last words of the sequence. Remember that it'll actually be our embedding vectors, so it'll essentially just be a list of numbers that represent the words in our sequence. The decision to make our our dependent variable the 2nd to the last words in the sequence is because we want to use all of the words before that (which is our independent variable) to predict the last word. The decision to exclude the first word of the sequence from our dependent variable  is a design decision for our particular task, which we'll dive into the reasoning for later. For now, you can think of it as making our model better at predicting subsequent words based on previous words, rather than maybe relying on the first word in a sequence to inform that prediction more than we'd like. 

Then, based on a loss function that depends on our goal, we'll use gradient descent to optimize our embedding matrix - the first layer of our network that translates our words into their numerical representation. Since our aim is to predict the next word, we'll use 'cross-entropy' which will measure the difference between the predicted word distribut and the actual next word. With gradient descent using this loss function, we'll find the embedding matrix that that will turn these words into the vectors of numbers, that when operated on by subsequent layers, will return the numerical vectors that are closest to the vectors that our outputs are now represented by. 

HOLY SHIET. That's a lot of preliminary info before we got started with any coding. And it doesn't even end there. Don't worry - I'm suffering as much as you are. In fact probably more so since I had to figure out and type this all out. But now you're in the position I was just in and it's your problem now. Okay, hopefully some of these principles become clearer when we start training our model.

Numericalization - Making a list of our unique tokens/words (our vocab) and converting our text into numbers based on their index in our vocab list. After this we'll use our embedding matrix to convert these indices to embedding vectors which will be numerical representations of our text that contain semantic (info about the meaning of these words) information about the tokens. These vector embeddings are unintepretable to humans, but it's the way we quantitatively encode our text to allow our model gain an understanding of language and perform functions like analyzing sentiment or predicting next words, by performing mathematical calculations with the numerical representations of these words (their embedding vectors)


In [None]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

files = get_text_files(path, folders = ['train', 'test', 'unsup'])

txt = files[0].open().read(); txt[:75]

I know you're getting antsy for some code so here's a little taste to curb your addiction. It's just to load the text files we'll be using. I need to talk more about tokenization so that's all you get for now you fiends.

In tokenization we're up our text into words, more specifically, semantic units called 'tokens'. 

There are a few approaches here depending on the language we're working with processing, among other things. 

Word-based: Split up a sentence based on spaces. Pretty straightforward. There might need to be special handling of cases like contractions  as well as punctuation, where words are split at punctuation as well. 

Subword Based: In English there are many prefixes and suffixes that have a consistent way of altering the meaning of the base word. For example an 's' or 'es' suffix denotes plurality. Thus, these semantic units that are more granular than the whole word may provide useful insight into the entire word's meaning, and should be isolated. This is usually done based on the most commonly occuring substrings. 

Character-based: Splitting text into individual  ["c", "h", "a", "r", "a", "c", "t", "e", "r", "s"]


In [None]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

We're using fast.ai's default WordTokenizer, which is a library called spaCy, hence the name of our object. We'll print out the first 30 tokens of our text using the coll_repr() method. 

These tokenizers take a collection of documents to tokensize, so we wrap our txt in a list and use first() to get our first item from the generator that's returned. 


In [None]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

This example illustrates how crafty spaCy is at handling different punctuation it might encounter. 


In [None]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

The Tokenizer class provides even more tokenizing functionality. We can see some interesting tokens like 'xxbos' or 'xxmaj'. This represent special characters or patterns in the text. 'xxmaj' indicates the next word begins with a capital. Through this, the model is able to learn about the concept of capitilzation itself (e.g. for proper nouns), and we don't have to worry about discerning capitalized and lowercase versions of words, which will save on memory resources in our vocab list and embedding matrix if nothing else. 


In [None]:
defaults.text_proc_rules

??replace rep

You can check the rules used to generate these special tokens, with .text_proc_rules, and check their meaning with '??' 


In [None]:
txts = L(o.open().read() for o in files[:2000])

Now lets go over to subword tokenization. Word tokenization was very straightforward - just look for spaces. How do we decide how subword tokens should be extract from our words? We'll have to do a little work before a text is tokenized where we analyze our corpus to find our most commonly occuring groups of letters, which will become our vocab. Then we'll tokenize our corpus with this vocab of the most common substrings. Lets write a method to do this for us:


In [None]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

Let's instantiate a SubwordTokenizer with parameter 'vocab_sz', which is the size we want our vocab to be. Then we call setup() and pass in our corpus which will have the most common substrings extracted into our vocab. Our method will then string together the first 40 tokens from our text and show it to us. We can call this with different vocab_sz arguments to see how having vocabs of different sizes will affect the tokenization of our text. 


In [None]:
print(f"1000 token vocab: {subword(1000)} \n 200 token vocab: {subword(200)} \n 10000 token vocab: {subword(10000)}")

We can see that with a larger vocab, there are more substrings (thus longer substrings) that can be included. That means we can represent more text with a lower amount of tokens, because entire words themselves will be tokens in our vocab. 

Therefore, subword tokenization can also be equivalent to word based (large vocabs with a size equivalent to the unique words in a text) or character based tokenization (vocab with 26 tokens for each letter of the alphabet, and then some more for special characters and punctuation) depending on the vocab size. 

There are trade offs for the scale of subword tokens. Large vocabs means more text represented with fewer tokens providing faster training. The downside is that the embedding matrix will be larger, so more data is required for the model to truly learn the meaning of the tokens i.e. optimize the value of the token's embedding vector. 

YAY now our words are smaller words. Let's turn our smaller words into numbers. 


In [None]:
toks200 = txts[:200].map(tkn)
num = Numericalize()
num.setup(toks200)

We'll tokenize the first 200 reviews of our data with the word-based tokenizer we used previously. 

After instantiating a Numericalize object, we'll call this class's setup() method and pass in our tokenized text, which will do the same thing as before in analyzing the text to create our vocab There are some arguments we can pass into numericalize with default values which are min_freq=3 (minimum number of times a word has to appear to be included in vocab) and max_vocab=60000 (pretty self explanitory). Everything else will be replaced by the special characer 'xxunk'


In [None]:
nums = num(toks)[:20]; nums

With the Numericalize object set up, we'll pass one of our reviews into it like a function, and take a look at them. Each number corresponds to the index of our vocab for that particular token.


In [None]:
' '.join(num.vocab[o] for o in nums)

Now we can recreate our original text by using each of our numbers to index into our vocab.


In [None]:
nums200 = toks200.map(num)
dl = LMDataLoader(nums200)

Let's us the LMDataLoader class in the fastai library to make the next step, which is batching, a little easier for ourselves. We'll use numericalize our tokenized text and pass that into our LMDataLoader instance.


In [None]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

We'll use the DataBlock class that we pass a TextBlock to for even more convenience, which will take care of tokenization and numericalization for us automatically. With our data ready, we'll be able to fine-tune (with IMDb review data), the pretrained (on Wikipedia) language model to give it a better understand of the language associated with movie reviews. 


In [None]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

learn.fit_one_cycle(1, 2e-2)

We'll put our DataLoader into a Learner for our language model. Perplexity is the expontential of the cross-entropy loss that's also our loss function. AWD_LSTM is the architecture for the recurrent neural network we'll be using. 

Now we'll also train our model for one epoch and use this as an opportunity to talk about saving model states. 


In [None]:
learn.save('1epoch')
learn = learn.load('1epoch')
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

By calling .save() we'll get a file in learn.path/models/ named '1epoch.pth'. We'll continue with the rest of fine-tuning. 


In [None]:
learn.save_encoder('finetuned')

Now by calling .save_encoder() we'll save the entire model except for the final layer that converts activations to probabilities of picking each token in our vocabulary. The entire model minus the function-specific final layer is called the 'Encoder'. 


In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

At this point we've fine-tuned our language model to better understand movie reviews. We can see it in action predicting the next words in text that seems like it'd belong in a movie review by prompting it with the starting sequence 'I liked this movie because...'. 

This isn't our final product, and is actual just an intermediary to our actual goal which is to fine-tune a language model for sentiment analysis of reviews. 


In [None]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

We'll make the DataBlock for this sentiment analysis task, which will just be another classification task, with our classes being the positive or negative sentiment a particular review had.


In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

learn = learn.load_encoder('finetuned')

With our classification DataLoader set up, we'll create our text classification model. Before training this classifier, we'll load the encoder of the model we fined-tuned in the last step, which was what we went to all that trouble for. 


In [None]:
learn.fit_one_cycle(1, 2e-2)
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

Now we can train our classifier by gradually unfreezing more and more of the model while running fine-tuning epochs. Amazing! Now we have a sentiment classifier for our IMDb movie reviews. Technology is so amazing. We can now detect if a comment is a hate or a dickride without even reading it! What an amazing time to be alive and literate. 
