We will create a classifier (text model) of IMDB movie reviews.

In [1]:
import pandas as pd
import numpy as np
from fastai.text.all import (
    coll_repr,
    defaults,
    first,
    get_text_files,
    Numericalize,
    L,
    LMDataLoader, # Language Model Data Loader
    Tokenizer,
    untar_data,
    URLs,
    WordTokenizer,
)

In [2]:
path = untar_data(URLs.IMDB)
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

print("Data downloaded at", path)

Data downloaded at /root/.fastai/data/imdb


In [12]:
word_tokenizer = Tokenizer(tok=WordTokenizer())

### Creating text batches

In [13]:
# Imagine we have a text stream, the tokenization process will add special tokens and deal with
# punctuation.

# We now have 90 tokens, separated by spaces. Let's say we want a batch size of 6. We need to 
# break this text into 6 contiguous parts of length 15:

stream = (
    "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
)
tokens = word_tokenizer(stream)
batch_size, seq_len = 6, 15

d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(batch_size)])
df = pd.DataFrame(d_tokens)
df


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
1,movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
2,first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
3,how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
4,of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
5,will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


It is important to maintain order within and across these subarrays, because we will use a model that maintains a state so that it remembers what it read previously when predicting what comes next.

The first step is to transform the individual texts into a stream by concatenating them together. As with images, it's best to randomize the order of the inputs, so at the beginning of each epoch we will shuffle the entries to make a new stream (we shuffle the order of the documents, not the order of the words inside them).

In [14]:
"""So to recap, at every epoch we shuffle our collection of documents and concatenate them into a stream of tokens. 
We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the 
mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence 
length we picked.
"""
numericalizer = Numericalize()

In [15]:
# example

# to demostrate this method we will select a corpus of 2000 movie reviews
txts = L(o.open().read() for o in files[:2000])

# Just like SubwordTokenizer we need to call setup on Numeralize
toks200 = txts[:200].map(word_tokenizer)
toks200[0]

(#158) ['xxbos','xxmaj','jiang','xxmaj','xian','uses','the','complex','backstory','of'...]

In [16]:
numericalizer.setup(toks200)

In [17]:
nums200 = toks200.map(numericalizer)

dl = LMDataLoader(nums200)

x,y = first(dl)
x.shape,y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

In [18]:
" ".join(numericalizer.vocab[o] for o in x[0][:20])

'xxbos xxmaj xxunk xxmaj xxunk uses the complex xxunk of xxmaj xxunk xxmaj xxunk and xxmaj xxunk xxmaj xxunk to'

### Training a Text Classifier

There are two steps to training a state-of-the-art text classifier using transfer learning: first we need to fine-tune our language model pretrained on Wikipedia to the corpus of IMDb reviews, and then we can use that model to train a classifier.

In [10]:
from functools import partial

from fastai.text.all import (
    accuracy,
    language_model_learner,
    AWD_LSTM,
    DataBlock,
    TextBlock,
    Perplexity,
    RandomSplitter,
)

In [19]:
# fastai handles tokenization and numericalization automatically when TextBlock is passed to DataBlock
get_imdb = partial(get_text_files, folders=["train", "test", "unsup"])
batch_size = 128
seq_len = 80

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=batch_size, seq_len=seq_len)

In [20]:
# we then can show a pair of example
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj huh ? \n\n xxmaj what ? \n\n xxmaj vampire cavemen ? xxmaj sex replaced by flashing multi - colored light bulbs ? xxmaj guys in dinosaur suits ? a film half made of stock footage ? \n\n xxmaj this is n't just bad , it 's inexplicably bad . xxup do xxup not xxup watch xxup this xxup alone . xxmaj make sure to have a friend or two with whom you can swap wisecracks about this …","xxmaj huh ? \n\n xxmaj what ? \n\n xxmaj vampire cavemen ? xxmaj sex replaced by flashing multi - colored light bulbs ? xxmaj guys in dinosaur suits ? a film half made of stock footage ? \n\n xxmaj this is n't just bad , it 's inexplicably bad . xxup do xxup not xxup watch xxup this xxup alone . xxmaj make sure to have a friend or two with whom you can swap wisecracks about this … this"
1,"3 0 , while he 's 40 . xxmaj besides , many transitions take place from 2 xxrep 3 0 to the 70 's or the other way around without any warning . xxmaj this is to show that the character did n't really evolved much . xxmaj he was a dreamer when younger , and unlike many he did n't change when he grew up . \n\n xxmaj about transitions , they all are very very smooth , and","0 , while he 's 40 . xxmaj besides , many transitions take place from 2 xxrep 3 0 to the 70 's or the other way around without any warning . xxmaj this is to show that the character did n't really evolved much . xxmaj he was a dreamer when younger , and unlike many he did n't change when he grew up . \n\n xxmaj about transitions , they all are very very smooth , and you"


#### Fine-tuning the language model

To convert the integer word indices into activations that we can use for our neural network, we will use embeddings. 

Then we'll feed those embeddings into a recurrent neural network (RNN), using an architecture called AWD-LSTM.

In [21]:
learner = language_model_learner(
    dls_lm,
    AWD_LSTM,
    drop_mult=0.3, # for regularization
    metrics=[accuracy, Perplexity()]
).to_fp16()

The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). The perplexity metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., `torch.exp(cross_entropy)`).

In [22]:
?learner.fit_one_cycle

[0;31mSignature:[0m
[0mlearner[0m[0;34m.[0m[0mfit_one_cycle[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_epoch[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlr_max[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdiv[0m[0;34m=[0m[0;36m25.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdiv_final[0m[0;34m=[0m[0;36m100000.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpct_start[0m[0;34m=[0m[0;36m0.25[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwd[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmoms[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcbs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreset_opt[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstart_epoch[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Fit `self.model` for `n_epoch` using the 1cycle policy.


In [23]:
"""language_model_learner automatically calls freeze when using a pretrained model 
(which is the default), so this will only train the embeddings (the only part of the model 
that contains randomly initialized weights—i.e., embeddings for words that are in our IMDb vocab, 
but aren't in the pretrained model vocab).
"""

# execute on Colab or Paper Space
learner.fit_one_cycle(n_epoch=1, lr_max=2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.010259,3.900625,0.300411,49.433323,29:00


#### Saving and Loading Models

In [24]:
# save weights
learner.save("1epoch")
# load weights
learner = learner.load("1epoch")

In [28]:
!ls $learner.path/models 

1epoch.pth


In [31]:
!cp -R $learner.path/models /kaggle/output

In [32]:
# once the initial training has completed, we can continue fine-tuning the model
# after unfreezing

learner.unfreeze()
learner.fit_one_cycle(n_epoch=7, lr_max=2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.987087,3.975214,0.296409,53.26149,29:53
1,4.106718,4.063072,0.289464,58.152668,30:13
2,3.98665,3.952163,0.30089,52.047848,31:22
3,3.863941,3.835424,0.311763,46.31308,30:50
4,3.697453,3.709002,0.324382,40.813057,30:45
5,3.548256,3.620828,0.33413,37.368492,30:51
6,3.430043,3.606477,0.336346,36.836044,31:07


Once this is done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the encoder. We can save it with save_encoder:

In [33]:
learner.save_encoder("finetuned")

In [34]:
!ls $learner.path/models 

1epoch.pth  finetuned.pth


In [35]:
!cp -R $learner.path/models /kaggle/output

### Text generation

In [52]:
# we will try the model to generate text, because at this point, the model is trained to guess
# what the next word of the sentence is, we can use the model to write new reviews

PROMPT = "I like SpiderMan because"
N_WORDS = 40
N_SENTENCES = 1

preds = [learner.predict(PROMPT, N_WORDS, temperature=0.5) for _ in range(N_SENTENCES)]
#print("\n".join(preds))

In [53]:
for idx, stream in enumerate(preds[0].split(" ")):
    print(stream, end=" ")
    
    if idx > 0 and idx % 8 == 0:
        print("")

i like spiderman because it 's a good movie 
. But that 's not the point . 
This movie is about a robot who is 
trying to create a robot . As it 
turns out , he 's a super hero 


### Creating the classifier

We're now moving from language model fine-tuning to classifier fine-tuning. To recap, a language model predicts the next word of a document, so it doesn't need any external labels. A classifier, however, predicts some external label—in the case of IMDb, it's the sentiment of a document.

In [56]:
from fastai.text.all import (
    CategoryBlock,
    GrandparentSplitter,
    parent_label,
)

In [57]:
batch_size = 128
seq_len = 72

# The reason that we pass the vocab of the language model is to make sure we use 
# the same correspondence of token to index. Otherwise the embeddings we learned in our 
# fine-tuned language model won't make any sense to this model, and the fine-tuning 
# step won't be of any use.
dls_class = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y=parent_label,
    get_items=partial(get_text_files, folders=["train", "test"]),
    splitter=GrandparentSplitter(valid_name="test")
).dataloaders(path, path=path, bs=batch_size, seq_len=seq_len)

In [59]:
dls_class.show_batch(max_n=2)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos * ! ! - xxup spoilers - ! ! * \n\n xxmaj before i begin this , let me say that i have had both the advantages of seeing this movie on the big screen and of having seen the "" authorized xxmaj version "" of this movie , remade by xxmaj stephen xxmaj king , himself , in 1997 . \n\n xxmaj both advantages made me appreciate this version of "" the xxmaj shining , "" all the more . \n\n xxmaj also , let me say that xxmaj i 've read xxmaj mr . xxmaj king 's book , "" the xxmaj shining "" on many occasions over the years , and while i love the book and am a huge fan of his work , xxmaj stanley xxmaj kubrick 's retelling of this story is far more compelling … and xxup scary . \n\n xxmaj kubrick",pos


There is one challenge we have to deal with, however, which is to do with collating multiple documents into a mini-batch. Let's see with an example, by trying to create a mini-batch containing the first 10 documents.

In [62]:
nums_samp = toks200[:10].map(numericalizer)
# we notice that each review has a different amount of tokens
nums_samp.map(len)

(#10) [158,319,181,193,114,145,260,146,252,295]

 PyTorch DataLoaders need to collate all the items in a batch into a single tensor, and a single tensor has a fixed shape.
 
 We will expand the shortest texts to make them all the same size. To do this, we use a special padding token that will be ignored by our model. 
 
Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. 

The result of this is that the documents collated into a single batch will tend to be of similar lengths. We won't pad every batch to the same size, but will instead use the size of the largest document in each batch as the target size. 

The sorting and padding are automatically done by the data block API for us when using a TextBlock, with `is_lm=False`.

In [70]:
from fastai.text.all import (
    text_classifier_learner,
)

In [65]:
# we can now create a model to classify our texts

classifier = text_classifier_learner(
    dls_class, AWD_LSTM, drop_mult=0.5, metrics=accuracy
).to_fp16()

In [66]:
# The final step prior to training the classifier is to load the encoder from our 
# fine-tuned language model. We use load_encoder instead of load because we 
# only have pretrained weights available for the encoder

classifier = classifier.load_encoder("finetuned")

### Fine-Tuning the Classifier

The last step is to train with discriminative learning rates and gradual unfreezing. In computer vision we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference.

Discriminative learning rate is one of the tricks that can help us guide fine-tuning. By using lower learning rates on deeper layers of the network, we make sure we are not tempering too much with the model blocks that have already learned general patterns and concentrate fine-tuning on further layers

In [67]:
classifier.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.242175,0.181831,0.93016,01:10


In [68]:
?classifier.fit_one_cycle

[0;31mSignature:[0m
[0mclassifier[0m[0;34m.[0m[0mfit_one_cycle[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_epoch[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlr_max[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdiv[0m[0;34m=[0m[0;36m25.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdiv_final[0m[0;34m=[0m[0;36m100000.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpct_start[0m[0;34m=[0m[0;36m0.25[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwd[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmoms[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcbs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreset_opt[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstart_epoch[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Fit `self.model` for `n_epoch` using the 1cycle polic

In [71]:
?slice

[0;31mInit signature:[0m [0mslice[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
slice(stop)
slice(start, stop[, step])

Create a slice object.  This is used for extended slicing (e.g. a[0:10:2]).
[0;31mType:[0m           type
[0;31mSubclasses:[0m     

In [73]:
lr = slice(1e-2/(2.6**4),1e-2)
"""A slice object is used to specify how to slice a sequence. You can specify
where to start the slicing, and where to end. You can also specify the step.
"""
print(type(lr))
lr

<class 'slice'>


slice(0.00021882987290360977, 0.01, None)

In [75]:
# We can pass -2 to freeze_to to freeze all except the last two parameter groups

classifier.freeze_to(-2)
classifier.fit_one_cycle(n_epoch=1, lr_max=lr)

epoch,train_loss,valid_loss,accuracy,time
0,0.212653,0.167675,0.936,01:14


In [77]:
lr = slice(5e-3/(2.6**4), 5e-3)

classifier.freeze_to(-3)
classifier.fit_one_cycle(n_epoch=1, lr_max=lr)

epoch,train_loss,valid_loss,accuracy,time
0,0.172222,0.157002,0.9418,01:28


In [78]:
# unfreeze the whole model
classifier.unfreeze()
classifier.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.145825,0.158671,0.94232,01:43
1,0.138071,0.158919,0.94228,01:46
