# Text Classification


###### We shall use a pre-trained network which at least knows how to read English.  we will train a model that predicts a next word of a sentence (i.e. language model), and just like in computer vision, stick some new layers on the end and ask it to predict whether something is positive or negative.

Fine-tuning a pre-trained network is really powerful. 

If we can get it to learn some related tasks first, then we can use all that information to try and help it on the second task.

After reading a thousands words knowing nothing about how English is structured or concept of a word or punctuation, all you get is a 1 or a 0 (positive or negative). 

Trying to learn the entire structure of English and then how it expresses positive and negative sentiments from a single number is just too much to expect.

In [3]:
# IMDb movie review dataset

In [4]:
#! wget --header="Host: files.fast.ai" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" --header="Accept-Language: en-US,en;q=0.9" --header="Referer: http://localhost:8888/notebooks/courses/dl1/lesson4-imdb.ipynb" "http://files.fast.ai/data/aclImdb.tgz" -O "aclImdb.tgz" -c

In [5]:
# Spacy does a lot of NLP stuff, and it has the best tokenizer 
#! pip install spacy
#! python -m spacy download en

In [6]:
#To auto-reload modules in jupyter notebook (so that changes in files *.py doesn't require manual reloading):
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

# Torch text: Py torch NLP library
import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy

  from numpy.core.umath_tests import inner1d


In [7]:
PATH = '/home/paperspace/fastai/courses/SelfCodes/Text_Class/aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

imdbEr.txt  imdb.vocab  [0m[01;34mmodels[0m/  README  [01;34mtest[0m/  [01;34mtmp[0m/  [01;34mtrain[0m/


In [8]:
#import os, sys, tarfile

#import tarfile
#tar = tarfile.open("aclImdb.tgz")
#tar.extractall()
#tar.close()


We do not have separate test and validation in this case. Just like in vision, the training directory has bunch of files in it:

Let's look inside the training folder...

In [9]:
trn_files = !ls {TRN}
trn_files[:10]

['0_0.txt',
 '0_3.txt',
 '0_9.txt',
 '10000_0.txt',
 '10000_4.txt',
 '10000_8.txt',
 '1000_0.txt',
 '10001_0.txt',
 '10001_10.txt',
 '10001_4.txt']

 an example review.

In [10]:
review = !cat {TRN}{trn_files[6]}
review[0]

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose...man, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out 

Check how many words are in the dataset:

In [11]:
!find {TRN} -name '*.txt' | xargs cat | wc -w
#17486581
!find {VAL} -name '*.txt' | xargs cat | wc -w
#5686719

17486581
5686719


Before we can do anything with text, we have to turn it into a list of tokens. 

Token is basically like a word. Eventually we will turn them into a list of numbers, but the first step is to turn it into a list of words — this is called “tokenization” in NLP. 

A good tokenizer will do a good job of recognizing pieces in your sentence. 

Each separated piece of punctuation will be separated, and each part of multi-part word will be separated as appropriate.

Spacy does a lot of NLP stuff, and it has the best tokenizer . So Fast.ai library is designed to work well with the Spacey tokenizer as with torchtext.

In [12]:
import spacy
spacy_tok = spacy.load('en')

In [13]:
' '.join([sent.string.strip() for sent in spacy_tok(review[0])])

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop - socky fung - ku , but what I got instead was a comedy . So , it was n't quite was I was expecting , but I really liked it anyway ! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them ! ! I was laughing my ass off . I mean , the cops were just so bad ! And when I say bad , I mean The Shield Vic Macky bad . But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose ... man , oh man . What can you say about that hottie . She was great and put those other actresses to shame . She should work more often ! ! ! ! ! I also really liked the fight scene outside of the building . That was done really well . Lots of fighting and people getting their heads banged up . FUN ! Last , but not least Joe Estevez and William Smith were great as the ... well , I was n't sure what they were , but they see

In [14]:
# Text Pre Processing

TEXT = data.Field(lower=True, tokenize= "spacy")

In [15]:
# Now we create the usual Fast.ai model data object:

# bptt - how many words are processed at a time in each row of mini batch , making this higher increases memry requirements
# # bptt  making this higher also increases models ability to handle long sentences
bs=64; bptt=70

We create model data object for language modelling taking advantage of , as we do not have a test set, we set validation set as teest set

In [16]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)

In [17]:
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

PATH : as per usual where the data is, where to save models, etc

TEXT : torchtext’s Field definition

**FILES : list of all of the files we have: training, validation, and test (to keep things simple, we do not have a separate validation and test set, so both points to validation folder)

bs : batch size

bptt : Back Prop Through Time. It means how long a sentence we will stick on the GPU at once

min_freq=10 : In a moment, we are going to be replacing words with integers (a unique index for every word). If there are any words that occur less than 10 times, just call it unknown.

After building our ModelData object, it automatically fills the TEXT object with a very important attribute: TEXT.vocab. This is a vocabulary, which stores which unique words (or tokens) have been seen in the text, and how each word will be mapped to a unique integer id.

In [18]:
# Save the model for later

In [19]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

In [20]:
#Here are the: 

# batches;
print(len(md.trn_dl))

# unique tokens in the vocab; 
print(md.nt)

# tokens in the training set; 
print(len(md.trn_ds))

# sentences
print(len(md.trn_ds[0].text))

4583
37392
1
20540756


In [21]:
# 'itos': 'int-to-string' 
TEXT.vocab.itos[:12]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'in', 'it']

In [22]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

2

In [23]:
# In a LanguageModelData object there is only one item in each dataset: all the words of the text joined together.

In [24]:
md.trn_ds[0].text[:12]

['at',
 'first',
 ',',
 'i',
 'thought',
 'this',
 'was',
 'a',
 'sequel',
 'to',
 'entre',
 'nous']

In [25]:
# Torch text will handle turing changing word to int

In [26]:
TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
    40
   102
     3
    12
   213
    13
    19
     6
   701
     8
 36172
     0
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

Our LanguageModelData object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our bptt parameter - backprop through time).

Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.

In [27]:
next(iter(md.trn_dl))

(Variable containing:
     40     20     11  ...      20     11   2519
    102      6     16  ...    9324     27      4
      3   8852     31  ...      20      2      8
         ...            ⋱           ...         
    101     76     27  ...    3859     20      0
      7  13402    108  ...      68     22     18
     13     18   2026  ...      23     74  13003
 [torch.cuda.LongTensor of size 72x64 (GPU 0)], Variable containing:
    102
      6
     16
   ⋮   
      3
    125
   2439
 [torch.cuda.LongTensor of size 4608 (GPU 0)])

A neat trick torchtext does is to randomly change the bptt number every time so each epoch it is getting slightly different bits of text — similar to shuffling images in computer vision. We cannot randomly shuffle the words because they need to be in the right order, so instead, we randomly move their breakpoints a little bit.

Now that we have a model data object that can fee d us batches, we can create a model. First, we are going to create an embedding matrix.

# Train

In [28]:

em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

The embedding size is 200 which is much bigger than our previous embedding vectors. Not surprising because a word has a lot more nuance to it than the concept of Sunday. Generally, an embedding size for a word will be somewhere between 50 and 600.

Researchers have found that large amounts of momentum (which we’ll learn about later) don’t work well with these kinds of RNN models, so we create a version of the Adam optimizer with less momentum than its default of 0.9. Any time you are doing NLP, you should probably include this line:

In [29]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

Fast.ai uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through Dropout. There is no simple way known (yet!) to find the best values of the dropout parameters below — you just have to experiment…

However, the other parameters (alpha, beta, and clip) shouldn't generally need tuning.

In [30]:
learner = md.get_model(opt_fn, em_sz, nh, nl, dropouti=0.05,
                       dropout=0.05, wdrop=0.1, dropoute=0.02, 
                       dropouth=0.05)


# to avoid over fitting
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

learner.clip=0.3 : when you look at your gradients and you multiply them by the learning rate to decide how much to update your weights by, this will not allow them be more than 0.3. This is a cool little trick to prevent us from taking too big of a step

In [30]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)


HBox(children=(IntProgress(value=0, description='Epoch', max=15, style=ProgressStyle(description_width='initia…

epoch      trn_loss   val_loss                                 
    0      4.842432   4.716591  
    1      4.650561   4.520028                                 
    2      4.525492   4.437697                                 
    3      4.601942   4.465507                                 
    4      4.509101   4.392377                                 
    5      4.44412    4.337345                                 
    6      4.408764   4.31964                                  
    7      4.545642   4.410581                                 
    8      4.499626   4.382818                                 
    9      4.478222   4.353855                                 
    10     4.43455    4.323576                                 
    11     4.390431   4.293987                                 
    12     4.341019   4.268963                                 
    13     4.344233   4.256771                                 
    14     4.32065    4.253503                                 



[array([4.2535])]

In [31]:
learner.save_encoder('adam1_enc')


In [None]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=10, 
            cycle_save_name='adam3_10')


In [None]:
learner.save_encoder('adam3_10_enc')


In [None]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=20, 
            cycle_save_name='adam3_20')


In [None]:
learner.load_cycle('adam3_20',0)

In the sentiment analysis section, we'll just need half of the language model - the encoder, so we save that part.

learner.save_encoder('adam3_20_enc')

learner.load_encoder('adam3_20_enc')

In [35]:
#learner.save_encoder('adam1_enc')
learner.load_encoder('adam1_enc')

# Testing

Testing language model: create a short bit of text to ‘prime’ a set of predictions. 

Application torchtext field to numericalize it so we can feed it to our language model.

In [36]:
m=learner.model
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [TEXT.preprocess(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

". so , it was n't quite was i was expecting , but i really liked it anyway ! the best"

 Methods to make  test a language model

In [37]:
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs

Top 10 predictions were for the next word after our short text:

In [38]:
nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

['part',
 'thing',
 'parts',
 'scene',
 'way',
 'of',
 'aspect',
 'scenes',
 'moment',
 'line']

Let's see if our model can generate more text all by itself

In [40]:
print(ss,"\n")
for i in range(50):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

. So, it wasn't quite was I was expecting, but I really liked it anyway! The best 

part of the movie . the movie is a bit of a mess , but it 's not a bad movie . it 's a very good movie , and i would recommend it to anyone who likes a good laugh . <eos> i have seen this movie several times ...


# Sentiment Classifcation

Fine-tune pre-trained a language model to do sentiment classification.

To use a pre-trained model, we will need to use the saved vocab from the language model, since we need to ensure the same words map to the same IDs.

In [41]:
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))

`sequential=False` tells torchtext that a text field should be tokenized (in this case, we just want to store the 'positive' or 'negative' single label).

In [43]:
IMDB_LABEL = data.Field(sequential=False)

`splits` is a torchtext method that creates train, test, and validation sets. The IMDB dataset is built into torchtext, so we can take advantage of that. Take a look at `lang_model-arxiv.ipynb` to see how to define  fastai/torchtext datasets.

In [45]:
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

In [46]:
t = splits[0].examples[0]

In [47]:
t.label, ' '.join(t.text[:16])

('pos',
 "fantastic documentary of 1924 . this early 20th century geography of today 's iraq was powerful")

fastai can create a ModelData object directly from torchtext splits.

In [48]:
md2 = TextData.from_splits(PATH, splits, bs)

We call get_model that gets us our learner. Then we can load into it the pre-trained language model (load_encoder).

In [50]:
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
m3.load_encoder(f'adam1_enc')

Because we’re fine-tuning a pretrained model, we’ll use differential learning rates, and also increase the max gradient for clipping, to allow the SGDR to work better.

In [52]:
m3.clip=25.
lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])

In [53]:
m3.freeze_to(-1)
m3.fit(lrs/2, 1, metrics=[accuracy])
m3.unfreeze()
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   accuracy                    
    0      0.656721   0.362944   0.851096  



HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   accuracy                    
    0      0.454559   0.275274   0.889759  



[array([0.27527]), 0.8897590984970833]

In [55]:
m3.fit(lrs, 7, metrics=[accuracy], cycle_len=2, cycle_save_name='imdb2')

HBox(children=(IntProgress(value=0, description='Epoch', max=14, style=ProgressStyle(description_width='initia…

epoch      trn_loss   val_loss   accuracy                    
    0      0.404522   0.26587    0.896754  
    1      0.379996   0.269141   0.898528                    
    2      0.355631   0.264314   0.899814                    
    3      0.345975   0.267545   0.901531                    
    4      0.353508   0.233287   0.909041                    
    5      0.331418   0.25991    0.904934                    
    6      0.310227   0.244432   0.913912                    
    7      0.306935   0.263987   0.906014                    
    8      0.296198   0.285577   0.90715                     
    9      0.294248   0.259355   0.910861                    
    10     0.300742   0.280246   0.907542                    
    11     0.288196   0.252279   0.915359                    
    12     0.280466   0.30317    0.906701                    
    13     0.26416    0.268972   0.914784                    



[array([0.26897]), 0.9147843049661689]

We make sure all except the last layer is frozen. Then we train a bit, unfreeze it, train it a bit. The nice thing is once you have got a pre-trained language model, it actually trains really fast.

In [56]:
m3.load_cycle('imdb2', 4)

In [57]:
accuracy_np(*m3.predict_with_targs())

0.91344