# Sentiment classification and Language Modeling
* We will leverage the [large movie view dataset](http://ai.stanford.edu/~amaas/data/sentiment)
* Dataset contains 50K reviews from IMDB rated as positive or negative (neutral reviews not included in data)
* Test and training sets have 25K reviews each
* Code and approach is adapted from Jeremy Howard's fast ai 

### We will start with building a language model (predicts next word)
* We will use this as a pre-trained model for the next task of predicting sentiment
* While predicting sentiment our goal will be to determine positive vs negative (a binary classification task)
* We achieve a high accuracy of **>93%**

### For the language model we will leverage a variant of the [AWD LSTM language model](https://arxiv.org/abs/1708.02182) which provides regularization features through Dropout

In [1]:
cd ../fastai

/home/paperspace/fastai


In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy

In [3]:
cd ~

/home/paperspace


In [4]:
PATH='data/aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

imdbEr.txt  imdb.vocab  [0m[01;34mmodels[0m/  README  [01;34mtest[0m/  [01;34mtmp[0m/  [01;34mtrain[0m/


In [21]:
trn_files = !ls {TRN}
trn_files[:10]

['0_0.txt',
 '0_3.txt',
 '0_9.txt',
 '10000_0.txt',
 '10000_4.txt',
 '10000_8.txt',
 '1000_0.txt',
 '10001_0.txt',
 '10001_10.txt',
 '10001_4.txt']

### example review.

In [22]:
review = !cat {TRN}{trn_files[6]}
review[0]

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose...man, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out 

### determine number of words in train and test datasets

In [7]:
!find {TRN} -name '*.txt' | xargs cat | wc -w

17486581


In [8]:
!find {VAL} -name '*.txt' | xargs cat | wc -w

5686719


### Tokenization
* this refers to the process of splitting a sentence into an array of words (or more generally, into an array of *tokens*)
* We will leverage the [Spacy](https://spacy.io) tokenizer

In [23]:
spacy_tok = spacy.load('en')

In [24]:
# how it looks after tokenization
' '.join([sent.string.strip() for sent in spacy_tok(review[0])])

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop - socky fung - ku , but what I got instead was a comedy . So , it was n't quite was I was expecting , but I really liked it anyway ! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them ! ! I was laughing my ass off . I mean , the cops were just so bad ! And when I say bad , I mean The Shield Vic Macky bad . But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose ... man , oh man . What can you say about that hottie . She was great and put those other actresses to shame . She should work more often ! ! ! ! ! I also really liked the fight scene outside of the building . That was done really well . Lots of fighting and people getting their heads banged up . FUN ! Last , but not least Joe Estevez and William Smith were great as the ... well , I was n't sure what they were , but they see

### Leverage [torchtext](https://github.com/pytorch/text) library to preprocess our data
* Need to create a torchtext *field*, which describes how to preprocess a piece of text - in this case, we tell torchtext to make everything lowercase, and tokenize it with spacy.

In [6]:
TEXT = data.Field(lower=True, tokenize="spacy")

### fastai works closely with torchtext and we can create a ModelData object for language modeling
* BPTT defines how many words are processing at a time in each row of the mini batch--the number of layers we backprop through
* Longer the BPTT the better the model's ability to handle long sentences
* Min_freq identifies that tokens which occur <10 times should be treated as 'unknown'

In [7]:
bs=64; bptt=70

In [8]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

### Vocabulary

* Once we build our model data object TEXT.vocab is created which stores the tokens and their integer mapping
* this can be saved for future use using Dill (pickle not supported by Spacy)


In [17]:
pickle.dump(TEXT, open('language/models/TEXT.pkl','wb'))

### How this works
* Imagine you have 64M tokens in the entire text you are analyzing in training. All reviews are put together into one large concatenated set for the purposes of language modeling
* If batch size(bs) is 64 the entire set of tokens is split into 64 groups of 1M tokens each. Remember order is maintained since context is important
* Now if bptt=70 then what that means is that the first chunk contains the first 70 tokens in each of the 64 groups
* Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.
* Pytorch will vary the bptt slightly every epoch (proxy for shuffling because here we cannot shuffle due to importance of word order)

#### number of batches = 1M/70 given by len(md.trn_dl)
#### number of unique tokens in vocab given by md.nt
#### len(md.trn_ds) will give us the total number of sentences which will be 1 since all of the reviews are concatenated
#### len(md.trn_ds[0].text) will give the total number of tokens (not just unique) in data

In [18]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(4583, 37392, 1, 20540756)

In [21]:
type(md.trn_ds[0].text)

list

In [22]:
# 'itos': 'int-to-string', ordered by freq
TEXT.vocab.itos[:12]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'in', 'it']

In [23]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

2

In [24]:
# first 12 tokens 
md.trn_ds[0].text[:12]

['at',
 'first',
 ',',
 'i',
 'thought',
 'this',
 'was',
 'a',
 'sequel',
 'to',
 'entre',
 'nous']

In [30]:
# Torchtext will automatically turn words into unique integers which is stored in TEXT.vocab
TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
    40
   102
     3
    12
   213
    13
    19
     6
   701
     8
 36172
     0
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

In [31]:
next(iter(md.trn_dl))

(Variable containing:
     40     20     11  ...      20     11   2519
    102      6     16  ...    9324     27      4
      3   8852     31  ...      20      2      8
         ...            ⋱           ...         
    588     53    113  ...      31   1404      2
     30    228    234  ...       6     84    491
      4   8453     68  ...     906      4      7
 [torch.cuda.LongTensor of size 69x64 (GPU 0)], Variable containing:
    102
      6
     16
   ⋮   
   3859
     20
      0
 [torch.cuda.LongTensor of size 4416 (GPU 0)])

### TRAINING

### An embedding matrix will have an embedding vector of size 200 for each unique token so matrix is of size number of unique tokens * 200. Embedding vector size is usually between 50 and 600, usually larger numbers needed than structured data to accurately capture the complexity of human language

In [9]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

#### Researchers have found that large amounts of *momentum* (which we'll learn about later) don't work well with these kinds of *RNN* models, so we create a version of the *Adam* optimizer with less momentum than it's default of `0.9`.

In [10]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

In [11]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [96]:
learner.summary

<bound method Learner.summary of SequentialRNN(
  (0): RNN_Encoder(
    (encoder): Embedding(37392, 200, padding_idx=1)
    (encoder_with_dropout): EmbeddingDropout(
      (embed): Embedding(37392, 200, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDrop(
        (module): LSTM(200, 500)
      )
      (1): WeightDrop(
        (module): LSTM(500, 500)
      )
      (2): WeightDrop(
        (module): LSTM(500, 200)
      )
    )
    (dropouti): LockedDropout(
    )
    (dropouths): ModuleList(
      (0): LockedDropout(
      )
      (1): LockedDropout(
      )
      (2): LockedDropout(
      )
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=200, out_features=37392)
    (dropout): LockedDropout(
    )
  )
)>

#### Optimal learning rate of .003 found with lr_find 

In [35]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

epoch      trn_loss   val_loss                                
    0      4.927479   4.80254   
    1      4.685541   4.554786                                
    2      4.570946   4.469886                                
    3      4.620267   4.485871                                
    4      4.527002   4.409289                                
    5      4.474805   4.356156                                
    6      4.430837   4.33783                                 
    7      4.553546   4.427456                                
    8      4.515164   4.397457                                
    9      4.486022   4.368983                                
    10     4.436023   4.333825                                
    11     4.405849   4.306923                                
    12     4.378236   4.284108                                
    13     4.353143   4.27009                                 
    14     4.342908   4.266654                                



[array([ 4.26665])]

#### Encoder is essentially tasked with creating a mathematical representation of the language based on the task for predicting the next word and can be saved for transfer learning

#### Decoder can be thrown away

In [36]:
learner.save_encoder('adam1_enc')

In [15]:
learner.load_encoder('adam1_enc')

In [16]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

epoch      trn_loss   val_loss                                
    0      4.495713   4.379872  
    1      4.48735    4.365586                                
    2      4.465033   4.347426                                
    3      4.438891   4.324449                                
    4      4.407966   4.302997                                
    5      4.368289   4.279553                                
    6      4.341055   4.25881                                 
    7      4.331617   4.245235                                
    8      4.297399   4.236853                                
    9      4.302721   4.235112                                



[array([ 4.23511])]

In [17]:
learner.save_encoder('adam3_10_enc')

In [12]:
learner.load_encoder('adam3_10_enc')

#### Perplexity is the inverse of the probability of the test set (as assigned by the language model), normalized by the number of word tokens in the test set.

#### Minimizing perplexity = maximizing probability!
* Language modeling accuracy is generally measured using the metric *perplexity*, which is simply `exp()` of the loss function we used.
* Since language model probabilities are very small, multiplying them together often yields to underflow. Hence exp() is used

In [19]:
math.exp(4.235112)

69.06941407470735

In [20]:
pickle.dump(TEXT, open('language/models/TEXT.pkl','wb'))

### Testing our Language Model

* Let's create a short bit of text to 'prime' a set of predictions. 
* We'll use our torchtext field to numericalize it so we can feed it to our language model.

In [81]:
m=learner.model
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [spacy_tok(ss)]
t=TEXT.numericalize(s)
' '.join([sent.string.strip() for sent in s[0]])

". So , it was n't quite was I was expecting , but I really liked it anyway ! The best"

In [82]:
print(type(s[0])); print(type(s))

<class 'spacy.tokens.doc.Doc'>
<class 'list'>


In [83]:
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs

* Let's see what the top 10 predictions were for the next word
* res is 21(number of tokens in our sentence) times 37392(number of unique tokens) and gives predictions
* res[-1] will pickout the last row since we need the next word in sentence
* topk gives us the top n predictions and indices

In [84]:
nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

['<unk>', ')', '"', '.', ',', '/><br', '<eos>', 'and', '!', '(']

* let's see if our model can generate a bit more text all by itself!

In [85]:
print(ss,"\n")
for i in range(50):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

. So, it wasn't quite was I was expecting, but I really liked it anyway! The best 

) . the film is a bit of a mess , but it is a good film . <eos> i saw this movie at the toronto film festival and i was very disappointed . i was n't expecting much , but i was wrong . the acting was terrible , ...


### Predicting Sentiment

* load up the saved vocab

In [86]:
TEXT = pickle.load(open('language/models/TEXT.pkl','rb'))

* `sequential=False` tells torchtext that a text field should be tokenized (in this case, we just want to store the 'positive' or 'negative' single label).

* `splits` is a torchtext method that creates train, test, and validation sets. The IMDB dataset is built into torchtext, so we can take advantage of that

In [87]:
IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

downloading aclImdb_v1.tar.gz


In [88]:
t = splits[0].examples[0]

In [89]:
t.label, ' '.join(t.text[:16])

('pos',
 "fantastic documentary of 1924 . this early 20th century geography of today 's iraq was powerful")

* fastai can create a ModelData object directly from torchtext splits.

In [93]:
md2 = TextData.from_splits(PATH, splits, bs)

In [94]:
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
m3.load_encoder(f'adam3_10_enc')

In [95]:
m3.summary

<bound method Learner.summary of SequentialRNN(
  (0): MultiBatchRNN(
    (encoder): Embedding(37392, 200, padding_idx=1)
    (encoder_with_dropout): EmbeddingDropout(
      (embed): Embedding(37392, 200, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDrop(
        (module): LSTM(200, 500)
      )
      (1): WeightDrop(
        (module): LSTM(500, 500)
      )
      (2): WeightDrop(
        (module): LSTM(500, 200)
      )
    )
    (dropouti): LockedDropout(
    )
    (dropouths): ModuleList(
      (0): LockedDropout(
      )
      (1): LockedDropout(
      )
      (2): LockedDropout(
      )
    )
  )
  (1): PoolingLinearClassifier(
    (layers): ModuleList(
      (0): LinearBlock(
        (lin): Linear(in_features=600, out_features=3)
        (drop): Dropout(p=0.1)
        (bn): BatchNorm1d(600, eps=1e-05, momentum=0.1, affine=True)
      )
    )
  )
)>

* Because we're fine-tuning a pretrained model, we'll use differential learning rates, and 
* also increase the max gradient for clipping, to allow the SGDR to work better.

In [98]:
m3.clip=25.
lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])

In [99]:
m3.freeze_to(-1)
m3.fit(lrs/2, 1, metrics=[accuracy])

epoch      trn_loss   val_loss   accuracy                    
    0      0.421861   0.264352   0.890739  



[array([ 0.26435]), 0.89073854366225536]

In [100]:
m3.unfreeze()
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1)

epoch      trn_loss   val_loss   accuracy                    
    0      0.380894   0.263654   0.890831  



[array([ 0.26365]), 0.89083108331426153]

In [101]:
m3.fit(lrs, 7, metrics=[accuracy], cycle_len=2, cycle_save_name='imdb2')

epoch      trn_loss   val_loss   accuracy                    
    0      0.35561    0.22741    0.911225  
    1      0.331502   0.243099   0.901143                    
    2      0.330561   0.222233   0.912783                    
    3      0.317785   0.227395   0.910048                    
    4      0.303724   0.203919   0.919699                    
    5      0.294606   0.203136   0.919817                    
    6      0.302173   0.206323   0.917787                    
    7      0.266207   0.199648   0.92279                     
    8      0.276916   0.197874   0.922243                    
    9      0.279578   0.200999   0.921772                    
    10     0.274344   0.189317   0.929168                    
    11     0.24929    0.190016   0.928052                    
    12     0.286644   0.194475   0.92438                     
    13     0.257254   0.188184   0.928792                    



[array([ 0.18818]), 0.92879192783453901]

In [105]:
m3.load_cycle('imdb2', 5)

In [106]:
accuracy_np(*m3.predict_with_targs())

0.93091999999999997