In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy

  from numpy.core.umath_tests import inner1d


## Language modeling

### Data

We have used the top 15 books by Arthur Conan Doyle available on [Project Gutenberg](http://www.gutenberg.org/ebooks/author/69). 

In [2]:
PATH='/home/paperspace/data/arthur/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

[0m[01;34mmodels[0m/  [01;34mtest[0m/  [01;34mtmp[0m/  [01;34mtrain[0m/


Let's look inside the training folder...

In [3]:
trn_files = !ls {TRN}
trn_files[:5]

['108-0.txt', '244-0.txt', '2852-0.txt', '3289-0.txt', '834-0.txt']

...and at an example beginning of a book. 

In [4]:
f'{TRN}{trn_files[1]}'

'/home/paperspace/data/arthur/train/all/244-0.txt'

In [5]:
text = !cat {TRN}{trn_files[1]}
text = ' '.join(text)
text[2500:4000]

'cers who were in the same situation as myself, and succeeded in reaching Candahar in safety, where I found my regiment, and at once entered upon my new duties.  The campaign brought honours and promotion to many, but for me it had nothing but misfortune and disaster. I was removed from my brigade and attached to the Berkshires, with whom I served at the fatal battle of Maiwand. There I was struck on the shoulder by a Jezail bullet, which shattered the bone and grazed the subclavian artery. I should have fallen into the hands of the murderous Ghazis had it not been for the devotion and courage shown by Murray, my orderly, who threw me across a pack-horse, and succeeded in bringing me safely to the British lines.  Worn with pain, and weak from the prolonged hardships which I had undergone, I was removed, with a great train of wounded sufferers, to the base hospital at Peshawar. Here I rallied, and had already improved so far as to be able to walk about the wards, and even to bask a litt

Total number of words in our training data set.. 

In [6]:
!find {TRN} -name '*.txt' | xargs cat | wc -w

911319


Total number of words in our validation data set.. 

In [7]:
!find {VAL} -name '*.txt' | xargs cat | wc -w

32232


Before we can analyze text, we must first *tokenize* it. This refers to the process of splitting a sentence into an array of words (or more generally, into an array of *tokens*). We will be using the [spacy tokenizer](https://spacy.io/).

In [8]:
spacy_tok = spacy.load('en')

In [9]:
text[2500:4000]

'cers who were in the same situation as myself, and succeeded in reaching Candahar in safety, where I found my regiment, and at once entered upon my new duties.  The campaign brought honours and promotion to many, but for me it had nothing but misfortune and disaster. I was removed from my brigade and attached to the Berkshires, with whom I served at the fatal battle of Maiwand. There I was struck on the shoulder by a Jezail bullet, which shattered the bone and grazed the subclavian artery. I should have fallen into the hands of the murderous Ghazis had it not been for the devotion and courage shown by Murray, my orderly, who threw me across a pack-horse, and succeeded in bringing me safely to the British lines.  Worn with pain, and weak from the prolonged hardships which I had undergone, I was removed, with a great train of wounded sufferers, to the base hospital at Peshawar. Here I rallied, and had already improved so far as to be able to walk about the wards, and even to bask a litt

The tokenized version of the review...

In [10]:
' '.join([sent.string.strip() for sent in spacy_tok(text[2500:4000])])

'cers who were in the same situation as myself , and succeeded in reaching Candahar in safety , where I found my regiment , and at once entered upon my new duties .  The campaign brought honours and promotion to many , but for me it had nothing but misfortune and disaster . I was removed from my brigade and attached to the Berkshires , with whom I served at the fatal battle of Maiwand . There I was struck on the shoulder by a Jezail bullet , which shattered the bone and grazed the subclavian artery . I should have fallen into the hands of the murderous Ghazis had it not been for the devotion and courage shown by Murray , my orderly , who threw me across a pack - horse , and succeeded in bringing me safely to the British lines .  Worn with pain , and weak from the prolonged hardships which I had undergone , I was removed , with a great train of wounded sufferers , to the base hospital at Peshawar . Here I rallied , and had already improved so far as to be able to walk about the wards , 

We use Pytorch's [torchtext](https://github.com/pytorch/text) library to preprocess our data, telling it to use the wonderful [spacy](https://spacy.io/) library to handle tokenization.

First, we create a torchtext *field*, which describes how to preprocess a piece of text - in this case, we tell torchtext to make everything lowercase, and tokenize it with spacy.

In [11]:
%time TEXT = data.Field(lower=True, tokenize="spacy")

CPU times: user 3.14 s, sys: 364 ms, total: 3.5 s
Wall time: 567 ms


fastai works closely with torchtext. We create a ModelData object for language modeling by taking advantage of `LanguageModelData`, passing it our torchtext field object, and the paths to our training, test, and validation sets. In this case, we don't have a separate test set, so we'll just use `VAL_PATH` for that too.

As well as the usual `bs` (batch size) parameter, we also now have `bptt`; this define how many words are processing at a time in each row of the mini-batch. More importantly, it defines how many 'layers' we will backprop through. Making this number higher will increase time and memory requirements, but will improve the model's ability to handle long sentences.

In [12]:
bs=64; bptt=70

In [13]:
%time md = LanguageModelData.from_text_files(PATH, field = TEXT, train=TRN_PATH, validation=VAL_PATH, \
                                             test=VAL_PATH, bs=bs, bptt=bptt, min_freq=10)

CPU times: user 11.8 s, sys: 684 ms, total: 12.5 s
Wall time: 11.5 s


After building our `ModelData` object, it automatically fills the `TEXT` object with a very important attribute: `TEXT.vocab`. This is a *vocabulary*, which stores which words (or *tokens*) have been seen in the text, and how each word will be mapped to a unique integer id. We'll need to use this information again later, so we save it.

*(Technical note: python's standard `Pickle` library can't handle this correctly, so at the top of this notebook we used the `dill` library instead and imported it as `pickle`)*.

In [14]:
%%time
pickle.dump(TEXT, open(f'/home/paperspace/data/arthur/models/TEXT.pkl', mode='wb'))

CPU times: user 1.72 s, sys: 248 ms, total: 1.96 s
Wall time: 2.16 s


Here are the: # batches; # unique tokens in the vocab; # tokens in the training set; # sentences

In [15]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(246, 5922, 1, 1107486)

This is the start of the mapping from integer IDs to unique tokens.

In [16]:
# 'itos': 'int-to-string', sorted by freq except first two 
TEXT.vocab.itos[:12]

['<unk>', '<pad>', ',', 'the', '.', 'and', 'of', 'to', 'a', 'i', 'in', 'that']

In [17]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

3

In [18]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['first']

143

Note that in a `LanguageModelData` object there is only one item in each dataset: all the words of the text joined together.

In [19]:
md.trn_ds[0].text[140:150]

['deduction',
 'sherlock',
 'holmes',
 'took',
 'his',
 'bottle',
 'from',
 'the',
 'corner',
 'of']

torchtext will handle turning this words into integer IDs for us automatically.

In [20]:
#convert first 12 worlds to numbers using torchtext to show as example
TEXT.numericalize([md.trn_ds[0].text[140:150]])

Variable containing:
 2815
  269
   71
  210
   19
 2050
   37
    3
  509
    6
[torch.cuda.LongTensor of size 10x1 (GPU 0)]

Our `LanguageModelData` object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our `bptt` parameter - *backprop through time*).

Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.

Here is the Video https://course.fast.ai/lessons/lesson4.html from 1:48:19 to understand the concept of embeddings. 

In [21]:
next(iter(md.trn_dl))

(Variable containing:
     0  2134   225  ...     23     4     3
   120    25  1013  ...     54    12     0
   118     0    99  ...   1772    62    30
        ...          ⋱          ...       
   790  4242     6  ...      8   169    12
   188   380    35  ...    653    11    80
  5434     4  3034  ...   1121     3   302
 [torch.cuda.LongTensor of size 80x64 (GPU 0)], Variable containing:
   120
    25
  1013
   ⋮  
     2
   497
    15
 [torch.cuda.LongTensor of size 5120 (GPU 0)])

### Train

We have a number of parameters to set - we'll learn more about these later, but you should find these values suitable for many problems.

In [22]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

Researchers have found that large amounts of *momentum* (which we'll learn about later) don't work well with these kinds of *RNN* models, so we create a version of the *Adam* optimizer with less momentum than it's default of `0.9`.

In [23]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

fastai uses a variant of the state of the art [AWD LSTM Language Model](https://arxiv.org/abs/1708.02182) developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through [Dropout](https://en.wikipedia.org/wiki/Convolutional_neural_network#Dropout). There is no simple way known (yet!) to find the best values of the dropout parameters below - you just have to experiment...

However, the other parameters (`alpha`, `beta`, and `clip`) shouldn't generally need tuning.

In [24]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

As you can see below, I gradually tuned the language model in a few stages. I possibly could have trained it further (it wasn't yet overfitting), but I didn't have time to experiment more. Maybe you can see if you can train it to a better accuracy! (I used `lr_find` to find a good learning rate, but didn't save the output in this notebook. Feel free to try running it yourself now.)

In [25]:
learner.fit(3e-3, 2, wds=1e-6, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=3, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss                              
    0      5.37474    5.223078  
    1      4.732445   4.560068                              
    2      4.513744   4.437626                              



[array([4.43763])]

In [26]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss                              
    0      4.357313   4.213681  
    1      4.184102   4.110161                              



[array([4.11016])]

In [27]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

HBox(children=(IntProgress(value=0, description='Epoch', max=15, style=ProgressStyle(description_width='initia…

epoch      trn_loss   val_loss                              
    0      4.176712   4.092642  
    1      3.984548   3.85947                               
    2      3.812103   3.74814                               
    3      3.700207   3.685311                              
    4      3.617175   3.648627                              
    5      3.531819   3.626706                              
    6      3.457802   3.618157                              
    7      3.41637    3.595262                              
    8      3.330064   3.611231                              
    9      3.311842   3.587546                              
    10     3.258124   3.608652                              
    11     3.238087   3.600522                              
    12     3.204517   3.602803                              
    13     3.212457   3.602623                              
    14     3.244695   3.601785                              



[array([3.60178])]

Language modeling accuracy is generally measured using the metric *perplexity*, which is simply `exp()` of the loss function we used.

In [28]:
math.exp(3.601785)

36.663620631971746

### Test

We can play around with our language model a bit to check it seems to be working OK. First, let's create a short bit of text to 'prime' a set of predictions. We'll use our torchtext field to numericalize it so we can feed it to our language model.

In [29]:
def proc_str(s): return TEXT.preprocess(TEXT.tokenize(s))
def num_str(s): return TEXT.numericalize([proc_str(s)])
m=learner.model
s="""Sherlock Holmes"""

In [30]:
def sample_model(m, s, l=50):
    t = num_str(s)
    m[0].bs=1
    m.eval()
    m.reset()
    res,*_ = m(t)
    print('...', end='')

    for i in range(l):
        n=res[-1].topk(2)[1]
        n = n[1] if n.data[0]==0 else n[0]
        word = TEXT.vocab.itos[n.data[0]]
        print(word, end=' ')
        if word=='<eos>': break
        res,*_ = m(n[0].unsqueeze(0))

    m[0].bs=bs

Let's see what would our model predict for the following strings..

In [32]:
sample_model(m,"Sherlock Holmes")

..., who was the last man , and that he was a man of a most extraordinary type . he was a man of a most extraordinary type , and a man of a most dangerous and dangerous nature . he was a very tall , handsome , clean - 

In [33]:
sample_model(m,"As we made our way")

..., the whole train was open , and the door was open . i had been in the room , and i was able to see the scene of the man who had been so much as to see the house . i had a little time to see the 

Seems like the model works! Future work can be to build on this model and to further use it for Sentimental Analysis. 