In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy

## Language modeling

### Data

#### Description of Data

In [2]:
PATH='/home/wk/myProjects/data/Enron/sign/tag/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

[0m[01;34mtest[0m/  [01;34mtrain[0m/


Let's look inside the training folder...

In [3]:
TRN

'/home/wk/myProjects/data/Enron/sign/tag/train/all/'

In [4]:
trn_files = !ls {TRN}
#trn_files = !dir /w {TRN}
trn_files[7:17]

['allen_p_deleted_items_145.txt',
 'allen_p_deleted_items_146.txt',
 'allen_p_deleted_items_149.txt',
 'allen_p_deleted_items_157.txt',
 'allen_p_deleted_items_165.txt',
 'allen_p_deleted_items_166.txt',
 'allen_p_deleted_items_173.txt',
 'allen_p_deleted_items_188.txt',
 'allen_p_deleted_items_193.txt',
 'allen_p_deleted_items_199.txt']

In [5]:
trn_files[6]

'allen_p_deleted_items_12.txt'

...and at an example review.

In [6]:
review = !cat {TRN}{trn_files[6]}
#review = !type {TRN}{trn_files[9]}
review[0]

''

Sounds like I'd really enjoy *Zombiegeddon*...

Now we'll check how many words are in the dataset.

In [7]:
!find {TRN} -name '*.txt' | xargs cat | wc -w

7959586


In [8]:
!find {VAL} -name '*.txt' | xargs cat | wc -w

1228878


Before we can analyze text, we must first *tokenize* it. This refers to the process of splitting a sentence into an array of words (or more generally, into an array of *tokens*).

*Note:* If you get an error like:

    Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
    
then you need to install the Spacy language model by running this command on the command-line:

    $ python -m spacy download en

In [9]:
# conda install -c spacy spacy 
# python -m spacy download en
import spacy
spacy_tok = spacy.load('en')

In [10]:
' '.join([sent.string.strip() for sent in spacy_tok(review[0])])

''

We use Pytorch's [torchtext](https://github.com/pytorch/text) library to preprocess our data, telling it to use the wonderful [spacy](https://spacy.io/) library to handle tokenization.

First, we create a torchtext *field*, which describes how to preprocess a piece of text - in this case, we tell torchtext to make everything lowercase, and tokenize it with spacy.

In [11]:
TEXT = data.Field(lower=True)
#tokenize="spacy"

In [12]:
TEXT

<torchtext.data.field.Field at 0x7efde556a4e0>

fastai works closely with torchtext. We create a ModelData object for language modeling by taking advantage of `LanguageModelData`, passing it our torchtext field object, and the paths to our training, test, and validation sets. In this case, we don't have a separate test set, so we'll just use `VAL_PATH` for that too.

As well as the usual `bs` (batch size) parameter, we also now have `bptt`; this define how many words are processing at a time in each row of the mini-batch. More importantly, it defines how many 'layers' we will backprop through. Making this number higher will increase time and memory requirements, but will improve the model's ability to handle long sentences.

In [19]:
bs=32; bptt=500

In [20]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=200)

After building our `ModelData` object, it automatically fills the `TEXT` object with a very important attribute: `TEXT.vocab`. This is a *vocabulary*, which stores which words (or *tokens*) have been seen in the text, and how each word will be mapped to a unique integer id. We'll need to use this information again later, so we save it.

*(Technical note: python's standard `Pickle` library can't handle this correctly, so at the top of this notebook we used the `dill` library instead and imported it as `pickle`)*.

In [21]:
f'{PATH}models/TEXT.pkl'

'/home/wk/myProjects/data/Enron/sign/tag/models/TEXT.pkl'

In [22]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

Here are the: # batches; # unique tokens in the vocab; # tokens in the training set; # sentences

In [23]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(497, 4139, 1, 7974911)

This is the start of the mapping from integer IDs to unique tokens.

In [24]:
# 'itos': 'int-to-string'
TEXT.vocab.itos[:12]

['<unk>',
 '<pad>',
 'the',
 'to',
 'and',
 'of',
 'a',
 'in',
 'for',
 '@@othr_ws@@',
 'is',
 'on']

In [25]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

2

Note that in a `LanguageModelData` object there is only one item in each dataset: all the words of the text joined together.

In [26]:
md.trn_ds[0].text[:12]

['please',
 'do',
 'not',
 'reply',
 'to',
 'this',
 'e-mail.',
 'you',
 'are',
 'receiving',
 'this',
 'message']

torchtext will handle turning this words into integer IDs for us automatically.

In [27]:
TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
   46
   88
   34
  573
    3
   17
 2288
   16
   26
  717
   17
  218
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

Our `LanguageModelData` object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our `bptt` parameter - *backprop through time*).

Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.

In [28]:
next(iter(md.trn_dl))

(Variable containing:
    46     0     5  ...    145     0    17
    88     2     2  ...    142   380  1779
    34  2278     0  ...   1911   107    17
        ...          ⋱          ...       
     0   401     7  ...   3200     0   990
     0   585  2649  ...    510   239    68
    41     5   203  ...     40   223     0
 [torch.cuda.LongTensor of size 505x32 (GPU 0)], Variable containing:
    88
     2
     2
   ⋮  
  2528
     0
    28
 [torch.cuda.LongTensor of size 16160 (GPU 0)])

### Train

We have a number of parameters to set - we'll learn more about these later, but you should find these values suitable for many problems.

In [29]:
em_sz = 300  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

Researchers have found that large amounts of *momentum* (which we'll learn about later) don't work well with these kinds of *RNN* models, so we create a version of the *Adam* optimizer with less momentum than it's default of `0.9`.

In [30]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

fastai uses a variant of the state of the art [AWD LSTM Language Model](https://arxiv.org/abs/1708.02182) developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through [Dropout](https://en.wikipedia.org/wiki/Convolutional_neural_network#Dropout). There is no simple way known (yet!) to find the best values of the dropout parameters below - you just have to experiment...

However, the other parameters (`alpha`, `beta`, and `clip`) shouldn't generally need tuning.

In [31]:
#learner = md.get_model(opt_fn, em_sz, nh, nl,
#               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.07, dropout=0.07, wdrop=0.15, dropoute=0.025, dropouth=0.07)

learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

As you can see below, I gradually tuned the language model in a few stages. I possibly could have trained it further (it wasn't yet overfitting), but I didn't have time to experiment more. Maybe you can see if you can train it to a better accuracy! (I used `lr_find` to find a good learning rate, but didn't save the output in this notebook. Feel free to try running it yourself now.)

In [32]:
learner.fit(6e-3, 2, wds=1e-6, cycle_len=2, cycle_mult=3)

HBox(children=(IntProgress(value=0, description='Epoch', max=8), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      5.770652   5.731839  
    1      5.753467   5.717794                              
    2      4.635405   4.350101                              
    3      3.774546   3.625579                              
    4      3.468385   3.328551                              
    5      3.311843   3.170504                              
    6      3.208021   3.092442                              
    7      3.182492   3.070893                              



[array([3.07089])]

In [33]:
learner.save_encoder('a_adam1_enc')

In [34]:
learner.load_encoder('a_adam1_enc')

In [35]:
learner.fit(1e-3, 1, wds=1e-6, cycle_len=4)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      3.169754   3.04839   
    1      3.1173     2.994132                              
    2      3.091632   2.969568                              
    3      3.068434   2.964501                              



[array([2.9645])]

In [36]:
#learner.save("rnn_enron")


In [37]:
#learner.load("rnn_enron")

In the sentiment analysis section, we'll just need half of the language model - the *encoder*, so we save that part.

In [38]:
learner.save_encoder('a_adam3_10_enc')

In [39]:
learner.load_encoder('a_adam3_10_enc')

Language modeling accuracy is generally measured using the metric *perplexity*, which is simply `exp()` of the loss function we used.

In [40]:
math.exp(4.165)

64.3926824434624

In [41]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

### Test

We can play around with our language model a bit to check it seems to be working OK. First, let's create a short bit of text to 'prime' a set of predictions. We'll use our torchtext field to numericalize it so we can feed it to our language model.

In [42]:
m=learner.model

ss=r"""
It makes one wonder whether one can be in any position in these deals and not 
be in a lease.  Maybe we should ask Herman to describe one.  I was under the 
assumption that if the PPA is not plant specific, there is less of a concern 
on who the EPC contractor is.  
"""
s = [TEXT.preprocess(ss)] 
t=TEXT.numericalize(s)
' '.join(s[0])

'it makes one wonder whether one can be in any position in these deals and not be in a lease. maybe we should ask herman to describe one. i was under the assumption that if the ppa is not plant specific, there is less of a concern on who the epc contractor is.'

We haven't yet added methods to make it easy to test a language model, so we'll need to manually go through the steps.

In [43]:
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs

Let's see what the top 10 predictions were for the next word after our short text:

In [44]:
nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

['<unk>', 'is', 'and', 'the', 'of', 'in', 'has', 'that', 'to', 'will']

...and let's see if our model can generate a bit more text all by itself!

In [45]:
print(ss,"\n")
for i in range(200):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')


It makes one wonder whether one can be in any position in these deals and not 
be in a lease.  Maybe we should ask Herman to describe one.  I was under the 
assumption that if the PPA is not plant specific, there is less of a concern 
on who the EPC contractor is.  
 

is the enron employee and should not be able to view the enron employee by clicking @@othr_ws@@ @@othr_ph@@ x @@othr_ph@@ if you are not a current or former employee and can't attend the meeting, or if you are located in london, and the new york times on the web and other events you can contact us at @@othr_em@@ or by phone at @@othr_ph@@ . <eos> the following items were sent to you as an enron employee and request that your membership rewards be used to be made by the internet is not included. all users must be directed to the resolution center at @@othr_ph@@ ets customers should direct inquiries to the ets solution center at @@othr_ph@@ we appreciate your participation in this email. <eos> dear power outage database c

In [46]:
??res[-1].topk

#### Subject Analysis

In [47]:
??m