<a href="https://colab.research.google.com/github/dpressel/dlss-tutorial/blob/master/2_context_vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part II: Contextualized embeddings


In this section, we are going to learn how to train an LSTM-based word-level language model.  Then we will take load a pre-trained langage model checkpoint and use everything below the output layers as the lower layers of our previously defined classification model.  We dont really need to change anything else, we just need to pass this whole network as the `embedding` parameter to the model.

## LSTM Language Models

We are going to quickly build an LSTM language model so that we can see how the training works.  For both our objectives and our metrics, we are interested in the perplexity, which is the exponentiated cross-entropy loss.

In [1]:
!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
!unzip wikitext-2-v1.zip

--2019-06-30 19:10:48--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.134.253
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.134.253|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4475746 (4.3M) [application/zip]
Saving to: ‘wikitext-2-v1.zip.5’


2019-06-30 19:10:49 (18.2 MB/s) - ‘wikitext-2-v1.zip.5’ saved [4475746/4475746]

Archive:  wikitext-2-v1.zip
replace wikitext-2/wiki.test.tokens? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

Our LSTM model will be a word-based model.  We will have a randomly trained embedding to start and we will put each output timestep through our LSTM blocks and then project to the output vocabulary size. At every step of training, we will detach our hidden states, preventing full backpropagation, but we will initialize the new batch from our old hidden state.  This is called "truncated backprop".   We will also create a function that resets the hidden state, which we will use at the start of each epoch to zero out the hidden states.

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List, Tuple
import os
import io
import re
import codecs
import numpy as np
from collections import Counter
import math
import time

class LSTMLanguageModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, hidden_dim, dropout=0.5, layers=2):
        super().__init__()
        self.layers = layers
        self.hidden_dim = hidden_dim
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.rnn = torch.nn.LSTM(embed_dim,
                                 hidden_dim,
                                 layers,
                                 dropout=dropout,
                                 bidirectional=False,
                                 batch_first=True)
        self.proj = nn.Linear(embed_dim, vocab_size)
        self.proj.bias.data.zero_()

        # Tie weights
        self.proj.weight = self.embed.weight

    def forward(self, x, hidden):
        emb = self.embed(x)
        decoded, hidden = self.rnn(emb, hidden)
        return self.proj(decoded), hidden
        
    def init_hidden(self, batchsz):
        weight = next(self.parameters()).data
        return (torch.autograd.Variable(weight.new(self.layers, batchsz, self.hidden_dim).zero_()),
                torch.autograd.Variable(weight.new(self.layers, batchsz, self.hidden_dim).zero_()))


Our dataset reader will read in a sequence of words and vectorize them.  We would like this to be a long sequence of text (like maybe a book), and we will read this in contiguously.  Our task is to learn to predict the next word, so we will end up using this sequence for input and output

In [0]:



class WordDatasetReader(object):
    """Provide a base-class to do operations to read words to tensors
    """
    def __init__(self, nctx, vectorizer=None):
        self.nctx = nctx
        self.num_words = {}
        self.vectorizer = vectorizer if vectorizer else self._vectorizer

    def build_vocab(self, files, min_freq=0):
        x = Counter()

        for file in files:
            if file is None:
                continue
            self.num_words[file] = 0
            with codecs.open(file, encoding='utf-8', mode='r') as f:
                sentences = []
                for line in f:
                    split_sentence = line.split() + ['<EOS>']
                    self.num_words[file] += len(split_sentence)
                    sentences += split_sentence
                x.update(Counter(sentences))
        x = dict(filter(lambda cnt: cnt[1] >= min_freq, x.items()))
        alpha = list(x.keys())
        alpha.sort()
        self.vocab = {w: i+1 for i, w in enumerate(alpha)}
        self.vocab['[PAD]'] = 0
    
    def _vectorizer(self, words: List[str]) -> List[int]:
        return [self.vocab.get(w, 0) for w in words]

      
    def load_features(self, filename):

        with codecs.open(filename, encoding='utf-8', mode='r') as f:
            sentences = []
            for line in f:
                sentences += line.strip().split() + ['<EOS>']
            return torch.tensor(self.vectorizer(sentences), dtype=torch.long)

    def load(self, filename, batch_size):
        x_tensor = self.load_features(filename)
        rest = x_tensor.shape[0]//batch_size
        num_steps = rest // self.nctx
        # if num_examples is divisible by batchsz * nctx (equivalent to rest is divisible by nctx), we
        # have a problem. reduce rest in that case.

        if rest % self.nctx == 0:
            rest = rest-1
        trunc = batch_size * rest
        
        x_tensor = x_tensor.narrow(0, 0, trunc)
        x_tensor = x_tensor.view(batch_size, -1).contiguous()
        return x_tensor
     
    

This class will keep track of our running average as we go so we dont have to remember to average things in our loops

In [0]:
class Average(object):
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


We are going to train on batches of contiguous text. Our batches will have been pre-created by the loader.  Each batch will be `BxT` where `B` is the batch size we specified, and `T` is the number of backprop steps through time.

In [0]:
class SequenceCriterion(nn.Module):

    def __init__(self):
        super().__init__()
        self.crit = nn.CrossEntropyLoss(ignore_index=0, size_average=True)
          
    def forward(self, inputs, targets):
        """Evaluate some loss over a sequence.

        :param inputs: torch.FloatTensor, [B, .., C] The scores from the model. Batch First
        :param targets: torch.LongTensor, The labels.

        :returns: torch.FloatTensor, The loss.
        """
        total_sz = targets.nelement()
        loss = self.crit(inputs.view(total_sz, -1), targets.view(total_sz))
        return loss

class LMTrainer:
  
    def __init__(self, optimizer: torch.optim.Optimizer, nctx):
        self.optimizer = optimizer
        self.nctx = nctx
    
    def run(self, model, train_data, loss_function, batch_size=20, clip=0.25):
        avg_loss = Average('average_train_loss')
        metrics = {}
        self.optimizer.zero_grad()
        start = time.time()
        model.train()
        hidden = model.init_hidden(batch_size)
        num_steps = train_data.shape[1]//self.nctx
        for i in range(num_steps):
            x = train_data[:,i*self.nctx:(i + 1) * self.nctx]
            y = train_data[:, i*self.nctx+1:(i + 1)*self.nctx + 1]
            labels = y.to('cuda:0').transpose(0, 1).contiguous()
            inputs = x.to('cuda:0')
            logits, (h, c) = model(inputs, hidden)
            hidden = (h.detach(), c.detach())
            logits = logits.transpose(0, 1).contiguous()
            loss = loss_function(logits, labels)
            loss.backward()

            avg_loss.update(loss.item())

            torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
            self.optimizer.step()
            self.optimizer.zero_grad()
            if (i + 1) % 100 == 0:
                print(avg_loss)

        # How much time elapsed in minutes
        elapsed = (time.time() - start)/60
        train_token_loss = avg_loss.avg
        train_token_ppl = math.exp(train_token_loss)
        metrics['train_elapsed_min'] = elapsed
        metrics['average_train_loss'] = train_token_loss
        metrics['train_ppl'] = train_token_ppl
        return metrics

class LMEvaluator:
    def __init__(self, nctx):
        self.nctx = nctx
        
    def run(self, model, valid_data, loss_function, batch_size=20):
        avg_valid_loss = Average('average_valid_loss')
        start = time.time()
        model.eval()
        hidden = model.init_hidden(batch_size)
        metrics = {}
        num_steps = valid_data.shape[1]//self.nctx
        for i in range(num_steps):

            with torch.no_grad():
                x = valid_data[:,i*self.nctx:(i + 1) * self.nctx]
                y = valid_data[:, i*self.nctx+1:(i + 1)*self.nctx + 1]
                labels = y.to('cuda:0').transpose(0, 1).contiguous()
                inputs = x.to('cuda:0')
                
                logits, hidden = model(inputs, hidden)
                logits = logits.transpose(0, 1).contiguous()
                loss = loss_function(logits, labels)
                avg_valid_loss.update(loss.item())

        valid_token_loss = avg_valid_loss.avg
        valid_token_ppl = math.exp(valid_token_loss)

        elapsed = (time.time() - start)/60
        metrics['valid_elapsed_min'] = elapsed

        metrics['average_valid_loss'] = valid_token_loss
        metrics['average_valid_word_ppl'] = valid_token_ppl
        return metrics
      
def fit_lm(model, optimizer, epochs, batch_size, nctx, train_data, valid_data):

    loss = SequenceCriterion()
    trainer = LMTrainer(optimizer, nctx)
    evaluator = LMEvaluator(nctx)
    best_acc = 0.0

    metrics = evaluator.run(model, valid_data, loss, batch_size)

    for epoch in range(epochs):

        print('EPOCH {}'.format(epoch + 1))
        print('=================================')
        print('Training Results')
        metrics = trainer.run(model, train_data, loss, batch_size)
        print(metrics)
        print('Validation Results')
        metrics = evaluator.run(model, valid_data, loss, batch_size)
        print(metrics)

Now we will train it on [Wikitext-2, Merity et al. 2016](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).  We will use 35 steps of backprop.

In [0]:
BASE = 'wikitext-2'
TRAIN = os.path.join(BASE, 'wiki.train.tokens')
VALID = os.path.join(BASE, 'wiki.valid.tokens')

batch_size = 20
nctx = 35
reader = WordDatasetReader(nctx)
reader.build_vocab((TRAIN,))

train_set = reader.load(TRAIN, batch_size)
valid_set = reader.load(VALID, batch_size)

Lets start with 1 epoch

In [7]:

model = LSTMLanguageModel(len(reader.vocab), 512, 512)
model.to('cuda:0')

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model has {num_params} parameters") 


learnable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(learnable_params, lr=0.001)
fit_lm(model, optimizer, 1, batch_size, nctx, train_set, valid_set)


Model has 21274623 parameters




EPOCH 1
Training Results
average_train_loss 7.130287 (7.630262)
average_train_loss 6.948112 (7.174242)
average_train_loss 6.429612 (6.957250)
average_train_loss 6.706582 (6.817735)
average_train_loss 6.480259 (6.716661)
average_train_loss 6.253604 (6.639940)
average_train_loss 6.250593 (6.584427)
average_train_loss 6.086081 (6.535446)
average_train_loss 6.046218 (6.491021)
average_train_loss 5.840661 (6.455732)
average_train_loss 6.127773 (6.425025)
average_train_loss 5.766460 (6.398616)
average_train_loss 6.137995 (6.376816)
average_train_loss 6.115303 (6.351095)
average_train_loss 6.203366 (6.333509)
average_train_loss 6.009459 (6.318195)
average_train_loss 6.126120 (6.297565)
average_train_loss 5.796104 (6.276726)
average_train_loss 5.737082 (6.260670)
average_train_loss 5.954897 (6.243683)
average_train_loss 5.674878 (6.226430)
average_train_loss 5.613625 (6.207307)
average_train_loss 5.878324 (6.189868)
average_train_loss 5.824322 (6.178013)
average_train_loss 5.932457 (6.164326)


We can sample out of our language model using the code below.

In [8]:
def sample(model, index2word, start_word='the', maxlen=20):
  

    model.eval() 
    words = [start_word]
    x = torch.tensor(reader.vocab.get(start_word)).long().reshape(1, 1).to('cuda:0')
    hidden = model.init_hidden(1)

    with torch.no_grad():
        for i in range(20):
            output, hidden = model(x, hidden)
            word_softmax = output.squeeze().exp().cpu()
            selected = torch.multinomial(word_softmax, 1)[0]
            x.fill_(selected)
            word = index2word[selected.item()]
            words.append(word)
    words.append('...')
    return words

index2word = {i: w for w, i in reader.vocab.items()}
words = sample(model, index2word)
print(' '.join(words))


the latter story pass that would be in Park or Ireland . Like Liam Stuart illustrator , NC apologize and livestock ...


Lets train a few more epochs and try again

In [9]:
fit_lm(model, optimizer, 3, batch_size, 35, train_set, valid_set)



EPOCH 1
Training Results
average_train_loss 5.888901 (5.598958)
average_train_loss 5.818975 (5.584962)
average_train_loss 5.503317 (5.580568)
average_train_loss 5.877174 (5.584086)
average_train_loss 5.637257 (5.563949)
average_train_loss 5.447403 (5.540543)
average_train_loss 5.460862 (5.531339)
average_train_loss 5.488514 (5.525078)
average_train_loss 5.359737 (5.517013)
average_train_loss 5.121772 (5.509718)
average_train_loss 5.441720 (5.503375)
average_train_loss 5.280029 (5.499962)
average_train_loss 5.543726 (5.500008)
average_train_loss 5.556562 (5.494267)
average_train_loss 5.593565 (5.495319)
average_train_loss 5.347257 (5.496266)
average_train_loss 5.519910 (5.489749)
average_train_loss 5.264927 (5.483263)
average_train_loss 5.207999 (5.481013)
average_train_loss 5.434073 (5.476722)
average_train_loss 5.112748 (5.471222)
average_train_loss 5.142471 (5.463090)
average_train_loss 5.362827 (5.455768)
average_train_loss 5.287307 (5.454580)
average_train_loss 5.420770 (5.449693)


In [10]:
index2word = {i: w for w, i in reader.vocab.items()}
words = sample(model, index2word)
print(' '.join(words))

the Supreme Court did not introduce any contact with its way and Grosser Davies of the chance of the country . ...



## ELMo

For the rest of this section, we will focus on ELMo ([Peters et al 2018](https://export.arxiv.org/pdf/1802.05365)), a language model, with an embedding layer and 2 subsequent LSTM layers, one in the forward direction and one in the backward direction.  The losses for the forward and reverse directions will be averaged.

In our example, we created a word-based LM with an embedding layer and 2 subsequent LSTM layers.  You might have been wondering what to do about words that we havent seen yet -- and thats a valid concern!   Instead of using a word embedding layer like our example above, what if we had a model that used a character compositional approach, taking each character in a word and applying a pooling operation to yield a word representation.  This would mean that the model can handle words that its never seen in the input before.

This is exactly what ELMo does -- its based on the research of [Kim et al. 2015](https://arxiv.org/abs/1508.06615).  

There is a nice [slide deck by the authors here](http://www.people.fas.harvard.edu/~yoonkim/data/char-nlm-slides.pdf), but the key high-level points are listed here:

### Kim Language Model

* **Goal**: predict the next word in the sentence (causal LM) but account for unseen words by using a character compositional approach that relies on letters within the pre-segmented words.  This also has the important impact of reducing the number of parameters required in the model drastically over word-level models.

* **Using**: LSTM layers that take in a word representation for each position.  Each word is put in and used to predict the next word over a context

* **The Twist**: use embeddings approach from [dos Santos & Zadrozny 2014](http://proceedings.mlr.press/v32/santos14.pdf) to represent words, but add parallel filters as in [Kim 2014](https://www.aclweb.org/anthology/D14-1181).  Also, add highway layers on top of the base model



### ELMo Language Model

* **Goal**: predict the next word in the sentence (causal LM) on the forward sequence **and** predict the previous word on the sentence conditioned on the following context.  

* **Using**: LSTM layers as before, but bidirectional, sum the forward and backward loss to make one big loss

* **The Twist** Potentially use all layers of the model (except we dont need head with the big softmax at the end over the words). After the fact, we can freeze our biLM embeddings but still provide useful information by learning a linear combination of the layers during downstream training.  During the biLM training, these scalars dont exist



### ELMo with AllenNLP

Even though ELMo is just a network like described above, there are a lot of details to getting it set up and reloading the pre-trained checkpoints that are provided, and these details are not really important for demonstration purposes.  So, we will just install [AllenNLP](https://github.com/allenai/allennlp) and use it the basis for a new embedding layer.

If you are interested in learning more about using ELMo with AllenNLP, they have provided a [tutorial here](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md)

#### TensorFlow and ELMo

ELMo was originally trained with TensorFlow.  You can find the code to train and use it in the [bilm-tf repository](https://github.com/allenai/bilm-tf/tree/master/bilm)

TF-Hub contains the [pre-trained ELMo model](https://tfhub.dev/google/elmo/2) and is very easy to integrate if you are using TensorFlow already.  The model takes a sequence of words (mixed-case) as inputs and can just be "glued" in to your existing models as a sub-graph of your own.


In [11]:
!pip install allennlp

Collecting allennlp
  Using cached https://files.pythonhosted.org/packages/30/8c/72b14d20c9cbb0306939ea41109fc599302634fd5c59ccba1a659b7d0360/allennlp-0.8.4-py3-none-any.whl
Collecting jsonnet>=0.10.0; sys_platform != "win32" (from allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/a9/a8/adba6cd0f84ee6ab064e7f70cd03a2836cefd2e063fd565180ec13beae93/jsonnet-0.13.0.tar.gz (255kB)
[K     |████████████████████████████████| 256kB 3.4MB/s 
[?25hCollecting numpydoc>=0.8.0 (from allennlp)
  Downloading https://files.pythonhosted.org/packages/6a/f3/7cfe4c616e4b9fe05540256cc9c6661c052c8a4cec2915732793b36e1843/numpydoc-0.9.1.tar.gz
Collecting pytorch-pretrained-bert>=0.6.0 (from allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
[K     |████████████████████████████████| 133kB 49.5MB/s 
Collecting parsimonious>=0.8.0 (from allennlp)
  Usi

### Approach

We will use mostly the same code as before.  For brevity, I have compacted it all here and omitted parts that arent required for this section.  For more information, see the previous section

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List, Tuple
import os
import io
import re
import codecs
import numpy as np
from collections import Counter
from torch.utils.data import DataLoader, TensorDataset

class LSTMClassifier(nn.Module):

    def __init__(self, embeddings, num_classes, embed_dims, rnn_units, rnn_layers=1, dropout=0.5, hidden_units=[]):
        super().__init__()
        self.embeddings = embeddings
        self.dropout = nn.Dropout(dropout)
        self.rnn = torch.nn.LSTM(embed_dims,
                                 rnn_units,
                                 rnn_layers,
                                 dropout=dropout,
                                 bidirectional=False,
                                 batch_first=False)
        nn.init.orthogonal_(self.rnn.weight_hh_l0)
        nn.init.orthogonal_(self.rnn.weight_ih_l0)
        sequence = []
        input_units = rnn_units
        output_units = rnn_units
        for h in hidden_units:
            sequence.append(nn.Linear(input_units, h))
            input_units = h
            output_units = h
            
        sequence.append(nn.Linear(output_units, num_classes))
        self.outputs = nn.Sequential(*sequence)
        
        
    def forward(self, inputs):
        one_hots, lengths = inputs
        embed = self.dropout(self.embeddings(one_hots))
        embed = embed.transpose(0, 1)
        packed = torch.nn.utils.rnn.pack_padded_sequence(embed, lengths.tolist())
        _, hidden = self.rnn(packed)
        hidden = hidden[0].view(hidden[0].shape[1:])
        linear = self.outputs(hidden)
        return F.log_softmax(linear, dim=-1)

class ConfusionMatrix:
    """Confusion matrix with metrics

    This class accumulates classification output, and tracks it in a confusion matrix.
    Metrics are available that use the confusion matrix
    """
    def __init__(self, labels):
        """Constructor with input labels

        :param labels: Either a dictionary (`k=int,v=str`) or an array of labels
        """
        if type(labels) is dict:
            self.labels = []
            for i in range(len(labels)):
                self.labels.append(labels[i])
        else:
            self.labels = labels
        nc = len(self.labels)
        self._cm = np.zeros((nc, nc), dtype=np.int)

    def add(self, truth, guess):
        """Add a single value to the confusion matrix based off `truth` and `guess`

        :param truth: The real `y` value (or ground truth label)
        :param guess: The guess for `y` value (or assertion)
        """

        self._cm[truth, guess] += 1

    def __str__(self):
        values = []
        width = max(8, max(len(x) for x in self.labels) + 1)
        for i, label in enumerate([''] + self.labels):
            values += ["{:>{width}}".format(label, width=width+1)]
        values += ['\n']
        for i, label in enumerate(self.labels):
            values += ["{:>{width}}".format(label, width=width+1)]
            for j in range(len(self.labels)):
                values += ["{:{width}d}".format(self._cm[i, j], width=width + 1)]
            values += ['\n']
        values += ['\n']
        return ''.join(values)

    def save(self, outfile):
        ordered_fieldnames = OrderedDict([("labels", None)] + [(l, None) for l in self.labels])
        with open(outfile, 'w') as f:
            dw = csv.DictWriter(f, delimiter=',', fieldnames=ordered_fieldnames)
            dw.writeheader()
            for index, row in enumerate(self._cm):
                row_dict = {l: row[i] for i, l in enumerate(self.labels)}
                row_dict.update({"labels": self.labels[index]})
                dw.writerow(row_dict)

    def reset(self):
        """Reset the matrix
        """
        self._cm *= 0

    def get_correct(self):
        """Get the diagonals of the confusion matrix

        :return: (``int``) Number of correct classifications
        """
        return self._cm.diagonal().sum()

    def get_total(self):
        """Get total classifications

        :return: (``int``) total classifications
        """
        return self._cm.sum()

    def get_acc(self):
        """Get the accuracy

        :return: (``float``) accuracy
        """
        return float(self.get_correct())/self.get_total()

    def get_recall(self):
        """Get the recall

        :return: (``float``) recall
        """
        total = np.sum(self._cm, axis=1)
        total = (total == 0) + total
        return np.diag(self._cm) / total.astype(float)

    def get_support(self):
        return np.sum(self._cm, axis=1)

    def get_precision(self):
        """Get the precision
        :return: (``float``) precision
        """

        total = np.sum(self._cm, axis=0)
        total = (total == 0) + total
        return np.diag(self._cm) / total.astype(float)

    def get_mean_precision(self):
        """Get the mean precision across labels

        :return: (``float``) mean precision
        """
        return np.mean(self.get_precision())

    def get_weighted_precision(self):
        return np.sum(self.get_precision() * self.get_support())/float(self.get_total())

    def get_mean_recall(self):
        """Get the mean recall across labels

        :return: (``float``) mean recall
        """
        return np.mean(self.get_recall())

    def get_weighted_recall(self):
        return np.sum(self.get_recall() * self.get_support())/float(self.get_total())

    def get_weighted_f(self, beta=1):
        return np.sum(self.get_class_f(beta) * self.get_support())/float(self.get_total())

    def get_macro_f(self, beta=1):
        """Get the macro F_b, with adjustable beta (defaulting to F1)

        :param beta: (``float``) defaults to 1 (F1)
        :return: (``float``) macro F_b
        """
        if beta < 0:
            raise Exception('Beta must be greater than 0')
        return np.mean(self.get_class_f(beta))

    def get_class_f(self, beta=1):
        p = self.get_precision()
        r = self.get_recall()

        b = beta*beta
        d = (b * p + r)
        d = (d == 0) + d

        return (b + 1) * p * r / d

    def get_f(self, beta=1):
        """Get 2 class F_b, with adjustable beta (defaulting to F1)

        :param beta: (``float``) defaults to 1 (F1)
        :return: (``float``) 2-class F_b
        """
        p = self.get_precision()[1]
        r = self.get_recall()[1]
        if beta < 0:
            raise Exception('Beta must be greater than 0')
        d = (beta*beta * p + r)
        if d == 0:
            return 0
        return (beta*beta + 1) * p * r / d

    def get_all_metrics(self):
        """Make a map of metrics suitable for reporting, keyed by metric name

        :return: (``dict``) Map of metrics keyed by metric names
        """
        metrics = {'acc': self.get_acc()}
        # If 2 class, assume second class is positive AKA 1
        if len(self.labels) == 2:
            metrics['precision'] = self.get_precision()[1]
            metrics['recall'] = self.get_recall()[1]
            metrics['f1'] = self.get_f(1)
        else:
            metrics['mean_precision'] = self.get_mean_precision()
            metrics['mean_recall'] = self.get_mean_recall()
            metrics['macro_f1'] = self.get_macro_f(1)
            metrics['weighted_precision'] = self.get_weighted_precision()
            metrics['weighted_recall'] = self.get_weighted_recall()
            metrics['weighted_f1'] = self.get_weighted_f(1)
        return metrics

    def add_batch(self, truth, guess):
        """Add a batch of data to the confusion matrix

        :param truth: The truth tensor
        :param guess: The guess tensor
        :return:
        """
        for truth_i, guess_i in zip(truth, guess):
            self.add(truth_i, guess_i)

class Trainer:
    def __init__(self, optimizer: torch.optim.Optimizer):
        self.optimizer = optimizer

    def run(self, model, labels, train, loss, batch_size): 
        model.train()       
        train_loader = DataLoader(train, batch_size=batch_size, shuffle=True)

        cm = ConfusionMatrix(labels)

        for batch in train_loader:
            loss_value, y_pred, y_actual = self.update(model, loss, batch)
            _, best = y_pred.max(1)
            yt = y_actual.cpu().int().numpy()
            yp = best.cpu().int().numpy()
            cm.add_batch(yt, yp)

        print(cm.get_all_metrics())
        return cm
    
    def update(self, model, loss, batch):
        self.optimizer.zero_grad()
        x, lengths, y = batch
        lengths, perm_idx = lengths.sort(0, descending=True)
        x_sorted = x[perm_idx]
        y_sorted = y[perm_idx]
        y_sorted = y_sorted.to('cuda:0')
        inputs = (x_sorted.to('cuda:0'), lengths)
        y_pred = model(inputs)
        loss_value = loss(y_pred, y_sorted)
        loss_value.backward()
        self.optimizer.step()
        return loss_value.item(), y_pred, y_sorted

class Evaluator:
    def __init__(self):
        pass

    def run(self, model, labels, dataset, batch_size=1):
        model.eval()
        valid_loader = DataLoader(dataset, batch_size=batch_size)
        cm = ConfusionMatrix(labels)
        for batch in valid_loader:
            y_pred, y_actual = self.inference(model, batch)
            _, best = y_pred.max(1)
            yt = y_actual.cpu().int().numpy()
            yp = best.cpu().int().numpy()
            cm.add_batch(yt, yp)
        return cm

    def inference(self, model, batch):
        with torch.no_grad():
            x, lengths, y = batch
            lengths, perm_idx = lengths.sort(0, descending=True)
            x_sorted = x[perm_idx]
            y_sorted = y[perm_idx]
            y_sorted = y_sorted.to('cuda:0')
            inputs = (x_sorted.to('cuda:0'), lengths)
            y_pred = model(inputs)
            return y_pred, y_sorted

def fit(model, labels, optimizer, loss, epochs, batch_size, train, valid, test):

    trainer = Trainer(optimizer)
    evaluator = Evaluator()
    best_acc = 0.0
    
    for epoch in range(epochs):
        print('EPOCH {}'.format(epoch + 1))
        print('=================================')
        print('Training Results')
        cm = trainer.run(model, labels, train, loss, batch_size)
        print('Validation Results')
        cm = evaluator.run(model, labels, valid)
        print(cm.get_all_metrics())
        if cm.get_acc() > best_acc:
            print('New best model {:.2f}'.format(cm.get_acc()))
            best_acc = cm.get_acc()
            torch.save(model.state_dict(), './checkpoint.pth')
    if test:
        model.load_state_dict(torch.load('./checkpoint.pth'))
        cm = evaluator.run(model, labels, test)
        print('Final result')
        print(cm.get_all_metrics())
    return cm.get_acc()

def whitespace_tokenizer(words: str) -> List[str]:
    return words.split() 

def sst2_tokenizer(words: str) -> List[str]:
    REPLACE = { "'s": " 's ",
                "'ve": " 've ",
                "n't": " n't ",
                "'re": " 're ",
                "'d": " 'd ",
                "'ll": " 'll ",
                ",": " , ",
                "!": " ! ",
                }
    words = words.lower()
    words = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", words)
    for k, v in REPLACE.items():
            words = words.replace(k, v)
    return [w.strip() for w in words.split()]


class Reader:

    def __init__(self, files, lowercase=True, min_freq=0,
                 tokenizer=sst2_tokenizer, vectorizer=None):
        self.lowercase = lowercase
        self.tokenizer = tokenizer
        build_vocab = vectorizer is None
        self.vectorizer = vectorizer if vectorizer else self._vectorizer
        x = Counter()
        y = Counter()
        for file_name in files:
            if file_name is None:
                continue
            with codecs.open(file_name, encoding='utf-8', mode='r') as f:
                for line in f:
                    words = line.split()
                    y.update(words[0])

                    if build_vocab:
                        words = self.tokenizer(' '.join(words[1:]))
                        words = words if not self.lowercase else [w.lower() for w in words]
                        x.update(words)
        self.labels = list(y.keys())

        if build_vocab:
            x = dict(filter(lambda cnt: cnt[1] >= min_freq, x.items()))
            alpha = list(x.keys())
            alpha.sort()
            self.vocab = {w: i+1 for i, w in enumerate(alpha)}
            self.vocab['[PAD]'] = 0

        self.labels.sort()

    def _vectorizer(self, words: List[str]) -> List[int]:
        return [self.vocab.get(w, 0) for w in words]

    def load(self, filename: str) -> TensorDataset:
        label2index = {l: i for i, l in enumerate(self.labels)}
        xs = []
        lengths = []
        ys = []
        with codecs.open(filename, encoding='utf-8', mode='r') as f:
            for line in f:
                words = line.split()
                ys.append(label2index[words[0]])
                words = self.tokenizer(' '.join(words[1:]))
                words = words if not self.lowercase else [w.lower() for w in words]
                vec = self.vectorizer(words)
                lengths.append(len(vec))
                xs.append(torch.tensor(vec, dtype=torch.long))
        x_tensor = torch.nn.utils.rnn.pad_sequence(xs, batch_first=True)
        lengths_tensor = torch.tensor(lengths, dtype=torch.long)
        y_tensor = torch.tensor(ys, dtype=torch.long)
        return TensorDataset(x_tensor, lengths_tensor, y_tensor)

### The new thing: set up to use ELMo

In [0]:
from allennlp.modules.elmo import Elmo, batch_to_ids


def elmo_vectorizer(sentence):
    character_ids = batch_to_ids([sentence])
    return character_ids.squeeze(0)

  
class ElmoEmbedding(nn.Module):
    def __init__(self, options_file, weight_file, dropout=0.5):
        super().__init__()
        self.elmo = Elmo(options_file, weight_file, 2, dropout=dropout)
    def forward(self, xch):
        elmo = self.elmo(xch)
        e1, e2 = elmo['elmo_representations']
        mask = elmo['mask']
        embeddings = (e1 + e2) * mask.float().unsqueeze(-1)
        return embeddings


As before, we are going to load up our data with a reader.  This time, though, we will provide a vectorizer for ELMo.  In our simple example `Reader`, we only allow a single feature as our input vector to our classifier, so we can stop counting up our vocab.  In real life, you probably want to support both word vector features and context vector features so you might want to modify the code to support both.  This is a very common approach -- just using ELMo to augment an existing setup.  Here, we just look at using ELMo features by themselves.


In [14]:
!wget https://www.dropbox.com/s/08km2ean8bkt7p3/trec.tar.gz?dl=1
!tar -xzf 'trec.tar.gz?dl=1'

--2019-06-30 19:21:55--  https://www.dropbox.com/s/08km2ean8bkt7p3/trec.tar.gz?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.8.1, 2620:100:601b:1::a27d:801
Connecting to www.dropbox.com (www.dropbox.com)|162.125.8.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/08km2ean8bkt7p3/trec.tar.gz [following]
--2019-06-30 19:21:56--  https://www.dropbox.com/s/dl/08km2ean8bkt7p3/trec.tar.gz
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc7fa2ae1930db92d5916f06ba12.dl.dropboxusercontent.com/cd/0/get/Aj3XoF2sz7098a7ulJjBQP5DA6LkkkTQEAgFciDKPLgTZrHSUdejKQ7f8hkI3LiEt0BP_zf3LYg-ul8IZkevEcRCL4oxvYa8Uw-4SCn9GK2Lqw/file?dl=1# [following]
--2019-06-30 19:21:56--  https://uc7fa2ae1930db92d5916f06ba12.dl.dropboxusercontent.com/cd/0/get/Aj3XoF2sz7098a7ulJjBQP5DA6LkkkTQEAgFciDKPLgTZrHSUdejKQ7f8hkI3LiEt0BP_zf3LYg-ul8IZkevEcRCL4oxvYa8Uw-4SCn9GK2Lqw/file?dl=1
Resolving uc7fa

We will set up our reader slightly differently than in the last experiment.  Here we will use an `elmo_vectorizer`

In [15]:
BASE = 'trec'
TRAIN = os.path.join(BASE, 'trec.nodev.utf8')
VALID = os.path.join(BASE, 'trec.dev.utf8')
TEST = os.path.join(BASE, 'trec.test.utf8')



reader = Reader((TRAIN, VALID, TEST,), lowercase=False, vectorizer=elmo_vectorizer)
train = reader.load(TRAIN)
valid = reader.load(VALID)
test = reader.load(TEST)



Building the network is basically the same as before, but we are using ELMo instead of word vectors.  The command below will take a few minutes -- this is a much larger (forward) network than before, even though the learnable parameters havent really changed

In [16]:
options_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
embeddings = ElmoEmbedding(options_file, weight_file)
model = LSTMClassifier(embeddings, len(reader.labels), embed_dims=1024, rnn_units=100, hidden_units=[100])

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model has {num_params} parameters") 


model.to('cuda:0')
loss = torch.nn.NLLLoss()
loss = loss.to('cuda:0')

learnable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adadelta(learnable_params, lr=1.0)

fit(model, reader.labels, optimizer, loss, 12, 50, train, valid, test)

100%|██████████| 336/336 [00:00<00:00, 192499.13B/s]
100%|██████████| 374434792/374434792 [00:07<00:00, 47927932.74B/s]
  "num_layers={}".format(dropout, num_layers))


Model has 461114 parameters
EPOCH 1
Training Results
{'acc': 0.5608, 'mean_precision': 0.6483439079531595, 'mean_recall': 0.48504498768062404, 'macro_f1': 0.49849106627634976, 'weighted_precision': 0.5726962123308302, 'weighted_recall': 0.5608, 'weighted_f1': 0.5554788741148295}
Validation Results
{'acc': 0.7942477876106194, 'mean_precision': 0.8454765420711672, 'mean_recall': 0.7033276693176708, 'macro_f1': 0.7144610587048233, 'weighted_precision': 0.8102895316760798, 'weighted_recall': 0.7942477876106194, 'weighted_f1': 0.7887777126414355}
New best model 0.79
EPOCH 2
Training Results
{'acc': 0.806, 'mean_precision': 0.799350329535837, 'mean_recall': 0.7675872074813431, 'macro_f1': 0.780728542640896, 'weighted_precision': 0.8062829252605372, 'weighted_recall': 0.806, 'weighted_f1': 0.8058397968891035}
Validation Results
{'acc': 0.8628318584070797, 'mean_precision': 0.8566120843164245, 'mean_recall': 0.7974693543452065, 'macro_f1': 0.8182667932069821, 'weighted_precision': 0.8675313347

0.944

Let's see how this number compares against a randomly initialized baseline model that is otherwise identical.  We dont really need to use such a huge embedding size in this case -- we are using word vectors instead of character compositional vectors and we dont really have enough information to train a huge word embedding from scratch.  Also, since we dont have much information, we will use lowercased features.  Note that using these word embeddings features, our model has **6x more parameters than before**.  Also, we might want to train it longer.

In [17]:

r = Reader((TRAIN, VALID, TEST,), lowercase=True)
train = r.load(TRAIN)
valid = r.load(VALID)
test = r.load(TEST)

embeddings = nn.Embedding(len(r.vocab), 300)
model = LSTMClassifier(embeddings, len(r.labels), embeddings.weight.shape[1], rnn_units=100, hidden_units=[100])

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model has {num_params} parameters") 


model.to('cuda:0')
loss = torch.nn.NLLLoss()
loss = loss.to('cuda:0')

learnable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adadelta(learnable_params, lr=1.0)

fit(model, r.labels, optimizer, loss, 48, 50, train, valid, test)

  "num_layers={}".format(dropout, num_layers))


Model has 2801306 parameters
EPOCH 1
Training Results
{'acc': 0.2542, 'mean_precision': 0.31261376969825, 'mean_recall': 0.2142077519956256, 'macro_f1': 0.21100023435767312, 'weighted_precision': 0.24503296037652494, 'weighted_recall': 0.2542, 'weighted_f1': 0.22759490197394544}
Validation Results
{'acc': 0.31194690265486724, 'mean_precision': 0.30219607432329726, 'mean_recall': 0.2697651533307747, 'macro_f1': 0.23033350512330675, 'weighted_precision': 0.3308620779538906, 'weighted_recall': 0.31194690265486724, 'weighted_f1': 0.25984647084205925}
New best model 0.31
EPOCH 2
Training Results
{'acc': 0.3776, 'mean_precision': 0.4211727454667504, 'mean_recall': 0.3614489295644783, 'macro_f1': 0.3731848509532312, 'weighted_precision': 0.37235350929169814, 'weighted_recall': 0.3776, 'weighted_f1': 0.3631077775707718}
Validation Results
{'acc': 0.4557522123893805, 'mean_precision': 0.486368442230124, 'mean_recall': 0.44059437145396335, 'macro_f1': 0.4116212305386946, 'weighted_precision': 0.

0.882

## Conclusion

Without even concatenating word features, our ELMo model, with far fewer parameters, surpasses the performance of the randomly initialized baseline, which we would expect.  It also significantly out-performs our CNN pre-trained, fine-tuned word embeddings baseline from the last section -- that model's max performance is around 93.  Note that this dataset is tiny, and the variance is large between datasets, but this model consistently outperforms both CNN and LSTM baselines.

Contextual embeddings consistently outperform non-contextual embeddings on almost every task in NLP, not just in text classification.  This method is becoming so commonly used that some papers have even started reporting this approach as a baseline.

### Some more references

- The PyTorch examples actually contain a [nice word-language model](https://github.com/pytorch/examples/tree/master/word_language_model)

- There is a [Tensorflow tutorial](https://www.tensorflow.org/tutorials/sequences/recurrent) as well

- The original source code for training [ELMo's bilm is here](https://github.com/allenai/bilm-tf/tree/master/bilm)

- [A succinct implementation](https://github.com/dpressel/baseline/blob/master/python/baseline/pytorch/embeddings.py#L63) of character-compositional embeddings in Baseline for PyTorch




