<a href="https://colab.research.google.com/github/dpressel/dlss-tutorial/blob/master/2_context_vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part II: Contextualized embeddings


In this section, we are going to take load a pre-trained langage model (LM) checkpoint and use everything below the LM output as the lower layers of our previously defined classification model.  We dont really need to change anything else, we just need to pass this whole network as the `embedding` parameter to the model.


## ELMo

For this demo, we will focus on ELMo ([Peters et al 2018](https://export.arxiv.org/pdf/1802.05365)), a bidirectional LM, with an embedding layer and 2 subsequent biLSTM layers.  ELMo is a bidirectional generalization of the character-aware LM from [Kim et al 2015](https://arxiv.org/abs/1508.06615).  There is a nice [slide deck by the authors here](http://www.people.fas.harvard.edu/~yoonkim/data/char-nlm-slides.pdf), but the key high-level points are listed here:

### Kim Language Model

* **Goal**: predict the next word in the sentence (causal LM) but account for unseen words by using a character compositional approach that relies on letters within the pre-segmented words.  This also has the important impact of reducing the number of parameters required in the model drastically over word-level models.

* **Using**: LSTM layers that take in a word representation for each position.  Each word is put in and used to predict the next word over a context

* **The Twist**: use embeddings approach from [dos Santos & Zadrozny 2014](http://proceedings.mlr.press/v32/santos14.pdf) to represent words, but add parallel filters as in [Kim 2014](https://www.aclweb.org/anthology/D14-1181).  Also, add highway layers on top of the base model



### ELMo Language Model

* **Goal**: predict the next word in the sentence (causal LM) on the forward sequence **and** predict the previous word on the sentence conditioned on the following context.  

* **Using**: LSTM layers as before, but bidirectional, sum the forward and backward loss to make one big loss

* **The Twist** Potentially use all layers of the model (except we dont need head with the big softmax at the end over the words). After the fact, we can freeze our biLM embeddings but still provide useful information by learning a linear combination of the layers during downstream training.  During the biLM training, these scalars dont exist

### ELMo with AllenNLP

Even though ELMo is just a network like described above, there are a lot of details to getting it set up and reloading the pre-trained checkpoints that are provided, and these details are not really important for demonstration purposes.  So, we will just install [AllenNLP](https://github.com/allenai/allennlp) and use it the basis for a new embedding layer.

If you are interested in learning more about using ELMo with AllenNLP, they have provided a [tutorial here](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md)

#### TensorFlow and ELMo

ELMo was originally trained with TensorFlow.  You can find the code to train and use it in the [bilm-tf repository](https://github.com/allenai/bilm-tf/tree/master/bilm)

TF-Hub contains the [pre-trained ELMo model](https://tfhub.dev/google/elmo/2) and is very easy to integrate if you are using TensorFlow already.  The model takes a sequence of words (mixed-case) as inputs and can just be "glued" in to your existing models as a sub-graph of your own.


In [1]:
!pip install allennlp

Collecting allennlp
[?25l  Downloading https://files.pythonhosted.org/packages/30/8c/72b14d20c9cbb0306939ea41109fc599302634fd5c59ccba1a659b7d0360/allennlp-0.8.4-py3-none-any.whl (5.7MB)
[K     |████████████████████████████████| 5.7MB 37.9MB/s 
Collecting tensorboardX>=1.2 (from allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/a2/57/2f0a46538295b8e7f09625da6dd24c23f9d0d7ef119ca1c33528660130d5/tensorboardX-1.7-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 55.5MB/s 
[?25hCollecting flaky (from allennlp)
  Downloading https://files.pythonhosted.org/packages/ae/09/94d623dda1adacd51722f3e3e0f88ba08dd030ac2b2662bfb4383096340d/flaky-3.6.0-py2.py3-none-any.whl
Collecting numpydoc>=0.8.0 (from allennlp)
  Downloading https://files.pythonhosted.org/packages/6a/f3/7cfe4c616e4b9fe05540256cc9c6661c052c8a4cec2915732793b36e1843/numpydoc-0.9.1.tar.gz
Collecting awscli>=1.11.91 (from allennlp)
[?25l  Downloading https://files.pythonhosted.org/pack

### Approach

We will use mostly the same code as before.  For brevity, I have compacted it all here and omitted parts that arent required for this section.  For more information, see the previous section

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List, Tuple
import os
import io
import re
import codecs
import numpy as np
from collections import Counter
from torch.utils.data import DataLoader, TensorDataset

class LSTMClassifier(nn.Module):

    def __init__(self, embeddings, num_classes, embed_dims, rnn_units, rnn_layers=1, dropout=0.5, hidden_units=[]):
        super().__init__()
        self.embeddings = embeddings
        self.dropout = nn.Dropout(dropout)
        self.rnn = torch.nn.LSTM(embed_dims,
                                 rnn_units,
                                 rnn_layers,
                                 dropout=dropout,
                                 bidirectional=False,
                                 batch_first=False)
        nn.init.orthogonal_(self.rnn.weight_hh_l0)
        nn.init.orthogonal_(self.rnn.weight_ih_l0)
        sequence = []
        input_units = rnn_units
        output_units = rnn_units
        for h in hidden_units:
            sequence.append(nn.Linear(input_units, h))
            input_units = h
            output_units = h
            
        sequence.append(nn.Linear(output_units, num_classes))
        self.outputs = nn.Sequential(*sequence)
        
        
    def forward(self, inputs):
        one_hots, lengths = inputs
        embed = self.dropout(self.embeddings(one_hots))
        embed = embed.transpose(0, 1)
        packed = torch.nn.utils.rnn.pack_padded_sequence(embed, lengths.tolist())
        _, hidden = self.rnn(packed)
        hidden = hidden[0].view(hidden[0].shape[1:])
        linear = self.outputs(hidden)
        return F.log_softmax(linear, dim=-1)

class ConfusionMatrix:
    """Confusion matrix with metrics

    This class accumulates classification output, and tracks it in a confusion matrix.
    Metrics are available that use the confusion matrix
    """
    def __init__(self, labels):
        """Constructor with input labels

        :param labels: Either a dictionary (`k=int,v=str`) or an array of labels
        """
        if type(labels) is dict:
            self.labels = []
            for i in range(len(labels)):
                self.labels.append(labels[i])
        else:
            self.labels = labels
        nc = len(self.labels)
        self._cm = np.zeros((nc, nc), dtype=np.int)

    def add(self, truth, guess):
        """Add a single value to the confusion matrix based off `truth` and `guess`

        :param truth: The real `y` value (or ground truth label)
        :param guess: The guess for `y` value (or assertion)
        """

        self._cm[truth, guess] += 1

    def __str__(self):
        values = []
        width = max(8, max(len(x) for x in self.labels) + 1)
        for i, label in enumerate([''] + self.labels):
            values += ["{:>{width}}".format(label, width=width+1)]
        values += ['\n']
        for i, label in enumerate(self.labels):
            values += ["{:>{width}}".format(label, width=width+1)]
            for j in range(len(self.labels)):
                values += ["{:{width}d}".format(self._cm[i, j], width=width + 1)]
            values += ['\n']
        values += ['\n']
        return ''.join(values)

    def save(self, outfile):
        ordered_fieldnames = OrderedDict([("labels", None)] + [(l, None) for l in self.labels])
        with open(outfile, 'w') as f:
            dw = csv.DictWriter(f, delimiter=',', fieldnames=ordered_fieldnames)
            dw.writeheader()
            for index, row in enumerate(self._cm):
                row_dict = {l: row[i] for i, l in enumerate(self.labels)}
                row_dict.update({"labels": self.labels[index]})
                dw.writerow(row_dict)

    def reset(self):
        """Reset the matrix
        """
        self._cm *= 0

    def get_correct(self):
        """Get the diagonals of the confusion matrix

        :return: (``int``) Number of correct classifications
        """
        return self._cm.diagonal().sum()

    def get_total(self):
        """Get total classifications

        :return: (``int``) total classifications
        """
        return self._cm.sum()

    def get_acc(self):
        """Get the accuracy

        :return: (``float``) accuracy
        """
        return float(self.get_correct())/self.get_total()

    def get_recall(self):
        """Get the recall

        :return: (``float``) recall
        """
        total = np.sum(self._cm, axis=1)
        total = (total == 0) + total
        return np.diag(self._cm) / total.astype(float)

    def get_support(self):
        return np.sum(self._cm, axis=1)

    def get_precision(self):
        """Get the precision
        :return: (``float``) precision
        """

        total = np.sum(self._cm, axis=0)
        total = (total == 0) + total
        return np.diag(self._cm) / total.astype(float)

    def get_mean_precision(self):
        """Get the mean precision across labels

        :return: (``float``) mean precision
        """
        return np.mean(self.get_precision())

    def get_weighted_precision(self):
        return np.sum(self.get_precision() * self.get_support())/float(self.get_total())

    def get_mean_recall(self):
        """Get the mean recall across labels

        :return: (``float``) mean recall
        """
        return np.mean(self.get_recall())

    def get_weighted_recall(self):
        return np.sum(self.get_recall() * self.get_support())/float(self.get_total())

    def get_weighted_f(self, beta=1):
        return np.sum(self.get_class_f(beta) * self.get_support())/float(self.get_total())

    def get_macro_f(self, beta=1):
        """Get the macro F_b, with adjustable beta (defaulting to F1)

        :param beta: (``float``) defaults to 1 (F1)
        :return: (``float``) macro F_b
        """
        if beta < 0:
            raise Exception('Beta must be greater than 0')
        return np.mean(self.get_class_f(beta))

    def get_class_f(self, beta=1):
        p = self.get_precision()
        r = self.get_recall()

        b = beta*beta
        d = (b * p + r)
        d = (d == 0) + d

        return (b + 1) * p * r / d

    def get_f(self, beta=1):
        """Get 2 class F_b, with adjustable beta (defaulting to F1)

        :param beta: (``float``) defaults to 1 (F1)
        :return: (``float``) 2-class F_b
        """
        p = self.get_precision()[1]
        r = self.get_recall()[1]
        if beta < 0:
            raise Exception('Beta must be greater than 0')
        d = (beta*beta * p + r)
        if d == 0:
            return 0
        return (beta*beta + 1) * p * r / d

    def get_all_metrics(self):
        """Make a map of metrics suitable for reporting, keyed by metric name

        :return: (``dict``) Map of metrics keyed by metric names
        """
        metrics = {'acc': self.get_acc()}
        # If 2 class, assume second class is positive AKA 1
        if len(self.labels) == 2:
            metrics['precision'] = self.get_precision()[1]
            metrics['recall'] = self.get_recall()[1]
            metrics['f1'] = self.get_f(1)
        else:
            metrics['mean_precision'] = self.get_mean_precision()
            metrics['mean_recall'] = self.get_mean_recall()
            metrics['macro_f1'] = self.get_macro_f(1)
            metrics['weighted_precision'] = self.get_weighted_precision()
            metrics['weighted_recall'] = self.get_weighted_recall()
            metrics['weighted_f1'] = self.get_weighted_f(1)
        return metrics

    def add_batch(self, truth, guess):
        """Add a batch of data to the confusion matrix

        :param truth: The truth tensor
        :param guess: The guess tensor
        :return:
        """
        for truth_i, guess_i in zip(truth, guess):
            self.add(truth_i, guess_i)

class Trainer:
    def __init__(self, optimizer: torch.optim.Optimizer):
        self.optimizer = optimizer

    def run(self, model, labels, train, loss, batch_size): 
        model.train()       
        train_loader = DataLoader(train, batch_size=batch_size, shuffle=True)

        cm = ConfusionMatrix(labels)

        for batch in train_loader:
            loss_value, y_pred, y_actual = self.update(model, loss, batch)
            _, best = y_pred.max(1)
            yt = y_actual.cpu().int().numpy()
            yp = best.cpu().int().numpy()
            cm.add_batch(yt, yp)

        print(cm.get_all_metrics())
        return cm
    
    def update(self, model, loss, batch):
        self.optimizer.zero_grad()
        x, lengths, y = batch
        lengths, perm_idx = lengths.sort(0, descending=True)
        x_sorted = x[perm_idx]
        y_sorted = y[perm_idx]
        y_sorted = y_sorted.to('cuda:0')
        inputs = (x_sorted.to('cuda:0'), lengths)
        y_pred = model(inputs)
        loss_value = loss(y_pred, y_sorted)
        loss_value.backward()
        self.optimizer.step()
        return loss_value.item(), y_pred, y_sorted

class Evaluator:
    def __init__(self):
        pass

    def run(self, model, labels, dataset, batch_size=1):
        model.eval()
        valid_loader = DataLoader(dataset, batch_size=batch_size)
        cm = ConfusionMatrix(labels)
        for batch in valid_loader:
            y_pred, y_actual = self.inference(model, batch)
            _, best = y_pred.max(1)
            yt = y_actual.cpu().int().numpy()
            yp = best.cpu().int().numpy()
            cm.add_batch(yt, yp)
        return cm

    def inference(self, model, batch):
        with torch.no_grad():
            x, lengths, y = batch
            lengths, perm_idx = lengths.sort(0, descending=True)
            x_sorted = x[perm_idx]
            y_sorted = y[perm_idx]
            y_sorted = y_sorted.to('cuda:0')
            inputs = (x_sorted.to('cuda:0'), lengths)
            y_pred = model(inputs)
            return y_pred, y_sorted

def fit(model, labels, optimizer, loss, epochs, batch_size, train, valid, test):

    trainer = Trainer(optimizer)
    evaluator = Evaluator()
    best_acc = 0.0
    
    for epoch in range(epochs):
        print('EPOCH {}'.format(epoch + 1))
        print('=================================')
        print('Training Results')
        cm = trainer.run(model, labels, train, loss, batch_size)
        print('Validation Results')
        cm = evaluator.run(model, labels, valid)
        print(cm.get_all_metrics())
        if cm.get_acc() > best_acc:
            print('New best model {:.2f}'.format(cm.get_acc()))
            best_acc = cm.get_acc()
            torch.save(model.state_dict(), './checkpoint.pth')
    if test:
        model.load_state_dict(torch.load('./checkpoint.pth'))
        cm = evaluator.run(model, labels, test)
        print('Final result')
        print(cm.get_all_metrics())
    return cm.get_acc()

def whitespace_tokenizer(words: str) -> List[str]:
    return words.split() 

def sst2_tokenizer(words: str) -> List[str]:
    REPLACE = { "'s": " 's ",
                "'ve": " 've ",
                "n't": " n't ",
                "'re": " 're ",
                "'d": " 'd ",
                "'ll": " 'll ",
                ",": " , ",
                "!": " ! ",
                }
    words = words.lower()
    words = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", words)
    for k, v in REPLACE.items():
            words = words.replace(k, v)
    return [w.strip() for w in words.split()]


class Reader:

    def __init__(self, files, lowercase=True, min_freq=0,
                 tokenizer=sst2_tokenizer, vectorizer=None):
        self.lowercase = lowercase
        self.tokenizer = tokenizer
        build_vocab = vectorizer is None
        self.vectorizer = vectorizer if vectorizer else self._vectorizer
        x = Counter()
        y = Counter()
        for file_name in files:
            if file_name is None:
                continue
            with codecs.open(file_name, encoding='utf-8', mode='r') as f:
                for line in f:
                    words = line.split()
                    y.update(words[0])

                    if build_vocab:
                        words = self.tokenizer(' '.join(words[1:]))
                        words = words if not self.lowercase else [w.lower() for w in words]
                        x.update(words)
        self.labels = list(y.keys())

        if build_vocab:
            x = dict(filter(lambda cnt: cnt[1] >= min_freq, x.items()))
            alpha = list(x.keys())
            alpha.sort()
            self.vocab = {w: i+1 for i, w in enumerate(alpha)}
            self.vocab['[PAD]'] = 0

        self.labels.sort()

    def _vectorizer(self, words: List[str]) -> List[int]:
        return [self.vocab.get(w, 0) for w in words]

    def load(self, filename: str) -> TensorDataset:
        label2index = {l: i for i, l in enumerate(self.labels)}
        xs = []
        lengths = []
        ys = []
        with codecs.open(filename, encoding='utf-8', mode='r') as f:
            for line in f:
                words = line.split()
                ys.append(label2index[words[0]])
                words = self.tokenizer(' '.join(words[1:]))
                words = words if not self.lowercase else [w.lower() for w in words]
                vec = self.vectorizer(words)
                lengths.append(len(vec))
                xs.append(torch.tensor(vec, dtype=torch.long))
        x_tensor = torch.nn.utils.rnn.pad_sequence(xs, batch_first=True)
        lengths_tensor = torch.tensor(lengths, dtype=torch.long)
        y_tensor = torch.tensor(ys, dtype=torch.long)
        return TensorDataset(x_tensor, lengths_tensor, y_tensor)

### The new thing: set up to use ELMo

In [0]:
from allennlp.modules.elmo import Elmo, batch_to_ids


def elmo_vectorizer(sentence):
    character_ids = batch_to_ids([sentence])
    return character_ids.squeeze(0)

  
class ElmoEmbedding(nn.Module):
    def __init__(self, options_file, weight_file, dropout=0.5):
        super().__init__()
        self.elmo = Elmo(options_file, weight_file, 2, dropout=dropout)
    def forward(self, xch):
        elmo = self.elmo(xch)
        e1, e2 = elmo['elmo_representations']
        mask = elmo['mask']
        embeddings = (e1 + e2) * mask.float().unsqueeze(-1)
        return embeddings


As before, we are going to load up our data with a reader.  This time, though, we will provide a vectorizer for ELMo.  In our simple example `Reader`, we only allow a single feature as our input vector to our classifier, so we can stop counting up our vocab.  In real life, you probably want to support both word vector features and context vector features so you might want to modify the code to support both.  This is a very common approach -- just using ELMo to augment an existing setup.  Here, we just look at using ELMo features by themselves.


In [5]:
!wget https://www.dropbox.com/s/08km2ean8bkt7p3/trec.tar.gz?dl=1
!tar -xzf 'trec.tar.gz?dl=1'

--2019-06-26 12:27:10--  https://www.dropbox.com/s/08km2ean8bkt7p3/trec.tar.gz?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.81.1, 2620:100:6031:1::a27d:5101
Connecting to www.dropbox.com (www.dropbox.com)|162.125.81.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/08km2ean8bkt7p3/trec.tar.gz [following]
--2019-06-26 12:27:11--  https://www.dropbox.com/s/dl/08km2ean8bkt7p3/trec.tar.gz
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc9655c06af3554607eeabe74f7f.dl.dropboxusercontent.com/cd/0/get/AjiDtTF_BsnqtxDcMPgw8EML45xXVCAbpq9FMPNTVGdqlFPIfMP1jWTNxY0pgYwGL4g5yDcpd9shPv7nxmqdzWj7ekJBLzFY8DZkCVe_YQrNPg/file?dl=1# [following]
--2019-06-26 12:27:11--  https://uc9655c06af3554607eeabe74f7f.dl.dropboxusercontent.com/cd/0/get/AjiDtTF_BsnqtxDcMPgw8EML45xXVCAbpq9FMPNTVGdqlFPIfMP1jWTNxY0pgYwGL4g5yDcpd9shPv7nxmqdzWj7ekJBLzFY8DZkCVe_YQrNPg/file?dl=1
Resolving uc

We will set up our reader slightly differently than in the last experiment.  Here we will use an `elmo_vectorizer`

In [7]:
BASE = 'trec'
TRAIN = os.path.join(BASE, 'trec.nodev.utf8')
VALID = os.path.join(BASE, 'trec.dev.utf8')
TEST = os.path.join(BASE, 'trec.test.utf8')



r = Reader((TRAIN, VALID, TEST,), lowercase=False, vectorizer=elmo_vectorizer)
train = r.load(TRAIN)
valid = r.load(VALID)
test = r.load(TEST)



Building the network is basically the same as before, but we are using ELMo instead of word vectors.  The command below will take a few minutes -- this is a much larger (forward) network than before, even though the learnable parameters havent really changed

In [16]:
options_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
embeddings = ElmoEmbedding(options_file, weight_file)
model = LSTMClassifier(embeddings, len(r.labels), embed_dims=1024, rnn_units=100, hidden_units=[100])

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model has {num_params} parameters") 


model.to('cuda:0')
loss = torch.nn.NLLLoss()
loss = loss.to('cuda:0')

learnable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adadelta(learnable_params, lr=1.0)

fit(model, r.labels, optimizer, loss, 12, 50, train, valid, test)

  "num_layers={}".format(dropout, num_layers))


Model has 461114 parameters
EPOCH 1
Training Results
{'acc': 0.52, 'mean_precision': 0.5709996982500932, 'mean_recall': 0.4607179688562917, 'macro_f1': 0.47822641943140026, 'weighted_precision': 0.5272261561285185, 'weighted_recall': 0.52, 'weighted_f1': 0.5146276073296864}
Validation Results
{'acc': 0.7787610619469026, 'mean_precision': 0.7408372302325149, 'mean_recall': 0.7422682612647518, 'macro_f1': 0.732661412122254, 'weighted_precision': 0.7809378800113048, 'weighted_recall': 0.7787610619469026, 'weighted_f1': 0.7688030535979937}
New best model 0.78
EPOCH 2
Training Results
{'acc': 0.8088, 'mean_precision': 0.8157138470419919, 'mean_recall': 0.762659844349094, 'macro_f1': 0.7817567233499173, 'weighted_precision': 0.8096565274550026, 'weighted_recall': 0.8088, 'weighted_f1': 0.808481917833408}
Validation Results
{'acc': 0.8783185840707964, 'mean_precision': 0.8377755875012601, 'mean_recall': 0.8108450321404167, 'macro_f1': 0.8216311841445192, 'weighted_precision': 0.87680638269738

0.946

Let's see how this number compares against a randomly initialized baseline model that is otherwise identical.  We dont really need to use such a huge embedding size in this case -- we are using word vectors instead of character compositional vectors and we dont really have enough information to train a huge word embedding from scratch.  Also, since we dont have much information, we will use lowercased features.  Note that using these word embeddings features, our model has **6x more parameters than before**.  Also, we might want to train it longer.

In [23]:

r = Reader((TRAIN, VALID, TEST,), lowercase=True)
train = r.load(TRAIN)
valid = r.load(VALID)
test = r.load(TEST)

embeddings = nn.Embedding(len(r.vocab), 300)
model = LSTMClassifier(embeddings, len(r.labels), embeddings.weight.shape[1], rnn_units=100, hidden_units=[100])

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model has {num_params} parameters") 


model.to('cuda:0')
loss = torch.nn.NLLLoss()
loss = loss.to('cuda:0')

learnable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adadelta(learnable_params, lr=1.0)

fit(model, r.labels, optimizer, loss, 48, 50, train, valid, test)

  "num_layers={}".format(dropout, num_layers))


Model has 2801306 parameters
EPOCH 1
Training Results
{'acc': 0.2542, 'mean_precision': 0.2711038695766469, 'mean_recall': 0.22218244347567548, 'macro_f1': 0.21425818700631108, 'weighted_precision': 0.2564855224194423, 'weighted_recall': 0.2542, 'weighted_f1': 0.22703312209459242}
Validation Results
{'acc': 0.3163716814159292, 'mean_precision': 0.38163668256669087, 'mean_recall': 0.2838794512309889, 'macro_f1': 0.254044176761413, 'weighted_precision': 0.3912014117742859, 'weighted_recall': 0.3163716814159292, 'weighted_f1': 0.2676486277978532}
New best model 0.32
EPOCH 2
Training Results
{'acc': 0.3816, 'mean_precision': 0.4291024620399562, 'mean_recall': 0.3640440077585961, 'macro_f1': 0.3785247578651236, 'weighted_precision': 0.3791969668224335, 'weighted_recall': 0.3816, 'weighted_f1': 0.3700027369634422}
Validation Results
{'acc': 0.5154867256637168, 'mean_precision': 0.4800233611955989, 'mean_recall': 0.46191577496493413, 'macro_f1': 0.44646430059374564, 'weighted_precision': 0.54

0.882

## Conclusion

Without even concatenating word features, our ELMo model, with far fewer parameters, surpasses the performance of the randomly initialized baseline, which we would expect.  It also significantly out-performs our CNN pre-trained, fine-tuned word embeddings baseline from the last section -- that model's max performance is around 93.  Note that this dataset is tiny, and the variance is large between datasets, but this model consistently outperforms both CNN and LSTM baselines.

Contextual embeddings consistently outperform non-contextual embeddings on almost every task in NLP, not just in text classification.  This method is becoming so commonly used that some papers have even started reporting this approach as a baseline.








