<a href="https://colab.research.google.com/github/dpressel/mead-tutorials/blob/master/mead_2_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tagging for Named Entity Recognition with MEAD

In this example, we will use [mead-baseline](https://github.com/dpressel/mead-baseline) with PyTorch as a backend to create deep learning tagger models using a `CNN-BiLSTM-CRF` architecture and look at the idiomatic way that this model is defined in Baseline and how we can achieve other models by switching out different components.

We will train a tagger on the CONLL2003 dataset and use `baseline.services` to reload the model and use it to predict a sample.

In [2]:
!pip install wheel
!pip install torch
!pip install mead-baseline[yaml]

Collecting mead-baseline[yaml]
  Using cached https://files.pythonhosted.org/packages/bf/fb/f81e8cfd141c900729f609db9ae77699f6c684a5c91205a9f5bcbaa9fdca/mead_baseline-2.0.1-py3-none-any.whl
Installing collected packages: mead-baseline
Successfully installed mead-baseline-2.0.1


For starters, we are just going to try to train a basic tagger model using a JSON configuration file.  In the previous tutorial, we showed YAML examples which are a bit easier to digest, but in this case, the example we want to run is in JSON.  Other than the format, its laid out exactly the same as before.  The example we will use is [from the repository](https://github.com/dpressel/mead-baseline/blob/master/mead/config/conll.json).

For starters, we will look at the "features" definition

## Features
```
"features": [
    {
      "name": "word",
      "vectorizer": {
        "type": "dict1d",
        "fields": "text",
        "transform": "baseline.lowercase"
      },
      "embeddings": {
        "label": "glove-6B-100"
      }
    },
    {
      "name": "senna",
      "vectorizer": {
        "type": "dict1d",
        "fields": "text",
        "transform": "baseline.lowercase"
      },
      "embeddings": {
        "label": "senna"
      }
    },
    {
      "name": "char",
      "vectorizer": {
        "type": "dict2d"
      },
      "embeddings": { "dsz": 30, "wsz": 30, "type": "char-conv" }
    }

```

### Vectorizers

You might have noticed that the vectorizers are defined as a different type than in our previous tutorial.  Here, there are 2 types of vectorizers that are referenced, `dict1d` and `dict2d`.  To understand what's going on here, lets first take a glance at our training data:

```
EU NNP I-NP S-ORG
rejects VBZ I-VP O
German JJ I-NP S-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP S-MISC
lamb NN I-NP O
. . O O

Peter NNP I-NP B-PER
Blackburn NNP I-NP E-PER

BRUSSELS NNP I-NP S-LOC
1996-08-22 CD I-NP O

The DT I-NP O
European NNP I-NP B-ORG
Commission NNP I-NP E-ORG
said VBD I-VP O

```

This is a common file format for tagging, parsing and some other tasks.  It can be space or tab-delimited with one feature per column, and the last column represents the target label we wish to learn to predict.

The second column here contains the Part-of-Speech chunks in IOB1 format.  For our example run, this is not important at all.  Since this is deep learning and we wish to learn good representations from just the surface data we are going to ignore this feature entirely and try to learn to predict the last column only from the surface text.  The final column here is in IOBES format, which is a specific formulation of tags known to work well for NER.  The idea is that single word spans will be annotated `S-<label>`, two-word spans will be annotated `B-<label> E-<label>` and any larger length span is annotated `B-<label> I-<label>+ E-<label>`. Anything that is not one of the labels of interest is annotated as `O` (short for Outside).

### Embeddings

The first 2 features defined here are using 2 sets of word embeddings, which are identified here as `word` and `senna`, and referencing `glove-6B-100` and `senna` pretrained word embeddings, respectively.  These are pretty well-known word embeddings, both of which perform quite well on the CONLL2003 Named Entity Recognition task.  The way we have defined them, they will be concatenated together by word.  The `senna` embeddings are 50-dimensional, and the `glove-6B-100` are 100-dimensional, so together this yields a 150-dimensional word vector.  In addition, we have another feature `char` that is of type `char-conv` and this is actually where the first name in our `CNN-BiLSTM-CRF` comes from.  The `char-conv` type embedding is a convolution over the characters in a word, followed by max-over-time pooling, and followed by one or more "gating" layers, defined as either a highway layer or a residual connection (with the default being the latter).  This produces a fixed dimensional vector of size `wsz` above (30 in this example).  We will also concatenate this to our `word` and `senna` features yielding a 180-dimensional hidden unit vector.  Remember from the last tutorial that the way this works is that these features are placed into an `EmbeddingsStack` and the default `reduction`, `concat` is applied to them.

## Model

Here is the `model` block of the JSON file:

```
  "model": {
    "type": "default",
    "cfiltsz": [
      3
    ],
    "hsz": 400,
    "dropout": 0.5,
    "dropin": {"word": 0.1,"senna": 0.1},
    "rnntype": "blstm",
    "layers": 1,
    "constrain_decode": true,
    "crf": 1
  },
```

This defines the second and third part of the model architecture name we are using here (`BiLSTM` is given as `blstm` under the `rnntype`, and the `crf` parameter is given as well).  Once the input features are projected through the embeddings and concatenated together, this embedding output is sent into an encoder layer, defined here as a `BiLSTM`.  The output of this layer will be in the label dimension space -- in other words, the number of output units is equal to the number of tag-pieces (IBES x Labels + O).

The `hsz` specifies the total number of hidden units that will be used for the `BiLSTM` encoder -- 200 for the forward direction, and 200 for the backward direction.

It turns out that a rather flat model works just fine for the `CONLL2003` NER dataset -- a single layer of `BiLSTM`s is all we need!

Once we have encoded or "transduced" our input data into our our label space output, we decode it to form an output.

The simplest form of decoding is greedy -- we are given the label output space from our encoder, so just pick the `argmax` at each timestep to get the output.  A linear-chain CRF is an alternative approach, where we learn a transition matrix from a current label state to a future label state, and we apply this to our encoder output to yield a final, globally coherent output, learning the most likely path through a sequence.  The `crf: 1` above is a `boolean` and it specifies that we would like to use a `CRF` to decode.

One thing that is worth noticing.  Previously, we described the way in which the labels are annotated in IOBES format.  Some transitions through the label space are not possible, and we could actually place a hard-constraint on our model to prevent it from generating those transitions.  For example, if we have a single token of some label surrounded by tokens of a different label, its impossible for the prefix of that token to be a `B-`. We can see in the example above, for instance, that the first token is a single token, and so its prefixed with `S-`:

```
EU NNP I-NP S-ORG
```

Setting `constrain_decode` to `true` (or `1` or anything "truthy") ensures that our model cannot generate invalid sequences.  In fact, it turns out that we can usually get nearly the same performance from a model with greedy, constrained decoding as we can from using a CRF layer for the output!

## Training

The training portion is quite similar to in our previous examples.  A key difference is that `CNN-BiLSTM-CRF` models are often trained with relatively small batches and using SGD with momentum for a large number of epochs with early stopping on F1, which is the target metric typically used for NER:

```
  "train": {
    "batchsz": 10,
    "epochs": 100,
    "optim": "sgd",
    "eta": 0.015,
    "mom": 0.9,
    "patience": 40,
    "early_stopping_metric": "f1",
    "clip": 5.0,
    "span_type": "iobes"
  }
```

This training procedure will run up to 100 epochs (unless it does not improve for `40` epochs due to the `patience` criteria), will take quite a bit of time.

Since the `epoch` parameter is inside the `train` block, we can override its values as `--x:train.epoch`.  So for instance, if you do not feel like waiting, modify the command line below to pass in `--x:train.epoch 10` to make it run for 10 epochs, which should be good enough for the purposes of the tutorial.

In [3]:
!mead-train --config https://raw.githubusercontent.com/dpressel/mead-baseline/master/mead/config/conll.json

Reading config file '/tmp/tmp7ks3__py'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/logging.json'
No file found '/usr/local/l...', loading as string
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/datasets.json'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/embeddings.json'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/vecs.json'
Task: [tagger]
using /root/.bl-data as data/embeddings cache
extracting file..
downloaded data saved in /root/.bl-data/80b0a839a7edd4c99f54537aa83327340592a4e8
[train file]: /root/.bl-data/80b0a839a7edd4c99f54537aa83327340592a4e8/eng.train.iobes
[valid file]: /root/.bl-data/80b0a839a7edd4c99f54537aa83327340592a4e8/eng.testa.iobes
[test file]: /root/.bl-data/80b0a839a7edd4c99f54537aa83327340592a4e8/eng.testb.iobes
extracting file..
downloaded data saved in /root/.bl-data/a483a44d4414a18c7b10b36dd6daa59195eb292b
embedding file location: /root/.bl-data/a483a44d4

Our model got a final score of **~91.55** which is a pretty strong result on this dataset, and within [the expected results for this model](https://github.com/dpressel/mead-baseline/blob/master/docs/tagging.md#model-performance).  The resulting model checkpoint, along with the vectorizers was stored below in `tagger-1199.zip` 

In [5]:
!ls

conllresults.conll  info.log		sample_data  tagger-1199.zip
errors.log	    reporting-1199.log	tagger	     timing-1199.log


## Using our trained model

Now that we have created an NER tagger, lets use it to tag some sentences.

`mead-baseline` has the concept of a "service", which is something that orchestrates underlying components like a `model` and a `vectorizer` to perform inference.  Each `mead` Task has a corresponding service object. 

These services can be wired up in many different ways to support remote execution (where the `model` itself is part of a remote gRPC service like TensorFlow Serving), or a local model checkpoint that was trained like in the previous example.

There is an example of how to use `baseline.services` for each Task type in the `api-examples` in `mead-baseline`, but in this example, we will create the code ourselves and run inference right in the notebook.

We just trained a `tagger`, and now we will reload the model with the `TaggerService` locally in Colab.

For this example, we are not going to use any special tokenizer, lets just assume that the tokens are white-space delimited.  Note that for a real use-case, you want to make sure that the tokenizer used to train looks as much like the one used for inference as possible, otherwise you might find that it does worse in inference than your metrics might lead you to believe.



In [6]:
from baseline.services import TaggerService
CHECKPOINT_PATH = './tagger-1199.zip'
model = TaggerService.load(CHECKPOINT_PATH, backend='pytorch')

EXAMPLE_SENTENCE = "Mr. Jones thinks Las Vegas is not as fun as NYC !".split()

model.predict(EXAMPLE_SENTENCE)



unzipping model
/tmp/39751301005946022bfb0cd6d0ff42a657ccc198/tagger/tagger-model-1199.pyt
/tmp/39751301005946022bfb0cd6d0ff42a657ccc198/tagger/vocabs-char-1199.json
/tmp/39751301005946022bfb0cd6d0ff42a657ccc198/tagger/vocabs-senna-1199.json
/tmp/39751301005946022bfb0cd6d0ff42a657ccc198/tagger/vocabs-word-1199.json
Calling model <function TaggerModelBase.load at 0x7fa7829018c8>


[[{'label': 'O', 'text': 'Mr.'},
  {'label': 'S-PER', 'text': 'Jones'},
  {'label': 'O', 'text': 'thinks'},
  {'label': 'B-LOC', 'text': 'Las'},
  {'label': 'E-LOC', 'text': 'Vegas'},
  {'label': 'O', 'text': 'is'},
  {'label': 'O', 'text': 'not'},
  {'label': 'O', 'text': 'as'},
  {'label': 'O', 'text': 'fun'},
  {'label': 'O', 'text': 'as'},
  {'label': 'S-LOC', 'text': 'NYC'},
  {'label': 'O', 'text': '!'}]]