<a href="https://colab.research.google.com/github/dpressel/mead-tutorials/blob/master/mead_1_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with MEAD Using PyTorch

In this example, we will download and test drive [mead-baseline](https://github.com/dpressel/mead-baseline) using the PyTorch backend. Then we will show how use a custom model as an addon.  Its recommended to be familiar with the basic problems we are trying to solve here, which are discussed in detail in my tutorial from the [Deep Learning Summer School Tutorial from Gdansk, 2019](https://github.com/dpressel/dliss-tutorial), especially as background for the implementation details.

This document is written in Colab and uses free GPUs from Google, which is really awesome, but it also means that the examples will run at a fraction of the speed of my laptop with its RTX 2080 or even a GTX 1070.  Particularly, our fine-tuning BERT example at the end will run particularly slowly.  Later in the tutorial, I will provide an alternate configuration that will run much faster

To get started, the first thing we need to do is install the latest version of the code

In [2]:
!pip install wheel
!pip install torch
!pip install mead-baseline[yaml]

Collecting mead-baseline[yaml]
  Using cached https://files.pythonhosted.org/packages/7d/bf/42b54c0d418341fdb81acc3c8e661bc6cae98d5650594292b7502d559456/mead_baseline-2.0.0-py3-none-any.whl
Installing collected packages: mead-baseline
Successfully installed mead-baseline-2.0.0


The `mead-baseline` project has very dependencies.  The basic install requires only

- `numpy`
- `six`
- `torch` or `tensorflow`

Above, we installed with an optional `YAML` dependency.  If you do not install this dependency on your system, the program will run just fine, but will only accept JSON configuration files.  For our purposes in this tutorial, `YAML` is a little easier to digest.

For starters, we are just going to try to train a basic model using a YAML configuration file.  The example we will use is [from the repository](https://github.com/dpressel/mead-baseline/blob/master/mead/config/sst2-pyt.yml).  We will break it down and describe it piece-by-piece here:

```yaml
task: classify
basedir: sst2-pyt
preproc:
  mxlen: 100
  clean: true
backend: pytorch
dataset: SST2
unif: 0.25
```

The `basedir` param stores our output.  Below that is a an optional "block" for preprocessing.  This is an optional way of defining overarching pre-processing params that should apply to the data globally.  In future examples we will see that these same transformations can be applied at the vectorization layer which is a bit cleaner.

The `mxlen` argument tell the training program that we wish to enforce a hard limit on the length of any features in the program.  By default, the maximum length of the features are determined in pre-processing by the maximum length observed in the data, but in this example `100` is more than adequate.

The `clean` option tells the program to apply data-specific transformations to clean up the input.  As a general rule, this should not be applied -- the best practice in MEAD is to have fully pre-processed the data ahead of time.  However, in a few cases, such as for this dataset, published work has applied a fairly standard set of transformations that we wish to recreate after downloading the original data files.

The `backend` parameter tells `mead-baseline` what backend we are using.  It can also be supplied or overridden in the command-line.  So for example, even though this configuration specifies its default `backend` as `pytorch`, we can override this at the command-line by providing an `--backend tf` option.

The `unif` option is also supported with local scoping within each feature, which we can see later -- here we are just being lazy and using the global property to override any and all features to tell them to initialize random weights between `-0.25` and `0.25`.

```yaml
loader:
  reader_type: default
```

The block above can be titled `loader` or `reader`, they are interchangeable.  The `reader_type` here is given as `default`.  The software has a registered `default` handler for each task and, in the case of classification, the [reader](https://github.com/dpressel/mead-baseline/blob/master/baseline/reader.py) is defined as a tab-separated value reader where the first column is the label, and the second column is some text.

You might have more complicated classification features, in which case, you might override this default with your own reader that can provide a set of features.  However, we are doing deep-learning and one of its supposed strengths is its ability to learn good feature representations from (nearly) raw input, so our single input reader should be sufficient for many examples.

```yaml
model:
  model_type: default
  filtsz: [3,4,5]
  cmotsz: 100
  dropout: 0.5
```

The `model` block describes our basic model archicture.  Our `default` model for classification is a parallel-filter convolutional neural network with a max-over-time pooling applied to the convolutional output.

This model in described in more detail and along with a from-scratch implementation in my [Deep Learning Summer School Tutorial from Gdansk, 2019](https://colab.research.google.com/github/dpressel/dlss-tutorial/blob/master/1_pretrained_vectors.ipynb)


```yaml
features:
  - name: word
    vectorizer:
      type: token1d
      transform: baseline.lowercase
    embeddings:
      label: w2v-gn
```
A strength of `mead-baseline` is its ability to incorporate multiple features into the model.  These features may be different representations of the same surface text, with optionally different pre-processing, or they may incorporate different features from the input altogether (like the Part-of-Speech tags).

Each feature in `mead-baseline` is incorporated into the model with a corresponding `embedding` which provides a dense representation of the feature (`embeddings` are also covered in my talk listed above).

The `default` embeddings are just basic word or character vectors (anything that can be instantiated with a lookup table).   Additionally, MEAD supports a lot of different pre-trained word embeddings out of the box, including `GloVe`, `Word2Vec` and `fastText`, as well as just random initializations for each word.  In the example above, we use a `label: w2v-gn` to indicate that we want the training program to look for a key named `w2v-gn` inside its `embeddings` "index" and access or download those embeddings and use them.

You might be wondering where this index is provided.  It turns out that this is an optional argument to our trainer program [mead-train](https://github.com/dpressel/mead-baseline/blob/master/mead/trainer.py), and that it defaults to an installed index that does contain this entry (among others):

```json
 {
    "label": "w2v-gn",
    "file": "https://www.dropbox.com/s/699kgut7hdb5tg9/GoogleNews-vectors-negative300.bin.gz?dl=1",
    "dsz": 300
  },
```
The [default embeddings index](https://github.com/dpressel/mead-baseline/blob/master/mead/config/embeddings.json) is installed when we install the package as a local file, or it can be expressed as a URL (using local files in the default installed embeddings index would have little utility, so all of the examples in the installed `config/embeddings.json` use URLs and download the embeddings to a local cache (which default to `~/.bl-data`).  The `dsz` parameter above tells us the hidden (or embedding) dimensionality of the vectors (in this case, `300`).  This means if you look for the word `egg` in the our word2vec embeddings above, you will get back a 300-dimensional tensor.

```yaml
train:
  batchsz: 50
  epochs: 2
  optim: adadelta
  eta: 1.0
  early_stopping_metric: acc
  verbose:
    console: True
    file: sst2-cm.csv
```
The `train` block above defines key training information like the batch size (`batchsz`), the number of times to iterate the full training set (`epochs`), and the optimizer (`optim`) and its learning rate (`lr` or `eta` can be used interchangeably).  Most commonly defined optimizers are supported.
The `early_stopping_metric`, along with the `do_early_stopping` boolean (which defaults to `true`) defines if we should do early stopping on the dataset using the validation set and, if so, what metric to use -- here `acc`.  Once training is completed, the best performing epoch in terms of accuracy will be persisted and used for final testing (and future inference).

*Note*: when you have `mead-baseline` installed on your local machine, its very common to refer to a local file using the `--config` option:

```
mead-train --config config/sst2.json
```

For convenience and to keep the examples up-to-date, we will provide our driver program with a URL instead for the configuration location, and we will do this consistently throughout the tutorial.

In [0]:
!mead-train --config https://raw.githubusercontent.com/dpressel/mead-baseline/master/mead/config/sst2-pyt.yml

Reading config file '/tmp/tmpg2la4mq1'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/logging.json'
No file found '/usr/local/l...', loading as string
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/datasets.json'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/embeddings.json'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/vecs.json'
Task: [classify]
using /root/.bl-data as data/embeddings cache
Clean
extracting file..
downloaded data saved in /root/.bl-data/31eb669609c65af3aa68a381fc760c4eaf801917
[train file]: /root/.bl-data/31eb669609c65af3aa68a381fc760c4eaf801917/stsa.binary.phrases.train
[valid file]: /root/.bl-data/31eb669609c65af3aa68a381fc760c4eaf801917/stsa.binary.dev
[test file]: /root/.bl-data/31eb669609c65af3aa68a381fc760c4eaf801917/stsa.binary.test
tcmalloc: large alloc 1743568896 bytes == 0x3f88000 @  0x7f363e3261e7 0x59203c 0x4ca610 0x56697a 0x5a4be1 0x5a5cda 0x4ce182 0x50a

To recap, the `mead-train` program trained our deep learning model on the SST2 dataset for 2 epochs, using the `adadelta` optimizer, following [Kim 2014](http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf), and we can see that our results successfully replicate the performance in that work.

The `mead-baseline` package provides us with a few different options for builtin baselines.  What if we wanted to run the same dataset, using the LSTM model described and implemented in my [Deep Learning Summer School Tutorial from Gdansk, 2019](https://colab.research.google.com/github/dpressel/dlss-tutorial/blob/master/1_pretrained_vectors.ipynb)?

To do this, we really dont need to change much!  In fact, while we are at it, lets add a few more feature representations!  First, our model updates:

```yaml
model:
  model_type: lstm
  rnnsz: 100
  dropout: 0.5
```
Not much changed here.  We are telling it to use our built-in `lstm` classifier, with a hidden size (`rnnsz`) for the LSTM of `100`.

Now lets make the features section a little more interesting.  Here we are going to stack some pretrained embeddings:

```yaml
features:
  - name: word
    vectorizer:
      type: token1d
    embeddings:
      label: glove-840B
  - name: word2
    vectorizer:
      type: token1d
    embeddings:
      label: w2v-gn
```
Above, we told the trainer to use 2 sets of pre-trained word embeddings, a large GloVe model as well as the word2vec GoogleNews vectors we used in the last example, and concatenate them together as 2 different features with 2 `token1d` vectorizers.  A more common case would be to provide different vectorizers or transformations on the surface data (for instance maybe providing mixed-case to one feature, and lower-case to another).  Notice in this example that we have actually eliminated the `transform_fn` parameter from our previous run, and that we are using the same `vectorizer` for each feature.

In this case, because the `vectorizer`s are actually the same, `MEAD` also allows us to "stack" them as a single feature as follows:

```yaml
features:
  - name: word
    vectorizer:
      type: token1d
    embeddings:
      label: [glove-840B, w2v-gn]

```
Finally, lets update the optimizer block and use [AdamW](https://www.fast.ai/2018/07/02/adam-weight-decay/#adamw) with a slightly different learning rate, and some weight decay:
```yaml
train:
  epochs: 2
  optim: adamw
  eta: 0.0008
  weight_decay: 1.0e-5
  early_stopping_metric: acc

```




In [0]:
!mead-train --config https://raw.githubusercontent.com/dpressel/mead-baseline/master/mead/config/sst2-lstm-pyt.yml

Reading config file '/tmp/tmpaz_5pt1t'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/logging.json'
No file found '/usr/local/l...', loading as string
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/datasets.json'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/embeddings.json'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/vecs.json'
Task: [classify]
using /root/.bl-data as data/embeddings cache
Clean
files for https://www.dropbox.com/s/7jyi4pi894bh2qh/sst2.tar.gz?dl=1 found in cache, not downloading
[train file]: /root/.bl-data/31eb669609c65af3aa68a381fc760c4eaf801917/stsa.binary.phrases.train
[valid file]: /root/.bl-data/31eb669609c65af3aa68a381fc760c4eaf801917/stsa.binary.dev
[test file]: /root/.bl-data/31eb669609c65af3aa68a381fc760c4eaf801917/stsa.binary.test
tcmalloc: large alloc 2176770048 bytes == 0x3a0a000 @  0x7f5001bd91e7 0x59203c 0x4ca610 0x56697a 0x5a4be1 0x5a5cda 0x4ce182 0x5

# Transformers are the new craze in NLP!

I covered them in my [Deep Learning Summer School Talk in Gdansk, 2019](https://docs.google.com/presentation/d/1DJI1yX4U5IgApGwavt0AmOCLWwso7ou1Un93sMuAWmA/edit#slide=id.p) including [how they are implemented and how to fine-tune BERT from scratch in this colab](https://github.com/dpressel/dliss-tutorial/blob/master/3_finetuning.ipynb).  In fact, the source code in the tutorial is based on the source code inside of the [mead-layers 8-mile API](https://github.com/dpressel/mead-baseline/tree/master/layers/eight_mile) within `mead-baseline`.

As you probably expect then, it should be very easy to use `mead-baseline` to fine-tune [BERT](https://arxiv.org/abs/1810.04805), the now ubiquitous bidirectional Transformer encoder (and you would be correct!).

The 8-mile API provides a full implementation of Transformers in both TensorFlow and PyTorch, and we provide deep-learning platform-independent checkpoints in the `numpy NPZ` format so you can easily use BERT from either framework.  This also means that you can run the exact same configuration no matter which backend you are using, TensorFlow or PyTorch by just changing the `backend` either through the command-line or in the config file!

The configuration looks a bit different, we will cover each difference in detail:

```yaml
model:
  model_type: fine-tune

```
The model block is quite simple here -- we just tell the driver we want to run a type of "fine-tune" model.  We will get into the details of how the code is organized later, but for now, think of a "fine-tune" model as the pre-training language model graph minus the final output to the vocabulary, grafted onto a single linear layer that projects to the logit space.  In `mead-baseline`, for fine-tuning on downstream tasks, the entire BERT model is considered as an embedding feature followed by the final projection layer.

```yaml
train:
  early_stopping_metric: acc
  epochs: 5
  eta: 4.0e-5
  optim: adamw
  weight_decay: 1.0e-3
```
The training routine has a slightly different learning rate and weight decay from our previous LSTM example, but otherwise, its basically the same.

The next section gets a bit more complicated, but it will make sense when we explain the details of BERT.  

```yaml
features:
- embeddings:
    word_embed_type: learned-positional-w-bias
    label: bert-base-uncased-npz
    type: tlm-words-embed-pooled
    reduction: sum-layer-norm
    layer_norms_after: true
    finetune: true
    dropout: 0.1
    mlm: true
  name: bert
  vectorizer:
    label: bert-base-uncased

```
First the checkpoint we are referencing is defined in the `label`.  Just as with our word embeddings, if you check the embeddings index, you will find an entry for this:

```
  {
    "label": "bert-base-uncased-npz",
    "file": "https://www.dropbox.com/s/3ivk6npc6e0bgyk/bert-base-uncased.npz?dl=1",
    "sha1": "54bef0c84ce29a7729c5e1fc3509a01f8c579891",
    "unzip": false,
    "dsz": 768
  },

```

This is nice -- `mead-train` doesnt really know, nor care, what type of embeddings you have or how they are orchestrated.  It just knows that embeddings are defined in the `features` block, under the `embeddings` section, and that they may be downloaded from the internet to our cache if they are listed as a URL.

Next, the `vectorizer` entry also uses a `label` to reference a vectorizer.  This is defined in the `vectorizer` index which can be passed into `mead-train` or defaulted to the installed [vecs.json](https://github.com/dpressel/mead-baseline/blob/master/mead/config/vecs.json).  Here is the entry for the `bert-base-uncased` vectorizer:

```
  {
    "label": "bert-base-uncased",
    "type": "wordpiece1d",
    "vocab_file": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
    "transform": "baseline.lowercase"
  },
```
You might be wondering why its referencing the HuggingFace repository if there are no dependencies from `mead-baseline`.  The good people at HuggingFace have made the official BERT vocabularies available for download, primarily for their excellent `transformers` library to be able to use, so there is no need to upload it elsewhere.  And although we are not using HuggingFace `transformers` in this example, we do provide `addon` integration that makes it possible to use their implementation if preferred.

## A Deep Dive into the BERT Fine-Tuning Config

BERT is just another Transformer, and so we can use MEAD's `Transformer` layers to implement BERT.  However, if it was just the original `Transformer` it wouldnt have such a cool name, right?

### What Kind of Transformer is BERT Exactly?

BERT is based very closely on the original Transformer paper [Attention is all you Need, Vaswani et al 2017](https://arxiv.org/abs/1706.03762) and the source repository [Tensor2Tensor, AKA T2T](https://github.com/tensorflow/tensor2tensor), mostly via [GPT, a Transformer language model (LM) with a causal objective](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf).

After the original Transformer paper came out, a bunch of researchers studied the location of the layer normalization in each Transformer layer, and realized it was [sub-optimally located](https://arxiv.org/abs/2002.04745).  Since then, most implementations have moved the layer norm to the beginning of the block, but BERT doesnt -- it follows T2T!

See this tweet from Colin Raffel, first author on the [T5 paper](https://arxiv.org/abs/1910.10683), and the follow-on thread for a recent reference to the layer norm location: https://twitter.com/colinraffel/status/1250853464864784384

There are a few difference in BERT's architecture from T2T:

1. it has no Decoder!  It basically follows the GPT implementation, treating the Encoder as an LM
1. it uses learned positional embeddings instead of the sinusoidal positional embeddings (again following GPT)
1. It has an additional feature known as token-type embeddings which are typically intended to delineate a demarcation along sentence boundaries.

BERT additionally differs from GPT in that it does not use a causal LM objective.  Instead it is trained on a Masked Language Model objective and a Next Sentence Prediction (NSP) objective.  More recent flavors of BERT like [XLM](https://arxiv.org/abs/1901.07291) and [RoBERTa](https://arxiv.org/abs/1907.11692) have abandoned the NSP objective, finding that it actually hurts downstream performance on the common benchmarks.

Armed with all this info, lets take another glance at the `embeddings` sub-block where we define our BERT representation:

```
- embeddings:
    word_embed_type: learned-positional-w-bias
    label: bert-base-uncased-npz
    type: tlm-words-embed-pooled
    reduction: sum-layer-norm
    layer_norms_after: true
    finetune: true
    dropout: 0.1
    mlm: true
```

The `type` field identifies a Transformer LM embedding that uses a pooled representation of words -- well sub-words really, via [WordPiece](https://github.com/google/sentencepiece) or [Byte-Pair Encoding](https://www.aclweb.org/anthology/P16-1162/) and we are `finetune`-ing an `mlm` model.  As discussed above, we want to place the `layer_norms_after` the block, not before it.

Because it does layer norms after the attention and at the end of each block, it makes sense that BERT places a layer norm right at the end of the embeddings.  In MEAD, a single object manages all of the embedddings and its called an [EmbeddingsStack](https://github.com/dpressel/mead-baseline/blob/master/layers/eight_mile/pytorch/layers.py).  It has a `reduction` operator that tells it how to combine multiple features.  The API also has an object called a `LearnedPositionalLookupTableEmbeddings` which combines positional embeddings with regular word embeddings (by summing them together).  If we were using the previously described token-type embeddings, we would construct another feature maybe named `token-type` and assign a vectorizer that would create 0s for the first sentence and 1s for the second sentence.  We could then combine these using the `sum-layer-norm` `reduction`, which would reduce to:

```
LayerNorm(PositionEmbed(IndexOf(x)) + WordEmbed(x)) + TokenTypeEmbed(SentIndexOf(x)))
```
Thats exactly the same operations BERT does, MEAD just provides a simple composition making it easier to define models without code.  A twist: if the TokenType is almost always ignored, we could provide a 0-tensor and still use a LookupTable embedding underneath.

In PyTorch this would be implemented as an `nn.Embedding(2, 768)` and we would always pass `zeros_like(x)`, which would cause a memory allocation every time we have different input.  This is inefficient and a bit complicated for simple fine-tuning.

If we think about it a bit different, in the case where we dont care about the token type ID we would really just be using the `embedding.weight[0]` and adding it to all input.  Well, thats just a bias, and if we had a `LearnedPositionalLookupTableEmbeddingsWithBias` object that allowed this, we could just use that... and of course, we do, which is the embedding type referred to in the block above.



For consistency and readability (YAML is easier to read IMO), up to this point, I have been showing the details of fine-tuning BERT for SST2 using this config:

https://github.com/dpressel/mead-baseline/blob/master/mead/config/sst2-bert-base-uncased.yml

And, if you have Colab Pro or are running this on a machine with some GPU horsepower, you can run the example above like this:

```
!mead-train --config https://raw.githubusercontent.com/dpressel/mead-baseline/master/mead/config/sst2-bert-base-uncased.yml
```
Because by default Colab is pretty slow, my example below is taken from the TREC dataset, not SST2.  On the plus side, this is the exact same dataset that I fine-tuned in my DLSS Tutorial, and it runs in a couple of minutes!

In [0]:
!mead-train --config https://raw.githubusercontent.com/dpressel/mead-baseline/master/mead/config/trec-bert-base-uncased.json

Reading config file '/tmp/tmp69__0rgw'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/logging.json'
No file found '/usr/local/l...', loading as string
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/datasets.json'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/embeddings.json'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/vecs.json'
Task: [classify]
using /root/.bl-data as data/embeddings cache
Downloading https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
extracting file..
downloaded data saved in /root/.bl-data/b52dd4b6cfe6ec51ea6cfa56baef8128b1785e3b
[train file]: /root/.bl-data/b52dd4b6cfe6ec51ea6cfa56baef8128b1785e3b/trec.nodev.utf8
[valid file]: /root/.bl-data/b52dd4b6cfe6ec51ea6cfa56baef8128b1785e3b/trec.dev.utf8
[test file]: /root/.bl-data/b52dd4b6cfe6ec51ea6cfa56baef8128b1785e3b/trec.test.utf8
files for https://www.dropbox.com/s/3ivk6npc6e0bgyk/bert-

The results above are quite good, in fact, that particular run did quite a bit better than our final results in the [DLSS Tutorial](https://github.com/dpressel/dliss-tutorial/blob/master/3_finetuning.ipynb).  Now we have seen examples of using `MEAD`'s built-in Baselines to do really well on a couple of datasets.

# But How Does It Work Under the Hood?

You might be wondering "how does it work, and how can I use this for my own research or model development"?  So far, our introduction has been quite high level -- all we have done is download a python package that installed a magic command `mead-train` that can run some configuration files.

The `mead-baseline` software can do quite a bit of things to streamline your research. It provides strong baselines on publicly available, automatically downloadable datasets out of the box, and provides software to help you track and persist your results and models.

It also allows you, via the `datasets`, `embeddings` and `vecs` optional indices to easily define your own components that can improve on the existing models or run new models.  The `mead-train` driver follows the pattern of [Inversion of Control](https://en.wikipedia.org/wiki/Inversion_of_control) and allows you, the user, to define your own components for the framework including your own custom:

- readers
- vectorizers
- embeddings
- models
- reporting and logging
- trainer
- fit function
- task

The details are out-of-scope of this tutorial, but if you want to know more, there is [documentation](https://github.com/dpressel/mead-baseline/blob/master/docs/addons.md) on how `addons` are supported in the software.

In this tutorial, we will focus on a single `addon` type for the purposes of classification -- we will be creating our own classification models, and in doing so, we can explain the idioms that are built into `MEAD`

## Tasks in MEAD

Under the hood, `mead-train` delegates all of its work to somebody else.  It basically does nothing except knowing who to call when, and that information is just the information that it receives from its configuration file.  If the `task` is `classify`, it knows to proxy the information to a `mead.ClassifyTask` which is a type of `mead.Task` that is registered to listen for those `classify` requests and train them.  The base `mead.Task` can be thought of as a recipe for creating a classifier.  It knows how to load `vectorizers`, `readers` and `trainer`s, and how to register the backend deep learning framework, but it still knows nothing about the actual classifier it is running -- it actually just calls the `baseline.train.fit()` method, which is defined like this:

```python
@export
def fit(model_params, ts, vs, es, **kwargs):
    """This method delegates to the registered fit function for each DL framework.  It is possible to provide a by-pass
    to our defined fit functions for each method (this is considered advanced usage).  In cases where the user wishes
    to provide their own fit hook, the need to decorate the bypass hook with @register_training_func(name='myname'),
    and then pass in the `fit_func='myname'` to this.  MEAD handles this automatically -- just pass fit_func: myname
    in the mead config if you want your own bypass, in which case training is entirely delegate to the 3rd party code.

    This use-case is expected to be extremely uncommon.  More common behavior would be to override the Trainer and use
    the provided fit function.

    :param model:
    :param ts:
    :param vs:
    :param es:
    :param kwargs:
    :return:
    """
    if type(model_params) is dict:
        task_name = model_params['task']
    else:
        task_name = model_params.task_name
    fit_func_name = kwargs.get('fit_func', 'default')
    return BASELINE_FIT_FUNC[task_name][fit_func_name](model_params, ts, vs, es, **kwargs)

```
Each task has a defined `default` fit function -- with at least one implementation per framework.  In fact, for the TensorFlow backend, we define 4 handlers including 2 defaults, one for eager mode and one for declarative mode:

- declarative mode `default` (using tf.datasets)
- eager mode `default` (using tf eager execution)
- `distributed` mode (using tf eager execution)
- `feed_dict` declarative mode using placeholders

The `fit_func` normally should be left alone, but in some cases, you may want to define it to one of the other approaches, or you might even want to make your own (though this is fairly uncommon).

### Framework Implementations

So far, notice that every software package we have used is either under `mead` or `baseline`.  The actual code that implements a backend like PyTorch, lives in a sub-module following the naming:

```
baseline.{framework}.{task}
```

In our case, for this tutorial, we are interested in the details of the PyTorch classification package, which is: [baseline.pytorch.classify](https://github.com/dpressel/mead-baseline/tree/master/baseline/pytorch/classify).  Thankfully, the PyTorch implementation is quite simple, containing only 3 files:

```
(mead) dpressel@dpressel-CORSAIR-ONE:~/dev/work/baseline/baseline/pytorch/classify$ ls -l
total 32
-rw-r--r-- 1 dpressel dpressel    92 Apr 29 21:19 __init__.py
-rw-r--r-- 1 dpressel dpressel 15831 May  6 16:17 model.py
-rw-r--r-- 1 dpressel dpressel  8499 May  6 16:17 train.py

```

The [model.py](https://github.com/dpressel/mead-baseline/tree/master/baseline/pytorch/classify/model.py) contains the PyTorch abstract base classes that extend `baseline.ClassifierModel`: 

```
class ClassifierModelBase(nn.Module, ClassifierModel)
class EmbedPoolStackClassifier(ClassifierModelBase)
```

These are also implemented in TensorFlow under [baseline.tf.classify.model](https://github.com/dpressel/mead-baseline/tree/master/baseline/tf/classify/model.py)

Additionally, there are several concrete classes that inherit either `EmbedPoolStackClassifier` or `ClassifierModelBase`.  Each of these models uses `@register_model()` to declare themselves to the model registry.

#### A Deep-Dive Into the LSTM and Convolutional Models

Both the convolutional and LSTM classifiers that we used inherit the `EmbedPoolStackClassifier`.  It has a funny but pretty precise name, named after the following stages

- *embed* the input, one or more temporal tensors, typically of shape `[B, T]`, where `B` is the batch size (AKA `batchsz`) and `T` is the temporal length (e.g. the number of words in a sentence) with an `EmbeddingsStack` followed by a `reduction` if more than one to a tensor of shape `[B, T, H]` where `H` is the number of hidden units
- *pool* the `[B, T, H]` tensor to a fixed length representation, with possible a different number of hidden units (lets call it `P`) of shape `[B, P]`
- Use a *stack* of hidden layers to transform the input `[B, P]` into some output `[B, S]`.  This is typically one or more `Dense` layers with activations in between (possibly also with residual connections, dropout and layer norms)
- *output* project the input `[B, S]` to a tensor of shape `[B, L]` where `L` is the number of output layers.

Not surprisingly, the `EmbedPoolStackClassifier` provides hooks for each step, but most sub-classes only need to implement the `init_pool` method:

```python
class EmbedPoolStackClassifier(ClassifierModelBase):

    def init_embed(self, embeddings: Dict[str, TensorDef], **kwargs) -> BaseLayer
    def init_pool(self, input_dim: int, **kwargs) -> BaseLayer
    def init_stacked(self, input_dim: int, **kwargs) -> BaseLayer
    def init_output(self, input_dim: int, **kwargs) -> BaseLayer
```

In the case of our convolutional model, we simply inherit the base functionality from this idiomatic base-class, and provide a pooling operation using parallel convolutions, followed by a max-over-time pooling.

For the LSTM, its the same thing, but in this case our pooling layer uses the final LSTM hidden state.  In fact, here is the entire RNN model:


```python

@register_model(task='classify', name='lstm')
class LSTMModel(EmbedPoolStackClassifier):
    """A simple single-directional single-layer LSTM. No layer-stacking.
    """

    def init_pool(self, input_dim: int, **kwargs) -> BaseLayer:
        """LSTM with dropout yielding a final-state as output

        :param input_dim: The input word embedding depth
        :param kwargs: See below

        :Keyword Arguments:
        * *rnnsz* -- (``int``) The number of hidden units (defaults to `hsz`)
        * *rnntype/rnn_type* -- (``str``) The RNN type, defaults to `lstm`, other valid values: `blstm`
        * *hsz* -- (``int``) backoff for `rnnsz`, typically a result of stacking params.  This keeps things simple so
          its easy to do things like residual connections between LSTM and post-LSTM stacking layers

        :return: A pooling layer
        """
        unif = kwargs.get('unif')
        hsz = kwargs.get('rnnsz', kwargs.get('hsz', 100))
        if type(hsz) is list:
            hsz = hsz[0]
        weight_init = kwargs.get('weight_init', 'uniform')
        rnntype = kwargs.get('rnn_type', kwargs.get('rnntype', 'lstm'))
        if rnntype == 'blstm':
            return BiLSTMEncoderHidden(input_dim, hsz, 1, self.pdrop, unif=unif, batch_first=True, initializer=weight_init)
        return LSTMEncoderHidden(input_dim, hsz, 1, self.pdrop, unif=unif, batch_first=True, initializer=weight_init)
```

The layers returned by this model are defined in the `mead-layers` package which provides TensorFlow and PyTorch NLP-specific utilities, but you could easily define your own layers.

Note that we use `register_model` to identify what `task` this class is serving as well as a unique identifier that we give in the mead configuration file's `model` block under `type`.

#### A Deep-Dive Into the Fine-Tuning Model Implementation

The Fine-Tuning Model is pretty robust, but most often, its just the simplest thing possibe -- a single output layer.  However, the same hooks are provided as in the `EmbedPoolStackClassifier` minus the pooling, as we assume that the embedding is Bring Your Own Pooler (BYOP).  Unlike the previous example, we typically dont need to override this class, it gives us everything we need.

```python
@register_model(task='classify', name='fine-tune')
class FineTuneModelClassifier(ClassifierModelBase):
    """Fine-tune based on pre-pooled representations"""

    def init_embed(self, embeddings: Dict[str, TensorDef], **kwargs) -> BaseLayer
    def init_stacked(self, input_dim: int, **kwargs) -> BaseLayer

    def init_output(self, input_dim: int, **kwargs) -> BaseLayer

```

#### Defining Your Own Models

Just to demonstrate, lets make a pretty simple example.  We will create a layered conv. net, gradually broadening the receptive field.  Finally, we will take the max-over-time and the mean-over-time of the last layer and concatenate them.

With PyTorch we could define this quite simply, but the `mead-layers` package, AKA `8 mile` makes this pretty trivial.  Our model is defined like this:

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
from baseline.pytorch.classify import EmbedPoolStackClassifier
from baseline.model import register_model
from eight_mile.pytorch.layers import ConvEncoderStack, MeanPool1D, MaxPool1D

class ResidualConvPool(nn.Module):
    def __init__(self, input_dim: int, **kwargs):
        super().__init__()
        filtsz = kwargs['filtsz']
        dropout = kwargs.get('dropout', 0.1)
        hsz = kwargs.get('hsz', 300)
        activation = kwargs.get('activation', 'relu')
        nlayers = kwargs.get('layers', 3)
        self.convs = ConvEncoderStack(input_dim, hsz, filtsz, nlayers=nlayers, pdrop=dropout, activation=activation)
        self.output_dim = 2 * hsz
        self.mean_pool = MeanPool1D(hsz)
        self.max_pool = MaxPool1D(hsz)

    def forward(self, inputs: Tuple[torch.Tensor, torch.Tensor]) -> torch.Tensor:
        x, lengths = inputs
        outputs = self.convs(x)
        mx = self.max_pool(outputs)
        mu = self.mean_pool((outputs, lengths))
        output = torch.cat([mx, mu], -1)
        return output

@register_model(task='classify', name='resconv')
class ConvClassifier(EmbedPoolStackClassifier):

    def init_pool(self, input_dim: int, **kwargs) -> nn.Module:
        return ResidualConvPool(input_dim, **kwargs)

```

Our config might look like very similar to our previous models with just a few differences:

```yaml
batchsz: 50
basedir: sst2-pyt
modules: [https://raw.githubusercontent.com/mead-ml/hub/master/v1/addons/resconv.py]
```
We can tell `mead-train` to go find our addon (it can be a file system full path, a URL, or anything in the python path).  Here we are referencing an online location.

```yaml
model:
  model_type: resconv
  filtsz: 3
  dropout: 0.5

```
This section looks almost like the previous examples, but our model is named `resconv`, the same value we registered for our class name.

This time, when we run, it we will use the `--x` arguments to override fields within the YAML.  In this case, we will pass in an override for the number of epochs to train, overriding the value in the file.  You can override any fields within the YAML file by defining them with `.`-delimited names:

In [0]:
!mead-train --config https://raw.githubusercontent.com/dpressel/mead-baseline/master/mead/config/sst2-resconv.yml --x:train.epochs 2

Reading config file '/tmp/tmpagu0d5by'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/logging.json'
No file found '/usr/local/l...', loading as string
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/datasets.json'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/embeddings.json'
Reading config file '/usr/local/lib/python3.6/dist-packages/mead/config/vecs.json'
Task: [classify]
using /root/.bl-data as data/embeddings cache
Clean
using /root/.bl-data as data/addons cache
files for https://www.dropbox.com/s/7jyi4pi894bh2qh/sst2.tar.gz?dl=1 found in cache, not downloading
[train file]: /root/.bl-data/31eb669609c65af3aa68a381fc760c4eaf801917/stsa.binary.phrases.train
[valid file]: /root/.bl-data/31eb669609c65af3aa68a381fc760c4eaf801917/stsa.binary.dev
[test file]: /root/.bl-data/31eb669609c65af3aa68a381fc760c4eaf801917/stsa.binary.test
files for https://www.dropbox.com/s/699kgut7hdb5tg9/GoogleNews-vectors-negative300.bi

You might have also noticed if you looked at the config file, that we didnt actually do what we said, and pass a URL into the addons, instead we did this:

```yaml
modules: [hub:v1:addons:resconv]
```

This is just an alias for `addons` that live in [mead-hub](https://github.com/mead-ml/hub).  Think of `hub` as an alias for the github root, and the rest of the `:` are path delimiters up to the file (we dont need to add the suffix as we expect it will be a `.py` file).

### Wrap-up

In this tutorial, we have demonstrated how to use the baseline models from `mead-baseline` including an LSTM-based model, a convolutional model, and a BERT-fine-tuned model.

We also explored how `mead-train` delegates its responsibilities through Inversion of Control to deep-learning framework-specific packages to provide a rich set of models for PyTorch and TensorFlow.  Finally, we explored how to override the existing packages to make our own models, with some assistance from the 8-mile layers provided in `mead-layers`.
