<div style="font-size:250%; font-weight:bold">Train NER with SpaCy</div>

This notebook shows how to train a new NER model from scratch using the SpaCy library on Amazon SageMaker.

In [None]:
!pip install --upgrade s3fs

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

import os
import s3fs
from sagemaker import get_execution_role
from sagemaker.mxnet import MXNet
from sagemaker.session import s3_input

from gtner_blog.util import split, write_split

<details>
    <summary>Note</summary>
    <blockquote>Choose the existing MXNet container, so we don't have to create a new container image.</blockquote>
</details>

# Prepare data channels

Split the whole corpus into train:test = 3:1 proportion, then upload the splits to S3.

In [None]:
bucket = 'gtner-blog'                # Change me as necessary
gt_jobname = 'test-gtner-blog-004'   # Change me as necessary

iob_file = f's3://{bucket}/gt/{gt_jobname}/manifests/output/output.iob'
train = f's3://{bucket}/spacy-data/train'
test = f's3://{bucket}/spacy-data/test'

fs = s3fs.S3FileSystem(anon=False)
with fs.open(iob_file, 'r') as f:
    train_split = os.path.join(train, 'data.iob')
    test_split = os.path.join(test, 'data.iob')

    # Chain of functions: .iob > split -> write_split.
    write_split(split(f), train_split, test_split)

display(iob_file, train, test)

# Start training

We create an MXNet estimator with our entry point script `spacy-train.py`, a thin wrapper over `spacy train ...` CLI that does the following:

1. parse SageMaker's entry-point protocol, namely model and channel directories.
2. pre-define a few arguments to `spacy train ...` CLI: `{"--lang", "--pipeline", "--output_path", "--train_path", "--dev_path"}`.
3. passes the estimator's hyper-parameters as arguments to `spacy train ...`.
   1. Each hyperparameter `abcd` will be passed down as `--abcd`.
   2. The hyperparameters must not conflict with those in the above mentioned step 2.
   3. The entry point only support `--abcd SOME_VALUE` form of arguments.

<details>
    <summary><code>spacy train --help</code></summary>
    <blockquote><pre>
usage: spacy train [-h] [-rt None] [-b None] [-p tagger,parser,ner] [-v None]
                   [-n 30] [-ne None] [-ns 0] [-g -1] [-V 0.0.0] [-m None]
                   [-t2v None] [-pt] [-et] [-nl 0.0] [-ovl 0.0] [-bw] [-G]
                   [-T] [-TML] [-ta bow] [-tpl None] [-VV] [-D]
                   lang output_path train_path dev_path

    Train or update a spaCy model. Requires data to be formatted in spaCy's
    JSON format. To convert data from other formats, use the `spacy convert`
    command.


positional arguments:
  lang                  Model language
  output_path           Output directory to store model in
  train_path            Location of JSON-formatted training data
  dev_path              Location of JSON-formatted development data

optional arguments:
  -h, --help            show this help message and exit
  -rt None, --raw-text None
                        Path to jsonl file with unlabelled text documents.
  -b None, --base-model None
                        Name of model to update (optional)
  -p tagger,parser,ner, --pipeline tagger,parser,ner
                        Comma-separated names of pipeline components
  -v None, --vectors None
                        Model to load vectors from
  -n 30, --n-iter 30    Number of iterations
  -ne None, --n-early-stopping None
                        Maximum number of training epochs without dev accuracy
                        improvement
  -ns 0, --n-examples 0
                        Number of examples
  -g -1, --use-gpu -1   Use GPU
  -V 0.0.0, --version 0.0.0
                        Model version
  -m None, --meta-path None
                        Optional path to meta.json to use as base.
  -t2v None, --init-tok2vec None
                        Path to pretrained weights for the token-to-vector
                        parts of the models. See 'spacy pretrain'.
                        Experimental.
  -pt , --parser-multitasks
                        Side objectives for parser CNN, e.g. 'dep' or
                        'dep,tag'
  -et , --entity-multitasks
                        Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'
  -nl 0.0, --noise-level 0.0
                        Amount of corruption for data augmentation
  -ovl 0.0, --orth-variant-level 0.0
                        Amount of orthography variation for data augmentation
  -bw , --eval-beam-widths
                        Beam widths to evaluate, e.g. 4,8
  -G, --gold-preproc    Use gold preprocessing
  -T, --learn-tokens    Make parser learn gold-standard tokenization
  -TML, --textcat-multilabel
                        Textcat classes aren't mutually exclusive (multilabel)
  -ta bow, --textcat-arch bow
                        Textcat model architecture
  -tpl None, --textcat-positive-label None
                        Textcat positive label for binary classes with two
                        labels
  -VV, --verbose        Display more information for debug
  -D, --debug           Run data diagnostics before training
</pre></blockquote>
</details>

In [None]:
estimator = MXNet(entry_point='spacy-train.py',
                  source_dir='./spacy-scripts',
                  role=get_execution_role(),
                  train_instance_count=1,
                  train_instance_type='ml.m5.large',
                  framework_version='1.6.0',
                  py_version='py3',
                  debugger_hook_config=False,
                  hyperparameters={'n_iter': 10})

In [None]:
estimator.fit({'train': s3_input(train), 'test': s3_input(test)})