# Semantic parsing with PyText

[PyText](https://engineering.fb.com/ai-research/pytext-open-source-nlp-framework/) is a modeling framework that blurs the boundaries between experimentation and large-scale deployment.  In Portal, PyText is used for production natural language processing tasks, including [semantic parsing](https://en.wikipedia.org/wiki/Semantic_parsing).  Semantic parsing involves converting natural language input to a logical form that can be easily processed by a machine.  Portal by Facebook uses PyText in production to semantically parse user queries.  In this notebook, we will use PyText to train a semantic parser on the freely available [Facebook Task Oriented Parsing dataset](https://fb.me/semanticparsingdialog) using the newly open-sourced Sequence-to-sequence framework.  We will export the resulting parser to a Torchscript file suitable for production deployment.

## The PyText sequence-to-sequence framework

We have recently open sourced our production sequence-to-sequence (Seq2Seq) framework in PyText ([framework](https://github.com/facebookresearch/pytext/commit/ff053d3388161917b189fabaa0e3058273ed4314), [Torchscript export](https://github.com/facebookresearch/pytext/commit/8dab0aec0e0456fdeb10ffac110f50e6a1382e6c)).  This framework provides an encoder-decoder architecture that is suitable for any task that requires mapping a sequence of input tokens to a sequence of output tokens.  Our existing implementation is based on recurrent neural networks (RNNs), which have been shown to be [unreasonably effective](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) at sequence processing tasks.  The model we will train includes three major components
  1. A bidirectional LSTM sequence encoder
  2. An LSTM sequence decoder
  3. A sequence generator that supports incremental decoding and beam search

All of these components are Torchscript-friendly, so that the trained model can be exported directly as-is.  

# Instructions

The remainder of this notebook installs PyText with its dependencies, downloads the training data to the local VM, trains the model, and verifies the exported Torchscript model.  It should run in 15-20 minutes if a GPU is used, and will train a reasonably accurate semantic parser on the Facebook TOP dataset.  As detailed below, simply increasing the number of epochs will allow a competitive result to be obtained in about an hour of training time.  The notebook will also export a Torchscript model which can be used for runtime inference from Python, C++ or Java.

It is *strongly recommended* that this notebook be run on a GPU.

# Installing PyText

As of this writing, semantic parsing requires a bleeding-edge version of PyText.  The following cell will download the master branch from Github.

If PyText is installed, the packages may change.  The notebook will restart in this case.  You may rerun the cell after this and everything should be fine.

In [0]:
try:
  from pytext.models.seq_models.seq2seq_model import Seq2SeqModel
  print("Detected compatible version of PyText.  Skipping install.")
  print("Run the code in the except block to force PyText installation.")
except ImportError:
  !git clone https://github.com/facebookresearch/pytext.git
  %cd pytext
  !pip install -e .
  print("Stopping RUNTIME because we installed new dependencies.")
  print("Rerun the notebook and everything should now work.")
  import os
  os.kill(os.getpid(), 9)

# Downloading the data

We'll use the Facebook Task Oriented Parsing (TOP) dataset for our example.  This dataset is publically available and can be added to the notebook with the following cells.

In [0]:
!curl -o semanticparsingdialog.zip -L https://fb.me/semanticparsingdialog
!unzip -o semanticparsingdialog.zip

In [0]:
TOP_PATH = "/content/top-dataset-semantic-parsing/"

## Preprocessing the data

In the interest of time, we're going to simplify the data a bit.  The notebook contains instructions on how to use the full dataset if you prefer.

Following [Gupta et al.](https://arxiv.org/pdf/1810.07942.pdf), we'll use a single-valued output vocabulary so that our model focuses on predicting semantic structure.

In [0]:
import random
import re
from contextlib import ExitStack

def make_lotv(input_path, output_path, sample_rate=1.0):
  with ExitStack() as ctx:
    input_file = ctx.enter_context(open(input_path, "r"))
    output_file = ctx.enter_context(open(output_path, "w"))
    for line in input_file:
      if random.random() > sample_rate:
        continue
      raw_seq, tokenized_seq, target_seq = line.split("\t")
      output_file.write(
          "\t".join(
              [
                raw_seq,
                tokenized_seq,
                # Change everything but IN:*, SL:*, [ and ] to 0
                re.sub(
                    r"(?!((IN|SL):[A-Z_]+(?<!\S))|\[|\])(?<!\S)((\w)|[^\w\s])+",
                    "0",
                    target_seq
                )
              ]
          )
      )

# Running on the full test set takes around 40 minutes, so using a reduced set
# here.  Change to ("test.tsv", 1.0) to evaluate on the full test set.
for f, r in [("train.tsv", 1.0), ("eval.tsv", 1.0), ("test.tsv", 0.1)]:
  make_lotv(f"{TOP_PATH}{f}", f"{TOP_PATH}lotv_{f}", r)

In [0]:
!head /content/top-dataset-semantic-parsing/lotv_train.tsv

In [0]:
# Comment out the next line to train on the original target instead of the
# limited output vocabulary version.
TOP_PATH = f"{TOP_PATH}lotv_"

# Preparing the PyText configuration

## Data configuration

PyText includes components to iterate through and preprocess the data.

In [0]:
from pytext.data import Data, PoolingBatcher
from pytext.data.sources import TSVDataSource

top_data_conf = Data.Config(
    sort_key="src_seq_tokens",
    source=TSVDataSource.Config(
        # Columns in the TSV.  These names will be used by the model.
        field_names=["raw_sequence", "source_sequence", "target_sequence"],
        train_filename=f"{TOP_PATH}train.tsv",
        eval_filename=f"{TOP_PATH}eval.tsv",
        test_filename=f"{TOP_PATH}test.tsv",
    ),
    batcher=PoolingBatcher.Config(
        num_shuffled_pools=10000,
        pool_num_batches=1,
        train_batch_size=64,
        eval_batch_size=100,
        # Testing relies on ScriptedSequenceGenerator, which
        # does not support batch sizes > 1
        test_batch_size=1,
    ),
)

## Model configuration

We can use the object below to specify the architecture for the Seq2seq model

In [0]:
from pytext.data.tensorizers import TokenTensorizer
from pytext.loss import LabelSmoothedCrossEntropyLoss
from pytext.models.embeddings import WordEmbedding
from pytext.models.seq_models.rnn_decoder import RNNDecoder
from pytext.models.seq_models.rnn_encoder import LSTMSequenceEncoder
from pytext.models.seq_models.rnn_encoder_decoder import RNNModel
from pytext.models.seq_models.seq2seq_model import Seq2SeqModel
from pytext.models.seq_models.seq2seq_output_layer import Seq2SeqOutputLayer
from pytext.torchscript.seq2seq.scripted_seq2seq_generator import (
    ScriptedSequenceGenerator
)

seq2seq_model_conf=Seq2SeqModel.Config(
    # Source and target embedding configuration
    source_embedding=WordEmbedding.Config(embed_dim=200),
    target_embedding=WordEmbedding.Config(embed_dim=512),
    
    # Configuration for the tensorizers that transform the 
    # raw data to the model inputs
    inputs=Seq2SeqModel.Config.ModelInput(
        src_seq_tokens=TokenTensorizer.Config(
            # Output from the data handling.  Must match one of the column
            # names in TSVDataSource.Config, above
            column="source_sequence",
            # Add begin/end of sequence markers to the model input
            add_bos_token=True,
            add_eos_token=True,
        ),
        trg_seq_tokens=TokenTensorizer.Config(
            column="target_sequence",
            add_bos_token=True,
            add_eos_token=True,
        ),
    ),
    # Encoder-decoder configuration
    encoder_decoder=RNNModel.Config(
        # Bi-LSTM encoder
        encoder=LSTMSequenceEncoder.Config(
            hidden_dim=1024,
            bidirectional=True,
            dropout_in=0.0,
            embed_dim=200,
            num_layers=2,
            dropout_out=0.2,
        ),
        # LSTM + Multi-headed attention decoder
        decoder=RNNDecoder.Config(
            # Needs to match hidden dimension of encoder
            encoder_hidden_dim=1024,
            dropout_in=0.2,
            dropout_out=0.2,
            embed_dim=512,
            hidden_dim=256,
            num_layers=1,
            out_embed_dim=256,
            attention_type="dot",
            attention_heads=1,
        ),
    ),
    # Sequence generation via beam search.  Torchscript is used for 
    # runtime performance
    sequence_generator=ScriptedSequenceGenerator.Config(
        beam_size=5,
        targetlen_b=3.76,
        targetlen_c=72,
        quantize=False,
        nbest=1,
    ),
    output_layer=Seq2SeqOutputLayer.Config(
        loss=LabelSmoothedCrossEntropyLoss.Config()
    ),
)

## Task configuration

Given the data and model configurations, it is straightforward to configure the PyText task.  

In [0]:
from pytext.optimizer.optimizers import Adam
from pytext.optimizer.scheduler import ReduceLROnPlateau
from pytext.task.tasks import SequenceLabelingTask
from pytext.trainers import TaskTrainer

seq2seq_on_top_task_conf = SequenceLabelingTask.Config(
    data=top_data_conf,
    model=seq2seq_model_conf,
    # Training configuration
    trainer=TaskTrainer.Config(
        # Setting a small number of epochs so the notebook executes more 
        # quickly.  Set epochs=40 to get a converged model.  Expect to see 
        # about 4 more points of frame accuracy in a converged model.
        epochs=10,
        # Clip gradient norm to 5
        max_clip_norm=5,
        # Stop if eval loss does not decrease for 5 consecutive epochs
        early_stop_after=5,
        # Optimizer and learning rate
        optimizer=Adam.Config(lr=0.001),
        # Learning rate scheduler: reduce LR if no progress made over 3 epochs
        scheduler=ReduceLROnPlateau.Config(patience=3),
    ),
)

## Complete the PyText config

PyText bundles the task along with several environment settings in to a single config object that's used for training.

In [0]:
from pytext.config import LATEST_VERSION as PytextConfigVersion, PyTextConfig

pytext_conf = PyTextConfig(
    task=seq2seq_on_top_task_conf,
    # Export a Torchscript model for runtime prediction
    export_torchscript_path="/content/pytext_seq2seq_top.pt1",
    # PyText configs are versioned so that configs saved by older versions can
    # still be used by later versions.  Since we're constructing the config
    # on the fly, we can just use the latest version.
    version=PytextConfigVersion,
)

# Train the model

PyText uses the configuration object for training and testing.

In [0]:
from pytext.workflow import train_model

model, best_metric = train_model(pytext_conf)

# Test the model

PyText training saves a snapshot with the model and training state.  We can use the snapshot for testing as well.

TODO: this is too slow.  We should see if we can get away with parallelism, and if not significantly downsample the test set.

In [0]:
from pytext.workflow import test_model_from_snapshot_path

test_model_from_snapshot_path(
    "/tmp/model.pt",
    # Sequence generation is presently CPU-only
    False,
    None,
    None,
    "/content/pytext_seq2seq_top_results.txt"
)

# Using the model at runtime

The exported Torchscript model can be used for efficient runtime inference.  We will demonstrate the API in Python here, but the file can also be loaded in [Java](https://pytorch.org/javadoc/) and [C++](https://pytorch.org/cppdocs/).

In [0]:
import torch
scripted_model = torch.jit.load("/content/pytext_seq2seq_top.pt1")

In [0]:
# To use the exported model, we need to manually add begin/end of sequence
# markers.
BOS = "__BEGIN_OF_SENTENCE__"
EOS = "__END_OF_SENTENCE__"
scripted_model(f"{BOS} what is the shortest way home {EOS}".split())