[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/facebookresearch/fairo/blob/master/tutorials/semantic_parser_onboarding.ipynb)

# Semantic Parser Onboarding

The **semantic parser** is a seq-to-seq model built on the Huggingface Transformers library. The input to the parser is a chat command, eg. "build a red cube". The output is a linearized parse tree (see [Action Dictionary Spec Doc](https://github.com/facebookresearch/fairo/blob/main/base_agent/documents/Action_Dictionary_Spec.md) for the grammar specification).

The encoder uses a pretrained DistilBERT model, followed by a highway transformation. For the default model, encoder parameters are frozen during training. The decoder consists of a 6-layer Transformer, and has a **Language Modeling** head, **span** beginning and span end heads, and **text span** beginning and end heads. The Language Modeling head predicts the next node in the linearized tree. The span heads predict the span range, which provides the value for the span node. For more details, see the [Craftassist Paper](https://www.aclweb.org/anthology/2020.acl-main.427.pdf).

This tutorial covers the end-to-end process of how to train a semantic parser model and use it in the CraftAssist agent:

*  Generating and preparing datasets
*  Training models
* Evaluating models
* Using models in the agent


## Set Up

### Downloading Pre-Trained Models and Datasets

When you run the CraftAssist agent for the first time, the pre-trained models and data files required by the project are downloaded automatically from S3.

```
cd ~/minecraft/craftassist
python ./agent/craftassist_agent.py
```

You can also do this manually:

```
cd ~/minecraft
./tools/data_scripts/try_download.sh
```

This script checks your local paths `craftassist/agent/models` and `craftassist/agent/datasets` for updates, and downloads the files from S3 if your local files are missing or outdated (optional).

### Conda Env

You may need to upgrade/downgrade your pytorch and CUDA versions based on your GPU driver.

For a list of pytorch and CUDA compatible versions, see: https://pytorch.org/get-started/previous-versions/

## Datasets

The datasets we use to train the semantic parsing models consist of:
* **Templated**: This file has 800K dialogue, action dictionary pairs generated using our generation script.
    * **Templated Modify**: This file has 100K dialogue, action dictionary pairs generated in the same way as templated.txt, except covering modify type commands, eg. "make this hole larger".
* **Annotated**: This file contains 7k dialogue, action dictionary pairs. These are human labelled examples obtained from crowd sourced tasks and in game interactions.

See the CraftAssist paper for more information on how datasets are collected.

We provide all the dialogue datasets we use in the CraftAssist project in a public S3 folder: 
https://craftassist.s3-us-west-2.amazonaws.com/pubr/dialogue_data.tar.gz

In addition to the datasets used to train the model, this folder also contains greetings and short commands that the agent queries during gameplay.

### Generating Datasets

This section describes how to use our tools to generate and process training data.

To generate some templated data to train the model on, run ``generate_dialogue.py``. This script generates language commands and their corresponding logical forms using heuristic rules and publicly available dialogue datasets. 

Provide the number of examples you want to generate, eg. for 500K examples:

In [None]:
! cd ~/minecraft/base_agent/ttad/generation_dialogues
! python generate_dialogue.py -n 500000 > generated_dialogues.txt

This creates a text file. We next pre-process the data into the format required by the training script:

In [None]:
! cd ../ttad_transformer_model/
! python ~/droidlet/tools/nsp_scripts/data_processing_scripts/preprocess_templated.py \
--raw_data_path ../generation_dialogues/generated_dialogues.txt \
--output_path [OUTPUT_PATH (file must be named templated.txt)]

The format of each row is 
```
[TEXT]|[ACTION DICTIONARY]
```

To create train/test/valid splits of the data, run


In [None]:
! python ~/droidlet/tools/nsp_scripts/data_processing_scripts/create_annotated_split.py \
--raw_data_path [PATH_TO_DATA_DIR] \
--output_path [PATH_TO_SPLIT_FOLDERS] \
--filename "templated.txt" \
--split_ratio "0.7:0.2:0.1"


To create a split of annotated data too, simply run the above, but with filename "annotated.txt".

## Training Models

We are now ready to train the model with

In [None]:
! cd ~/minecraft
! python base_agent/ttad/ttad_transformer_model/train_model.py \
--data_dir craftassist/agent/models/ttad_bert_updated/annotated_data/ \
--dtype_samples '[["templated", 0.35], ["templated_modify", 0.05], ["annotated", 0.6]]' \
--tree_voc_file craftassist/agent/models/ttad_bert_updated/models/caip_test_model_tree.json \
--output_dir $CHECKPOINT_PATH

Feel free to experiment with the model parameters. Note that ``dtype_samples`` is the sampling proportions of the different data types. ``templated`` is generated using the ``generate_dialogue`` script as described above, whereas ``annotated`` is obtained from human labellers.

With a single NVIDIA Quadro GP100 GPU, one training epoch typically takes 30 minutes.

The models and tree vocabulary files are saved under ``$CHECKPOINT_PATH``, along with a log that contains training and validation accuracies after every epoch. Once you're done, you can choose which epoch you want the parameters for, and use that model.

You can take the params of the best model

In [None]:
! cp $PATH_TO_BEST_CHECKPOINT_MODEL craftassist/agent/models/caip_test_model.pth

## Testing Models

During training, validation accuracy after every epoch is calculated and logged. You can access the log file in the output directory, where the checkpointed models are also saved.

You can test the model using our inference script:



In [None]:
! python3 -i ~/droidlet/tools/nsp_scripts/data_processing_scripts/test_model_script.py
>>> get_beam_tree("build a house")

This will output the logical form for this command, i.e.

In [2]:
from pprint import pprint

pprint({'dialogue_type': 'HUMAN_GIVE_COMMAND', 'action_sequence': [{'action_type': 'BUILD', 'schematic': {'has_name': [0, [2, 2]], 'text_span': [0, [2, 2]]}}]})

To calculate accuracy on a test dataset, eg. annotated

In [None]:
>>> model_trainer = ModelTrainer(args)
>>> full_tree_voc = (full_tree, tree_i2w)
>>> model_trainer.eval_model_on_dataset(encoder_decoder, "annotated", full_tree_voc, tokenizer)

You can now use this model to run the agents. Some command line params to note:

`--dev`: Disables automatic model/dataset downloads.

`--ground_truth_data_dir`: Path to folder of ground truth short commands and templated commands. When given a command, the agent first queries this set for an exact match. If it exists, the agent returns the action dictionary from ground truth. Otherwise, the agent queries the semantic parsing model. Defaults to `~/minecraft/craftassist/agent/datasets/ground_truth/`. You can write your own templated examples and add them to `~/minecraft/craftassist/agent/datasets/ground_truth/datasets/`.

`--nsp_models_dir`: Path to binarized models and vocabulary files. Defaults to `~/minecraft/craftassist/agent/models/semantic_parser/`.

`--nsp_data_dir`: Path to semantic parser datasets. Defaults to `~/minecraft/craftassist/agent/datasets/annotated_data/`.

You can now plug your own parsing models into the craftassist or locobot agents.