[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/facebookresearch/droidlet/blob/master/tutorials/semantic_parser_onboarding.ipynb)

# Semantic Parser Onboarding

The **semantic parser** is a seq-to-seq model built on the Huggingface Transformers library. The input to the parser is a chat command, eg. "build a red cube". The output is a linearized parse tree (see [Action Dictionary Spec Doc](https://github.com/fairinternal/minecraft/blob/master/craftassist/agent/documents/Action_Dictionary_Spec.md) for the grammar specification).

The encoder uses a pretrained DistilBERT model, followed by a highway transformation. For the default model, encoder parameters are frozen during training. The decoder consists of a 6-layer Transformer, and has a **Language Modeling** head, **span** beginning and span end heads, and **text span** beginning and end heads. The Language Modeling head predicts the next node in the linearized tree. The span heads predict the span range, which provides the value for the span node. For more details, see the [Craftassist Paper](https://www.aclweb.org/anthology/2020.acl-main.427.pdf).

This tutorial covers the end-to-end process of how to train a semantic parser model and use it in the CraftAssist agent:

*  Generating and preparing datasets
*  Training models
* Evaluating models
* Using models in the agent


## Set Up

### Downloading Pre-Trained Models and Datasets

When you run the CraftAssist agent for the first time, the pre-trained models and data files required by the project are downloaded automatically from S3.

```
cd ~/minecraft/craftassist
python ./agent/craftassist_agent.py
```

You can also do this manually:

```
cd ~/minecraft
./tools/data_scripts/compare_directory_hash.sh
```

This script checks your local paths `craftassist/agent/models` and `craftassist/agent/datasets` for updates, and downloads the files from S3 if your local files are missing or outdated (optional).

### Conda Env

First make sure you set up your conda environment as per the [README instructions](https://github.com/fairinternal/minecraft/blob/master/README.md).

Depending on your GPU driver version, you may need to downgrade your pytorch and CUDA versions. As of this writing, FAIR machines have installed NVIDIA driver version 10010, which is compatible with pytorch 1.5.1 and cudatoolkit 10.1. To update your conda env with these versions, run
```
conda install pytorch==1.5.1 torchvision==0.6.1 cudatoolkit=10.1 -c pytorch
```

For a list of pytorch and CUDA compatible versions, see: https://pytorch.org/get-started/previous-versions/

## Datasets

The datasets we use to train the semantic parsing models consist of:
* **Templated**: This file has 800K dialogue, action dictionary pairs generated using our generation script.
    * **Templated Modify**: This file has 100K dialogue, action dictionary pairs generated in the same way as templated.txt, except covering modify type commands, eg. "make this hole larger".
* **Annotated**: This file contains 7k dialogue, action dictionary pairs. These are human labelled examples obtained from crowd sourced tasks and in game interactions.

See the CraftAssist paper for more information on how datasets are collected.

We provide all the dialogue datasets we use in the CraftAssist project in a public S3 folder: 
https://craftassist.s3-us-west-2.amazonaws.com/pubr/dialogue_data.tar.gz

In addition to the datasets used to train the model, this folder also contains greetings and short commands that the agent queries during gameplay.

### Generating Datasets

This section describes how to use our tools to generate and process training data.

To generate some templated data to train the model on, run this script with the number of examples you want to generate, eg. 500K examples:



In [None]:
! cd ~/minecraft/base_agent/ttad/generation_dialogues
! python generate_dialogue.py -n 500000 > generated_dialogues.txt

/bin/bash: line 0: cd: /root/minecraft/base_agent/ttad/generation_dialogues: No such file or directory
python3: can't open file 'generate_dialogue.py': [Errno 2] No such file or directory


This creates a text file. We next pre-process the data into the format required by the training script:

In [None]:
! cd ../ttad_transformer_model/
! python data_scripts/preprocess_templated.py \
--raw_data_path ../generation_dialogues/generated_dialogues.txt \
--output_path [OUTPUT_PATH (file must be named templated.txt)]

/bin/bash: line 0: cd: ../ttad_transformer_model/: No such file or directory
python3: can't open file 'data_scripts/preprocess_templated.py': [Errno 2] No such file or directory


The format of each row is 
```
[TEXT]|[ACTION DICTIONARY]
```

To create train/test/valid splits of the data, run


In [None]:
! python data_scripts/create_annotated_split.py \
--raw_data_path [PATH_TO_DATA_DIR] \
--output_path [PATH_TO_SPLIT_FOLDERS] \
--filename "templated.txt" \
--split_ratio "0.7:0.2:0.1"


python3: can't open file 'data_scripts/create_annotated_split.py': [Errno 2] No such file or directory


To create a split of annotated data too, simply run the above, but with filename "annotated.txt".

We are now ready to train the model with

In [None]:
! cd ~/minecraft
! python base_agent/ttad/ttad_transformer_model/train_model.py \
--data_dir craftassist/agent/models/ttad_bert_updated/annotated_data/ \
--dtype_samples '[["templated", 0.35], ["templated_modify", 0.05], ["annotated", 0.6]]' \
--tree_voc_file craftassist/agent/models/ttad_bert_updated/models/caip_test_model_tree.json \
--output_dir $CHECKPOINT_PATH

/bin/bash: line 0: cd: /root/minecraft: No such file or directory
python3: can't open file 'base_agent/ttad/ttad_transformer_model/train_model.py': [Errno 2] No such file or directory


Feel free to experiment with the model parameters. The models and tree vocabulary files are saved under $CHECKPOINT_PATH, along with a log that contains training and validation accuracies after every epoch. Once you're done, you can choose which epoch you want the parameters for, and use that model.

You can take the params of the best model

In [None]:
! cp $PATH_TO_BEST_CHECKPOINT_MODEL craftassist/agent/models/caip_test_model.pth

cp: missing destination file operand after 'craftassist/agent/models/caip_test_model.pth'
Try 'cp --help' for more information.


## Testing Models

During training, validation accuracy after every epoch is calculated and logged. You can access the log file in the output directory, where the checkpointed models are also saved.

To calculate accuracy on the test set,

In [None]:
# eval_model code

You can test the model using our inference script:



In [None]:
! python3 -i base_agent/ttad/ttad_transformer_model/test_model_script.py

In [None]:
get_beam_tree("dig a hole")

NameError: ignored

In [None]:
## how to load iinto agent, ground truth

You can now use that model to run the agent.