[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/facebookresearch/fairo/blob/master/tutorials/make_new_dataset.ipynb)

# Datasets for Semantic Parsing Model
The semantic parsing model converts plain text to a nested action dictionary that is partially executable. 

The training dataset for this was created meticulously using a crowdsourcing procedure we describe in detail here: https://www.aclweb.org/anthology/2020.acl-main.427.pdf

More details on the existing dataset used for training the semantic parsing model, here : https://colab.research.google.com/github/facebookresearch/fairo/blob/master/tutorials/semantic_parser_onboarding.ipynb



## Creating your own dataset
If you want to create your own dataset, the high level procedure to follow would be :

1. Come up with new natural language command examples.
2. Get annotations for these commands.
3. Create the training data in the format explained here: 
https://colab.research.google.com/github/facebookresearch/fairo/blob/master/tutorials/semantic_parser_onboarding.ipynb

### Getting annotations for new examples

We designed and open-sourced a multi-step web-based tool which asks users a series of multiple-choice questions to determine the semantic content of a sentence. The responses to some questions will prompt other more specific questions, in a process that mirrors the hierarchical structure of the grammar (described here: https://github.com/facebookresearch/fairo/blob/main/base_agent/documents/Action_Dictionary_Spec.md) . 

These tools are available for use, here: https://github.com/facebookresearch/fairo/tree/main/tools/annotation_tools/text_to_tree_tool .


#### **Annotating  the data**

If the new commands in your dataset are supported by our grammar (defined here: https://github.com/facebookresearch/fairo/blob/main/base_agent/documents/Action_Dictionary_Spec.md), then you can directly use the tools above with Mechanical Turk to get final annotations.


However, if there are commands in your dataset that don't fall under the category of actions (Build, Destroy, Dig, ...etc) or dialogue types (human_give_command, get_memory, put_memory) or are just new additions to what we released, then the existing tool and grammar will need to be extended. 

The proposed way to go about the extension and a general intuitive explanation about the annotation process is as follows :


1. Example command: **"touch the green chair"**

  1. What kind of dialogue is it ? - Human is giving a command, so `"dialogue_type" : "HUMAN_GIVE_COMMAND"`
  2. Given the dialogue type, other semantic details of the sentence:
    - Are there multiple actions ? - Just one single action, so `"action_sequence"` list will have one dictionary in it.
    - Is there a specific kind of action? Can any of the existing actions be reused ? - We need a new `"action_type" : "TOUCH"`
    - What is it that needs to be touched? `"reference_object" : {"text_span" : "the green chair"}`
      - More details about the thing (these go under the "filters"):
        - `"has_name" : "chair"` , and `"has_colour": "green"`
    - Does the action need to be repeated ? No.

  Based on above, the full dictionary looks like:
  ```
  { "dialogue_type": "HUMAN_GIVE_COMMAND",
    "action_sequence" : [
    { "action_type" :  "TOUCH",
      "reference_object: {
          "filters" : {
            "has_name" : [0, [3, 3]],
            "has_colour": [0, [2, 2]]
          }
      }
    }]
  }
  ```
  where the values for the keys `has_name` and `has_colour` are the indices of the word in the 0th (most recent) sentence in the dialogue.



2. Example command: **"dig three tiny holes there"**
  1. What kind of dialogue is it ? - Human is giving a command, so `"dialogue_type" : "HUMAN_GIVE_COMMAND"`
  2. Given the dialogue type, other semantic details of the sentence:
    - Are there multiple actions ? - Just one single action, so `"action_sequence"` list will have one dictionary in it.
    - Is there a specific kind of action? Can any of the existing actions be reused ? - Yes it : `"action_type" : "DIG"`
    - What is it that needs to be dug? `"schematic" : {"text_span" : "three tiny holes"}`
      - More details about the thing that needs to be dug:
        - `"has_name" : "holes"` , and `"has_size": "tiny"`
    - Is there a location where the hole will be dug ? - Yes:
      - More details about the location:
        - The location is represented using indefinite noun, so `"location" : {"contains_coreference" : "yes"}`
    - Does the action need to be repeated ? Yes, three times: `"repeat" : {"repeat_key" : "FOR", "repeat_count" : "three}`

  Based on above, the full dictionary looks like:
  ```
  { "dialogue_type": "HUMAN_GIVE_COMMAND",
    "action_sequence" : [
    { "action_type" :  "DIG",
      "schematic: {
          "has_name" : [0, [3, 3]],
          "has_size": [0, [2, 2]]
          },
      "location" : {
        "contains_coreference" : "yes" 
      },
      "repeat" : {
        "repeat_key" : "FOR",
        "repeat_count" : [0, [1, 1]]
      }
    }]
  }
  ```
  where the values for the keys `has_name`, `has_size` and `repeat_count` are the indices of the word in the 0th (most recent) sentence in the dialogue.

3. Example command: **"what colour do you think is that ?"**
  1. What kind of dialogue is it ? - Human is asking a question about something in the environment, so `"dialogue_type" : "GET_MEMORY"`
  2. Given the dialogue type, other semantic details of the sentence:
    - What is the question about?  - denoted using the `filters` key in the dictionary. The question is about a reference object in memory, so `"memory_type": "REFERENCE_OBJECT"`
      - More details about the reference object:
        - It is represented using indefinite noun (`that`), so `"contains_coreference" : "yes"`
    - Is the question about a specific property of the object or a yes/no question ? - It is asking for the colour, so the expected output is `"attribute": "has_colour"`

  Based on above, the full dictionary looks like:
  ```
  {"dialogue_type": "GET_MEMORY", 
  "filters": {
    "memory_type": "REFERENCE_OBJECT", 
    "contains_coreference": "yes", 
    "output": {"attribute": "has_colour"}}}
  ```

4. Example command: **"is this a house ?"**
  1. What kind of dialogue is it ? - Human is asking a question about something in the environment, so `"dialogue_type" : "GET_MEMORY"`
  2. Given the dialogue type, other semantic details of the sentence:
    - What is the question about?  - denoted using the `filters` key in the dictionary. The question is about a reference object in agent's memory, so `"memory_type": "REFERENCE_OBJECT"`
      - More details about the reference object:
        - It has a name : "house"  so `"has_name": "house"`
    - Is the question about a specific property of the object or a yes/no question ? - It is a yes/no confirmation like question, so output is returning whether a memory exists: `"output": "memory"`

  Based on above, the full dictionary looks like:
  ```
  {"dialogue_type": "GET_MEMORY", 
  "filters": {
    "memory_type": "REFERENCE_OBJECT", 
    "has_name": [0, [3, 3]], 
    "contains_coreference": "yes", 
    "output": "memory"
    }
  }
  ```
  where the value for the key `has_name` is the index of the word "house" in the 0th (most recent) sentence in the dialogue.

5. Example command: **"label that very gigantic thing as yellow"**
  1. What kind of dialogue is it ? - Human is giving a label to something in the bot's environment, so `"dialogue_type" : "PUT_MEMORY"`
  2. Given the dialogue type, other semantic details of the sentence:
    - What is being tagged?  - denoted using the `filters` key in the dictionary. The label is about a reference object in agent's memory, so `"memory_type": "REFERENCE_OBJECT"`
      - More details about the reference object:
        - It has a size : "very gigantic"  so `"has_size": "very gigantic"`
        - It is represented using indefinite noun (`that`), so `"contains_coreference" : "yes"`
    - What is the information being given to the agent ? - The colour of the object needs to be tagged as "yellow", so update the memory for this as : `"memory_type": "TRIPLE", "has_colour": [0, [7, 7]]`

  Based on above, the full dictionary looks like:
 ```
 {"dialogue_type": "PUT_MEMORY", 
  "filters": {
    "has_size": [0, [2, 3]],  
    "contains_coreference": "yes"}, 
    "upsert": {
      "memory_data": {
        "memory_type": "TRIPLE", 
        "has_colour": [0, [6, 6]]}}}
 ```
 where the values for the keys `has_size` and `has_colour` are the indices of the word in the 0th (most recent) sentence in the dialogue.


