Open-Retrieval Conversational Machine Reading

OR-ShARC is an Open-Retrieval Conversational Machine Reading dataset focussing on answering high-level questions from texts on natural language rules. It is adapted from the ShARC dataset with a rewriting of all initial questions into their complete and unambiguous formats (redistributed under CC BY-SA 3.0).

More details can be found on our Technical report.

Dataset

id2snippet.json contains all rule texts in OR-ShARC as our knowledge base.

Each sample in train|dev|test has the following attributes (Changes to the original ShARC dataset are in bold):

utterance_id: Unique identification code for an instance in the dataset.
tree_id: A tree_id specifies a unique combination of a snippet and a question. There could be several instances with the same tree_id. This is because depending on the answer that a user provide to a follow-up question, the path of the conversation or the final answer can vary.
source_url: The URL of the document containing the rule snippet.
snippet: In ShARC, it is the input support document, i.e. often a paragraph which contains some rules. But we remove this in our dataset for our open-retrieval setting, you can refer to the gole_snippet_id to find the gold snippet.
gole_snippet_id: the gold snippet this sample should refer to in the database id2snippet.json.
question: Our rewritting of the original incomplete and ambiguous question in ShARC.
scenario: Describes the context of the question.
history: The conversation history, i.e. a set of follow-up questions and their corresponding answers.
evidence: A list of relevant information that the system should extract from the user's scenario. This information should not be included in the input.
answer: The desired output of a prediction model.
snippet_seen: For dev & test set only. It indicates whether this sample asks on rule texts (snippet) seen in training stage or not.

Code

We release the implementation of MUDERN as well as baselines as follows:

Retriever

TF-IDF: ./retriever_tfidf/
DPR: ./DPR/

Reader

MUDERN: ./MUDERN/
MP-RoBERTa: ./reader_baseline/MP-RoBERTa/
DISCERN: ./reader_baseline/DISCERN/
Explicit Memory Tracker: ./reader_baseline/explicit_memory_tracker/
e3: ./reader_baseline/e3/

Acknowledgement

We use the implementation of E3, DPR, TF-IDF, DISCERN, Explicit Memory Tracker in this work for baseline experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
MUDERN		MUDERN
data		data
reader_baseline		reader_baseline
retriever_DPR		retriever_DPR
retriever_tfidf		retriever_tfidf
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-Retrieval Conversational Machine Reading

Dataset

Code

Retriever

Reader

Acknowledgement

About

Releases

Packages

Languages

Yifan-Gao/open_retrieval_conversational_machine_reading

Folders and files

Latest commit

History

Repository files navigation

Open-Retrieval Conversational Machine Reading

Dataset

Code

Retriever

Reader

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages