This is the github repo for ACL 2020 paper "WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge".
Python 3.6, Pytorch 1.1
This repo includes the original Winograd Schema Challenge (WSC) dataset and 4095 WinoWhy reasons (15 for each WSC question) that could justify the pronoun coreference choices in WSC.
WinoWhy contains 3 sources of reasons: (1) Human; (2) Human Reverse; (3) Generation Model. Each WSC reason has 5 reasons from each source.
Here are the descriptions and examples of reasons from these sources. The examples are based on the WSC question: "The city councilmen refused the demonstrators a permit because they feared violence. Does the 'they' refer to 'the city councilmen' or 'the demonstrators'?". The reasons are based on the question "The 'they' refers to the city councilmen because...". The paired question of this WSC changes "feared" to "advocated".
Resource | Description | Example |
---|---|---|
Human | Reasons provided by human beings. | city councilmen are administrative so they are more likely to fear. |
Human Reverse | Human reasons for the paired WSC question. | the demonstrators were the ones who needed a permit. |
Generation Model | The reasons generated by GPT-2 with the same question. | they are under the command of Mayor James B. Gray. |
Upon the collected reasons from humans and the second round annotation on their plausibility, valid reasons (at least 80% of the annotators agree that the reason justifies the answer to the WSC question) are then used to categorize what types of commonsense knowledge are needed to solve the WSC question. The selected knowledge types are as follows (notice that a question could require knowledge from multiple categories):
Name (# of question) | Definition | Example |
---|---|---|
Property (32) | Knowledge about property of objects. | ice is cold. |
Object (82) | Knowledge about objects. | cats have ears. |
Eventuality (88) | Knowledge about eventualities. | 'wake up' happens before 'open eyes'. |
Spatial (64) | Knowledge about spatial position. | object at the back can be blocked. |
Quantity (20) | Knowledge about numbers. | 2 is smaller than 10. |
Others (48) | All other knowledge. | NA |
In general, WinoWhy provide interesting and broad-covering reasons for the WSC questions. Human reasons to solve these pronoun coreference questions are creative. Humans annotators could answer the questions through giving specific definition on the concepts in the question, general and abstract explanation, or indirect tricks. GPT-2 reasons are usually valid English sentences yet invalid justification. However, the number of the reasons might be small due to the essence that WSC is a small dataset with delicately questions. Also, a careful use of the reasons could be studied since it is another challenge towards understanding the commonsense. We will keep working on improving the dataset quality.
There are two data files in the repo:
winowhy.json: the WSC dataset and corresponding WinoWhy questions.
cat_ref.json: the knowledge categories and indexes of corresponding WSC questions.
Datatset: a list of 273 WSC questions.
WSC Question: a dictionary. The keys: values are:
"text": a dictionary of the orginal WSC text. The keys: values are:
"txt1": a string of text before the pronoun;
"pron": a string of the target pronoun;
"txt2": a string of text after the pronoun;
"answers": a list of strings of candidate answer spans;
"correctAnswer": A or B;
"source": original wsc source;
"reasons": a list of the WinoWhy reasons:
WinoWhy Reason: a list of a reason info:
reason[0]: reason text;
reason[1]: reason source (human, gpt, reverse);
reason[2]: reason plausibility;
reason[3]: reason label (Valid, Invalid, Undecided)
Dataset: a dictionary. The key: values are:
"Property", "Object", "Eventuality", "Spatial", "Quantity", "Others": a list of the indexes of the WSC questions.
We can first connect the question and the reason as a single sentence by adding a few words between them (e.g., WSC Question+" The 'they' refers to the city councilmen because "+ Reason). Then we can put the sentence into the models and take the returned probability as the prediction.
Similarly, we can regard WinoWhy as a binary classification problem which requires the model to distinguish the valid/invalid reasons through supervised learning. You can run the code for supervised learning by python supervised_winowhy.py
. A processed dataset removing the reasons with label undecided for classification is available in ./dataset/
.
Include the categorical annotation into the WinoWhy main dataset.
@inproceedings{zhang2020WinoWhy,
author = {Hongming Zhang* and Xinran Zhao* and Yangqiu Song},
title = {WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge},
booktitle = {Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL) 2020},
year = {2020}
}
If you have any other questions about this repo, you are welcome to open an issue or send me an email, I will respond to that as soon as possible.