From ab6ad982db1e47f0b29485c2b6409fb4fdcff8a2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tom=C3=A1=C5=A1=20Nekvinda?= Date: Thu, 24 Jun 2021 14:20:10 +0200 Subject: [PATCH 1/4] Update README.md --- README.md | 173 ++++++++++++++++++++++++++---------------------------- 1 file changed, 83 insertions(+), 90 deletions(-) diff --git a/README.md b/README.md index d2b70e8..bf04755 100644 --- a/README.md +++ b/README.md @@ -1,32 +1,31 @@ # MultiWOZ Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. +## Versions -The newest, corrected version of the dataset is available at [MultiWOZ_2.2](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.2) thanks to [the Google crew](https://arxiv.org/abs/2007.12720). + - **The newest, corrected version of the dataset is available at [MultiWOZ_2.2](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.2) thanks to [the Google crew](https://arxiv.org/abs/2007.12720).** + - The new, corrected version of the dataset is available at [MultiWOZ_2.1](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.1.zip) thanks to [the Amazon crew](https://arxiv.org/abs/1907.01669). + - The dataset used in the EMNLP publication can be accessed at: [MultiWOZ_2.0](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.0.zip) + - The dataset used in the ACL publication can be accessed at: [MultiWOZ_1.0](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_1.0.zip) -The new, corrected version of the dataset is available at [MultiWOZ_2.1](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.1.zip) thanks to [the Amazon crew](https://arxiv.org/abs/1907.01669). - -The dataset used in the EMNLP publication can be accessed at: [MultiWOZ_2.0](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.0.zip) - -The dataset used in the ACL publication can be accessed at: [MultiWOZ_1.0](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_1.0.zip) - -# Data structure +## Data structure There are 3,406 single-domain dialogues that include booking if the domain allows for that and 7,032 multi-domain dialogues consisting of at least 2 up to 5 domains. To enforce reproducibility of results, the corpus was randomly split into a train, test and development set. The test and development sets contain 1k examples each. Even though all dialogues are coherent, some of them were not finished in terms of task description. Therefore, the validation and test sets only contain fully successful dialogues thus enabling a fair comparison of models. There are no dialogues from hospital and police domains in validation and testing sets. Each dialogue consists of a goal, multiple user and system utterances as well as a belief state. Additionally, the task description in natural language presented to turkers working from the visitor’s side is added. Dialogues with MUL in the name refers to multi-domain dialogues. Dialogues with SNG refers to single-domain dialogues (but a booking sub-domain is possible). The booking might not have been possible to complete if fail_book option is not empty in goal specifications – turkers did not know about that. The belief state have three sections: semi, book and booked. Semi refers to slots from a particular domain. Book refers to booking slots for a particular domain and booked is a sub-list of book dictionary with information about the booked entity (once the booking has been made). The goal sometimes was wrongly followed by the turkers which may results in the wrong belief state. The joint accuracy metrics includes ALL slots. -# FAQ -1. File names refer to two types of dialogues. The MUL and PMUL names refer to strictly multi domain dialogues (at least 2 main domains are involved) while the SNG, SSNG and WOZ names refer to single domain dialogues with potentially sub-domains like booking. -2. Only system utterances are annotated with dialogue acts – there are no annotations from the user side. -3. There is no 1-to-1 mapping between dialogue acts and sentences. -4. There is no dialogue state tracking labels for police and hospital as these domains are very simple. However, there are no dialogues with these domains in validation and testing sets either. -5. For the dialogue state tracking experiments please follow the datat processing and scoring scripts from the [TRADE](https://github.com/jasonwu0731/trade-dst) model (Wu et al. 2019). -6. For the response generation evaluation please consider using the scripts from [this repository](https://github.com/Tomiinek/MultiWOZ_Evaluation). +# :grey_question: FAQ +- File names refer to two types of dialogues. The `MUL` and `PMUL` names refer to strictly multi domain dialogues (at least 2 main domains are involved) while the `SNG`, `SSNG` and `WOZ` names refer to single domain dialogues with potentially sub-domains like booking. +- Only system utterances are annotated with dialogue acts – there are no annotations from the user side. +- There is no 1-to-1 mapping between dialogue acts and sentences. +- There is no dialogue state tracking labels for police and hospital as these domains are very simple. However, there are no dialogues with these domains in validation and testing sets either. + +# :trophy: Benchmarks +## Dialog State Tracking + +:bangbang: **For the DST experiments please follow the datat processing and scoring scripts from [TRADE (Wu et al. 2019)](https://github.com/jasonwu0731/trade-dst).** -

Benchmarks

-

Belief Tracking

@@ -59,9 +58,14 @@ The belief state have three sections: semi, book and booked. Semi refers to slot
MultiWOZ 2.0MultiWOZ 2.1MultiWOZ 2.2
+## Response Generation + +:bangbang: **For the response generation evaluation please see and use the scoring scripts from [this repository](https://github.com/Tomiinek/MultiWOZ_Evaluation) (Nekvinda & Dušek 2021).** The following tables show numbers reported in current works. However, these numbers may not be comparable directly because of inconsistencies in the evaluation scripts. + +### Policy Optimization +\* Denotes that the results were obtained with an even earlier version of the evaluator. The performance on these works were underestimated. -

Policy Optimization

@@ -81,29 +85,19 @@ The belief state have three sections: semi, book and booked. Semi refers to slot -
(INFORM + SUCCESS)*0.5 + BLEUMultiWOZ 2.0MultiWOZ 2.1
UBAR (Yang et al. 2020)94.00 83.60 17.20 92.70 81.00 16.70
HDNO (Wang et al. 2020)96.4084.7018.8592.8083.00 18.97
LAVA (Lubis et al. 2020)97.5094.8012.1096.3983.5714.02
-* The results were obtained with a previous version of the evaluator. The performance on these works before the upgrade were underestimated. -

Natural Language Generation

-
- - - - -
ModelSERBLEU
Baseline (Budzianowski et al. 2018)2.99 0.632
-
+### End-to-End Modelling -

End-to-End Modelling

- + @@ -113,25 +107,74 @@ The belief state have three sections: semi, book and booked. Semi refers to slot - - + + +
(INFORM + SUCCESS)*0.5 + BLEUMultiWOZ 2.0MultiWOZ 2.1
ModelINFORMSUCCESSBLEUINFORMSUCCESSBLEU
DAMD (Zhang et al. 2019)76.360.4 18.6
DAMD (Zhang et al. 2019)76.360.4 16.6
LABES-S2S (Zhang et al. 2020) 78.07 67.06 18.3
SimpleTOD (Hosseini-Asl et al. 2020)84.470.1 15.01
DoTS (Jeon et al. 2021)86.5974.1415.0686.6574.1815.90
NoisyChannel (Liu et al. 2021)86.9076.2020.58
UBAR (Yang et al. 2020)95.40 80.70 17.00 95.70 81.80 16.50
SUMBT+LaRL (Lee et al. 2020)92.20 85.40 17.90
+
+ + ### Natural Language Generation + +
+ + +
ModelSERBLEU
Baseline (Budzianowski et al. 2018)2.99 0.632
+# :thought_balloon: References +If you use any source codes or datasets included in this toolkit in your +work, please cite the corresponding papers. The bibtex are listed below: +``` +[Budzianowski et al. 2018] +@inproceedings{budzianowski2018large, + Author = {Budzianowski, Pawe{\l} and Wen, Tsung-Hsien and Tseng, Bo-Hsiang and Casanueva, I{\~n}igo and Ultes Stefan and Ramadan Osman and Ga{\v{s}}i\'c, Milica}, + title={MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling}, + booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, + year={2018} +} -# Requirements -Python 2 with pip, pytorch==0.4.1 +[Ramadan et al. 2018] +@inproceedings{ramadan2018large, + title={Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing}, + author={Ramadan, Osman and Budzianowski, Pawe{\l} and Gasic, Milica}, + booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics}, + volume={2}, + pages={432--437}, + year={2018} +} -# Quick start -In repo directory: +[Eric et al. 2019] +@article{eric2019multiwoz, + title={MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines}, + author={Eric, Mihail and Goel, Rahul and Paul, Shachi and Sethi, Abhishek and Agarwal, Sanchit and Gao, Shuyag and Hakkani-Tur, Dilek}, + journal={arXiv preprint arXiv:1907.01669}, + year={2019} +} -## Preprocessing +[Zang et al. 2020] +@inproceedings{zang2020multiwoz, + title={MultiWOZ 2.2: A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines}, + author={Zang, Xiaoxue and Rastogi, Abhinav and Sunkara, Srinivas and Gupta, Raghav and Zhang, Jianguo and Chen, Jindong}, + booktitle={Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, ACL 2020}, + pages={109--117}, + year={2020} +} +``` + +# Baseline + +:bangbang: This part relates to the first version of the dataset and evaluation scripts. + +### Requirements +Python 2 with `pip`, `pytorch==0.4.1` + +### Preprocessing To download and pre-process the data run: ```python create_delex_data.py``` -## Training +### Training To train the model run: ```python train.py [--args=value]``` @@ -162,17 +205,8 @@ To evaluate the trained model, run: ```python test.py [--args=value]``` -To evaluate the outside model, run: - -```python evaluate.py``` - -where in line 611 you need to load your generation predictions. - - -# Benchmark results -The following [benchmark results](http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/) were produced by this software. -We ran a small grid search over various hyperparameter settings -and reported the performance of the best model on the test set. +## Results +We ran a small grid search over various hyperparameter settings and reported the performance of the best model on the test set. The selection criterion was 0.5*match + 0.5*success+100*BLEU on the validation set. The final parameters were: @@ -195,47 +229,6 @@ The final parameters were: --cell_type : lstm ``` - -# References -If you use any source codes or datasets included in this toolkit in your -work, please cite the corresponding papers. The bibtex are listed below: -``` -[Budzianowski et al. 2018] -@inproceedings{budzianowski2018large, - Author = {Budzianowski, Pawe{\l} and Wen, Tsung-Hsien and Tseng, Bo-Hsiang and Casanueva, I{\~n}igo and Ultes Stefan and Ramadan Osman and Ga{\v{s}}i\'c, Milica}, - title={MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling}, - booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, - year={2018} -} - -[Ramadan et al. 2018] -@inproceedings{ramadan2018large, - title={Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing}, - author={Ramadan, Osman and Budzianowski, Pawe{\l} and Gasic, Milica}, - booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics}, - volume={2}, - pages={432--437}, - year={2018} -} - -[Eric et al. 2019] -@article{eric2019multiwoz, - title={MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines}, - author={Eric, Mihail and Goel, Rahul and Paul, Shachi and Sethi, Abhishek and Agarwal, Sanchit and Gao, Shuyag and Hakkani-Tur, Dilek}, - journal={arXiv preprint arXiv:1907.01669}, - year={2019} -} - -[Zang et al. 2020] -@inproceedings{zang2020multiwoz, - title={MultiWOZ 2.2: A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines}, - author={Zang, Xiaoxue and Rastogi, Abhinav and Sunkara, Srinivas and Gupta, Raghav and Zhang, Jianguo and Chen, Jindong}, - booktitle={Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, ACL 2020}, - pages={109--117}, - year={2020} -} -``` - # License MultiWOZ is an open source toolkit for building end-to-end trainable task-oriented dialogue models. It is released by Paweł Budzianowski from Cambridge Dialogue Systems Group under Apache License 2.0. From 311ca104c202c449b2444813c7fd2a615225dfbf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tom=C3=A1=C5=A1=20Nekvinda?= Date: Thu, 24 Jun 2021 14:23:04 +0200 Subject: [PATCH 2/4] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index bf04755..65f8926 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ The belief state have three sections: semi, book and booked. Semi refers to slot # :trophy: Benchmarks ## Dialog State Tracking -:bangbang: **For the DST experiments please follow the datat processing and scoring scripts from [TRADE (Wu et al. 2019)](https://github.com/jasonwu0731/trade-dst).** +:bangbang: **For the DST experiments please follow the data processing and scoring scripts from the [TRADE model](https://github.com/jasonwu0731/trade-dst).**
@@ -60,7 +60,7 @@ The belief state have three sections: semi, book and booked. Semi refers to slot ## Response Generation -:bangbang: **For the response generation evaluation please see and use the scoring scripts from [this repository](https://github.com/Tomiinek/MultiWOZ_Evaluation) (Nekvinda & Dušek 2021).** The following tables show numbers reported in current works. However, these numbers may not be comparable directly because of inconsistencies in the evaluation scripts. +:bangbang: **For the response generation evaluation please see and use the scoring scripts from [this repository](https://github.com/Tomiinek/MultiWOZ_Evaluation).** The following tables show numbers reported in current works. However, these numbers may not be comparable directly because of inconsistencies in the evaluation scripts. ### Policy Optimization From 8a276bff6b61d6e53bac4619833e176763ef0f6e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tom=C3=A1=C5=A1=20Nekvinda?= Date: Mon, 28 Jun 2021 08:36:45 +0200 Subject: [PATCH 3/4] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 65f8926..d30a4c0 100644 --- a/README.md +++ b/README.md @@ -60,7 +60,9 @@ The belief state have three sections: semi, book and booked. Semi refers to slot ## Response Generation -:bangbang: **For the response generation evaluation please see and use the scoring scripts from [this repository](https://github.com/Tomiinek/MultiWOZ_Evaluation).** The following tables show numbers reported in current works. However, these numbers may not be comparable directly because of inconsistencies in the evaluation scripts. +:bangbang: **For the response generation evaluation please see and use the scoring scripts from [this repository](https://github.com/Tomiinek/MultiWOZ_Evaluation).** + +:bangbang: The following tables show numbers reported in current works. However, these numbers may not be comparable directly because of inconsistencies in the evaluation scripts. ### Policy Optimization From 40caf83332a12023425c4bb03f3460224ac9ffd6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tom=C3=A1=C5=A1=20Nekvinda?= Date: Mon, 28 Jun 2021 12:41:55 +0200 Subject: [PATCH 4/4] Update README.md --- README.md | 51 ++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 38 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index d30a4c0..81aa138 100644 --- a/README.md +++ b/README.md @@ -62,12 +62,46 @@ The belief state have three sections: semi, book and booked. Semi refers to slot :bangbang: **For the response generation evaluation please see and use the scoring scripts from [this repository](https://github.com/Tomiinek/MultiWOZ_Evaluation).** -:bangbang: The following tables show numbers reported in current works. However, these numbers may not be comparable directly because of inconsistencies in the evaluation scripts. +- See [this directory](https://github.com/Tomiinek/MultiWOZ_Evaluation/tree/master/predictions) for details about the raw generated predictions of other models. +- BLEU reported in these tables is calculated with references obtained from the *MultiWOZ 2.2 span annotations*. +- CBE stands for *conditional bigram entropy*. -### Policy Optimization +| Model | BLEU | Inform | Success | Av. len. | CBE | #uniq. words | #uniq. 3-grams | +| ------------------ | -----:| -------:| --------:| ---------:| -----------------:| -------------:| -------------:| +| Reference corpus   | - | 93.7 | 90.9 | 14.00 | 3.01 | 1407 | 23877 | + +**End-to-end models**, i.e. those that use only the context as input. + +| Model | BLEU | Inform | Success | Av. len. | CBE | #uniq. words | #uniq. 3-grams | +| ------------------ | -----:| -------:| --------:| ---------:| -----------------:| -------------:| -------------:| +| DAMD ([paper](https://arxiv.org/abs/1911.10484)\|[code](https://github.com/thu-spmi/damd-multiwoz)) | 16.4 | 57.9 | 47.6 | 14.27 | 1.65 | 212 | 1755 | +| MinTL ([paper](https://arxiv.org/pdf/2009.12005.pdf)\|[code](https://github.com/zlinao/MinTL)) | **19.4** | 73.7 | 65.4 | 14.78 | 1.81 | 297 | 2525 | +| UBAR ([paper](https://arxiv.org/abs/2012.03539)\|[code](https://github.com/TonyNemo/UBAR-MultiWOZ)) | 17.6 | **83.4** | 70.3 | 13.54 | 2.10 | 478 | 5238 | +| SOLOIST ([paper](https://arxiv.org/abs/2005.05298)) | 13.6 | 82.3 | **72.4** | 18.45 | **2.41** | **615** | **7923** | +| AuGPT ([paper](https://arxiv.org/abs/2102.05126)\|[code](https://github.com/ufal/augpt)) | 16.8 | 76.6 | 60.5 | 12.90 | 2.15 | 608 | 5843 | +| LABES ([paper](https://arxiv.org/pdf/2009.08115v3.pdf)\|[code](https://github.com/thu-spmi/LABES)) | 18.9 | 68.5 | 58.1 | 14.20 | 1.83 | 374 | 3228 | +| DoTS ([paper](https://arxiv.org/pdf/2103.06648.pdf)) | 16.8 | 80.4 | 68.7 | 14.66 | 2.10 | 411 | 5162 | + +**Policy optimization models**, i.e. those that use also the ground-truth dialog states. + +| Model | BLEU | Inform | Success | Av. len. | CBE | #uniq. words | #uniq. 3-grams | +| ------------------ | -----:| -------:| --------:| ---------:| -----------------:| -------------:| -------------:| +| MarCo ([paper](https://arxiv.org/pdf/2004.12363.pdf)\|[code](https://github.com/InitialBug/MarCo-Dialog)) | 17.3 | 94.5 | 87.2 | 16.01 | **1.94** | 319 | **3002** | +| HDSA ([paper](https://arxiv.org/pdf/1905.12866.pdf)\|[code](https://github.com/wenhuchen/HDSA-Dialog)) | **20.7** | 87.9 | 79.4 | 14.42 | 1.64 | 259 | 2019 | +| HDNO ([paper](https://arxiv.org/pdf/2006.06814.pdf)\|[code](https://github.com/mikezhang95/HDNO)) | 17.8 | 93.3 | 83.4 | 14.96 | 0.84 | 103 | 315 | +| SFN ([paper](https://arxiv.org/pdf/1907.10016.pdf)\|[code](https://github.com/Shikib/structured_fusion_networks)) | 14.1 | 93.4 | 82.3 | 14.93 | 1.63 | 188 | 1218 | +| UniConv ([paper](https://arxiv.org/pdf/2004.14307.pdf)\|[code](https://github.com/henryhungle/UniConv)) | 18.1 | 66.7 | 58.7 | 14.17 | 1.79 | **338** | 2932 | +| LAVA ([paper](https://arxiv.org/abs/2011.09378)\|[code](https://gitlab.cs.uni-duesseldorf.de/general/dsml/lava-public/-/tree/master/experiments_woz/sys_config_log_model/2020-05-12-14-51-49-actz_cat)) | 10.8 | **95.9** | **93.5** | 13.28 | 1.27 | 176 | 708 | + + +### Older results + +The following tables show older numbers which may not be comparable directly because of inconsistencies in the evaluation scripts used. \* Denotes that the results were obtained with an even earlier version of the evaluator. The performance on these works were underestimated. +**Policy optimization** +
@@ -92,7 +126,7 @@ The belief state have three sections: semi, book and booked. Semi refers to slot
(INFORM + SUCCESS)*0.5 + BLEUMultiWOZ 2.0MultiWOZ 2.1
-### End-to-End Modelling +**End-to-end modelling**
@@ -113,16 +147,7 @@ The belief state have three sections: semi, book and booked. Semi refers to slot
- - ### Natural Language Generation - -
- - - - -
ModelSERBLEU
Baseline (Budzianowski et al. 2018)2.99 0.632
-
+ # :thought_balloon: References If you use any source codes or datasets included in this toolkit in your