Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The format of Multiwoz dataset #2

Closed
yanzhangnlp opened this issue Apr 7, 2021 · 7 comments
Closed

The format of Multiwoz dataset #2

yanzhangnlp opened this issue Apr 7, 2021 · 7 comments

Comments

@yanzhangnlp
Copy link

Hi Giovanni,

Nice work and thanks for the sharing. I am reproducing the results of the DST task. However, I found the processed data format of multiwoz 2.1 dataset using the script from https://github.com/jasonwu0731/trade-dst does not match your code. May I ask if you do additional preprocessing procedure? If so, would you mind sharing the script?

Sincerely,
Yan

@iambabao
Copy link

I have the same problem on ACE 05 NER dataset.

I download the ACE 05 NER dataset from the link provided in datasets.py and renamed it to {split}.ner.json, but it does not work :(

@Magolor
Copy link

Magolor commented Jul 5, 2021

@iambabao

I have the same problem on ACE 05 NER dataset.

I download the ACE 05 NER dataset from the link provided in datasets.py and renamed it to {split}.ner.json, but it does not work :(

Yes, but I believe modifying it by simply adding:

if 'label' not in x:
                    x['label'] = {
                        x['entity_label']:x['span_position'],
                    }

could work.

However, @giove91 , please add more links to all the datasets used in tanl if available. Most of the datasets reported in paper and defined in dataset.py are currently not provided with acquisition method, preprocessing scripts, or instructions. I would really appreciate it if you could complete the datasets.

@giove91
Copy link
Contributor

giove91 commented Jul 14, 2021

Hi, thanks for your interest in this project!

@yanzhangnlp We added the instructions to process the Multiwoz dataset (thanks @jasonkrone). Hope this helps!

@iambabao Apparently the version I downloaded from that link is not available anymore (it is different from the version that can be currently downloaded). Thanks @Magolor for providing a possible fix. I'll check and update the instructions.

@MerrickWang1
Copy link

Hi,

The data files provided for the ACE2005 dataset are of .test, .train, and .dev file types. @iambabao how did you obtain .json files?

Here is where I am attempting to obtain the ACE2005 data:
https://github.com/ShannonAI/mrc-for-flat-nested-ner/blob/master/ner2mrc/download.md
https://drive.google.com/file/d/1iodaJ92dTAjUWnkMyYm8aLEi5hj3cseY/view

Thanks,

@iambabao
Copy link

Hi,

The data files provided for the ACE2005 dataset are of .test, .train, and .dev file types. @iambabao how did you obtain .json files?

Here is where I am attempting to obtain the ACE2005 data:
https://github.com/ShannonAI/mrc-for-flat-nested-ner/blob/master/ner2mrc/download.md
https://drive.google.com/file/d/1iodaJ92dTAjUWnkMyYm8aLEi5hj3cseY/view

Thanks,

The files are in JSON format, you can directly rename them.

@David-Lee-1990
Copy link

@iambabao

I have the same problem on ACE 05 NER dataset.
I download the ACE 05 NER dataset from the link provided in datasets.py and renamed it to {split}.ner.json, but it does not work :(

Yes, but I believe modifying it by simply adding:

if 'label' not in x:
                    x['label'] = {
                        x['entity_label']:x['span_position'],
                    }

could work.

However, @giove91 , please add more links to all the datasets used in tanl if available. Most of the datasets reported in paper and defined in dataset.py are currently not provided with acquisition method, preprocessing scripts, or instructions. I would really appreciate it if you could complete the datasets.

hey guys, after preprocess ace2005 ner dataset following guidence here, and run tanl , i get F1 = 88.3 (tanl paper is 84.9). Is there a bug or else?

@giove91
Copy link
Contributor

giove91 commented Jun 8, 2022

Interesting! Are the splits correct and have you used the same hyperparameters as in the paper? (50 epochs, initial learning rate 0.0005, ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants