Add templates for story_cloze #443

zaidalyafeai · 2021-09-16T23:21:57Z

Added a custom data loader script for the offline dataset story_cloze. Also, created a candidate template.

VictorSanh · 2021-09-17T15:35:25Z

Thanks for opening this PR @zaidalyafeai !

I was initially suggesting that:

your story_cloze.py goes to huggingface/datasets via a PR (as a dataset that you need manual download)
the datasets library version is installed locally from your PR (in your local version, you would specify the local path where you need to load the data i.e. path_to_manual_folder)
Ultimately, we would probably have a "data_dir" box in the promptsource to specify the data_dir in the tool rather than specifying it in the code, but we don't need that now since the priority is to get the eval suite ready

This has multiple advantages:

you would still be able to prompt this dataset in promptsource
the things that should to be in datasets (storycloze.py, datasets_info.json) would be in datasets in not in promptsource
when we do seqio caching, that would work out of the box -> right now I don't think that works

btw, you shouldn't push the .lock file (even though I believe in hf datasets, it's flagged in the .gitignore)

zaidalyafeai · 2021-09-17T16:06:55Z

Thanks @VictorSanh, I already opened a PR to add it to huggingface datasets but it seems because of the large number of open PRs it will take sometime to revise and merge. I was suggesting this approach in case if we need to add custom datasets quickly so that we don't need to wait for PRs to be merged to datasets. Since @awebson is saying that this dataset is a must-have, what do you suggest is the next thing to do ?

VictorSanh · 2021-09-17T16:38:02Z

Thanks @VictorSanh, I already opened a PR to add it to huggingface datasets but it seems because of the large number of open PRs it will take sometime to revise and merge. I was suggesting this approach in case if we need to add custom datasets quickly so that we don't need to wait for PRs to be merged to datasets. Since @awebson is saying that this dataset is a must-have, what do you suggest is the next thing to do ?

Oh awesome, I didn't find it initially for some reason...

Yes, please do:

the datasets library version is installed locally from your PR (in your local version, you would specify the local path where you need to load the data i.e. path_to_manual_folder)
remove the three files (.lock, story_cloze.py and datasets_info.json)
remove CUSTOM_DATASETS related stuff in promptsource.py

once this is done, i will ask you to seqio cache the dataset and upload it to the hub (i will share the command with you later)

thank you!

zaidalyafeai · 2021-09-17T18:13:30Z

Sounds good, what about the templates, for story_cloze, do I add them as part of this PR or just discard it?

VictorSanh · 2021-09-17T18:37:22Z

yes, let's have the template here in this pr!

zaidalyafeai · 2021-09-17T22:53:04Z

@VictorSanh I followed your steps and this seems to work fine. I want to test the seqio cache script early on some test templates to make sure it works fine.

zaidalyafeai · 2021-09-18T20:20:23Z

Since the dataset requires manual download and is not part of hf datasets maybe it should be excluded from the build checks?

VictorSanh

from what i can saw from the data, LGTM! corrected a few minor things.

Letting @awebson validating one more time and we can cache it.

Note: I see you only did the 2016 subset. I see the recommend evaluation subset is 2018, is that a problem? (FLAN report numbers on the 2016 subset)

promptsource/templates/story_cloze/2016/templates.yaml

zaidalyafeai · 2021-09-19T21:01:59Z

I discussed that with @awebson, the 2018 doesn't have labels for the test split only for validation. We can copy the same templates to the 2018 validation if you want to include that. Between in GPT-3 the use the 2016 subset.

add custom data loaders for story_cloze and a candidate template

52e1bb6

VictorSanh self-assigned this Sep 17, 2021

zaidalyafeai mentioned this pull request Sep 18, 2021

Adding Story Cloze which requires manual download #424

Closed

zaidalyafeai added 2 commits September 18, 2021 23:13

roll back and add templates

aecce6a

only modify templates

23ea982

zaidalyafeai changed the title ~~add custom data loaders for story_cloze and a candidate template~~ Add templates for story_cloze Sep 18, 2021

VictorSanh approved these changes Sep 19, 2021

View reviewed changes

VictorSanh added 3 commits September 19, 2021 16:40

small typo on join

29b082d

fix metrics

59eeb82

small typos

5bca7ed

VictorSanh merged commit fc5b3bb into bigscience-workshop:main Sep 19, 2021

VictorSanh mentioned this pull request Sep 20, 2021

Fix seqio import for story cloze #447

Merged

Add templates for story_cloze #443

Add templates for story_cloze #443

Uh oh!

Conversation

zaidalyafeai commented Sep 16, 2021

Uh oh!

VictorSanh commented Sep 17, 2021

Uh oh!

zaidalyafeai commented Sep 17, 2021

Uh oh!

VictorSanh commented Sep 17, 2021

Uh oh!

zaidalyafeai commented Sep 17, 2021

Uh oh!

VictorSanh commented Sep 17, 2021

Uh oh!

zaidalyafeai commented Sep 17, 2021

Uh oh!

zaidalyafeai commented Sep 18, 2021

Uh oh!

VictorSanh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zaidalyafeai commented Sep 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants