Skip to content

Conversation

@zaidalyafeai
Copy link
Contributor

Added a custom data loader script for the offline dataset story_cloze. Also, created a candidate template.

@VictorSanh VictorSanh self-assigned this Sep 17, 2021
@VictorSanh
Copy link
Member

Thanks for opening this PR @zaidalyafeai !

I was initially suggesting that:

  • your story_cloze.py goes to huggingface/datasets via a PR (as a dataset that you need manual download)
  • the datasets library version is installed locally from your PR (in your local version, you would specify the local path where you need to load the data i.e. path_to_manual_folder)
  • Ultimately, we would probably have a "data_dir" box in the promptsource to specify the data_dir in the tool rather than specifying it in the code, but we don't need that now since the priority is to get the eval suite ready

This has multiple advantages:

  • you would still be able to prompt this dataset in promptsource
  • the things that should to be in datasets (storycloze.py, datasets_info.json) would be in datasets in not in promptsource
  • when we do seqio caching, that would work out of the box -> right now I don't think that works

btw, you shouldn't push the .lock file (even though I believe in hf datasets, it's flagged in the .gitignore)

@zaidalyafeai
Copy link
Contributor Author

Thanks @VictorSanh, I already opened a PR to add it to huggingface datasets but it seems because of the large number of open PRs it will take sometime to revise and merge. I was suggesting this approach in case if we need to add custom datasets quickly so that we don't need to wait for PRs to be merged to datasets. Since @awebson is saying that this dataset is a must-have, what do you suggest is the next thing to do ?

@VictorSanh
Copy link
Member

Thanks @VictorSanh, I already opened a PR to add it to huggingface datasets but it seems because of the large number of open PRs it will take sometime to revise and merge. I was suggesting this approach in case if we need to add custom datasets quickly so that we don't need to wait for PRs to be merged to datasets. Since @awebson is saying that this dataset is a must-have, what do you suggest is the next thing to do ?

Oh awesome, I didn't find it initially for some reason...

Yes, please do:

  • the datasets library version is installed locally from your PR (in your local version, you would specify the local path where you need to load the data i.e. path_to_manual_folder)
  • remove the three files (.lock, story_cloze.py and datasets_info.json)
  • remove CUSTOM_DATASETS related stuff in promptsource.py

once this is done, i will ask you to seqio cache the dataset and upload it to the hub (i will share the command with you later)

thank you!

@zaidalyafeai
Copy link
Contributor Author

Sounds good, what about the templates, for story_cloze, do I add them as part of this PR or just discard it?

@VictorSanh
Copy link
Member

yes, let's have the template here in this pr!

@zaidalyafeai
Copy link
Contributor Author

@VictorSanh I followed your steps and this seems to work fine. I want to test the seqio cache script early on some test templates to make sure it works fine.

@zaidalyafeai zaidalyafeai changed the title add custom data loaders for story_cloze and a candidate template Add templates for story_cloze Sep 18, 2021
@zaidalyafeai
Copy link
Contributor Author

Since the dataset requires manual download and is not part of hf datasets maybe it should be excluded from the build checks?

Copy link
Member

@VictorSanh VictorSanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what i can saw from the data, LGTM! corrected a few minor things.

Letting @awebson validating one more time and we can cache it.

Note: I see you only did the 2016 subset. I see the recommend evaluation subset is 2018, is that a problem? (FLAN report numbers on the 2016 subset)

@VictorSanh VictorSanh merged commit fc5b3bb into bigscience-workshop:main Sep 19, 2021
@zaidalyafeai
Copy link
Contributor Author

I discussed that with @awebson, the 2018 doesn't have labels for the test split only for validation. We can copy the same templates to the 2018 validation if you want to include that. Between in GPT-3 the use the 2016 subset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants