Skip to content

Conversation

@SamuelCahyawijaya
Copy link
Contributor

@SamuelCahyawijaya SamuelCahyawijaya commented Apr 6, 2022

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

If the following information is NOT present in the issue, please populate:

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.

Note: Need to specify a specific subset_id, i.e., pqal, pqau, and pqaa, to run the unit test.

@jason-fries jason-fries self-assigned this Apr 7, 2022
Copy link
Collaborator

@hakunanatasha hakunanatasha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SamuelCahyawijaya

Please remove the images.

I'm having trouble running the unittests for pqaa and pqau but pqal passes - when I look at google I see it limits the download because of size - can you check if these work?

@SamuelCahyawijaya
Copy link
Contributor Author

Hi @hakunanatasha, same as in the other PR, I have deleted the two image files here.
For the pqau and pqaa, I think it is the bug from HF datasets as mentioned here: huggingface/datasets#3787, and it has been included in the datasets==2.0.0 (https://github.com/huggingface/datasets/releases/tag/2.0.0).

I tested on datasets==2.0.0 and it works just fine. Is it possible to update the datasets requirement in the requirements.txt to datasets==2.0.0 to cope with this problem?

I tested some other datasets (mediqa_qa, mediqa_rqe, pubmed_qa, paramed, pico_extraction, medhop. scital, and mqp) using the datasets==2.0.0 and all of them seem to work just fine.

@hakunanatasha
Copy link
Collaborator

@SamuelCahyawijaya very interesting - yes if that's the case, let's update the reqs.

…emove LONG_ANSWER, update question type to yesno
@SamuelCahyawijaya
Copy link
Contributor Author

@hakunanatasha : I have added the 10-fold configs for source and bigbio schemas. The subset_id for pqal changes from pqal_[source|bigbio] to pqal_fold{k}_[source|bigbio], k in [0..9].

…emove LONG_ANSWER, update question type to yesno
@galtay galtay self-assigned this Apr 14, 2022
@galtay
Copy link
Collaborator

galtay commented Apr 15, 2022

don't worry about the images (we can just merge this and then add the images back right after)
I'll add some comments to the code though

… naming for the subset_id following bigbio convention, update None to BigBioValues.NULL on the bigbio schema
@galtay
Copy link
Collaborator

galtay commented Apr 16, 2022

Hi @SamuelCahyawijaya, I think I understand this dataset a bit better now. Are all the BigBio Q/A schemas using the yes/no answers? It seems that we might also be able to support long_answer Q/A and maybe even text classification with the MESH labels. I think we can wrap up it up for now but let's flag this for later development.

format, remove print, add TODO
@galtay galtay merged commit 8e7c461 into bigscience-workshop:master Apr 16, 2022
galtay added a commit to galtay/biomedical that referenced this pull request Apr 16, 2022
galtay added a commit that referenced this pull request Apr 16, 2022
galtay added a commit that referenced this pull request Apr 16, 2022
* add back images that were removed in #357

* oops! rename images
@SamuelCahyawijaya
Copy link
Contributor Author

Hi @galtay, yes, you are right. The dataset can be utilized to support long_answer Q/A and some other possible tasks. Before I include the long answer as the label in the bigbio schema, but then I remove it, since we need to maintain the QA type.

Let me know if later we plan to implement a different task / QA type for this. I can help to implement that part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants