Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #64 #453

Merged
merged 7 commits into from May 1, 2022
Merged

Closes #64 #453

merged 7 commits into from May 1, 2022

Conversation

mapama247
Copy link
Contributor

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Copy link
Collaborator

@sg-wbi sg-wbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @mapama247 ! Sorry for the delay. Unfortunately I could not run the tests because the download link does not seem to work. I tried in two different occasion/networks but still nothing. Is there any chance we can get it somewhere else? Or maybe you can create a github repo with the data if you still have it. This is just to double check that the script works fine!

biodatasets/codiesp/codiesp.py Outdated Show resolved Hide resolved
biodatasets/codiesp/codiesp.py Show resolved Hide resolved
@mapama247
Copy link
Contributor Author

Thank you for your contribution @mapama247 ! Sorry for the delay. Unfortunately I could not run the tests because the download link does not seem to work. I tried in two different occasion/networks but still nothing. Is there any chance we can get it somewhere else? Or maybe you can create a github repo with the data if you still have it. This is just to double check that the script works fine!

Oh that's weird, I had no problems downloading the data last week.. I just tried calling the script (after making sure to clean my cache) and the first time the download stopped at around 86% and threw a time out error. But then I tried again and the entire dataset was downloaded smoothly. I don't know if this is the same error you experienced, but if that's the case could you please give it another try? I have also tried downloading the data using the wget command and it works without problems.. but if it doesn't for you I guess that I could simply upload it in a repo as you suggest.

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 20, 2022

I can now download the dataset w/o problems! Who knows what was the problem... thanks for double checking though!

@mapama247
Copy link
Contributor Author

I can now download the dataset w/o problems! Who knows what was the problem... thanks for double checking though!

Glad to hear that! must have something to do with zenodo... I have just updated the script with a different source config for each task, let me know if everything is fine now.

Copy link
Collaborator

@sg-wbi sg-wbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapama247 thank you for applying the changes! Now that I could finally take a look at the dataset I have some more in depth comments. Could you please check them out? Thank you!

biodatasets/codiesp/codiesp.py Outdated Show resolved Hide resolved
biodatasets/codiesp/codiesp.py Outdated Show resolved Hide resolved
biodatasets/codiesp/codiesp.py Outdated Show resolved Hide resolved
biodatasets/codiesp/codiesp.py Show resolved Hide resolved
@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 21, 2022

Ah! One more thing: here (see bottom of the page ) they provide an "Additional dataset" to extend train/dev splits. Could you please add these as well?
Or maybe it is worth to create a new Issue and a separate dataloader script? What do you think?

@mapama247
Copy link
Contributor Author

Ah! One more thing: here (see bottom of the page ) they provide an "Additional dataset" to extend train/dev splits. Could you please add these as well? Or maybe it is worth to create a new Issue and a separate dataloader script? What do you think?

I'm not sure... I guess that if we add it in the same script this would simply require an additional BigBioConfig, what schema should I use and in what fields would you want me to store the data (title, abstract, pmid, mesh...)

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 22, 2022

Regarding the additional split:
The data here it's in the following format:

{"articles": [{"title": "Influencia: un problema Pol\u00edtico-Terap\u00e9utico en la Genealog\u00eda del Psicoan\u00e1lisis", "pmid": "biblio-994981", "abstractText": "El art\u00edculo tiene por objectivo trazar una genealog\u00eda possible de la t\u00e9cnica psicoanal\u00edtica marcando la importancia de las dimenciones hist\u00f3ricas, pol\u00edtica y sociales en la construcci\u00f3n de la psicoan\u00e1lisis. A partir del fen\u00f4meno del magnetismo animal, propuesto por Franz Anton Mesmer, en siglo XVIII, pasando por la hipnosis y por la sugesti\u00f3n, hasta encuentrarmos la transferencia psicoanal\u00edtica, intentamos remontar la influencia como pressupuesto \u00e9tico-pol\u00edtico que atraviesa y possibilita todas esas practicas.(AU)", "Mesh": [{"Code": "D006801", "Word": "Humanos"}, {"Code": "D007989", "Word": "Libido", "CIE": ["R68.82"]}, {"Code": "D011572", "Word": "Psicoan\u00e1lisis"}, {"Code": "D006990", "Word": "Hipnosis", "CIE": ["GZFZZZZ"]}, {"Code": "D011057", "Word": "Pol\u00edtica"}]}

This would turn out to be a Tasks.TEXT_CLASSIFICATION with examples in the form:

	"id": "<unique_id>",
	"document_id": biblio-994981 , # "pmid" key
	"text": "TEXT" , # "title" + "abstractText" keys
	"labels": ["D006801", ...], # "Mesh" key

I would keep this data attached to this dataloader since it was designed to be an extra training split. A key aspect here is that the labels sometimes contain both a MeSH ID (e.g. D006801) and an CIE ID (e.g. R68.82) or only a MeSH ID. So, the way I see it this should result in two (four counting source) extra configs with a train split:

        BigBioConfig(
            name="codiesp_extra_mesh_bigbio_text",
            version=SOURCE_VERSION,
            description="Abstracts from Lilacs and Ibecs with MESH Codes",
            schema="bigbio",
            subset_id="codiesp_extra_mesh",
        ),
    BigBioConfig(
            name="codiesp_extra_cie_bigbio_text",
            version=SOURCE_VERSION,
            description="Abstracts from Lilacs and Ibecs with CIE10 Codes",
            schema="bigbio",
            subset_id="codiesp_extra_cie",
        ),

Does it make sense?

@mapama247
Copy link
Contributor Author

Does it make sense?

Sure! I followed your suggestions and added four extra configs to hold the abstracts dataset :)

@sg-wbi sg-wbi mentioned this pull request Apr 25, 2022
@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 25, 2022

Great work @mapama247 ! The dataset is now complete! Thank you very much for this contribution!
I'll wait to merge this since there seems to be an issue w/ testing, i.e. if I run

python -m tests.test_bigbio biodatasets/codiesp/codiesp.py --subset_id codiesp_p

I get this error

Traceback (most recent call last):
  File "/home/ele/Desktop/projects/tm/bigbio/tests/test_bigbio.py", line 187, in setUp
    self.datasets_bigbio[schema] = datasets.load_dataset(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1675, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1533, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 1020, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 258, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 349, in _create_builder_config
    raise ValueError(f"BuilderConfig {name} not found. Available: {list(self.builder_configs.keys())}")
ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text']

However, to me it's the error being wrong, i.e. we are not supposed to run tests for "codiesp_p_bigbio_kb" if the subset_id only supports bigbio_text.

As soon as we figure this out (see #398) I'll merge this.

@hakunanatasha
Copy link
Collaborator

Great work @mapama247 ! The dataset is now complete! Thank you very much for this contribution! I'll wait to merge this since there seems to be an issue w/ testing, i.e. if I run

python -m tests.test_bigbio biodatasets/codiesp/codiesp.py --subset_id codiesp_p

I get this error

Traceback (most recent call last):
  File "/home/ele/Desktop/projects/tm/bigbio/tests/test_bigbio.py", line 187, in setUp
    self.datasets_bigbio[schema] = datasets.load_dataset(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1675, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1533, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 1020, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 258, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 349, in _create_builder_config
    raise ValueError(f"BuilderConfig {name} not found. Available: {list(self.builder_configs.keys())}")
ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text']

However, to me it's the error being wrong, i.e. we are not supposed to run tests for "codiesp_p_bigbio_kb" if the subset_id only supports bigbio_text.

As soon as we figure this out (see #398) I'll merge this.

@sg-wbi the issue is that we inspect the tasks -the dataset tasks require relation extraction and NER; there is no 'text' task which is why it's throwing this error. Either the name of the view is incorrect (should be _kb) or there is a task missing in _SUPPORTED_TASKS

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 27, 2022

Hey @hakunanatasha thank you for taking a look at this.

Either the name of the view is incorrect (should be _kb) or there is a task missing in _SUPPORTED_TASKS

I think that the name is correct, because the "split" codiesp_p is a text classification task and in the _SUPPORTED_TASKS there is a Tasks.TEXT_CLASSIFICATION. The way I see it we should run tests based on the schema provided by the subset_id or have a bypass arg. I'm seeing the same issue here

@hakunanatasha
Copy link
Collaborator

@sg-wbi great call - I will fix this in the refactor of the unit-tests.

@hakunanatasha
Copy link
Collaborator

@sg-wbi can we just run --subset_id and specifically temporarily remove the other _SUPPORTED_TASKS that relate to the kb dataset? This will be a hack to check if the unit-tests work and workaround the fact that it's failing on this unique case even if the structure is correct.

@hakunanatasha
Copy link
Collaborator

@sg-wbi the new unit-tests take care of this consideration. I will merge.

@hakunanatasha hakunanatasha merged commit 985a389 into bigscience-workshop:master May 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants