Closes #64 by mapama247 · Pull Request #453 · bigscience-workshop/biomedical

mapama247 · 2022-04-13T23:09:04Z

Name: CODIESP
Description: Collection of 1,000 manually selected clinical case studies in Spanish.
Paper: link to the dataset paper
Data: link to the online home of the dataset

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

sg-wbi

Thank you for your contribution @mapama247 ! Sorry for the delay. Unfortunately I could not run the tests because the download link does not seem to work. I tried in two different occasion/networks but still nothing. Is there any chance we can get it somewhere else? Or maybe you can create a github repo with the data if you still have it. This is just to double check that the script works fine!

mapama247 · 2022-04-20T11:53:35Z

Thank you for your contribution @mapama247 ! Sorry for the delay. Unfortunately I could not run the tests because the download link does not seem to work. I tried in two different occasion/networks but still nothing. Is there any chance we can get it somewhere else? Or maybe you can create a github repo with the data if you still have it. This is just to double check that the script works fine!

Oh that's weird, I had no problems downloading the data last week.. I just tried calling the script (after making sure to clean my cache) and the first time the download stopped at around 86% and threw a time out error. But then I tried again and the entire dataset was downloaded smoothly. I don't know if this is the same error you experienced, but if that's the case could you please give it another try? I have also tried downloading the data using the wget command and it works without problems.. but if it doesn't for you I guess that I could simply upload it in a repo as you suggest.

sg-wbi · 2022-04-20T12:37:28Z

I can now download the dataset w/o problems! Who knows what was the problem... thanks for double checking though!

mapama247 · 2022-04-20T15:29:14Z

I can now download the dataset w/o problems! Who knows what was the problem... thanks for double checking though!

Glad to hear that! must have something to do with zenodo... I have just updated the script with a different source config for each task, let me know if everything is fine now.

sg-wbi

@mapama247 thank you for applying the changes! Now that I could finally take a look at the dataset I have some more in depth comments. Could you please check them out? Thank you!

sg-wbi · 2022-04-21T10:03:18Z

Ah! One more thing: here (see bottom of the page ) they provide an "Additional dataset" to extend train/dev splits. Could you please add these as well?
Or maybe it is worth to create a new Issue and a separate dataloader script? What do you think?

mapama247 · 2022-04-21T16:43:58Z

Ah! One more thing: here (see bottom of the page ) they provide an "Additional dataset" to extend train/dev splits. Could you please add these as well? Or maybe it is worth to create a new Issue and a separate dataloader script? What do you think?

I'm not sure... I guess that if we add it in the same script this would simply require an additional BigBioConfig, what schema should I use and in what fields would you want me to store the data (title, abstract, pmid, mesh...)

sg-wbi · 2022-04-22T08:40:06Z

Regarding the additional split:
The data here it's in the following format:

{"articles": [{"title": "Influencia: un problema Pol\u00edtico-Terap\u00e9utico en la Genealog\u00eda del Psicoan\u00e1lisis", "pmid": "biblio-994981", "abstractText": "El art\u00edculo tiene por objectivo trazar una genealog\u00eda possible de la t\u00e9cnica psicoanal\u00edtica marcando la importancia de las dimenciones hist\u00f3ricas, pol\u00edtica y sociales en la construcci\u00f3n de la psicoan\u00e1lisis. A partir del fen\u00f4meno del magnetismo animal, propuesto por Franz Anton Mesmer, en siglo XVIII, pasando por la hipnosis y por la sugesti\u00f3n, hasta encuentrarmos la transferencia psicoanal\u00edtica, intentamos remontar la influencia como pressupuesto \u00e9tico-pol\u00edtico que atraviesa y possibilita todas esas practicas.(AU)", "Mesh": [{"Code": "D006801", "Word": "Humanos"}, {"Code": "D007989", "Word": "Libido", "CIE": ["R68.82"]}, {"Code": "D011572", "Word": "Psicoan\u00e1lisis"}, {"Code": "D006990", "Word": "Hipnosis", "CIE": ["GZFZZZZ"]}, {"Code": "D011057", "Word": "Pol\u00edtica"}]}

This would turn out to be a Tasks.TEXT_CLASSIFICATION with examples in the form:

	"id": "<unique_id>",
	"document_id": biblio-994981 , # "pmid" key
	"text": "TEXT" , # "title" + "abstractText" keys
	"labels": ["D006801", ...], # "Mesh" key

I would keep this data attached to this dataloader since it was designed to be an extra training split. A key aspect here is that the labels sometimes contain both a MeSH ID (e.g. D006801) and an CIE ID (e.g. R68.82) or only a MeSH ID. So, the way I see it this should result in two (four counting source) extra configs with a train split:

        BigBioConfig(
            name="codiesp_extra_mesh_bigbio_text",
            version=SOURCE_VERSION,
            description="Abstracts from Lilacs and Ibecs with MESH Codes",
            schema="bigbio",
            subset_id="codiesp_extra_mesh",
        ),
    BigBioConfig(
            name="codiesp_extra_cie_bigbio_text",
            version=SOURCE_VERSION,
            description="Abstracts from Lilacs and Ibecs with CIE10 Codes",
            schema="bigbio",
            subset_id="codiesp_extra_cie",
        ),

Does it make sense?

mapama247 · 2022-04-22T13:53:35Z

Does it make sense?

Sure! I followed your suggestions and added four extra configs to hold the abstracts dataset :)

sg-wbi · 2022-04-25T08:53:32Z

Great work @mapama247 ! The dataset is now complete! Thank you very much for this contribution!
I'll wait to merge this since there seems to be an issue w/ testing, i.e. if I run

python -m tests.test_bigbio biodatasets/codiesp/codiesp.py --subset_id codiesp_p

I get this error

Traceback (most recent call last):
  File "/home/ele/Desktop/projects/tm/bigbio/tests/test_bigbio.py", line 187, in setUp
    self.datasets_bigbio[schema] = datasets.load_dataset(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1675, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1533, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 1020, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 258, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 349, in _create_builder_config
    raise ValueError(f"BuilderConfig {name} not found. Available: {list(self.builder_configs.keys())}")
ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text']

However, to me it's the error being wrong, i.e. we are not supposed to run tests for "codiesp_p_bigbio_kb" if the subset_id only supports bigbio_text.

As soon as we figure this out (see #398) I'll merge this.

hakunanatasha · 2022-04-27T05:35:41Z

Great work @mapama247 ! The dataset is now complete! Thank you very much for this contribution! I'll wait to merge this since there seems to be an issue w/ testing, i.e. if I run

python -m tests.test_bigbio biodatasets/codiesp/codiesp.py --subset_id codiesp_p

I get this error

Traceback (most recent call last):
  File "/home/ele/Desktop/projects/tm/bigbio/tests/test_bigbio.py", line 187, in setUp
    self.datasets_bigbio[schema] = datasets.load_dataset(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1675, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1533, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 1020, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 258, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 349, in _create_builder_config
    raise ValueError(f"BuilderConfig {name} not found. Available: {list(self.builder_configs.keys())}")
ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text']

However, to me it's the error being wrong, i.e. we are not supposed to run tests for "codiesp_p_bigbio_kb" if the subset_id only supports bigbio_text.

As soon as we figure this out (see #398) I'll merge this.

@sg-wbi the issue is that we inspect the tasks -the dataset tasks require relation extraction and NER; there is no 'text' task which is why it's throwing this error. Either the name of the view is incorrect (should be _kb) or there is a task missing in _SUPPORTED_TASKS

sg-wbi · 2022-04-27T07:57:10Z

Hey @hakunanatasha thank you for taking a look at this.

Either the name of the view is incorrect (should be _kb) or there is a task missing in _SUPPORTED_TASKS

I think that the name is correct, because the "split" codiesp_p is a text classification task and in the _SUPPORTED_TASKS there is a Tasks.TEXT_CLASSIFICATION. The way I see it we should run tests based on the schema provided by the subset_id or have a bypass arg. I'm seeing the same issue here

hakunanatasha · 2022-04-27T15:17:12Z

@sg-wbi great call - I will fix this in the refactor of the unit-tests.

hakunanatasha · 2022-04-27T15:19:07Z

@sg-wbi can we just run --subset_id and specifically temporarily remove the other _SUPPORTED_TASKS that relate to the kb dataset? This will be a hack to check if the unit-tests work and workaround the fact that it's failing on this unique case even if the structure is correct.

hakunanatasha · 2022-05-01T14:57:27Z

@sg-wbi the new unit-tests take care of this consideration. I will merge.

add codiesp data loader

81082e7

mapama247 requested review from debajyotidatta, galtay, hakunanatasha, jason-fries, leonweber, ruisi-su, sg-wbi and sunnnymskang as code owners April 13, 2022 23:09

sg-wbi self-assigned this Apr 14, 2022

replace try by if

abde885

sg-wbi reviewed Apr 19, 2022

View reviewed changes

Comment thread biodatasets/codiesp/codiesp.py Outdated

Comment thread biodatasets/codiesp/codiesp.py

standardize config names

7759580

mapama247 mentioned this pull request Apr 20, 2022

Create a dataset loader for CodiEsp #64

Closed

add different source schema per task

fce1d8b

sg-wbi reviewed Apr 21, 2022

View reviewed changes

Comment thread biodatasets/codiesp/codiesp.py Outdated

Comment thread biodatasets/codiesp/codiesp.py Outdated

Comment thread biodatasets/codiesp/codiesp.py Outdated

Comment thread biodatasets/codiesp/codiesp.py

add changes suggested by reviewer

557e0dc

sg-wbi reviewed Apr 22, 2022

View reviewed changes

Comment thread biodatasets/codiesp/codiesp.py Outdated

add additional dataset of abstracts

1c922b8

sg-wbi mentioned this pull request Apr 25, 2022

unit-test bypass args #398

Closed

fix: change import statements to match new package

e82b22f

hakunanatasha approved these changes May 1, 2022

View reviewed changes

hakunanatasha merged commit 985a389 into bigscience-workshop:master May 1, 2022

Conversation

mapama247 commented Apr 13, 2022

Checkbox

Uh oh!

sg-wbi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mapama247 commented Apr 20, 2022

Uh oh!

sg-wbi commented Apr 20, 2022

Uh oh!

mapama247 commented Apr 20, 2022

Uh oh!

sg-wbi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sg-wbi commented Apr 21, 2022

Uh oh!

mapama247 commented Apr 21, 2022

Uh oh!

Uh oh!

sg-wbi commented Apr 22, 2022

Uh oh!

mapama247 commented Apr 22, 2022

Uh oh!

sg-wbi commented Apr 25, 2022

Uh oh!

hakunanatasha commented Apr 27, 2022

Uh oh!

sg-wbi commented Apr 27, 2022

Uh oh!

hakunanatasha commented Apr 27, 2022

Uh oh!

hakunanatasha commented Apr 27, 2022

Uh oh!

hakunanatasha commented May 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants