Closes #64 #453

mapama247 · 2022-04-13T23:09:04Z

Name: CODIESP
Description: Collection of 1,000 manually selected clinical case studies in Spanish.
Paper: link to the dataset paper
Data: link to the online home of the dataset

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

sg-wbi

Thank you for your contribution @mapama247 ! Sorry for the delay. Unfortunately I could not run the tests because the download link does not seem to work. I tried in two different occasion/networks but still nothing. Is there any chance we can get it somewhere else? Or maybe you can create a github repo with the data if you still have it. This is just to double check that the script works fine!

biodatasets/codiesp/codiesp.py

mapama247 · 2022-04-20T11:53:35Z

Thank you for your contribution @mapama247 ! Sorry for the delay. Unfortunately I could not run the tests because the download link does not seem to work. I tried in two different occasion/networks but still nothing. Is there any chance we can get it somewhere else? Or maybe you can create a github repo with the data if you still have it. This is just to double check that the script works fine!

Oh that's weird, I had no problems downloading the data last week.. I just tried calling the script (after making sure to clean my cache) and the first time the download stopped at around 86% and threw a time out error. But then I tried again and the entire dataset was downloaded smoothly. I don't know if this is the same error you experienced, but if that's the case could you please give it another try? I have also tried downloading the data using the wget command and it works without problems.. but if it doesn't for you I guess that I could simply upload it in a repo as you suggest.

sg-wbi · 2022-04-20T12:37:28Z

I can now download the dataset w/o problems! Who knows what was the problem... thanks for double checking though!

mapama247 · 2022-04-20T15:29:14Z

I can now download the dataset w/o problems! Who knows what was the problem... thanks for double checking though!

Glad to hear that! must have something to do with zenodo... I have just updated the script with a different source config for each task, let me know if everything is fine now.

sg-wbi

@mapama247 thank you for applying the changes! Now that I could finally take a look at the dataset I have some more in depth comments. Could you please check them out? Thank you!

biodatasets/codiesp/codiesp.py

sg-wbi · 2022-04-21T10:03:18Z

Ah! One more thing: here (see bottom of the page ) they provide an "Additional dataset" to extend train/dev splits. Could you please add these as well?
Or maybe it is worth to create a new Issue and a separate dataloader script? What do you think?

mapama247 · 2022-04-21T16:43:58Z

Ah! One more thing: here (see bottom of the page ) they provide an "Additional dataset" to extend train/dev splits. Could you please add these as well? Or maybe it is worth to create a new Issue and a separate dataloader script? What do you think?

I'm not sure... I guess that if we add it in the same script this would simply require an additional BigBioConfig, what schema should I use and in what fields would you want me to store the data (title, abstract, pmid, mesh...)

biodatasets/codiesp/codiesp.py

sg-wbi · 2022-04-22T08:40:06Z

Regarding the additional split:
The data here it's in the following format:

{"articles": [{"title": "Influencia: un problema Pol\u00edtico-Terap\u00e9utico en la Genealog\u00eda del Psicoan\u00e1lisis", "pmid": "biblio-994981", "abstractText": "El art\u00edculo tiene por objectivo trazar una genealog\u00eda possible de la t\u00e9cnica psicoanal\u00edtica marcando la importancia de las dimenciones hist\u00f3ricas, pol\u00edtica y sociales en la construcci\u00f3n de la psicoan\u00e1lisis. A partir del fen\u00f4meno del magnetismo animal, propuesto por Franz Anton Mesmer, en siglo XVIII, pasando por la hipnosis y por la sugesti\u00f3n, hasta encuentrarmos la transferencia psicoanal\u00edtica, intentamos remontar la influencia como pressupuesto \u00e9tico-pol\u00edtico que atraviesa y possibilita todas esas practicas.(AU)", "Mesh": [{"Code": "D006801", "Word": "Humanos"}, {"Code": "D007989", "Word": "Libido", "CIE": ["R68.82"]}, {"Code": "D011572", "Word": "Psicoan\u00e1lisis"}, {"Code": "D006990", "Word": "Hipnosis", "CIE": ["GZFZZZZ"]}, {"Code": "D011057", "Word": "Pol\u00edtica"}]}

This would turn out to be a Tasks.TEXT_CLASSIFICATION with examples in the form:

	"id": "<unique_id>",
	"document_id": biblio-994981 , # "pmid" key
	"text": "TEXT" , # "title" + "abstractText" keys
	"labels": ["D006801", ...], # "Mesh" key

I would keep this data attached to this dataloader since it was designed to be an extra training split. A key aspect here is that the labels sometimes contain both a MeSH ID (e.g. D006801) and an CIE ID (e.g. R68.82) or only a MeSH ID. So, the way I see it this should result in two (four counting source) extra configs with a train split:

        BigBioConfig(
            name="codiesp_extra_mesh_bigbio_text",
            version=SOURCE_VERSION,
            description="Abstracts from Lilacs and Ibecs with MESH Codes",
            schema="bigbio",
            subset_id="codiesp_extra_mesh",
        ),
    BigBioConfig(
            name="codiesp_extra_cie_bigbio_text",
            version=SOURCE_VERSION,
            description="Abstracts from Lilacs and Ibecs with CIE10 Codes",
            schema="bigbio",
            subset_id="codiesp_extra_cie",
        ),

Does it make sense?

mapama247 · 2022-04-22T13:53:35Z

Does it make sense?

Sure! I followed your suggestions and added four extra configs to hold the abstracts dataset :)

sg-wbi · 2022-04-25T08:53:32Z

Great work @mapama247 ! The dataset is now complete! Thank you very much for this contribution!
I'll wait to merge this since there seems to be an issue w/ testing, i.e. if I run

python -m tests.test_bigbio biodatasets/codiesp/codiesp.py --subset_id codiesp_p

I get this error

Traceback (most recent call last):
  File "/home/ele/Desktop/projects/tm/bigbio/tests/test_bigbio.py", line 187, in setUp
    self.datasets_bigbio[schema] = datasets.load_dataset(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1675, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1533, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 1020, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 258, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 349, in _create_builder_config
    raise ValueError(f"BuilderConfig {name} not found. Available: {list(self.builder_configs.keys())}")
ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text']

However, to me it's the error being wrong, i.e. we are not supposed to run tests for "codiesp_p_bigbio_kb" if the subset_id only supports bigbio_text.

As soon as we figure this out (see #398) I'll merge this.

hakunanatasha · 2022-04-27T05:35:41Z

Great work @mapama247 ! The dataset is now complete! Thank you very much for this contribution! I'll wait to merge this since there seems to be an issue w/ testing, i.e. if I run

python -m tests.test_bigbio biodatasets/codiesp/codiesp.py --subset_id codiesp_p

I get this error

Traceback (most recent call last):
  File "/home/ele/Desktop/projects/tm/bigbio/tests/test_bigbio.py", line 187, in setUp
    self.datasets_bigbio[schema] = datasets.load_dataset(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1675, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1533, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 1020, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 258, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 349, in _create_builder_config
    raise ValueError(f"BuilderConfig {name} not found. Available: {list(self.builder_configs.keys())}")
ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text']

However, to me it's the error being wrong, i.e. we are not supposed to run tests for "codiesp_p_bigbio_kb" if the subset_id only supports bigbio_text.

As soon as we figure this out (see #398) I'll merge this.

@sg-wbi the issue is that we inspect the tasks -the dataset tasks require relation extraction and NER; there is no 'text' task which is why it's throwing this error. Either the name of the view is incorrect (should be _kb) or there is a task missing in _SUPPORTED_TASKS

sg-wbi · 2022-04-27T07:57:10Z

Hey @hakunanatasha thank you for taking a look at this.

Either the name of the view is incorrect (should be _kb) or there is a task missing in _SUPPORTED_TASKS

I think that the name is correct, because the "split" codiesp_p is a text classification task and in the _SUPPORTED_TASKS there is a Tasks.TEXT_CLASSIFICATION. The way I see it we should run tests based on the schema provided by the subset_id or have a bypass arg. I'm seeing the same issue here

hakunanatasha · 2022-04-27T15:17:12Z

@sg-wbi great call - I will fix this in the refactor of the unit-tests.

hakunanatasha · 2022-04-27T15:19:07Z

@sg-wbi can we just run --subset_id and specifically temporarily remove the other _SUPPORTED_TASKS that relate to the kb dataset? This will be a hack to check if the unit-tests work and workaround the fact that it's failing on this unique case even if the structure is correct.

hakunanatasha · 2022-05-01T14:57:27Z

@sg-wbi the new unit-tests take care of this consideration. I will merge.

add codiesp data loader

81082e7

mapama247 requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber, sg-wbi and debajyotidatta as code owners April 13, 2022 23:09

sg-wbi self-assigned this Apr 14, 2022

replace try by if

abde885

sg-wbi reviewed Apr 19, 2022

View reviewed changes

biodatasets/codiesp/codiesp.py Outdated Show resolved Hide resolved

biodatasets/codiesp/codiesp.py Show resolved Hide resolved

standardize config names

7759580

mapama247 mentioned this pull request Apr 20, 2022

Create a dataset loader for CodiEsp #64

Closed

add different source schema per task

fce1d8b

sg-wbi reviewed Apr 21, 2022

View reviewed changes

biodatasets/codiesp/codiesp.py Outdated Show resolved Hide resolved

biodatasets/codiesp/codiesp.py Outdated Show resolved Hide resolved

biodatasets/codiesp/codiesp.py Outdated Show resolved Hide resolved

biodatasets/codiesp/codiesp.py Show resolved Hide resolved

add changes suggested by reviewer

557e0dc

sg-wbi reviewed Apr 22, 2022

View reviewed changes

biodatasets/codiesp/codiesp.py Outdated Show resolved Hide resolved

add additional dataset of abstracts

1c922b8

sg-wbi mentioned this pull request Apr 25, 2022

unit-test bypass args #398

Closed

fix: change import statements to match new package

e82b22f

hakunanatasha approved these changes May 1, 2022

View reviewed changes

hakunanatasha merged commit 985a389 into bigscience-workshop:master May 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #64 #453

Closes #64 #453

mapama247 commented Apr 13, 2022

sg-wbi left a comment

mapama247 commented Apr 20, 2022

sg-wbi commented Apr 20, 2022

mapama247 commented Apr 20, 2022

sg-wbi left a comment

sg-wbi commented Apr 21, 2022

mapama247 commented Apr 21, 2022

sg-wbi commented Apr 22, 2022

mapama247 commented Apr 22, 2022

sg-wbi commented Apr 25, 2022

hakunanatasha commented Apr 27, 2022

sg-wbi commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

hakunanatasha commented May 1, 2022

Closes #64 #453

Closes #64 #453

Conversation

mapama247 commented Apr 13, 2022

Checkbox

sg-wbi left a comment

Choose a reason for hiding this comment

mapama247 commented Apr 20, 2022

sg-wbi commented Apr 20, 2022

mapama247 commented Apr 20, 2022

sg-wbi left a comment

Choose a reason for hiding this comment

sg-wbi commented Apr 21, 2022

mapama247 commented Apr 21, 2022

sg-wbi commented Apr 22, 2022

mapama247 commented Apr 22, 2022

sg-wbi commented Apr 25, 2022

hakunanatasha commented Apr 27, 2022

sg-wbi commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

hakunanatasha commented May 1, 2022