Conversation
sg-wbi
left a comment
There was a problem hiding this comment.
Thank you for your contribution @mapama247 ! Sorry for the delay. Unfortunately I could not run the tests because the download link does not seem to work. I tried in two different occasion/networks but still nothing. Is there any chance we can get it somewhere else? Or maybe you can create a github repo with the data if you still have it. This is just to double check that the script works fine!
Oh that's weird, I had no problems downloading the data last week.. I just tried calling the script (after making sure to clean my cache) and the first time the download stopped at around 86% and threw a time out error. But then I tried again and the entire dataset was downloaded smoothly. I don't know if this is the same error you experienced, but if that's the case could you please give it another try? I have also tried downloading the data using the wget command and it works without problems.. but if it doesn't for you I guess that I could simply upload it in a repo as you suggest. |
|
I can now download the dataset w/o problems! Who knows what was the problem... thanks for double checking though! |
Glad to hear that! must have something to do with zenodo... I have just updated the script with a different source config for each task, let me know if everything is fine now. |
sg-wbi
left a comment
There was a problem hiding this comment.
@mapama247 thank you for applying the changes! Now that I could finally take a look at the dataset I have some more in depth comments. Could you please check them out? Thank you!
|
Ah! One more thing: here (see bottom of the page ) they provide an "Additional dataset" to extend train/dev splits. Could you please add these as well? |
I'm not sure... I guess that if we add it in the same script this would simply require an additional BigBioConfig, what schema should I use and in what fields would you want me to store the data (title, abstract, pmid, mesh...) |
|
Regarding the additional split: {"articles": [{"title": "Influencia: un problema Pol\u00edtico-Terap\u00e9utico en la Genealog\u00eda del Psicoan\u00e1lisis", "pmid": "biblio-994981", "abstractText": "El art\u00edculo tiene por objectivo trazar una genealog\u00eda possible de la t\u00e9cnica psicoanal\u00edtica marcando la importancia de las dimenciones hist\u00f3ricas, pol\u00edtica y sociales en la construcci\u00f3n de la psicoan\u00e1lisis. A partir del fen\u00f4meno del magnetismo animal, propuesto por Franz Anton Mesmer, en siglo XVIII, pasando por la hipnosis y por la sugesti\u00f3n, hasta encuentrarmos la transferencia psicoanal\u00edtica, intentamos remontar la influencia como pressupuesto \u00e9tico-pol\u00edtico que atraviesa y possibilita todas esas practicas.(AU)", "Mesh": [{"Code": "D006801", "Word": "Humanos"}, {"Code": "D007989", "Word": "Libido", "CIE": ["R68.82"]}, {"Code": "D011572", "Word": "Psicoan\u00e1lisis"}, {"Code": "D006990", "Word": "Hipnosis", "CIE": ["GZFZZZZ"]}, {"Code": "D011057", "Word": "Pol\u00edtica"}]}This would turn out to be a "id": "<unique_id>",
"document_id": biblio-994981 , # "pmid" key
"text": "TEXT" , # "title" + "abstractText" keys
"labels": ["D006801", ...], # "Mesh" keyI would keep this data attached to this dataloader since it was designed to be an extra training split. A key aspect here is that the labels sometimes contain both a BigBioConfig(
name="codiesp_extra_mesh_bigbio_text",
version=SOURCE_VERSION,
description="Abstracts from Lilacs and Ibecs with MESH Codes",
schema="bigbio",
subset_id="codiesp_extra_mesh",
),
BigBioConfig(
name="codiesp_extra_cie_bigbio_text",
version=SOURCE_VERSION,
description="Abstracts from Lilacs and Ibecs with CIE10 Codes",
schema="bigbio",
subset_id="codiesp_extra_cie",
),Does it make sense? |
Sure! I followed your suggestions and added four extra configs to hold the abstracts dataset :) |
|
Great work @mapama247 ! The dataset is now complete! Thank you very much for this contribution! python -m tests.test_bigbio biodatasets/codiesp/codiesp.py --subset_id codiesp_pI get this error Traceback (most recent call last):
File "/home/ele/Desktop/projects/tm/bigbio/tests/test_bigbio.py", line 187, in setUp
self.datasets_bigbio[schema] = datasets.load_dataset(
File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1675, in load_dataset
builder_instance = load_dataset_builder(
File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1533, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 1020, in __init__
super().__init__(*args, **kwargs)
File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 258, in __init__
self.config, self.config_id = self._create_builder_config(
File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 349, in _create_builder_config
raise ValueError(f"BuilderConfig {name} not found. Available: {list(self.builder_configs.keys())}")
ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text']However, to me it's the error being wrong, i.e. we are not supposed to run tests for "codiesp_p_bigbio_kb" if the As soon as we figure this out (see #398) I'll merge this. |
@sg-wbi the issue is that we inspect the tasks -the dataset tasks require relation extraction and NER; there is no 'text' task which is why it's throwing this error. Either the name of the view is incorrect (should be _kb) or there is a task missing in |
|
Hey @hakunanatasha thank you for taking a look at this.
I think that the name is correct, because the "split" |
|
@sg-wbi great call - I will fix this in the refactor of the unit-tests. |
|
@sg-wbi can we just run --subset_id and specifically temporarily remove the other _SUPPORTED_TASKS that relate to the kb dataset? This will be a hack to check if the unit-tests work and workaround the fact that it's failing on this unique case even if the structure is correct. |
|
@sg-wbi the new unit-tests take care of this consideration. I will merge. |
Checkbox
biodatasets/my_dataset/my_dataset.py(please use only lowercase and underscore for dataset naming)._CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_BIGBIO_VERSIONvariables._info(),_split_generators()and_generate_examples()in dataloader script.BUILDER_CONFIGSclass attribute is a list with at least oneBigBioConfigfor the source schema and one for a bigbio schema.datasets.load_datasetfunction.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.