New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #64 #453
Closes #64 #453
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @mapama247 ! Sorry for the delay. Unfortunately I could not run the tests because the download link does not seem to work. I tried in two different occasion/networks but still nothing. Is there any chance we can get it somewhere else? Or maybe you can create a github repo with the data if you still have it. This is just to double check that the script works fine!
Oh that's weird, I had no problems downloading the data last week.. I just tried calling the script (after making sure to clean my cache) and the first time the download stopped at around 86% and threw a time out error. But then I tried again and the entire dataset was downloaded smoothly. I don't know if this is the same error you experienced, but if that's the case could you please give it another try? I have also tried downloading the data using the wget command and it works without problems.. but if it doesn't for you I guess that I could simply upload it in a repo as you suggest. |
I can now download the dataset w/o problems! Who knows what was the problem... thanks for double checking though! |
Glad to hear that! must have something to do with zenodo... I have just updated the script with a different source config for each task, let me know if everything is fine now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapama247 thank you for applying the changes! Now that I could finally take a look at the dataset I have some more in depth comments. Could you please check them out? Thank you!
Ah! One more thing: here (see bottom of the page ) they provide an "Additional dataset" to extend train/dev splits. Could you please add these as well? |
I'm not sure... I guess that if we add it in the same script this would simply require an additional BigBioConfig, what schema should I use and in what fields would you want me to store the data (title, abstract, pmid, mesh...) |
Regarding the additional split: {"articles": [{"title": "Influencia: un problema Pol\u00edtico-Terap\u00e9utico en la Genealog\u00eda del Psicoan\u00e1lisis", "pmid": "biblio-994981", "abstractText": "El art\u00edculo tiene por objectivo trazar una genealog\u00eda possible de la t\u00e9cnica psicoanal\u00edtica marcando la importancia de las dimenciones hist\u00f3ricas, pol\u00edtica y sociales en la construcci\u00f3n de la psicoan\u00e1lisis. A partir del fen\u00f4meno del magnetismo animal, propuesto por Franz Anton Mesmer, en siglo XVIII, pasando por la hipnosis y por la sugesti\u00f3n, hasta encuentrarmos la transferencia psicoanal\u00edtica, intentamos remontar la influencia como pressupuesto \u00e9tico-pol\u00edtico que atraviesa y possibilita todas esas practicas.(AU)", "Mesh": [{"Code": "D006801", "Word": "Humanos"}, {"Code": "D007989", "Word": "Libido", "CIE": ["R68.82"]}, {"Code": "D011572", "Word": "Psicoan\u00e1lisis"}, {"Code": "D006990", "Word": "Hipnosis", "CIE": ["GZFZZZZ"]}, {"Code": "D011057", "Word": "Pol\u00edtica"}]} This would turn out to be a "id": "<unique_id>",
"document_id": biblio-994981 , # "pmid" key
"text": "TEXT" , # "title" + "abstractText" keys
"labels": ["D006801", ...], # "Mesh" key I would keep this data attached to this dataloader since it was designed to be an extra training split. A key aspect here is that the labels sometimes contain both a BigBioConfig(
name="codiesp_extra_mesh_bigbio_text",
version=SOURCE_VERSION,
description="Abstracts from Lilacs and Ibecs with MESH Codes",
schema="bigbio",
subset_id="codiesp_extra_mesh",
),
BigBioConfig(
name="codiesp_extra_cie_bigbio_text",
version=SOURCE_VERSION,
description="Abstracts from Lilacs and Ibecs with CIE10 Codes",
schema="bigbio",
subset_id="codiesp_extra_cie",
), Does it make sense? |
Sure! I followed your suggestions and added four extra configs to hold the abstracts dataset :) |
Great work @mapama247 ! The dataset is now complete! Thank you very much for this contribution! python -m tests.test_bigbio biodatasets/codiesp/codiesp.py --subset_id codiesp_p I get this error Traceback (most recent call last):
File "/home/ele/Desktop/projects/tm/bigbio/tests/test_bigbio.py", line 187, in setUp
self.datasets_bigbio[schema] = datasets.load_dataset(
File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1675, in load_dataset
builder_instance = load_dataset_builder(
File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/load.py", line 1533, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 1020, in __init__
super().__init__(*args, **kwargs)
File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 258, in __init__
self.config, self.config_id = self._create_builder_config(
File "/home/ele/.venv/bigbio/lib/python3.9/site-packages/datasets/builder.py", line 349, in _create_builder_config
raise ValueError(f"BuilderConfig {name} not found. Available: {list(self.builder_configs.keys())}")
ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text'] However, to me it's the error being wrong, i.e. we are not supposed to run tests for "codiesp_p_bigbio_kb" if the As soon as we figure this out (see #398) I'll merge this. |
@sg-wbi the issue is that we inspect the tasks -the dataset tasks require relation extraction and NER; there is no 'text' task which is why it's throwing this error. Either the name of the view is incorrect (should be _kb) or there is a task missing in |
Hey @hakunanatasha thank you for taking a look at this.
I think that the name is correct, because the "split" |
@sg-wbi great call - I will fix this in the refactor of the unit-tests. |
@sg-wbi can we just run --subset_id and specifically temporarily remove the other _SUPPORTED_TASKS that relate to the kb dataset? This will be a hack to check if the unit-tests work and workaround the fact that it's failing on this unique case even if the structure is correct. |
@sg-wbi the new unit-tests take care of this consideration. I will merge. |
Checkbox
biodatasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema.datasets.load_dataset
function.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py
.