New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Chemdner dataset loader #326
Conversation
Issue #28 |
Commit 526e7d9 |
Thank you for implementing the changes! Everything looks fine. You made me realize that we could use the "MeSH_Indexing_Chemical " as labels for a |
Commit 6251db8 |
@sg-wbi thanks for self assigning this one! I'll jump to some of the other PRs |
just a quick note, we have some extra files in this PR besides just the |
Fixed! I remove other files in the Commit 7796654. |
During the tests of the Text Classification task, I crashed during the The error :
My code : def _get_textcls_example(self, d: bioc.BioCDocument) -> Dict:
example = {"document_id": d.id, "text": [], "labels": []}
for p in d.passages:
example["text"].append(p.text)
for a in p.annotations:
if a.infons.get("type") == "MeSH_Indexing_Chemical":
example["labels"].append(a.infons.get("identifier"))
example["text"] = " ".join(example["text"])
return example The Have you passed the tests with the loading script of |
The test for git pull upstream master |
Thanks, It works. |
Commit 540901f |
Thank you very much for implementing the additional view! One last thing and then it should be good to go, for real this time. If I run the tests I get these warnings:
It seems that somehow some entities without text get into the dataset. Could you please check again your script and fix this? |
Nice job! I haven't seemed them since the output is very compact. It was due to
I fixed it using a condition : biomedical/biodatasets/chemdner/chemdner.py Lines 293 to 294 in ac8fb5e
I hope it's ok! Commit ac8fb5e |
Like this ? if self.config.schema == "bigbio_kb" and a_type == "MeSH_Indexing_Chemical":
continue
if (a.text == None or a.text == "") and self.config.schema == "bigbio_kb":
continue Or like this ? if (self.config.schema == "bigbio_kb" and a_type == "MeSH_Indexing_Chemical") or ( (a.text == None or a.text == "") and self.config.schema == "bigbio_kb"):
continue |
The first one is easier to read |
Commit 9cf9100 |
Great! Thank you for bearing all the changes, especially w/ the schema change. Thanks again for your contribution! |
Sorry! I got confused and closed it! How to I change the status of #28? Can I simply close it? |
I think so, like Gabriel do on the Issue #53 |
Now? |
Perfect! 🎉 |
* Add the TEXT_CLASSIFICATION task * Fix empty entities * Apply entity removal only on the bigbio_kb schema
* initial commit, data loader ready * fix issues * add normalized fields * remove duplicate datasets, pin aiohttp (#354) remove duplicate datasets, pin aiohttp * Medhop (#322) * Reset branch * Edit the source schemas to be the same as the raw data * Closes #237 (#344) * implement scifact_bigbio_entailment_rationale * fix source dataset builds remove comments * create _bigbio_labelprediction_generate_examples erase todo put config name in beginning rather than end change check in _generate_examples to look for source only * improve claims description * correct rationale description * import BigBioConfig fix subset_id and name convention * remove name * Update biodatasets/scifact/scifact.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * Update biodatasets/scifact/scifact.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * Update biodatasets/scifact/scifact.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * formatting delete __main__ method Co-authored-by: Nicholas Broad <nicholas@nmbroad.com> Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * Update progress bars * Closes #96 (#346) * Add dataloader for SETH-corpus * exclude wrong annotation events * add support task RELATION_EXTRACTION Co-authored-by: ChienVM <chien_vm@detomo.co.jp> * Osiris (#334) * Changed name of dataset to make it consistent with other datasets, changed copyright and deleted TODO * Added loader script for OSIRIS * Changed website, added v_norm, changed offsets, text and normalized for variants and genes * Delete multi_xscience.py * Changed db_name for variants * Fixed a minor bug * Add Chemdner dataset loader (#326) * Add the TEXT_CLASSIFICATION task * Fix empty entities * Apply entity removal only on the bigbio_kb schema * fix entity names for discontiguous entities (ref PR #373) * formatting * initial commit, data loader ready * fix issues * add normalized fields * fix entity names for discontiguous entities (ref PR #373) * formatting Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> Co-authored-by: Labrak Yanis <yanis.labrak@alumni.univ-avignon.fr> Co-authored-by: Nicholas Broad <nbroad94@gmail.com> Co-authored-by: Nicholas Broad <nicholas@nmbroad.com> Co-authored-by: Leon Weber <leonweber@users.noreply.github.com> Co-authored-by: Minh Chien Vu <31467068+vumichien@users.noreply.github.com> Co-authored-by: ChienVM <chien_vm@detomo.co.jp> Co-authored-by: Marianna Nezhurina <43296932+marianna13@users.noreply.github.com>
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
If the following information is NOT present in the issue, please populate:
Checkbox
biodatasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema.datasets.load_dataset
function.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py
.