Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Chemdner dataset loader #326

Merged
merged 3 commits into from Apr 7, 2022
Merged

Conversation

qanastek
Copy link
Contributor

@qanastek qanastek commented Apr 4, 2022

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

If the following information is NOT present in the issue, please populate:

  • Name: name of the dataset
  • Description: short description of the dataset (or link to social media or blog post)
  • Paper: link to the dataset paper if available
  • Data: link to the online home of the dataset

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.

@qanastek
Copy link
Contributor Author

qanastek commented Apr 4, 2022

Issue #28

biodatasets/chemdner/chemdner.py Outdated Show resolved Hide resolved
biodatasets/chemdner/chemdner.py Outdated Show resolved Hide resolved
biodatasets/chemdner/chemdner.py Outdated Show resolved Hide resolved
biodatasets/chemdner/chemdner.py Outdated Show resolved Hide resolved
@qanastek
Copy link
Contributor Author

qanastek commented Apr 4, 2022

Commit 526e7d9

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 5, 2022

Thank you for implementing the changes! Everything looks fine. You made me realize that we could use the "MeSH_Indexing_Chemical " as labels for a text_classification schema. I asked internally about this and get back to you asap.

@qanastek
Copy link
Contributor Author

qanastek commented Apr 5, 2022

Commit 6251db8

@sg-wbi sg-wbi self-assigned this Apr 5, 2022
@galtay
Copy link
Collaborator

galtay commented Apr 5, 2022

@sg-wbi thanks for self assigning this one! I'll jump to some of the other PRs

@galtay
Copy link
Collaborator

galtay commented Apr 6, 2022

just a quick note, we have some extra files in this PR besides just the chemdner loader

@qanastek
Copy link
Contributor Author

qanastek commented Apr 6, 2022

Fixed! I remove other files in the Commit 7796654.

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 6, 2022

Thanks galtey! There is an updated on the schema for text_classification which allows for mulitple labels. See the "#general" channel in Discord.

@qanastek Could you please add this view to the dataset? You can use this PR #348 as an example.

Then we can close this!

@qanastek
Copy link
Contributor Author

qanastek commented Apr 6, 2022

During the tests of the Text Classification task, I crashed during the datasets.load_dataset() function. Have you any ideas ? I use your code base.

The error :

======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/d/Projects/LIA/huggingface/BigScience BioMedical/pull_resquests/Final_ChemdNER/biomedical/tests/test_bigbio.py", line 164, in setUp
    self.datasets_bigbio[schema] = datasets.load_dataset(
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/builder.py", line 683, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/builder.py", line 1080, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/features/features.py", line 1287, in encode_example
    return encode_nested_example(self, example)
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/features/features.py", line 971, in encode_nested_example
    return {
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/features/features.py", line 971, in <dictcomp>
    return {
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 152, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 152, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: 'label'

----------------------------------------------------------------------
Ran 1 test in 15.950s

FAILED (errors=1)

My code :

    def _get_textcls_example(self, d: bioc.BioCDocument) -> Dict:

        example = {"document_id": d.id, "text": [], "labels": []}

        for p in d.passages:

            example["text"].append(p.text)

            for a in p.annotations:

                if a.infons.get("type") == "MeSH_Indexing_Chemical":

                    example["labels"].append(a.infons.get("identifier"))

        example["text"] = " ".join(example["text"])

        return example

The datasets.load_dataset is searching for a label field, except that we are giving it a labels one.

Have you passed the tests with the loading script of nlmchem ?

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 6, 2022

The test for nlmchem passes. Are you working on the latest version of the repo?

git pull upstream master

@qanastek
Copy link
Contributor Author

qanastek commented Apr 6, 2022

Thanks, It works.

@qanastek
Copy link
Contributor Author

qanastek commented Apr 6, 2022

Commit 540901f

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 6, 2022

Thank you very much for implementing the additional view! One last thing and then it should be good to go, for real this time. If I run the tests I get these warnings:

WARNING:__main__:Example:11379 - entity:11371  text:`None` != text_by_offset:``
Example:17134 - entity:17129  text:`None` != text_by_offset:``
Example:21504 - entity:21503  text:`None` != text_by_offset:``
Example:23227 - entity:23213  text:`None` != text_by_offset:``

It seems that somehow some entities without text get into the dataset. Could you please check again your script and fix this?
Thank you!

@qanastek
Copy link
Contributor Author

qanastek commented Apr 6, 2022

Nice job! I haven't seemed them since the output is very compact.

It was due to None presents in the data :

  • {'id': '21503', 'type': 'Chemical', 'text': [None], 'offsets': [[1607, 1607]], 'normalized': []}
  • {'id': '11371', 'type': 'Chemical', 'text': [None], 'offsets': [[179, 179]], 'normalized': []}
  • {'id': '17129', 'type': 'Chemical', 'text': [None], 'offsets': [[55, 55]], 'normalized': []}

I fixed it using a condition :

if a.text == None or a.text == "":
continue

I hope it's ok!

Commit ac8fb5e

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 7, 2022

@qanastek

@qanastek
Copy link
Contributor Author

qanastek commented Apr 7, 2022

Like this ?

if self.config.schema == "bigbio_kb" and a_type == "MeSH_Indexing_Chemical":
    continue

if (a.text == None or a.text == "") and self.config.schema == "bigbio_kb":
    continue

Or like this ?

if (self.config.schema == "bigbio_kb" and a_type == "MeSH_Indexing_Chemical") or ( (a.text == None or a.text == "") and self.config.schema == "bigbio_kb"):
    continue

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 7, 2022

The first one is easier to read

@qanastek
Copy link
Contributor Author

qanastek commented Apr 7, 2022

Commit 9cf9100

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 7, 2022

Great! Thank you for bearing all the changes, especially w/ the schema change. Thanks again for your contribution!

@sg-wbi sg-wbi closed this Apr 7, 2022
@sg-wbi sg-wbi reopened this Apr 7, 2022
@sg-wbi sg-wbi merged commit 3f280a4 into bigscience-workshop:master Apr 7, 2022
@qanastek
Copy link
Contributor Author

qanastek commented Apr 7, 2022

@sg-wbi Can you switch the issue (#28) status to Done to keep track of the advancement ?

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 7, 2022

Sorry! I got confused and closed it! How to I change the status of #28? Can I simply close it?

@qanastek
Copy link
Contributor Author

qanastek commented Apr 7, 2022

I think so, like Gabriel do on the Issue #53

@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 7, 2022

Now?

@qanastek
Copy link
Contributor Author

qanastek commented Apr 7, 2022

Perfect! 🎉

whoisjones pushed a commit to whoisjones/biomedical that referenced this pull request Apr 8, 2022
* Add the TEXT_CLASSIFICATION task

* Fix empty entities

* Apply entity removal only on the bigbio_kb schema
hakunanatasha pushed a commit that referenced this pull request Apr 9, 2022
* initial commit, data loader ready

* fix issues

* add normalized fields

* remove duplicate datasets, pin aiohttp (#354)

remove duplicate datasets, pin aiohttp

* Medhop (#322)

* Reset branch

* Edit the source schemas to be the same as the raw data

* Closes #237 (#344)

* implement scifact_bigbio_entailment_rationale

* fix source dataset builds

remove comments

* create _bigbio_labelprediction_generate_examples

erase todo
put config name in beginning rather than end
change check in _generate_examples to look for source only

* improve claims description

* correct rationale description

* import BigBioConfig

fix subset_id and name convention

* remove name

* Update biodatasets/scifact/scifact.py

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* Update biodatasets/scifact/scifact.py

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* Update biodatasets/scifact/scifact.py

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* formatting

delete __main__ method

Co-authored-by: Nicholas Broad <nicholas@nmbroad.com>
Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* Update progress bars

* Closes #96 (#346)

* Add dataloader for SETH-corpus

* exclude wrong annotation events

* add support task RELATION_EXTRACTION

Co-authored-by: ChienVM <chien_vm@detomo.co.jp>

* Osiris (#334)

* Changed name of dataset to make it consistent with other datasets, changed copyright and deleted TODO

* Added loader script for OSIRIS

* Changed website, added v_norm, changed offsets, text and normalized for variants and genes

* Delete multi_xscience.py

* Changed db_name for variants

* Fixed a minor bug

* Add Chemdner dataset loader (#326)

* Add the TEXT_CLASSIFICATION task

* Fix empty entities

* Apply entity removal only on the bigbio_kb schema

* fix entity names for discontiguous entities (ref PR #373)

* formatting

* initial commit, data loader ready

* fix issues

* add normalized fields

* fix entity names for discontiguous entities (ref PR #373)

* formatting

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>
Co-authored-by: Labrak Yanis <yanis.labrak@alumni.univ-avignon.fr>
Co-authored-by: Nicholas Broad <nbroad94@gmail.com>
Co-authored-by: Nicholas Broad <nicholas@nmbroad.com>
Co-authored-by: Leon Weber <leonweber@users.noreply.github.com>
Co-authored-by: Minh Chien Vu <31467068+vumichien@users.noreply.github.com>
Co-authored-by: ChienVM <chien_vm@detomo.co.jp>
Co-authored-by: Marianna Nezhurina <43296932+marianna13@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants