Add Chemdner dataset loader #326

qanastek · 2022-04-04T13:53:27Z

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

If the following information is NOT present in the issue, please populate:

Name: name of the dataset
Description: short description of the dataset (or link to social media or blog post)
Paper: link to the dataset paper if available
Data: link to the online home of the dataset

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.

qanastek · 2022-04-04T13:53:55Z

Issue #28

biodatasets/chemdner/chemdner.py

qanastek · 2022-04-04T19:35:07Z

Commit 526e7d9

sg-wbi · 2022-04-05T08:32:36Z

Thank you for implementing the changes! Everything looks fine. You made me realize that we could use the "MeSH_Indexing_Chemical " as labels for a text_classification schema. I asked internally about this and get back to you asap.

qanastek · 2022-04-05T08:49:55Z

Commit 6251db8

galtay · 2022-04-05T22:58:28Z

@sg-wbi thanks for self assigning this one! I'll jump to some of the other PRs

galtay · 2022-04-06T01:14:11Z

just a quick note, we have some extra files in this PR besides just the chemdner loader

qanastek · 2022-04-06T09:07:37Z

Fixed! I remove other files in the Commit 7796654.

sg-wbi · 2022-04-06T09:44:01Z

Thanks galtey! There is an updated on the schema for text_classification which allows for mulitple labels. See the "#general" channel in Discord.

@qanastek Could you please add this view to the dataset? You can use this PR #348 as an example.

Then we can close this!

qanastek · 2022-04-06T10:29:31Z

During the tests of the Text Classification task, I crashed during the datasets.load_dataset() function. Have you any ideas ? I use your code base.

The error :

======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/d/Projects/LIA/huggingface/BigScience BioMedical/pull_resquests/Final_ChemdNER/biomedical/tests/test_bigbio.py", line 164, in setUp
    self.datasets_bigbio[schema] = datasets.load_dataset(
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/builder.py", line 683, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/builder.py", line 1080, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/features/features.py", line 1287, in encode_example
    return encode_nested_example(self, example)
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/features/features.py", line 971, in encode_nested_example
    return {
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/features/features.py", line 971, in <dictcomp>
    return {
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 152, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/home/home/miniconda3/envs/bigscience-biomedical/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 152, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: 'label'

----------------------------------------------------------------------
Ran 1 test in 15.950s

FAILED (errors=1)

My code :

    def _get_textcls_example(self, d: bioc.BioCDocument) -> Dict:

        example = {"document_id": d.id, "text": [], "labels": []}

        for p in d.passages:

            example["text"].append(p.text)

            for a in p.annotations:

                if a.infons.get("type") == "MeSH_Indexing_Chemical":

                    example["labels"].append(a.infons.get("identifier"))

        example["text"] = " ".join(example["text"])

        return example

The datasets.load_dataset is searching for a label field, except that we are giving it a labels one.

Have you passed the tests with the loading script of nlmchem ?

sg-wbi · 2022-04-06T10:32:20Z

The test for nlmchem passes. Are you working on the latest version of the repo?

git pull upstream master

qanastek · 2022-04-06T10:42:42Z

Thanks, It works.

qanastek · 2022-04-06T10:50:47Z

Commit 540901f

sg-wbi · 2022-04-06T16:16:39Z

Thank you very much for implementing the additional view! One last thing and then it should be good to go, for real this time. If I run the tests I get these warnings:

WARNING:__main__:Example:11379 - entity:11371  text:`None` != text_by_offset:``
Example:17134 - entity:17129  text:`None` != text_by_offset:``
Example:21504 - entity:21503  text:`None` != text_by_offset:``
Example:23227 - entity:23213  text:`None` != text_by_offset:``

It seems that somehow some entities without text get into the dataset. Could you please check again your script and fix this?
Thank you!

qanastek · 2022-04-06T21:40:53Z

Nice job! I haven't seemed them since the output is very compact.

It was due to None presents in the data :

{'id': '21503', 'type': 'Chemical', 'text': [None], 'offsets': [[1607, 1607]], 'normalized': []}
{'id': '11371', 'type': 'Chemical', 'text': [None], 'offsets': [[179, 179]], 'normalized': []}
{'id': '17129', 'type': 'Chemical', 'text': [None], 'offsets': [[55, 55]], 'normalized': []}

I fixed it using a condition :

biomedical/biodatasets/chemdner/chemdner.py

Lines 293 to 294 in ac8fb5e

    
           if a.text == None or a.text == "": 
        
               continue

I hope it's ok!

Commit ac8fb5e

sg-wbi · 2022-04-07T07:52:10Z

@qanastek

qanastek · 2022-04-07T09:04:24Z

Like this ?

if self.config.schema == "bigbio_kb" and a_type == "MeSH_Indexing_Chemical":
    continue

if (a.text == None or a.text == "") and self.config.schema == "bigbio_kb":
    continue

Or like this ?

if (self.config.schema == "bigbio_kb" and a_type == "MeSH_Indexing_Chemical") or ( (a.text == None or a.text == "") and self.config.schema == "bigbio_kb"):
    continue

sg-wbi · 2022-04-07T10:09:01Z

The first one is easier to read

qanastek · 2022-04-07T10:19:55Z

Commit 9cf9100

sg-wbi · 2022-04-07T10:49:11Z

Great! Thank you for bearing all the changes, especially w/ the schema change. Thanks again for your contribution!

qanastek · 2022-04-07T11:10:58Z

@sg-wbi Can you switch the issue (#28) status to Done to keep track of the advancement ?

sg-wbi · 2022-04-07T12:01:12Z

Sorry! I got confused and closed it! How to I change the status of #28? Can I simply close it?

qanastek · 2022-04-07T12:20:08Z

I think so, like Gabriel do on the Issue #53

sg-wbi · 2022-04-07T12:27:29Z

Now?

qanastek · 2022-04-07T12:40:44Z

Perfect! 🎉

* Add the TEXT_CLASSIFICATION task * Fix empty entities * Apply entity removal only on the bigbio_kb schema

* initial commit, data loader ready * fix issues * add normalized fields * remove duplicate datasets, pin aiohttp (#354) remove duplicate datasets, pin aiohttp * Medhop (#322) * Reset branch * Edit the source schemas to be the same as the raw data * Closes #237 (#344) * implement scifact_bigbio_entailment_rationale * fix source dataset builds remove comments * create _bigbio_labelprediction_generate_examples erase todo put config name in beginning rather than end change check in _generate_examples to look for source only * improve claims description * correct rationale description * import BigBioConfig fix subset_id and name convention * remove name * Update biodatasets/scifact/scifact.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * Update biodatasets/scifact/scifact.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * Update biodatasets/scifact/scifact.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * formatting delete __main__ method Co-authored-by: Nicholas Broad <nicholas@nmbroad.com> Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * Update progress bars * Closes #96 (#346) * Add dataloader for SETH-corpus * exclude wrong annotation events * add support task RELATION_EXTRACTION Co-authored-by: ChienVM <chien_vm@detomo.co.jp> * Osiris (#334) * Changed name of dataset to make it consistent with other datasets, changed copyright and deleted TODO * Added loader script for OSIRIS * Changed website, added v_norm, changed offsets, text and normalized for variants and genes * Delete multi_xscience.py * Changed db_name for variants * Fixed a minor bug * Add Chemdner dataset loader (#326) * Add the TEXT_CLASSIFICATION task * Fix empty entities * Apply entity removal only on the bigbio_kb schema * fix entity names for discontiguous entities (ref PR #373) * formatting * initial commit, data loader ready * fix issues * add normalized fields * fix entity names for discontiguous entities (ref PR #373) * formatting Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> Co-authored-by: Labrak Yanis <yanis.labrak@alumni.univ-avignon.fr> Co-authored-by: Nicholas Broad <nbroad94@gmail.com> Co-authored-by: Nicholas Broad <nicholas@nmbroad.com> Co-authored-by: Leon Weber <leonweber@users.noreply.github.com> Co-authored-by: Minh Chien Vu <31467068+vumichien@users.noreply.github.com> Co-authored-by: ChienVM <chien_vm@detomo.co.jp> Co-authored-by: Marianna Nezhurina <43296932+marianna13@users.noreply.github.com>

qanastek requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay and leonweber as code owners April 4, 2022 13:53

qanastek mentioned this pull request Apr 4, 2022

Create dataset loader for CHEMDNER #28

Closed

sg-wbi reviewed Apr 4, 2022

View reviewed changes

biodatasets/chemdner/chemdner.py Outdated Show resolved Hide resolved

biodatasets/chemdner/chemdner.py Outdated Show resolved Hide resolved

biodatasets/chemdner/chemdner.py Outdated Show resolved Hide resolved

biodatasets/chemdner/chemdner.py Outdated Show resolved Hide resolved

sg-wbi self-assigned this Apr 5, 2022

qanastek force-pushed the chemdner branch from 6251db8 to 7796654 Compare April 6, 2022 09:07

qanastek force-pushed the chemdner branch from 7796654 to 540901f Compare April 6, 2022 10:50

qanastek force-pushed the chemdner branch from 540901f to ac8fb5e Compare April 6, 2022 21:39

Add the TEXT_CLASSIFICATION task

2c7c6c3

qanastek added 2 commits April 7, 2022 12:18

Fix empty entities

90ddb28

Apply entity removal only on the bigbio_kb schema

9cf9100

qanastek force-pushed the chemdner branch from ac8fb5e to 9cf9100 Compare April 7, 2022 10:19

sg-wbi closed this Apr 7, 2022

sg-wbi reopened this Apr 7, 2022

sg-wbi merged commit 3f280a4 into bigscience-workshop:master Apr 7, 2022

whoisjones pushed a commit to whoisjones/biomedical that referenced this pull request Apr 8, 2022

Add Chemdner dataset loader (bigscience-workshop#326)

03a462e

* Add the TEXT_CLASSIFICATION task * Fix empty entities * Apply entity removal only on the bigbio_kb schema

galtay mentioned this pull request Jun 5, 2022

chemdner kb implementation needs normalizations #683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Chemdner dataset loader #326

Add Chemdner dataset loader #326

qanastek commented Apr 4, 2022

qanastek commented Apr 4, 2022

qanastek commented Apr 4, 2022

sg-wbi commented Apr 5, 2022

qanastek commented Apr 5, 2022

galtay commented Apr 5, 2022

galtay commented Apr 6, 2022

qanastek commented Apr 6, 2022 •

edited

sg-wbi commented Apr 6, 2022

qanastek commented Apr 6, 2022

sg-wbi commented Apr 6, 2022

qanastek commented Apr 6, 2022

qanastek commented Apr 6, 2022

sg-wbi commented Apr 6, 2022 •

edited

qanastek commented Apr 6, 2022 •

edited

sg-wbi commented Apr 7, 2022

qanastek commented Apr 7, 2022 •

edited

sg-wbi commented Apr 7, 2022

qanastek commented Apr 7, 2022

sg-wbi commented Apr 7, 2022

qanastek commented Apr 7, 2022

sg-wbi commented Apr 7, 2022

qanastek commented Apr 7, 2022

sg-wbi commented Apr 7, 2022

qanastek commented Apr 7, 2022 •

edited

Add Chemdner dataset loader #326

Add Chemdner dataset loader #326

Conversation

qanastek commented Apr 4, 2022

Checkbox

qanastek commented Apr 4, 2022

qanastek commented Apr 4, 2022

sg-wbi commented Apr 5, 2022

qanastek commented Apr 5, 2022

galtay commented Apr 5, 2022

galtay commented Apr 6, 2022

qanastek commented Apr 6, 2022 • edited

sg-wbi commented Apr 6, 2022

qanastek commented Apr 6, 2022

sg-wbi commented Apr 6, 2022

qanastek commented Apr 6, 2022

qanastek commented Apr 6, 2022

sg-wbi commented Apr 6, 2022 • edited

qanastek commented Apr 6, 2022 • edited

sg-wbi commented Apr 7, 2022

qanastek commented Apr 7, 2022 • edited

sg-wbi commented Apr 7, 2022

qanastek commented Apr 7, 2022

sg-wbi commented Apr 7, 2022

qanastek commented Apr 7, 2022

sg-wbi commented Apr 7, 2022

qanastek commented Apr 7, 2022

sg-wbi commented Apr 7, 2022

qanastek commented Apr 7, 2022 • edited

qanastek commented Apr 6, 2022 •

edited

sg-wbi commented Apr 6, 2022 •

edited

qanastek commented Apr 6, 2022 •

edited

qanastek commented Apr 7, 2022 •

edited

qanastek commented Apr 7, 2022 •

edited