Closes #61 [DEPRECATED] by MFreidank · Pull Request #422 · bigscience-workshop/biomedical

MFreidank · 2022-04-11T16:12:58Z

Closes issue #61

Relevant details:

Name: SPL ADR 200 DB
Description: SPL ADR 200 DB is a dataset for adverse event mentions annotated at entity level in Structured Product Labels of 200 FDA-approved drugs. Annotations were done as part of a partnership between
the United States Food and Drug Administration (FDA) and the National Library of Medicine.
These data were used for the adverse event challenge of TAC (Text Analysis Conference) 2017, see also:
https://bionlp.nlm.nih.gov/tac2017adversereactions/
Paper: https://www.researchgate.net/publication/322810855_A_dataset_of_200_structured_product_labels_annotated_for_adverse_drug_reactions
Data: https://bionlp.nlm.nih.gov/tac2017adversereactions/

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Please let me know in case any changes are required. Looking forward to your comments!

SPL ADR 200 DB is a dataset for adverse event mentions annotated at entity level in Structured Product Labels of 200 FDA-approved drugs. Annotations were done as part of a partnership between the United States Food and Drug Administration (FDA) and the National Library of Medicine. These data were used for the adverse event challenge of TAC (Text Analysis Conference) 2017, see also: https://bionlp.nlm.nih.gov/tac2017adversereactions/

* implemented chebi corpus * removed __main__ * cleanup * implementation with utils.parsing.parse_brat_file

* Implemented ehr_rel pairs similarity * fixed document id * fixed source config * updated doc_id and formatting * added subset ids and fixed homepage

* Initial meqsum commit * Add short description Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com> Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>

* _split_generators includes filepath * add pubmed_id, fix authors data_dir as dict pass ner_filepath, corpus_filepath use _corpus_to_dict function * modified schema need to fix entity offsets fix entities to match schema * bioasq_2021_mesinesp * delete comments add citation, description change year to string because some values are 'Not Available' check for tracks to get top level folder * change subset_id fix formatting * add all/only_articles configs for bigbio * update subset ids Co-authored-by: Nicholas Broad <nicholas@nmbroad.com>

* add pharmaconer data loader * update pharmaconer.py to include subtrack 2 * sort lists of files

* adding quaero dataset loading script * adding _DATASET_NAME to quaero.py * Changed the dataloading script according to the reviews :) * split between 2 subsets (EMEA/ MEDLINE) * remove empty normalized Co-authored-by: sg-wbi <87170658+sg-wbi@users.noreply.github.com>

* Add support for annotator notes to brat parser * Make note parsing optional

* feat(scielo): add scielo dataset loader * refactor(scielo): refactor Scielo loader * docs(scielo): fill documentation for Scielo loader * feat(scielo): add scielo dataset loader * refactor(scielo): refactor Scielo loader * docs(scielo): fill documentation for Scielo loader * fix(scielo): update scielo dataset config * Update biodatasets/scielo/scielo.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * Update biodatasets/scielo/scielo.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * make everything underscore there were 1 or 2 lingering inconsistencies so just converted everything to underscore Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* Add pubtator central dataloader * Remove coreference as a task * Handle mentions with no type * Add dataset loader for Chia (#282) (#323) * Add initial version of Chia data set loader * Revert changes in utils.parsing and move them to separate implementation in chia * Make fixing of offsets depending on used schema Co-authored-by: Mario Sänger <saengema@informatik.hu-berlin.de> * doc: updated etiquette rules * Fix existence check of required keys and make all tests subtests (#336) * Fix existence check of required keys and make all tests subtests * fix: aiohttp not provided in the requirements file * fix: missing datasets req omg * Warn when there are KB features not covered by any supported task * Add statistics and checks for normalized/disambiguation * Remove extra setUp * fix typo * make task_to_features global constant Co-authored-by: Natasha Seelam <nseelam1@gmail.com> * Add ability to load from pubtator API * Passed is_filepath to wrong function * Fix several bugs in pubtator to bigbio_kb conversion * Add docstring to _parse_pubtator_file * Add license for PubTator * Final draft of pubtator central data loader * Fix typo in license * Update biodatasets/pubtator_central/pubtator_central.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * Correct parsing of entities in bigbio schema * Load sample or full depending on subset_id Co-authored-by: Mario Sänger <40803339+mariosaenger@users.noreply.github.com> Co-authored-by: Mario Sänger <saengema@informatik.hu-berlin.de> Co-authored-by: Natasha Seelam <nseelam1@gmail.com> Co-authored-by: Leon Weber <leonweber@users.noreply.github.com> Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* added bio_simlex * style fixes * made changes as per comments * fix: remove the main call * fix: docstring to reflect source schema * fix: source schema represented as float * fix: source schema switched to float Co-authored-by: Natasha Seelam <nseelam1@gmail.com>

* adding bio_sim_verb * Delete bio_sim_verb.py.lock * float for source, string for bigbio float for source, string for bigbio Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* WIP scai_disease.py. * Fix file download. * Bug fixes. * Apply suggestions from code review * Update biodatasets/scai_disease/scai_disease.py * drop stale comments drop stale comments Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* umnsrs dataset loader added * casting src labels as floats

WojciechKusa

Hi @MFreidank. The code looks great, thank you for your contribution! 🎉

I have noticed that ADR dataset also contains unannotated test data split.

If the schema is similar to the trainset, it would be nice to add it to the loader as data in a split called "unannotated". What do you think?

MFreidank · 2022-04-19T09:30:28Z

Hi @WojciechKusa

Thank you for your review and encouraging feedback.

I also noticed the additional split, but when I had tried implementing it, I got a unittest-time error for the subset spl_adr_200db_unannotated_bigbio_kb as it requires at least one entity (and as the data is unlabeled, I don't have any entity labels).

At a high level I thought of these options:

Have two source subsets, one for train and one for unannotated splits and one single bigbio_kb schema that uses only the train split data.
Implement a source and bigbio subset for train and unannotated respectively and accept that the test for bigbio_unannotated will fail.
Use a different bigbio schema than bigbio_kb for the unannotated split (could you suggest one, I had a quick look but didn't see anything stick out immediately).
Implement a single train subset for both (current implementation)

Please let me know if there's a better solution or which option you'd like me to implement and I'll make the necessary changes right away.

WojciechKusa · 2022-04-19T11:39:58Z

Thanks for the quick reply @MFreidank.

I would vote for option (2) and allow for failing unittest in unannotated split as we might relax this requirement after the hackathon. Alternatively, I would only implement the unannotated schema for source (option 1).

However, it would be great to get a second opinion @galtay @hakunanatasha?

* Add GENETAG dataloader * Add POS tags and tokenized text

* Adding some information about citing and dataset * Remove template file * Minor work * Working dataloader * Cleanup * Parsing PubMed XML and only include raw text * Formatting * local dataset: subclassing BigBioConfig Co-authored-by: sg-wbi <87170658+sg-wbi@users.noreply.github.com>

* scai_chemical.py. * Fix dataset name. * Apply suggestions from code review Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com> Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>

* Initial biorelex * Initial biorelex * Fixes - Change db_name entrez to NCBI Gene - Remove unused code Co-authored-by: nomisto <you@example.com>

* Adding some information for dataset * Attempts of retrieving all informations for a abstract * Minor Work * Finish source schema * Finished data loader for kb * style: fix latex Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com> * fix: style change url per wk suggestion * fix: line 221 for backwards compatibility with py3.6 * fix: line 266 backwards compatibility py2.6 Co-authored-by: HallerPatrick Co-authored-by: Natasha Seelam <nseelam1@gmail.com> Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>

…s. (#489) Changes link in task_schemas.md docs.

* Fix dataset commment * Fix entity references Co-authored-by: Mario Sänger <saengema@informatik.hu-berlin.de>

* check for multi-lable `type` and `db_id` * fix small typos fix small typos Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

WojciechKusa · 2022-04-20T16:16:39Z

Hi @MFreidank,
We have a fix coming to the unit-tests that will allow for bypassing specific arguments. This will help with the case of empty entities.

So we agreed that the best way would be to go with option (2) and, for the moment, let bigbio_unannotated fail:

Implement a source and bigbio subset for train and unannotated respectively and accept that the test for bigbio_unannotated will fail.

Please let us know if you have time to implement this :)

MFreidank · 2022-04-20T17:21:15Z

Hi @WojciechKusa
Thank you for this update!

I'll get to working on the changes to implement option (2) right away.

Update Changes are now implemented (turned out to be a pretty small change, as schemas and generator code are unaffected).

I've run unit tests, this test passes for me:

python -m tests.test_bigbio --subset_id "spl_adr_200db_train" biodatasets/spl_adr_200db/spl_adr_200db.py

While this one fails as expected as the unannotated data does not return entities:

python -m tests.test_bigbio --subset_id "spl_adr_200db_unannotated" biodatasets/spl_adr_200db/spl_adr_200db.py

Please let me know if the changes look good to you.

Hmm, I may have rushed and it seems I accidentally merged in some commits that weren't intended. Should I close this PR and reopen with a single clean commit?

SPL ADR 200 DB is a dataset for adverse event mentions annotated at entity level in Structured Product Labels of 200 FDA-approved drugs. Annotations were done as part of a partnership between the United States Food and Drug Administration (FDA) and the National Library of Medicine. These data were used for the adverse event challenge of TAC (Text Analysis Conference) 2017, see also: https://bionlp.nlm.nih.gov/tac2017adversereactions/

… into spl_adr_200db

WojciechKusa · 2022-04-21T10:23:17Z

Thanks @MFreidank, looks good to me, I am happy to merge it!

I agree, let's open a new PR just to make sure that we commit only this dataset.

MFreidank · 2022-04-21T10:31:14Z

Okay, closing this PR and re-opening a separate one with only this dataset.
Will tag you directly when it is open so we can proceed to merging @WojciechKusa

MFreidank requested review from galtay, hakunanatasha, jason-fries, leonweber, ruisi-su, sg-wbi and sunnnymskang as code owners April 11, 2022 16:12

mart1nro and others added 4 commits April 11, 2022 18:19

Closes #114 (#414)

bd39946

* implemented chebi corpus * removed __main__ * cleanup * implementation with utils.parsing.parse_brat_file

Closes #235 (#402)

46f7672

* Implemented ehr_rel pairs similarity * fixed document id * fixed source config * updated doc_id and formatting * added subset ids and fixed homepage

Update progress bars

02d0671

Closes #415 (#417)

64c0c4f

* Initial meqsum commit * Add short description Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com> Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>

WojciechKusa self-assigned this Apr 11, 2022

HallerPatrick and others added 10 commits April 12, 2022 09:41

Closes #48 (#407)

7884390

Update progress bars

bc6ea7d

add dataloader for Phoner Dataset (#431)

74fcd19

added loading script for essai dataset (#423)

b906850

Closes #138 (#409)

f7e3b5d

* add pharmaconer data loader * update pharmaconer.py to include subtrack 2 * sort lists of files

Update progress bars

efd5574

Update progress bars

4527815

BRAT: Add support for annotator notes to brat parser (#443)

a67eca0

* Add support for annotator notes to brat parser * Make note parsing optional

WojciechKusa linked an issue Apr 13, 2022 that may be closed by this pull request

Create a dataset loader for SPL-ADR-200db - Adverse Drug Reactions #61

Closed

hakunanatasha and others added 6 commits April 13, 2022 15:02

add deba to admins

7fb0a09

Closes #154 (#406)

e54eb88

* added bio_simlex * style fixes * made changes as per comments * fix: remove the main call * fix: docstring to reflect source schema * fix: source schema represented as float * fix: source schema switched to float Co-authored-by: Natasha Seelam <nseelam1@gmail.com>

Closes #153 (#404)

890c038

* adding bio_sim_verb * Delete bio_sim_verb.py.lock * float for source, string for bigbio float for source, string for bigbio Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

Fix parsing of entities in bigbio schema (#444)

c18e5eb

galtay and others added 5 commits April 17, 2022 16:15

erasing BigBioValues (#485)

1a1fc5b

Closes #51 (#483)

f739779

* WIP scai_disease.py. * Fix file download. * Bug fixes. * Apply suggestions from code review * Update biodatasets/scai_disease/scai_disease.py * drop stale comments drop stale comments Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

addressing PR cmments from PR 460 (#479)

b359c23

follow-up from PR 461 - casting src labels as floats (#482)

b043b35

Closes #155 (#464)

675d2ea

* umnsrs dataset loader added * casting src labels as floats

WojciechKusa reviewed Apr 19, 2022

View reviewed changes

leonweber and others added 10 commits April 19, 2022 13:40

Update README.md

dc05c38

Closes #212 (#439)

157be2c

* Add GENETAG dataloader * Add POS tags and tokenized text

Closes #52 (#484)

3429394

* scai_chemical.py. * Fix dataset name. * Apply suggestions from code review Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com> Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>

Closes #441 (#445)

4a2482d

* Initial biorelex * Initial biorelex * Fixes - Change db_name entrez to NCBI Gene - Remove unused code Co-authored-by: nomisto <you@example.com>

Changed BioASQ Task B dataset name to match that of other BioASQ task…

d0bfa78

…s. (#489) Changes link in task_schemas.md docs.

Fix dataloader for Chia dataset (#492)

f9917b0

* Fix dataset commment * Fix entity references Co-authored-by: Mario Sänger <saengema@informatik.hu-berlin.de>

Update progress

3a58aef

check for multi-lable type and db_id (#487)

40cfa87

* check for multi-lable `type` and `db_id` * fix small typos fix small typos Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

MFreidank added 3 commits April 20, 2022 19:32

add unannotated subsets to SPL ADR 200 DB dataset

4d27f00

Merge branch 'spl_adr_200db' of https://github.com/MFreidank/biomedical…

5e7248f

… into spl_adr_200db

MFreidank requested a review from debajyotidatta as a code owner April 20, 2022 17:34

MFreidank changed the title ~~Closes #61~~ Closes #61 [DEPRECATED] Apr 21, 2022

MFreidank closed this Apr 21, 2022

MFreidank deleted the spl_adr_200db branch April 21, 2022 10:31

MFreidank mentioned this pull request Apr 21, 2022

Closes #61 #497

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #61 [DEPRECATED]#422

Closes #61 [DEPRECATED]#422
MFreidank wants to merge 66 commits intobigscience-workshop:masterfrom
MFreidank:spl_adr_200db

MFreidank commented Apr 11, 2022 •

edited

Loading

Uh oh!

WojciechKusa left a comment

Uh oh!

MFreidank commented Apr 19, 2022 •

edited

Loading

Uh oh!

WojciechKusa commented Apr 19, 2022

Uh oh!

WojciechKusa commented Apr 20, 2022

Uh oh!

MFreidank commented Apr 20, 2022 •

edited

Loading

Uh oh!

WojciechKusa commented Apr 21, 2022

Uh oh!

MFreidank commented Apr 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

MFreidank commented Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checkbox

Uh oh!

WojciechKusa left a comment

Choose a reason for hiding this comment

Uh oh!

MFreidank commented Apr 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WojciechKusa commented Apr 19, 2022

Uh oh!

WojciechKusa commented Apr 20, 2022

Uh oh!

MFreidank commented Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WojciechKusa commented Apr 21, 2022

Uh oh!

MFreidank commented Apr 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

MFreidank commented Apr 11, 2022 •

edited

Loading

MFreidank commented Apr 19, 2022 •

edited

Loading

MFreidank commented Apr 20, 2022 •

edited

Loading