Skip to content

Closes #61 [DEPRECATED]#422

Closed
MFreidank wants to merge 66 commits intobigscience-workshop:masterfrom
MFreidank:spl_adr_200db
Closed

Closes #61 [DEPRECATED]#422
MFreidank wants to merge 66 commits intobigscience-workshop:masterfrom
MFreidank:spl_adr_200db

Conversation

@MFreidank
Copy link
Copy Markdown
Contributor

@MFreidank MFreidank commented Apr 11, 2022

Closes issue #61

Relevant details:

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Please let me know in case any changes are required. Looking forward to your comments!

SPL ADR 200 DB is a dataset for adverse event
mentions annotated at entity level in Structured Product Labels
of 200 FDA-approved drugs.
Annotations were done as part of a partnership between
the United States Food and Drug Administration (FDA)
and the National Library of Medicine.

These data were used for the adverse event challenge
of TAC (Text Analysis Conference) 2017, see also:
https://bionlp.nlm.nih.gov/tac2017adversereactions/
mart1nro and others added 4 commits April 11, 2022 18:19
* implemented chebi corpus

* removed __main__

* cleanup

* implementation with utils.parsing.parse_brat_file
* Implemented ehr_rel pairs similarity

* fixed document id

* fixed source config

* updated doc_id and formatting

* added subset ids and fixed homepage
* Initial meqsum commit

* Add short description

Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>

Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>
@WojciechKusa WojciechKusa self-assigned this Apr 11, 2022
HallerPatrick and others added 10 commits April 12, 2022 09:41
* _split_generators includes filepath

* add pubmed_id, fix authors

data_dir as dict
pass ner_filepath, corpus_filepath

use _corpus_to_dict function

* modified schema

need to fix entity offsets
fix entities to match schema

* bioasq_2021_mesinesp

* delete comments

add citation, description

change year to string because some values are 'Not Available'
check for tracks to get top level folder

* change subset_id

fix formatting

* add all/only_articles configs for bigbio

* update subset ids

Co-authored-by: Nicholas Broad <nicholas@nmbroad.com>
* add pharmaconer data loader

* update pharmaconer.py to include subtrack 2

* sort lists of files
* adding quaero dataset loading script

* adding _DATASET_NAME to quaero.py

* Changed the dataloading script according to the reviews :)

* split between 2 subsets (EMEA/ MEDLINE)

* remove empty normalized

Co-authored-by: sg-wbi <87170658+sg-wbi@users.noreply.github.com>
* Add support for annotator notes to brat parser

* Make note parsing optional
@WojciechKusa WojciechKusa linked an issue Apr 13, 2022 that may be closed by this pull request
hakunanatasha and others added 6 commits April 13, 2022 15:02
* feat(scielo): add scielo dataset loader

* refactor(scielo): refactor Scielo loader

* docs(scielo): fill documentation for Scielo loader

* feat(scielo): add scielo dataset loader

* refactor(scielo): refactor Scielo loader

* docs(scielo): fill documentation for Scielo loader

* fix(scielo): update scielo dataset config

* Update biodatasets/scielo/scielo.py

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* Update biodatasets/scielo/scielo.py

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* make everything underscore

there were 1 or 2 lingering inconsistencies so just converted everything to underscore

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>
* Add pubtator central dataloader

* Remove coreference as a task

* Handle mentions with no type

* Add dataset loader for Chia (#282) (#323)

* Add initial version of Chia data set loader

* Revert changes in utils.parsing and move them to separate implementation in chia

* Make fixing of offsets depending on used schema

Co-authored-by: Mario Sänger <saengema@informatik.hu-berlin.de>

* doc: updated etiquette rules

* Fix existence check of required keys and make all tests subtests (#336)

* Fix existence check of required keys and make all tests subtests

* fix: aiohttp not provided in the requirements file

* fix: missing datasets req omg

* Warn when there are KB features not covered by any supported task

* Add statistics and checks for normalized/disambiguation

* Remove extra setUp

* fix typo
* make task_to_features global constant

Co-authored-by: Natasha Seelam <nseelam1@gmail.com>

* Add ability to load from pubtator API

* Passed is_filepath to wrong function

* Fix several bugs in pubtator to bigbio_kb conversion

* Add docstring to _parse_pubtator_file

* Add license for PubTator

* Final draft of pubtator central data loader

* Fix typo in license

* Update biodatasets/pubtator_central/pubtator_central.py

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>

* Correct parsing of entities in bigbio schema

* Load sample or full depending on subset_id

Co-authored-by: Mario Sänger <40803339+mariosaenger@users.noreply.github.com>
Co-authored-by: Mario Sänger <saengema@informatik.hu-berlin.de>
Co-authored-by: Natasha Seelam <nseelam1@gmail.com>
Co-authored-by: Leon Weber <leonweber@users.noreply.github.com>
Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>
* added bio_simlex

* style fixes

* made changes as per comments

* fix: remove the main call

* fix: docstring to reflect source schema

* fix: source schema represented as float

* fix: source schema switched to float

Co-authored-by: Natasha Seelam <nseelam1@gmail.com>
* adding bio_sim_verb

* Delete bio_sim_verb.py.lock

* float for source, string for bigbio

float for source, string for bigbio

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>
galtay and others added 5 commits April 17, 2022 16:15
* WIP scai_disease.py.

* Fix file download.

* Bug fixes.

* Apply suggestions from code review

* Update biodatasets/scai_disease/scai_disease.py

* drop stale comments

drop stale comments

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>
* umnsrs dataset loader added

* casting src labels as floats
Copy link
Copy Markdown
Collaborator

@WojciechKusa WojciechKusa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MFreidank. The code looks great, thank you for your contribution! 🎉

I have noticed that ADR dataset also contains unannotated test data split.

If the schema is similar to the trainset, it would be nice to add it to the loader as data in a split called "unannotated". What do you think?

@MFreidank
Copy link
Copy Markdown
Contributor Author

MFreidank commented Apr 19, 2022

Hi @WojciechKusa

Thank you for your review and encouraging feedback.

I also noticed the additional split, but when I had tried implementing it, I got a unittest-time error for the subset spl_adr_200db_unannotated_bigbio_kb as it requires at least one entity (and as the data is unlabeled, I don't have any entity labels).

At a high level I thought of these options:

  • Have two source subsets, one for train and one for unannotated splits and one single bigbio_kb schema that uses only the train split data.
  • Implement a source and bigbio subset for train and unannotated respectively and accept that the test for bigbio_unannotated will fail.
  • Use a different bigbio schema than bigbio_kb for the unannotated split (could you suggest one, I had a quick look but didn't see anything stick out immediately).
  • Implement a single train subset for both (current implementation)

Please let me know if there's a better solution or which option you'd like me to implement and I'll make the necessary changes right away.

@WojciechKusa
Copy link
Copy Markdown
Collaborator

Thanks for the quick reply @MFreidank.

I would vote for option (2) and allow for failing unittest in unannotated split as we might relax this requirement after the hackathon. Alternatively, I would only implement the unannotated schema for source (option 1).

However, it would be great to get a second opinion @galtay @hakunanatasha?

leonweber and others added 10 commits April 19, 2022 13:40
* Add GENETAG dataloader

* Add POS tags and tokenized text
* Adding some information about citing and dataset

* Remove template file

* Minor work

* Working dataloader

* Cleanup

* Parsing PubMed XML and only include raw text

* Formatting

* local dataset: subclassing BigBioConfig

Co-authored-by: sg-wbi <87170658+sg-wbi@users.noreply.github.com>
* scai_chemical.py.

* Fix dataset name.

* Apply suggestions from code review

Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>

Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>
* Initial biorelex

* Initial biorelex

* Fixes

- Change db_name entrez to NCBI Gene
- Remove unused code

Co-authored-by: nomisto <you@example.com>
* Adding some information for dataset

* Attempts of retrieving all informations for a abstract

* Minor Work

* Finish source schema

* Finished data loader for kb

* style: fix latex

Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>

* fix: style change url per wk suggestion

* fix: line 221 for backwards compatibility with py3.6

* fix: line 266 backwards compatibility py2.6

Co-authored-by: HallerPatrick
Co-authored-by: Natasha Seelam <nseelam1@gmail.com>
Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>
* Fix dataset commment

* Fix entity references

Co-authored-by: Mario Sänger <saengema@informatik.hu-berlin.de>
* check for multi-lable `type` and `db_id`

* fix small typos

fix small typos

Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>
@WojciechKusa
Copy link
Copy Markdown
Collaborator

Hi @MFreidank,
We have a fix coming to the unit-tests that will allow for bypassing specific arguments. This will help with the case of empty entities.

So we agreed that the best way would be to go with option (2) and, for the moment, let bigbio_unannotated fail:

Implement a source and bigbio subset for train and unannotated respectively and accept that the test for bigbio_unannotated will fail.

Please let us know if you have time to implement this :)

@MFreidank
Copy link
Copy Markdown
Contributor Author

MFreidank commented Apr 20, 2022

Hi @WojciechKusa
Thank you for this update!

I'll get to working on the changes to implement option (2) right away.

Update Changes are now implemented (turned out to be a pretty small change, as schemas and generator code are unaffected).

I've run unit tests, this test passes for me:

python -m tests.test_bigbio --subset_id "spl_adr_200db_train" biodatasets/spl_adr_200db/spl_adr_200db.py

While this one fails as expected as the unannotated data does not return entities:

python -m tests.test_bigbio --subset_id "spl_adr_200db_unannotated" biodatasets/spl_adr_200db/spl_adr_200db.py

Please let me know if the changes look good to you.

Hmm, I may have rushed and it seems I accidentally merged in some commits that weren't intended. Should I close this PR and reopen with a single clean commit?

MFreidank added 3 commits April 20, 2022 19:32
SPL ADR 200 DB is a dataset for adverse event
mentions annotated at entity level in Structured Product Labels
of 200 FDA-approved drugs.
Annotations were done as part of a partnership between
the United States Food and Drug Administration (FDA)
and the National Library of Medicine.

These data were used for the adverse event challenge
of TAC (Text Analysis Conference) 2017, see also:
https://bionlp.nlm.nih.gov/tac2017adversereactions/
@WojciechKusa
Copy link
Copy Markdown
Collaborator

Thanks @MFreidank, looks good to me, I am happy to merge it!

I agree, let's open a new PR just to make sure that we commit only this dataset.

@MFreidank MFreidank changed the title Closes #61 Closes #61 [DEPRECATED] Apr 21, 2022
@MFreidank
Copy link
Copy Markdown
Contributor Author

Okay, closing this PR and re-opening a separate one with only this dataset.
Will tag you directly when it is open so we can proceed to merging @WojciechKusa

@MFreidank MFreidank closed this Apr 21, 2022
@MFreidank MFreidank deleted the spl_adr_200db branch April 21, 2022 10:31
@MFreidank MFreidank mentioned this pull request Apr 21, 2022
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a dataset loader for SPL-ADR-200db - Adverse Drug Reactions