Closes #61 [DEPRECATED]#422
Closes #61 [DEPRECATED]#422MFreidank wants to merge 66 commits intobigscience-workshop:masterfrom MFreidank:spl_adr_200db
Conversation
SPL ADR 200 DB is a dataset for adverse event mentions annotated at entity level in Structured Product Labels of 200 FDA-approved drugs. Annotations were done as part of a partnership between the United States Food and Drug Administration (FDA) and the National Library of Medicine. These data were used for the adverse event challenge of TAC (Text Analysis Conference) 2017, see also: https://bionlp.nlm.nih.gov/tac2017adversereactions/
* _split_generators includes filepath * add pubmed_id, fix authors data_dir as dict pass ner_filepath, corpus_filepath use _corpus_to_dict function * modified schema need to fix entity offsets fix entities to match schema * bioasq_2021_mesinesp * delete comments add citation, description change year to string because some values are 'Not Available' check for tracks to get top level folder * change subset_id fix formatting * add all/only_articles configs for bigbio * update subset ids Co-authored-by: Nicholas Broad <nicholas@nmbroad.com>
* Add support for annotator notes to brat parser * Make note parsing optional
* feat(scielo): add scielo dataset loader * refactor(scielo): refactor Scielo loader * docs(scielo): fill documentation for Scielo loader * feat(scielo): add scielo dataset loader * refactor(scielo): refactor Scielo loader * docs(scielo): fill documentation for Scielo loader * fix(scielo): update scielo dataset config * Update biodatasets/scielo/scielo.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * Update biodatasets/scielo/scielo.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * make everything underscore there were 1 or 2 lingering inconsistencies so just converted everything to underscore Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>
* Add pubtator central dataloader * Remove coreference as a task * Handle mentions with no type * Add dataset loader for Chia (#282) (#323) * Add initial version of Chia data set loader * Revert changes in utils.parsing and move them to separate implementation in chia * Make fixing of offsets depending on used schema Co-authored-by: Mario Sänger <saengema@informatik.hu-berlin.de> * doc: updated etiquette rules * Fix existence check of required keys and make all tests subtests (#336) * Fix existence check of required keys and make all tests subtests * fix: aiohttp not provided in the requirements file * fix: missing datasets req omg * Warn when there are KB features not covered by any supported task * Add statistics and checks for normalized/disambiguation * Remove extra setUp * fix typo * make task_to_features global constant Co-authored-by: Natasha Seelam <nseelam1@gmail.com> * Add ability to load from pubtator API * Passed is_filepath to wrong function * Fix several bugs in pubtator to bigbio_kb conversion * Add docstring to _parse_pubtator_file * Add license for PubTator * Final draft of pubtator central data loader * Fix typo in license * Update biodatasets/pubtator_central/pubtator_central.py Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com> * Correct parsing of entities in bigbio schema * Load sample or full depending on subset_id Co-authored-by: Mario Sänger <40803339+mariosaenger@users.noreply.github.com> Co-authored-by: Mario Sänger <saengema@informatik.hu-berlin.de> Co-authored-by: Natasha Seelam <nseelam1@gmail.com> Co-authored-by: Leon Weber <leonweber@users.noreply.github.com> Co-authored-by: Gabriel Altay <gabriel.altay@gmail.com>
WojciechKusa
left a comment
There was a problem hiding this comment.
Hi @MFreidank. The code looks great, thank you for your contribution! 🎉
I have noticed that ADR dataset also contains unannotated test data split.
If the schema is similar to the trainset, it would be nice to add it to the loader as data in a split called "unannotated". What do you think?
|
Thank you for your review and encouraging feedback. I also noticed the additional split, but when I had tried implementing it, I got a unittest-time error for the subset At a high level I thought of these options:
Please let me know if there's a better solution or which option you'd like me to implement and I'll make the necessary changes right away. |
|
Thanks for the quick reply @MFreidank. I would vote for option (2) and allow for failing unittest in unannotated split as we might relax this requirement after the hackathon. Alternatively, I would only implement the unannotated schema for However, it would be great to get a second opinion @galtay @hakunanatasha? |
* Adding some information about citing and dataset * Remove template file * Minor work * Working dataloader * Cleanup * Parsing PubMed XML and only include raw text * Formatting * local dataset: subclassing BigBioConfig Co-authored-by: sg-wbi <87170658+sg-wbi@users.noreply.github.com>
* Adding some information for dataset * Attempts of retrieving all informations for a abstract * Minor Work * Finish source schema * Finished data loader for kb * style: fix latex Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com> * fix: style change url per wk suggestion * fix: line 221 for backwards compatibility with py3.6 * fix: line 266 backwards compatibility py2.6 Co-authored-by: HallerPatrick Co-authored-by: Natasha Seelam <nseelam1@gmail.com> Co-authored-by: Wojciech Kusa <WojciechKusa@users.noreply.github.com>
…s. (#489) Changes link in task_schemas.md docs.
* Fix dataset commment * Fix entity references Co-authored-by: Mario Sänger <saengema@informatik.hu-berlin.de>
|
Hi @MFreidank, So we agreed that the best way would be to go with option (2) and, for the moment, let
Please let us know if you have time to implement this :) |
|
Hi @WojciechKusa I'll get to working on the changes to implement option (2) right away. Update Changes are now implemented (turned out to be a pretty small change, as schemas and generator code are unaffected). I've run unit tests, this test passes for me: python -m tests.test_bigbio --subset_id "spl_adr_200db_train" biodatasets/spl_adr_200db/spl_adr_200db.pyWhile this one fails as expected as the unannotated data does not return entities: Please let me know if the changes look good to you. Hmm, I may have rushed and it seems I accidentally merged in some commits that weren't intended. Should I close this PR and reopen with a single clean commit? |
SPL ADR 200 DB is a dataset for adverse event mentions annotated at entity level in Structured Product Labels of 200 FDA-approved drugs. Annotations were done as part of a partnership between the United States Food and Drug Administration (FDA) and the National Library of Medicine. These data were used for the adverse event challenge of TAC (Text Analysis Conference) 2017, see also: https://bionlp.nlm.nih.gov/tac2017adversereactions/
… into spl_adr_200db
|
Thanks @MFreidank, looks good to me, I am happy to merge it! I agree, let's open a new PR just to make sure that we commit only this dataset. |
|
Okay, closing this PR and re-opening a separate one with only this dataset. |
Closes issue #61
Relevant details:
the United States Food and Drug Administration (FDA) and the National Library of Medicine.
These data were used for the adverse event challenge of TAC (Text Analysis Conference) 2017, see also:
https://bionlp.nlm.nih.gov/tac2017adversereactions/
Checkbox
biodatasets/my_dataset/my_dataset.py(please use only lowercase and underscore for dataset naming)._CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_BIGBIO_VERSIONvariables._info(),_split_generators()and_generate_examples()in dataloader script.BUILDER_CONFIGSclass attribute is a list with at least oneBigBioConfigfor the source schema and one for a bigbio schema.datasets.load_datasetfunction.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.Please let me know in case any changes are required. Looking forward to your comments!