Closes #209 + PubMed Abstract Loader by HallerPatrick · Pull Request #418 · bigscience-workshop/biomedical

HallerPatrick · 2022-04-11T12:59:27Z

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

If the following information is NOT present in the issue, please populate:

Name: BioASQ Task 5 C 2017
Description: http://participants-area.bioasq.org/general_information/Task5c/
Paper: Paper
Data: Dataset

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/bioasq_task_c_2017/bioasq_task_c_2017.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Output Unittest:

INFO:__main__:args: Namespace(path='biodatasets/bioasq_task_c_2017/bioasq_task_c_2017.py', schema='TEXT', subset_id=None, data_dir='data/', use_auth_token=None)
INFO:__main__:self.PATH: biodatasets/bioasq_task_c_2017/bioasq_task_c_2017.py
INFO:__main__:self.SUBSET_ID: bioasq_task_c_2017
INFO:__main__:self.SCHEMA: TEXT
INFO:__main__:self.DATA_DIR: data/
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.TEXT_CLASSIFICATION: 'TXTCLASS'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'TEXT'}
INFO:__main__:schemas_to_check: ['TEXT']
INFO:__main__:Checking load_dataset with config name bioasq_task_c_2017_source
WARNING:datasets.builder:Using custom data configuration bioasq_task_c_2017_source-data_dir=data%2F
Downloading and preparing dataset bio_asq_task_c2017/bioasq_task_c_2017_source to /Users/patrickhaller/.cache/huggingface/datasets/bio_asq_task_c2017/bioasq_task_c_2017_source-data_dir=data%2F/1.0.0/5aec061d13981e5e98cba22764c6da6b188d9b0c59471c9f6c721d78fcea6c17...
Dataset bio_asq_task_c2017 downloaded and prepared to /Users/patrickhaller/.cache/huggingface/datasets/bio_asq_task_c2017/bioasq_task_c_2017_source-data_dir=data%2F/1.0.0/5aec061d13981e5e98cba22764c6da6b188d9b0c59471c9f6c721d78fcea6c17. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.54it/s]
INFO:__main__:Checking load_dataset with config name bioasq_task_c_2017_bigbio_text
WARNING:datasets.builder:Using custom data configuration bioasq_task_c_2017_bigbio_text-data_dir=data%2F
Downloading and preparing dataset bio_asq_task_c2017/bioasq_task_c_2017_bigbio_text to /Users/patrickhaller/.cache/huggingface/datasets/bio_asq_task_c2017/bioasq_task_c_2017_bigbio_text-data_dir=data%2F/1.0.0/5aec061d13981e5e98cba22764c6da6b188d9b0c59471c9f6c721d78fcea6c17...
Dataset bio_asq_task_c2017 downloaded and prepared to /Users/patrickhaller/.cache/huggingface/datasets/bio_asq_task_c2017/bioasq_task_c_2017_bigbio_text-data_dir=data%2F/1.0.0/5aec061d13981e5e98cba22764c6da6b188d9b0c59471c9f6c721d78fcea6c17. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.12it/s]
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 22610 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 62952
document_id: 62952
text: 62952
labels: 128329

test
==========
id: 22610
document_id: 22610
text: 22610
labels: 47266

.
----------------------------------------------------------------------
Ran 1 test in 310.715s

OK

As discussed with @jason-fries includes this PR also the data downloader for pubmed articles taken from Jasons Repo. I wrote a little explanation into the function, which can be removed
:)

…dical into bioasq_task_c_2017

sg-wbi

Great contribution @HallerPatrick! Sorry for the delay...
Everything LGTM, however I do have a question. Why do we need a PubMed Abstract Loader at all? I see you had a conversation with @jason-fries, however, if the dataset is local and the folks at BioASQ kindly provide a link to download all the articles I do not see the use of the fetcher. Or am I misunderstanding something?

HallerPatrick · 2022-04-14T10:03:47Z

Great contribution @HallerPatrick! Sorry for the delay...
Everything LGTM, however I do have a question. Why do we need a PubMed Abstract Loader at all? I see you had a conversation with @jason-fries, however, if the dataset is local and the folks at BioASQ kindly provide a link to download all the articles I do not see the use of the fetcher. Or am I misunderstanding something?

@sg-wbi Hey! You are correct, that the abstracts are part of the local dataset. @jason-fries idea was more or less to check if a general pubmed abstract data loader would work for this project and maybe test it for this given dataset. So that it can be used for future similar cases

sg-wbi · 2022-04-14T10:17:52Z

I see! Thank you for the clarification! A general utility to fetch data from PubMed/PMC would be indeed very valuable. However, my suggestion would be to open a separate PR for this. Here you could simply add a note to the script documenting that the user needs to download the articles too from the website.

A second thing that I noticed is that the text field is the raw unparsed xml string. While this is fine for the source schema (the xml contains a lot of metadata which might be useful for the task), I am not so sure for the bigbio schema. As far as I know, the examples for text classification I have seen so far provide the pure "text" (i.e. what humans would read), so my guess is that we want it here to. Would it be much trouble trying to extract the "text" (i.e. passages) from the xml?

HallerPatrick · 2022-04-14T10:33:32Z

I see! Thank you for the clarification! A general utility to fetch data from PubMed/PMC would be indeed very valuable. However, my suggestion would be to open a separate PR for this. Here you could simply add a note to the script documenting that the user needs to download the articles too from the website.

A second thing that I noticed is that the text field is the raw unparsed xml string. While this is fine for the source schema (the xml contains a lot of metadata which might be useful for the task), I am not so sure for the bigbio schema. As far as I know, the examples for text classification I have seen so far provide the pure "text" (i.e. what humans would read), so my guess is that we want it here to. Would it be much trouble trying to extract the "text" (i.e. passages) from the xml?

I can do both. No problem!

HallerPatrick · 2022-04-14T10:42:48Z

@sg-wbi The XML does not contain the text in raw form, but:

      <sec>
        <title>2.3. Outcome measures</title>
        <p>Measures of meditation compliance, smoking, and stress were taken at each meeting, including one day, eight days and six weeks post quit. Compliance with meditation was tested via 7-day meditation calendars provided weekly. Smoking abstinence was tested via 7-day smoking calendars and verified via carbon monoxide breath test (abstinence defined as a carbon monoxide level under10 ppm) [<xref ref-type="bibr" rid="B22">22</xref>,<xref ref-type="bibr" rid="B23">23</xref>]. Questionnaires used to test changes in reported stress and affective distress included respectively, the Perceived Stress Scale (PSS) and the Symptoms Check List (SCL-90-R). The Perceived Stress Scale is a questionnaire designed to provide an assessment of symptoms of stress over the previous week. The PSS has robust reliability and validity [<xref ref-type="bibr" rid="B34">34</xref>] and has been used in multiple studies to measure the effect of mindfulness training on stress [<xref ref-type="bibr" rid="B18">18</xref>-<xref ref-type="bibr" rid="B21">21</xref>]. SCL-90-R is a 90 item questionnaire designed to test self-reported affective distress associated with nine categories: Somatization, Obsessive Compulsive Disorder, Interpersonal Sensitivity, Depression, Anxiety, Hostility, Phobic Anxiety, Paranoid Ideation and Psychoticism. The SCL-90-R has been used in numerous studies on mindfulness [<xref ref-type="bibr" rid="B11">11</xref>,<xref ref-type="bibr" rid="B13">13</xref>,<xref ref-type="bibr" rid="B19">19</xref>], and reliability and validity has been tested in multiple populations [<xref ref-type="bibr" rid="B24">24</xref>-<xref ref-type="bibr" rid="B31">31</xref>].</p>
        <p>.</p>
      </sec>

Its XML annotated. Should I just extract the raw text from each tag?

sg-wbi · 2022-04-14T11:04:37Z

Yeah usually you do that when you parse PMC articles. Ideally you should keep the text inside <title> and <p>

HallerPatrick · 2022-04-14T13:59:17Z

@sg-wbi Alright, I finally got it to work. I now parse the xml and only use the raw text (no xml anymore) from the body section, which contains the actual text of the article. Had to fiddle with namespaces and "bad" xml files, but it now runs through cleanly. Tests look exactly like previously. I also removed the abstract data loader and will make a second PR :)

sg-wbi

@HallerPatrick thank you very much for adding the xml parsing: it's very valuable! Great job!

HallerPatrick added 6 commits April 1, 2022 15:30

Adding some information about citing and dataset

f2ed703

Remove template file

bddfc32

Minor work

5519352

Working dataloader

4f29791

Merge branch 'master' of https://github.com/bigscience-workshop/biome…

7e30830

…dical into bioasq_task_c_2017

Cleanup

ad9f067

HallerPatrick requested review from galtay, hakunanatasha, jason-fries, leonweber, ruisi-su, sg-wbi and sunnnymskang as code owners April 11, 2022 12:59

sg-wbi self-assigned this Apr 13, 2022

sg-wbi added the local dataset dataset requires local files to run label Apr 13, 2022

sg-wbi reviewed Apr 14, 2022

View reviewed changes

Parsing PubMed XML and only include raw text

02af06e

HallerPatrick requested a review from debajyotidatta as a code owner April 14, 2022 13:53

Formatting

28c4e75

local dataset: subclassing BigBioConfig

fecea44

sg-wbi approved these changes Apr 19, 2022

View reviewed changes

sg-wbi merged commit f1007c9 into bigscience-workshop:master Apr 19, 2022

HallerPatrick deleted the bioasq_task_c_2017 branch April 19, 2022 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #209 + PubMed Abstract Loader#418

Closes #209 + PubMed Abstract Loader#418
sg-wbi merged 9 commits intobigscience-workshop:masterfrom
HallerPatrick:bioasq_task_c_2017

HallerPatrick commented Apr 11, 2022

Uh oh!

sg-wbi left a comment

Uh oh!

HallerPatrick commented Apr 14, 2022

Uh oh!

sg-wbi commented Apr 14, 2022

Uh oh!

HallerPatrick commented Apr 14, 2022

Uh oh!

HallerPatrick commented Apr 14, 2022

Uh oh!

sg-wbi commented Apr 14, 2022

Uh oh!

HallerPatrick commented Apr 14, 2022

Uh oh!

sg-wbi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HallerPatrick commented Apr 11, 2022

Checkbox

Uh oh!

sg-wbi left a comment

Choose a reason for hiding this comment

Uh oh!

HallerPatrick commented Apr 14, 2022

Uh oh!

sg-wbi commented Apr 14, 2022

Uh oh!

HallerPatrick commented Apr 14, 2022

Uh oh!

HallerPatrick commented Apr 14, 2022

Uh oh!

sg-wbi commented Apr 14, 2022

Uh oh!

HallerPatrick commented Apr 14, 2022

Uh oh!

sg-wbi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants