Skip to content

Closes #209 + PubMed Abstract Loader#418

Merged
sg-wbi merged 9 commits intobigscience-workshop:masterfrom
HallerPatrick:bioasq_task_c_2017
Apr 19, 2022
Merged

Closes #209 + PubMed Abstract Loader#418
sg-wbi merged 9 commits intobigscience-workshop:masterfrom
HallerPatrick:bioasq_task_c_2017

Conversation

@HallerPatrick
Copy link
Copy Markdown
Contributor

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

If the following information is NOT present in the issue, please populate:

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/bioasq_task_c_2017/bioasq_task_c_2017.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Output Unittest:

INFO:__main__:args: Namespace(path='biodatasets/bioasq_task_c_2017/bioasq_task_c_2017.py', schema='TEXT', subset_id=None, data_dir='data/', use_auth_token=None)
INFO:__main__:self.PATH: biodatasets/bioasq_task_c_2017/bioasq_task_c_2017.py
INFO:__main__:self.SUBSET_ID: bioasq_task_c_2017
INFO:__main__:self.SCHEMA: TEXT
INFO:__main__:self.DATA_DIR: data/
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.TEXT_CLASSIFICATION: 'TXTCLASS'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'TEXT'}
INFO:__main__:schemas_to_check: ['TEXT']
INFO:__main__:Checking load_dataset with config name bioasq_task_c_2017_source
WARNING:datasets.builder:Using custom data configuration bioasq_task_c_2017_source-data_dir=data%2F
Downloading and preparing dataset bio_asq_task_c2017/bioasq_task_c_2017_source to /Users/patrickhaller/.cache/huggingface/datasets/bio_asq_task_c2017/bioasq_task_c_2017_source-data_dir=data%2F/1.0.0/5aec061d13981e5e98cba22764c6da6b188d9b0c59471c9f6c721d78fcea6c17...
Dataset bio_asq_task_c2017 downloaded and prepared to /Users/patrickhaller/.cache/huggingface/datasets/bio_asq_task_c2017/bioasq_task_c_2017_source-data_dir=data%2F/1.0.0/5aec061d13981e5e98cba22764c6da6b188d9b0c59471c9f6c721d78fcea6c17. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.54it/s]
INFO:__main__:Checking load_dataset with config name bioasq_task_c_2017_bigbio_text
WARNING:datasets.builder:Using custom data configuration bioasq_task_c_2017_bigbio_text-data_dir=data%2F
Downloading and preparing dataset bio_asq_task_c2017/bioasq_task_c_2017_bigbio_text to /Users/patrickhaller/.cache/huggingface/datasets/bio_asq_task_c2017/bioasq_task_c_2017_bigbio_text-data_dir=data%2F/1.0.0/5aec061d13981e5e98cba22764c6da6b188d9b0c59471c9f6c721d78fcea6c17...
Dataset bio_asq_task_c2017 downloaded and prepared to /Users/patrickhaller/.cache/huggingface/datasets/bio_asq_task_c2017/bioasq_task_c_2017_bigbio_text-data_dir=data%2F/1.0.0/5aec061d13981e5e98cba22764c6da6b188d9b0c59471c9f6c721d78fcea6c17. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.12it/s]
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 22610 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 62952
document_id: 62952
text: 62952
labels: 128329

test
==========
id: 22610
document_id: 22610
text: 22610
labels: 47266

.
----------------------------------------------------------------------
Ran 1 test in 310.715s

OK

As discussed with @jason-fries includes this PR also the data downloader for pubmed articles taken from Jasons Repo. I wrote a little explanation into the function, which can be removed
:)

@sg-wbi sg-wbi self-assigned this Apr 13, 2022
@sg-wbi sg-wbi added the local dataset dataset requires local files to run label Apr 13, 2022
Copy link
Copy Markdown
Collaborator

@sg-wbi sg-wbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great contribution @HallerPatrick! Sorry for the delay...
Everything LGTM, however I do have a question. Why do we need a PubMed Abstract Loader at all? I see you had a conversation with @jason-fries, however, if the dataset is local and the folks at BioASQ kindly provide a link to download all the articles I do not see the use of the fetcher. Or am I misunderstanding something?

@HallerPatrick
Copy link
Copy Markdown
Contributor Author

Great contribution @HallerPatrick! Sorry for the delay...
Everything LGTM, however I do have a question. Why do we need a PubMed Abstract Loader at all? I see you had a conversation with @jason-fries, however, if the dataset is local and the folks at BioASQ kindly provide a link to download all the articles I do not see the use of the fetcher. Or am I misunderstanding something?

@sg-wbi Hey! You are correct, that the abstracts are part of the local dataset. @jason-fries idea was more or less to check if a general pubmed abstract data loader would work for this project and maybe test it for this given dataset. So that it can be used for future similar cases

@sg-wbi
Copy link
Copy Markdown
Collaborator

sg-wbi commented Apr 14, 2022

I see! Thank you for the clarification! A general utility to fetch data from PubMed/PMC would be indeed very valuable. However, my suggestion would be to open a separate PR for this. Here you could simply add a note to the script documenting that the user needs to download the articles too from the website.

A second thing that I noticed is that the text field is the raw unparsed xml string. While this is fine for the source schema (the xml contains a lot of metadata which might be useful for the task), I am not so sure for the bigbio schema. As far as I know, the examples for text classification I have seen so far provide the pure "text" (i.e. what humans would read), so my guess is that we want it here to. Would it be much trouble trying to extract the "text" (i.e. passages) from the xml?

@HallerPatrick
Copy link
Copy Markdown
Contributor Author

I see! Thank you for the clarification! A general utility to fetch data from PubMed/PMC would be indeed very valuable. However, my suggestion would be to open a separate PR for this. Here you could simply add a note to the script documenting that the user needs to download the articles too from the website.

A second thing that I noticed is that the text field is the raw unparsed xml string. While this is fine for the source schema (the xml contains a lot of metadata which might be useful for the task), I am not so sure for the bigbio schema. As far as I know, the examples for text classification I have seen so far provide the pure "text" (i.e. what humans would read), so my guess is that we want it here to. Would it be much trouble trying to extract the "text" (i.e. passages) from the xml?

I can do both. No problem!

@HallerPatrick
Copy link
Copy Markdown
Contributor Author

@sg-wbi The XML does not contain the text in raw form, but:

      <sec>
        <title>2.3. Outcome measures</title>
        <p>Measures of meditation compliance, smoking, and stress were taken at each meeting, including one day, eight days and six weeks post quit. Compliance with meditation was tested via 7-day meditation calendars provided weekly. Smoking abstinence was tested via 7-day smoking calendars and verified via carbon monoxide breath test (abstinence defined as a carbon monoxide level under10 ppm) [<xref ref-type="bibr" rid="B22">22</xref>,<xref ref-type="bibr" rid="B23">23</xref>]. Questionnaires used to test changes in reported stress and affective distress included respectively, the Perceived Stress Scale (PSS) and the Symptoms Check List (SCL-90-R). The Perceived Stress Scale is a questionnaire designed to provide an assessment of symptoms of stress over the previous week. The PSS has robust reliability and validity [<xref ref-type="bibr" rid="B34">34</xref>] and has been used in multiple studies to measure the effect of mindfulness training on stress [<xref ref-type="bibr" rid="B18">18</xref>-<xref ref-type="bibr" rid="B21">21</xref>]. SCL-90-R is a 90 item questionnaire designed to test self-reported affective distress associated with nine categories: Somatization, Obsessive Compulsive Disorder, Interpersonal Sensitivity, Depression, Anxiety, Hostility, Phobic Anxiety, Paranoid Ideation and Psychoticism. The SCL-90-R has been used in numerous studies on mindfulness [<xref ref-type="bibr" rid="B11">11</xref>,<xref ref-type="bibr" rid="B13">13</xref>,<xref ref-type="bibr" rid="B19">19</xref>], and reliability and validity has been tested in multiple populations [<xref ref-type="bibr" rid="B24">24</xref>-<xref ref-type="bibr" rid="B31">31</xref>].</p>
        <p>.</p>
      </sec>

Its XML annotated. Should I just extract the raw text from each tag?

@sg-wbi
Copy link
Copy Markdown
Collaborator

sg-wbi commented Apr 14, 2022

Yeah usually you do that when you parse PMC articles. Ideally you should keep the text inside <title> and <p>

@HallerPatrick
Copy link
Copy Markdown
Contributor Author

@sg-wbi Alright, I finally got it to work. I now parse the xml and only use the raw text (no xml anymore) from the body section, which contains the actual text of the article. Had to fiddle with namespaces and "bad" xml files, but it now runs through cleanly. Tests look exactly like previously. I also removed the abstract data loader and will make a second PR :)

Copy link
Copy Markdown
Collaborator

@sg-wbi sg-wbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HallerPatrick thank you very much for adding the xml parsing: it's very valuable! Great job!

@sg-wbi sg-wbi merged commit f1007c9 into bigscience-workshop:master Apr 19, 2022
@HallerPatrick HallerPatrick deleted the bioasq_task_c_2017 branch April 19, 2022 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

local dataset dataset requires local files to run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants