Closes #209 + PubMed Abstract Loader#418
Conversation
sg-wbi
left a comment
There was a problem hiding this comment.
Great contribution @HallerPatrick! Sorry for the delay...
Everything LGTM, however I do have a question. Why do we need a PubMed Abstract Loader at all? I see you had a conversation with @jason-fries, however, if the dataset is local and the folks at BioASQ kindly provide a link to download all the articles I do not see the use of the fetcher. Or am I misunderstanding something?
@sg-wbi Hey! You are correct, that the abstracts are part of the local dataset. @jason-fries idea was more or less to check if a general pubmed abstract data loader would work for this project and maybe test it for this given dataset. So that it can be used for future similar cases |
|
I see! Thank you for the clarification! A general utility to fetch data from PubMed/PMC would be indeed very valuable. However, my suggestion would be to open a separate PR for this. Here you could simply add a note to the script documenting that the user needs to download the articles too from the website. A second thing that I noticed is that the |
I can do both. No problem! |
|
@sg-wbi The XML does not contain the text in raw form, but: Its XML annotated. Should I just extract the raw text from each tag? |
|
Yeah usually you do that when you parse PMC articles. Ideally you should keep the text inside |
|
@sg-wbi Alright, I finally got it to work. I now parse the xml and only use the raw text (no xml anymore) from the |
sg-wbi
left a comment
There was a problem hiding this comment.
@HallerPatrick thank you very much for adding the xml parsing: it's very valuable! Great job!
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
If the following information is NOT present in the issue, please populate:
Checkbox
biodatasets/bioasq_task_c_2017/bioasq_task_c_2017.py(please use only lowercase and underscore for dataset naming)._CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_BIGBIO_VERSIONvariables._info(),_split_generators()and_generate_examples()in dataloader script.BUILDER_CONFIGSclass attribute is a list with at least oneBigBioConfigfor the source schema and one for a bigbio schema.datasets.load_datasetfunction.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.Output Unittest:
As discussed with @jason-fries includes this PR also the data downloader for pubmed articles taken from Jasons Repo. I wrote a little explanation into the function, which can be removed
:)