How is CS and BioMed corpus filtered from S2ORC dataset #4

stevezheng23 · 2020-04-25T16:59:22Z

Hi Team, I'm wondering how CS/BioMed corpus is filtered from S2ORC dataset? I didn't find details on this in the original paper, could you share some light on this? Thanks!

stevezheng23 · 2020-04-26T17:00:29Z

@kernelmachine / @amarasovic , I tried to filter the docs with 'grobid_parse' and 'latex_parse' fields based on 'mag_fos' in metadata, but got more 2.68M docs for 'BioMed' domain and only ~600K docs for 'CS' domain, which is less than 2.22M mentioned in paper. Are there any additional information I'm missing or you have other tricks for generating the unlabeled corpus?

kernelmachine · 2020-04-26T21:31:47Z

thanks for submitting this! @kyleclo, can you help?

kyleclo · 2020-04-27T01:12:54Z

Hey @stevezheng23, sorry about the lack of clarity here. The current public release of S2ORC only contains papers that could be released while adhering to strict copyright regulations. Unfortunately, we had finished LM pretraining experiments with an earlier version of S2ORC that contained substantially more papers before learning about this. Many of those papers are unfortunately not available in S2ORC currently -- It'll take more negotiation, and we can't promise when/if it'll happen. Thanks for catching this -- We'll update the paper to make this more clear.

stevezheng23 · 2020-04-28T08:10:06Z

@kyleclo thanks a lot for the explanation!

shizhediao · 2020-08-29T01:43:44Z

@kyleclo thanks a lot for the explanation!

Hi Would you like to share the dataset after preprocessing?
Thanks!

kyleclo · 2020-08-29T19:29:09Z

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

shizhediao · 2020-08-30T01:16:50Z

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

Hi, what does the script here mean? Does that mean the example provided at https://github.com/allenai/s2orc?

` import json
.....
if paper_id in paper_id_to_pdf_parse:
# (1) get the full pdf parse from the previously computed lookup dict
pdf_parse = paper_id_to_pdf_parse[paper_id]
# (2) pull out fields we need from the pdf parse, including bibliography & text
bib_entries = pdf_parse['bib_entries']
paragraphs = pdf_parse['abstract'] + pdf_parse['body_text']

        # (3) loop over paragraphs, grabbing citation contexts
        for paragraph in paragraphs:
            
            # (4) loop over each inline citation in this paragraph
            for cite_span in paragraph['cite_spans']:
                
                # (5) each inline citation can be resolved to a bib entry
                cited_bib_entry = bib_entries[cite_span['ref_id']]
                
                # (6) that bib entry *may* be linked to a S2ORC paper.  if so, grab paragraph
                linked_paper_id = cited_bib_entry['link']
                if linked_paper_id:
                    citation_contexts.append({
                        'citing_paper_id': paper_id,
                        'cited_paper_id': linked_paper_id,
                        'context': paragraph['text'],
                        'citation_mention_start': cite_span['start'],
                        'citation_mention_end': cite_span['end'],
                    })

`

shizhediao · 2020-08-30T01:23:45Z

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

If so, actually, I have checked this example. And I try to filter the dataset into pretraining corpus by adding conditions:

if not ("Medicine" in metadata_dict['mag_field_of_study'] and "Biology" in metadata_dict['mag_field_of_study']):
continue
if metadata_dict["has_pdf_parse"] == False:
continue
if metadata_dict["has_pdf_parse"] == True and metadata_dict["has_pdf_parsed_body_text"] ==False:
continue

Am I correct?

kyleclo · 2020-08-30T01:44:46Z

That's right. And if you wanted the computer science subset, you can switch the mag_field_of_study tag to "Computer Science".

As for the has_pdf_parsed_body_text tag, I think it's fine to include papers that don't have body text as long as they also have abstracts.

shizhediao · 2020-08-30T01:47:38Z

That's right. And if you wanted the computer science subset, you can switch the mag_field_of_study tag to "Computer Science".

As for the has_pdf_parsed_body_text tag, I think it's fine to include papers that don't have body text as long as they also have abstracts.

OK, Thanks!
One last question, biomed means "bio" and "med" are in the mag_field at the same time right?
I think It needs "bio and med" instead of “bio or med"?
Just want to make sure.

kyleclo · 2020-08-30T01:57:24Z

The mag_field_of_study field is a list of strings that looks like: "mag_field_of_study": ["Biology", "Medicine"]

A paper can be Bio or Medicine or both.

For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both

shizhediao · 2020-08-30T01:58:56Z

Got it.
Thanks!

shizhediao · 2020-08-30T03:38:20Z

The mag_field_of_study field is a list of strings that looks like: "mag_field_of_study": ["Biology", "Medicine"]

A paper can be Bio or Medicine or both.

For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both

Hi Kyle,
May I ask are there any extra data cleaning steps performing?
Or, you just use the raw text from s2orc?

kyleclo · 2020-08-30T16:14:03Z

We pulled text from the abstract and body_text fields, when available and preserved paragraph breaks. That is, we didn't allow sequences to be built up consisting of sentences from separate paragraphs. abstract is its own paragraph, and the body_text should be divided into paragraphs. Since there are paragraphs longer than the max sequence length for RoBERTa, we used ScispaCy to pre-sentencize everything so it was a single sentence per line. This made it possible to use the --line-by-line flag for Huggingface.

Didn't do any other processing besides this. RoBERTa pretraining is pretty robust to even really poorly-formatted text parsed from paper PDFs.

shizhediao · 2020-08-30T16:20:24Z

OK, Thanks!
Have you encountered with the RAM memoryError problem?
Because there is no lazyDataLoader right now from Huggingface library, my 128 GB RAM memory could not load all 48 GB data.
Could you provide any hints about how to deal with a large dataset?
Thanks!

shizhediao · 2020-08-30T16:42:44Z

Since there are paragraphs longer than the max sequence length for RoBERTa, we used ScispaCy to pre-sentencize everything so it was a single sentence per line

By the way, I don't quite understand what does that means. Actually, I have checked the ScispaCy repo but I do not find something related to this operation.
Could you provide an example and point out the function of this operation in ScispaCy?

For example, if the paragraph is

"Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily."

What output do you want to get via ScispaCy and how?

kyleclo · 2020-08-30T23:38:29Z

Sentencization meaning the output will look like:

# first sentence
"Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR)."

# second sentence
"SBMA can be caused by this easily."

The code looks something like:

nlp = spacy.load('en_core_sci_sm')
text = "..."
for sentence in nlp(text).sents:
    # do something with sentence.text

As for debugging your memory issues with Huggingface code, that might be best handled by opening an Issue on their library's GitHub repo.

shizhediao · 2020-08-31T01:07:13Z

Thanks!
So you view a sentence as a unit, right?
After sentencization, every sentence will be one line. Are there spaces between sentences from different paragraphs?

shizhediao · 2020-09-02T02:58:05Z

when available and preserved paragraph breaks. That is, we didn't allow sequences to be built up consisting of sentences from separate paragraphs

Hi Kyle,
I didn't get the point why do you preserve paragraph breaks and how to do that.
Why: In my understanding, the RoBerta do not employ NSP task, so I think we do not need to preserve paragraph breaks.
How: From your comments, I understand that you split a paragraph into lines, each line is a sentence from the paragraph. I was wondering is there space line between sentences from different paragraphs? In my understanding, if we want to preserve the breaks, we need to add a space line?

Thanks!

kyleclo · 2020-09-02T03:20:22Z

@shizhediao To clarify, the pretraining corpus is formatted like:

Sent1 from Paper1
Sent2 from Paper1
...
Sent500 from Paper1

Sent1 from Paper2
Sent2 from Paper2
...

with a newline separating individual documents.

Sorry I wasn't being clear about paragraphs. Please ignore that comment; it's a very minor detail pertaining to how S2ORC is distributed w/ paragraphs, and in hindsight more confusing than following the format I've pasted above.

shizhediao · 2020-09-02T03:22:55Z

OK, Got it!
Thanks for your time and patience.
I agree with you it's a very minor detail. Now I fully understand that.
Thanks again!

shizhediao · 2020-09-06T15:50:25Z

Hi @kyleclo
I was wondering will you use the unlabelled valid/test set in TAPT?

kyleclo · 2020-09-07T01:54:23Z

@shizhediao can you create a new issue for this? It makes it easier for others to search for answered questions. thanks

shizhediao · 2020-09-07T04:56:06Z

yes, sure. Thanks for pointing out and here is the link:
#17

stevezheng23 changed the title ~~How is CS and BioMed corpus filtered from S2GCR dataset~~ How is CS and BioMed corpus filtered from S2ORC dataset Apr 26, 2020

kernelmachine added the question Further information is requested label Apr 26, 2020

stevezheng23 closed this as completed Apr 28, 2020

kernelmachine mentioned this issue Sep 15, 2020

How to get the pretraining corpus? #13

Closed

kernelmachine mentioned this issue Mar 17, 2021

Dataset for DAPT #23

Closed

wormyu mentioned this issue Aug 2, 2023

How to download specific domain of paper (BIOMED, CS )in S2ORC by the bulk dataset api ? allenai/s2-folks#133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is CS and BioMed corpus filtered from S2ORC dataset #4

How is CS and BioMed corpus filtered from S2ORC dataset #4

stevezheng23 commented Apr 25, 2020 •

edited

Loading

stevezheng23 commented Apr 26, 2020

kernelmachine commented Apr 26, 2020

kyleclo commented Apr 27, 2020

stevezheng23 commented Apr 28, 2020

shizhediao commented Aug 29, 2020

kyleclo commented Aug 29, 2020

shizhediao commented Aug 30, 2020 •

edited

Loading

shizhediao commented Aug 30, 2020 •

edited

Loading

kyleclo commented Aug 30, 2020

shizhediao commented Aug 30, 2020 •

edited

Loading

kyleclo commented Aug 30, 2020 •

edited

Loading

shizhediao commented Aug 30, 2020

shizhediao commented Aug 30, 2020

kyleclo commented Aug 30, 2020

shizhediao commented Aug 30, 2020 •

edited

Loading

shizhediao commented Aug 30, 2020 •

edited

Loading

kyleclo commented Aug 30, 2020

shizhediao commented Aug 31, 2020 •

edited

Loading

shizhediao commented Sep 2, 2020 •

edited

Loading

kyleclo commented Sep 2, 2020

shizhediao commented Sep 2, 2020

shizhediao commented Sep 6, 2020

kyleclo commented Sep 7, 2020

shizhediao commented Sep 7, 2020

How is CS and BioMed corpus filtered from S2ORC dataset #4

How is CS and BioMed corpus filtered from S2ORC dataset #4

Comments

stevezheng23 commented Apr 25, 2020 • edited Loading

stevezheng23 commented Apr 26, 2020

kernelmachine commented Apr 26, 2020

kyleclo commented Apr 27, 2020

stevezheng23 commented Apr 28, 2020

shizhediao commented Aug 29, 2020

kyleclo commented Aug 29, 2020

shizhediao commented Aug 30, 2020 • edited Loading

shizhediao commented Aug 30, 2020 • edited Loading

kyleclo commented Aug 30, 2020

shizhediao commented Aug 30, 2020 • edited Loading

kyleclo commented Aug 30, 2020 • edited Loading

shizhediao commented Aug 30, 2020

shizhediao commented Aug 30, 2020

kyleclo commented Aug 30, 2020

shizhediao commented Aug 30, 2020 • edited Loading

shizhediao commented Aug 30, 2020 • edited Loading

kyleclo commented Aug 30, 2020

shizhediao commented Aug 31, 2020 • edited Loading

shizhediao commented Sep 2, 2020 • edited Loading

kyleclo commented Sep 2, 2020

shizhediao commented Sep 2, 2020

shizhediao commented Sep 6, 2020

kyleclo commented Sep 7, 2020

shizhediao commented Sep 7, 2020

stevezheng23 commented Apr 25, 2020 •

edited

Loading

shizhediao commented Aug 30, 2020 •

edited

Loading

shizhediao commented Aug 30, 2020 •

edited

Loading

shizhediao commented Aug 30, 2020 •

edited

Loading

kyleclo commented Aug 30, 2020 •

edited

Loading

shizhediao commented Aug 30, 2020 •

edited

Loading

shizhediao commented Aug 30, 2020 •

edited

Loading

shizhediao commented Aug 31, 2020 •

edited

Loading

shizhediao commented Sep 2, 2020 •

edited

Loading