Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is CS and BioMed corpus filtered from S2ORC dataset #4

Closed
stevezheng23 opened this issue Apr 25, 2020 · 24 comments
Closed

How is CS and BioMed corpus filtered from S2ORC dataset #4

stevezheng23 opened this issue Apr 25, 2020 · 24 comments
Labels
question Further information is requested

Comments

@stevezheng23
Copy link

stevezheng23 commented Apr 25, 2020

Hi Team, I'm wondering how CS/BioMed corpus is filtered from S2ORC dataset? I didn't find details on this in the original paper, could you share some light on this? Thanks!

@stevezheng23 stevezheng23 changed the title How is CS and BioMed corpus filtered from S2GCR dataset How is CS and BioMed corpus filtered from S2ORC dataset Apr 26, 2020
@stevezheng23
Copy link
Author

@kernelmachine / @amarasovic , I tried to filter the docs with 'grobid_parse' and 'latex_parse' fields based on 'mag_fos' in metadata, but got more 2.68M docs for 'BioMed' domain and only ~600K docs for 'CS' domain, which is less than 2.22M mentioned in paper. Are there any additional information I'm missing or you have other tricks for generating the unlabeled corpus?

@kernelmachine
Copy link
Contributor

thanks for submitting this! @kyleclo, can you help?

@kernelmachine kernelmachine added the question Further information is requested label Apr 26, 2020
@kyleclo
Copy link

kyleclo commented Apr 27, 2020

Hey @stevezheng23, sorry about the lack of clarity here. The current public release of S2ORC only contains papers that could be released while adhering to strict copyright regulations. Unfortunately, we had finished LM pretraining experiments with an earlier version of S2ORC that contained substantially more papers before learning about this. Many of those papers are unfortunately not available in S2ORC currently -- It'll take more negotiation, and we can't promise when/if it'll happen. Thanks for catching this -- We'll update the paper to make this more clear.

@stevezheng23
Copy link
Author

@kyleclo thanks a lot for the explanation!

@shizhediao
Copy link

@kyleclo thanks a lot for the explanation!

Hi Would you like to share the dataset after preprocessing?
Thanks!

@kyleclo
Copy link

kyleclo commented Aug 29, 2020

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

@shizhediao
Copy link

shizhediao commented Aug 30, 2020

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

Hi, what does the script here mean? Does that mean the example provided at https://github.com/allenai/s2orc?

` import json
.....
if paper_id in paper_id_to_pdf_parse:
# (1) get the full pdf parse from the previously computed lookup dict
pdf_parse = paper_id_to_pdf_parse[paper_id]
# (2) pull out fields we need from the pdf parse, including bibliography & text
bib_entries = pdf_parse['bib_entries']
paragraphs = pdf_parse['abstract'] + pdf_parse['body_text']

        # (3) loop over paragraphs, grabbing citation contexts
        for paragraph in paragraphs:
            
            # (4) loop over each inline citation in this paragraph
            for cite_span in paragraph['cite_spans']:
                
                # (5) each inline citation can be resolved to a bib entry
                cited_bib_entry = bib_entries[cite_span['ref_id']]
                
                # (6) that bib entry *may* be linked to a S2ORC paper.  if so, grab paragraph
                linked_paper_id = cited_bib_entry['link']
                if linked_paper_id:
                    citation_contexts.append({
                        'citing_paper_id': paper_id,
                        'cited_paper_id': linked_paper_id,
                        'context': paragraph['text'],
                        'citation_mention_start': cite_span['start'],
                        'citation_mention_end': cite_span['end'],
                    })

`

@shizhediao
Copy link

shizhediao commented Aug 30, 2020

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

If so, actually, I have checked this example. And I try to filter the dataset into pretraining corpus by adding conditions:

if not ("Medicine" in metadata_dict['mag_field_of_study'] and "Biology" in metadata_dict['mag_field_of_study']):
continue
if metadata_dict["has_pdf_parse"] == False:
continue
if metadata_dict["has_pdf_parse"] == True and metadata_dict["has_pdf_parsed_body_text"] ==False:
continue

Am I correct?

@kyleclo
Copy link

kyleclo commented Aug 30, 2020

That's right. And if you wanted the computer science subset, you can switch the mag_field_of_study tag to "Computer Science".

As for the has_pdf_parsed_body_text tag, I think it's fine to include papers that don't have body text as long as they also have abstracts.

@shizhediao
Copy link

shizhediao commented Aug 30, 2020

That's right. And if you wanted the computer science subset, you can switch the mag_field_of_study tag to "Computer Science".

As for the has_pdf_parsed_body_text tag, I think it's fine to include papers that don't have body text as long as they also have abstracts.

OK, Thanks!
One last question, biomed means "bio" and "med" are in the mag_field at the same time right?
I think It needs "bio and med" instead of “bio or med"?
Just want to make sure.

@kyleclo
Copy link

kyleclo commented Aug 30, 2020

The mag_field_of_study field is a list of strings that looks like: "mag_field_of_study": ["Biology", "Medicine"]

A paper can be Bio or Medicine or both.

For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both

@shizhediao
Copy link

Got it.
Thanks!

@shizhediao
Copy link

The mag_field_of_study field is a list of strings that looks like: "mag_field_of_study": ["Biology", "Medicine"]

A paper can be Bio or Medicine or both.

For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both

Hi Kyle,
May I ask are there any extra data cleaning steps performing?
Or, you just use the raw text from s2orc?

@kyleclo
Copy link

kyleclo commented Aug 30, 2020

We pulled text from the abstract and body_text fields, when available and preserved paragraph breaks. That is, we didn't allow sequences to be built up consisting of sentences from separate paragraphs. abstract is its own paragraph, and the body_text should be divided into paragraphs. Since there are paragraphs longer than the max sequence length for RoBERTa, we used ScispaCy to pre-sentencize everything so it was a single sentence per line. This made it possible to use the --line-by-line flag for Huggingface.

Didn't do any other processing besides this. RoBERTa pretraining is pretty robust to even really poorly-formatted text parsed from paper PDFs.

@shizhediao
Copy link

shizhediao commented Aug 30, 2020

OK, Thanks!
Have you encountered with the RAM memoryError problem?
Because there is no lazyDataLoader right now from Huggingface library, my 128 GB RAM memory could not load all 48 GB data.
Could you provide any hints about how to deal with a large dataset?
Thanks!

@shizhediao
Copy link

shizhediao commented Aug 30, 2020

Since there are paragraphs longer than the max sequence length for RoBERTa, we used ScispaCy to pre-sentencize everything so it was a single sentence per line

By the way, I don't quite understand what does that means. Actually, I have checked the ScispaCy repo but I do not find something related to this operation.
Could you provide an example and point out the function of this operation in ScispaCy?

For example, if the paragraph is

"Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily."

What output do you want to get via ScispaCy and how?

@kyleclo
Copy link

kyleclo commented Aug 30, 2020

Sentencization meaning the output will look like:

# first sentence
"Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR)."

# second sentence
"SBMA can be caused by this easily."

The code looks something like:

nlp = spacy.load('en_core_sci_sm')
text = "..."
for sentence in nlp(text).sents:
    # do something with sentence.text

As for debugging your memory issues with Huggingface code, that might be best handled by opening an Issue on their library's GitHub repo.

@shizhediao
Copy link

shizhediao commented Aug 31, 2020

Thanks!
So you view a sentence as a unit, right?
After sentencization, every sentence will be one line. Are there spaces between sentences from different paragraphs?

@shizhediao
Copy link

shizhediao commented Sep 2, 2020

when available and preserved paragraph breaks. That is, we didn't allow sequences to be built up consisting of sentences from separate paragraphs

Hi Kyle,
I didn't get the point why do you preserve paragraph breaks and how to do that.
Why: In my understanding, the RoBerta do not employ NSP task, so I think we do not need to preserve paragraph breaks.
How: From your comments, I understand that you split a paragraph into lines, each line is a sentence from the paragraph. I was wondering is there space line between sentences from different paragraphs? In my understanding, if we want to preserve the breaks, we need to add a space line?

Thanks!

@kyleclo
Copy link

kyleclo commented Sep 2, 2020

@shizhediao To clarify, the pretraining corpus is formatted like:

Sent1 from Paper1
Sent2 from Paper1
...
Sent500 from Paper1

Sent1 from Paper2
Sent2 from Paper2
...

with a newline separating individual documents.

Sorry I wasn't being clear about paragraphs. Please ignore that comment; it's a very minor detail pertaining to how S2ORC is distributed w/ paragraphs, and in hindsight more confusing than following the format I've pasted above.

@shizhediao
Copy link

OK, Got it!
Thanks for your time and patience.
I agree with you it's a very minor detail. Now I fully understand that.
Thanks again!

@shizhediao
Copy link

Hi @kyleclo
I was wondering will you use the unlabelled valid/test set in TAPT?

@kyleclo
Copy link

kyleclo commented Sep 7, 2020

@shizhediao can you create a new issue for this? It makes it easier for others to search for answered questions. thanks

@shizhediao
Copy link

yes, sure. Thanks for pointing out and here is the link:
#17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants