-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is CS and BioMed corpus filtered from S2ORC dataset #4
Comments
@kernelmachine / @amarasovic , I tried to filter the docs with 'grobid_parse' and 'latex_parse' fields based on 'mag_fos' in metadata, but got more 2.68M docs for 'BioMed' domain and only ~600K docs for 'CS' domain, which is less than 2.22M mentioned in paper. Are there any additional information I'm missing or you have other tricks for generating the unlabeled corpus? |
thanks for submitting this! @kyleclo, can you help? |
Hey @stevezheng23, sorry about the lack of clarity here. The current public release of S2ORC only contains papers that could be released while adhering to strict copyright regulations. Unfortunately, we had finished LM pretraining experiments with an earlier version of S2ORC that contained substantially more papers before learning about this. Many of those papers are unfortunately not available in S2ORC currently -- It'll take more negotiation, and we can't promise when/if it'll happen. Thanks for catching this -- We'll update the paper to make this more clear. |
@kyleclo thanks a lot for the explanation! |
Hi Would you like to share the dataset after preprocessing? |
@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining? |
Hi, what does the script here mean? Does that mean the example provided at https://github.com/allenai/s2orc? ` import json
` |
If so, actually, I have checked this example. And I try to filter the dataset into pretraining corpus by adding conditions:
Am I correct? |
That's right. And if you wanted the computer science subset, you can switch the As for the |
OK, Thanks! |
The A paper can be Bio or Medicine or both. For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both |
Got it. |
Hi Kyle, |
We pulled text from the Didn't do any other processing besides this. RoBERTa pretraining is pretty robust to even really poorly-formatted text parsed from paper PDFs. |
OK, Thanks! |
By the way, I don't quite understand what does that means. Actually, I have checked the ScispaCy repo but I do not find something related to this operation. For example, if the paragraph is
What output do you want to get via ScispaCy and how? |
Sentencization meaning the output will look like:
The code looks something like:
As for debugging your memory issues with Huggingface code, that might be best handled by opening an Issue on their library's GitHub repo. |
Thanks! |
Hi Kyle, Thanks! |
@shizhediao To clarify, the pretraining corpus is formatted like:
with a newline separating individual documents. Sorry I wasn't being clear about paragraphs. Please ignore that comment; it's a very minor detail pertaining to how S2ORC is distributed w/ paragraphs, and in hindsight more confusing than following the format I've pasted above. |
OK, Got it! |
Hi @kyleclo |
@shizhediao can you create a new issue for this? It makes it easier for others to search for answered questions. thanks |
yes, sure. Thanks for pointing out and here is the link: |
Hi Team, I'm wondering how CS/BioMed corpus is filtered from S2ORC dataset? I didn't find details on this in the original paper, could you share some light on this? Thanks!
The text was updated successfully, but these errors were encountered: