S2ORC: The Semantic Scholar Open Research Corpus
S2ORC is a general-purpose corpus for NLP and text mining research over scientific papers.
- Download instructions.
- S2ORC was developed by Kyle Lo and Lucy Lu Wang at the Allen Institute for AI. It is currently being maintained by the Semantic Scholar API team led by Rodney Kinney and Waleed Ammar. Feel free to email us.
- S2ORC is only for non-commercial use, and is released under the ODC-By 1.0. By using S2ORC, you agree to the terms in the license.
- Please cite our ACL 2020 paper if you use S2ORC for your project. See the BibTeX. You can also watch our 12 min ACL 2020 talk.
News and Releases
It's Jan 2023; happy new year! After years of managing S2ORC as a research project, it has now been adopted as a core dataset offering through the Semantic Scholar Public API. Please look for the instructions under "Bulk Dataset" for download!
S2ORC is now available through the Semantic Scholar Public API as a "Bulk Dataset". It is continuously being rebuilt so if you access it through there, you'll get access to new papers as well!
Software Release: 2021-02-01
- Released s2orc-doc2json to support parsing of PDF and LaTeX to JSON format.
S2ORC Release: 2020-07-05
-
Released a new version of S2ORC containing papers up until 2020-04-14, bringing full text coverage from 8M to 12M.
-
Lifted some paper filters to be more lenient toward papers that don't have sufficient amount of text. This brought total paper count to 136M from 81M.
-
Updated the schema to keep paper metadata and parsed paper text separate.
-
Fixed major bugs such as (i) missing section names, (ii) inline citation mention links that don't resolve to bibliographies, and (iii) unpredictable typing in certain metadata fields.
-
Omitted LaTeX parses from this release. They will be added in a subsequent release. Part of the dataset schema change is to accommodate incremental releases (e.g. LaTeX-only release without having to re-run PDF parsing).
-
Feb 2023 update: We are no longer supporting access to this version & recommend everyone use the latest way of accessing S2ORC through the Semantic Scholar Public API. If you must use this version and need assistance, please contact Kyle and Lucy.
Project Status: 2020-04-07
- S2ORC has been accepted to ACL 2020!
- We've changed the name of the project to S2ORC. We will update the preprint shortly with the new name.
- The BibTeX citation has also been changed to reflect this.
- Feb 2023 update: We are no longer supporting access to this version & recommend everyone use the latest way of accessing S2ORC through the Semantic Scholar Public API. If you must use this version and need assistance, please contact Kyle and Lucy.
S2ORC Release: 2019-09-28
- Statistics: 81M+ paper nodes; 73M+ gold abstracts; 8M+ full text papers
- Due to release bugs (e.g. missing section names), we no longer recommend usage of this version. If you must use this version and need assistance, please contact Kyle and Lucy.
Download instructions
Please request access to S2ORC by:
- Requesting a Semantic Scholar API key here
- It may take us up to a week to get back to you. If it has been longer than one week since you have completed the form and you have not heard from us, please send us an email -- your request may have slipped through the cracks.
Contact us
The best way to contact us is through email. Don't hesitate to reach out about anything; we've helped a lot of people get started with the dataset, which can be a bit daunting given its size.
Email: Please include {kylel, lucyw, rodneyk, waleeda}@allenai.org on all correspondence.
Twitter @kylelostat, @lucyluwang
IRC: #s2orc at irc.oftc.net
Give us Feedback: Totally optional, but we'd love to hear how you're using this dataset & any feedback for improving it. Send us an email or leave a Github Issue.
Report issues: Use GitHub Issues to report bugs or issues! We'll try to fix it for the next release.
FAQ
How is this related to the Semantic Scholar Academic Graph (S2AG)?
S2ORC and S2AG should be viewed as separate datasets, but accessed through the same public API. In short:
-
S2AG is everything that is covered in the literature graph, including Nodes (i.e. papers, authors) and Edges (i.e. citations, authorship). A
paperin S2AG is represented by a bundle of Metadata, such as the Title, Authors, Year, Venue, Abstract, etc. -
S2ORC is everything that is machine-readable full text of the paper, which we derive using models run on the paper's PDF.
If you're unsure what to use, please email us and we'd be happy to discuss your project with you.
License
S2ORC is currently released through the Semantic Scholar Public API under the ODC-By 1.0. By using S2ORC, you are agreeing to its usage terms.
Please note this means that S2ORC is currently only available for non-commercial use.
Citation
If using this dataset, please cite:
@inproceedings{lo-wang-2020-s2orc,
title = "{S}2{ORC}: The Semantic Scholar Open Research Corpus",
author = "Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Daniel",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.447",
doi = "10.18653/v1/2020.acl-main.447",
pages = "4969--4983"
}