This repository provides metadata to papers from DBLP (> 5.9m articles, > 3.8m authors as of September 2022) crawled with the cs-insights-crawler.
As of version 2.1 of the dataset and version 1.3.0 of the crawler, the dataset adds the Computer Science Ontology with information about sub-fields. Also as of version 2.1 of the dataset, we support 🤗 Hugging Face Datasets.
As of version 2.0 of the dataset and version 1.0.2 of the crawler, the dataset uses SemanticScholar data. We will release a blog post soon about this update and how it affects porting from version 1.0.
The goal is to keep this corpus monthly updated and provide a comprehensive repository of the full DBLP collection.
This repository provides the following exports:
- 📖 All paper entries with metadata as jsonl: size 2.7G (gz) download here.
- 🙋 All author entries with metadata as jsonl: size 188M (gz) download here.
>>> from datasets import load_dataset
>>> load_dataset("jpwahle/dblp-discovery-dataset, "papers")['train'][0]
{
'corpusid': 26,
'externalids': ['ACL',
'DBLP',
'ArXiv',
'MAG',
'CorpusId',
'PubMed',
'DOI',
'PubMedCentral'],
'title': 'FPGA-based design and implementation of an approximate polynomial matrix EVD algorithm',
'authors': {'authorId': [12653318, 144237481],
'name': ['Server Kasap', 'Soydan Redif']},
'venue': '2012 International Conference on Field-Programmable Technology',
'year': 2012,
'publicationdate': '2012-12-01',
'abstract': 'In this paper, we introduce a field-programmable gate array (FPGA) hardware architecture for the realization of an algorithm for computing the eigenvalue decomposition (EVD) of para-Hermitian polynomial matrices. Specifically, we develop a parallelized version of the second-order sequential best rotation (SBR2) algorithm for polynomial matrix EVD (PEVD). The proposed algorithm is an extension of the parallel Jacobi method to para-Hermitian polynomial matrices, as such it is the first architecture devoted to PEVD. Hardware implementation of the algorithm is achieved via a highly pipelined, non-systolic FPGA architecture. The proposed architecture is scalable in terms of the size of the input para-Hermitian matrix. We demonstrate the decomposition accuracy of the architecture through FPGA-in-the-loop hardware co-simulations. Results confirm that the proposed solution gives low execution times while reducing the number of resources required from the FPGA.',
'referencecount': 16,
'citationcount': 1,
'isopenaccess': False,
'influentialcitationcount': 0,
's2fieldsofstudy': {'category': ['Computer Science', 'Computer Science'],
'source': ['s2-fos-model', 'external']},
'publicationtypes': ['JournalArticle', 'Conference'],
'journal': "{'name': '2012 International Conference on Field-Programmable Technology', 'volume': None, 'pages': '135-140'}",
'updated': '2022-02-13T16:00:07.412Z',
'url': 'https://www.semanticscholar.org/paper/7011b84b03f1d992962c4a6c87459f7742bc3165'
}
The code to crawl this dataset was migrated to the CS-Insights Project and now lives here.
We are hoping that this corpus can be helpful for analysis relevant to the computer science community.
Please cite/star 🌟 this page if you use this corpus
The exports contain the following features:
Feature | Description |
---|---|
corpusid |
The unique identifier of the paper. |
externalids |
The same paper in other repositories (e.g., DOI, ACL). |
title |
The title of the paper. |
authors |
The authors of the paper with their authorid and name . |
venue |
The venue of the paper. |
year |
The year of the paper publication. |
publicationdate |
A more precise publication date of the paper. |
abstract |
The abstract of the paper. |
referencecount |
The number of references of the paper. |
citationcount |
The number of citations of the paper. |
isopenaccess |
Whether the paper is open access. |
influentialcitationcount |
The number of influential citations of the paper according to SemanticScholar. |
s2fieldsofstudy |
The fields of study of the paper according to SemanticScholar. |
publicationtypes |
The publication types of the paper. |
journal |
The journal of the paper. |
updated |
The last time the paper was updated. |
url |
A url to the paper in SemanticScholar. |
Feature | Description |
---|---|
authorid |
The unique identifier of the author. |
externalids |
The same author in other repositories (e.g., ACL, PubMed). This can include ORCID |
name |
The name of the author. |
affiliations |
The affiliations of the author. |
homepage |
The homepage of the author. |
papercount |
The number of papers the author has written. |
citationcount |
The number of citations the author has received. |
hindex |
The h-index of the author. |
updated |
The last time the author was updated. |
email |
The email of the author. |
s2url |
A url to the author in SemanticScholar. |
If you use the dataset in any way, please cite:
@inproceedings{Wahle2022c,
title = {D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research},
author = {Wahle, Jan Philip and Ruas, Terry and Mohammad, Saif M. and Gipp, Bela},
year = {2022},
month = {July},
booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
publisher = {European Language Resources Association},
address = {Marseille, France},
doi = {},
}
Also make sure to cite the following papers if you use SemanticScholar data:
@inproceedings{ammar-etal-2018-construction,
title = "Construction of the Literature Graph in Semantic Scholar",
author = "Ammar, Waleed and
Groeneveld, Dirk and
Bhagavatula, Chandra and
Beltagy, Iz",
booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)",
month = jun,
year = "2018",
address = "New Orleans - Louisiana",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N18-3011",
doi = "10.18653/v1/N18-3011",
pages = "84--91",
}
@inproceedings{lo-wang-2020-s2orc,
title = "{S}2{ORC}: The Semantic Scholar Open Research Corpus",
author = "Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Daniel",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.447",
doi = "10.18653/v1/2020.acl-main.447",
pages = "4969--4983"
}
The DBLP Discovery Dataset is released under the CC BY-NC 4.0. By using this corpus, you are agreeing to its usage terms.
We adapted this readme from the awesome ACL Anthology Corpus. We thank Semantic Scholar for the metadata they provided for this work.