Skip to content

The official repository for the LREC 2022 paper "D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research"

License

Notifications You must be signed in to change notification settings

jpwahle/lrec22-d3-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

📖 The DBLP Discovery Dataset (D3)

arXiv DOI License HuggingFace Dataset

This repository provides metadata to papers from DBLP (> 5.9m articles, > 3.8m authors as of September 2022) crawled with the cs-insights-crawler.

As of version 2.1 of the dataset and version 1.3.0 of the crawler, the dataset adds the Computer Science Ontology with information about sub-fields. Also as of version 2.1 of the dataset, we support 🤗 Hugging Face Datasets.

As of version 2.0 of the dataset and version 1.0.2 of the crawler, the dataset uses SemanticScholar data. We will release a blog post soon about this update and how it affects porting from version 1.0.

The goal is to keep this corpus monthly updated and provide a comprehensive repository of the full DBLP collection.

This repository provides the following exports:

  1. 📖 All paper entries with metadata as jsonl: size 2.7G (gz) download here.
  2. 🙋 All author entries with metadata as jsonl: size 188M (gz) download here.
>>> from datasets import load_dataset
>>> load_dataset("jpwahle/dblp-discovery-dataset, "papers")['train'][0]
{
 'corpusid': 26,
 'externalids': ['ACL',
  'DBLP',
  'ArXiv',
  'MAG',
  'CorpusId',
  'PubMed',
  'DOI',
  'PubMedCentral'],
 'title': 'FPGA-based design and implementation of an approximate polynomial matrix EVD algorithm',
 'authors': {'authorId': [12653318, 144237481],
  'name': ['Server Kasap', 'Soydan Redif']},
 'venue': '2012 International Conference on Field-Programmable Technology',
 'year': 2012,
 'publicationdate': '2012-12-01',
 'abstract': 'In this paper, we introduce a field-programmable gate array (FPGA) hardware architecture for the realization of an algorithm for computing the eigenvalue decomposition (EVD) of para-Hermitian polynomial matrices. Specifically, we develop a parallelized version of the second-order sequential best rotation (SBR2) algorithm for polynomial matrix EVD (PEVD). The proposed algorithm is an extension of the parallel Jacobi method to para-Hermitian polynomial matrices, as such it is the first architecture devoted to PEVD. Hardware implementation of the algorithm is achieved via a highly pipelined, non-systolic FPGA architecture. The proposed architecture is scalable in terms of the size of the input para-Hermitian matrix. We demonstrate the decomposition accuracy of the architecture through FPGA-in-the-loop hardware co-simulations. Results confirm that the proposed solution gives low execution times while reducing the number of resources required from the FPGA.',
 'referencecount': 16,
 'citationcount': 1,
 'isopenaccess': False,
 'influentialcitationcount': 0,
 's2fieldsofstudy': {'category': ['Computer Science', 'Computer Science'],
  'source': ['s2-fos-model', 'external']},
 'publicationtypes': ['JournalArticle', 'Conference'],
 'journal': "{'name': '2012 International Conference on Field-Programmable Technology', 'volume': None, 'pages': '135-140'}",
 'updated': '2022-02-13T16:00:07.412Z',
 'url': 'https://www.semanticscholar.org/paper/7011b84b03f1d992962c4a6c87459f7742bc3165'
}

The code to crawl this dataset was migrated to the CS-Insights Project and now lives here.

We are hoping that this corpus can be helpful for analysis relevant to the computer science community.

Please cite/star 🌟 this page if you use this corpus

💻 Features

The exports contain the following features:

Papers

Feature Description
corpusid The unique identifier of the paper.
externalids The same paper in other repositories (e.g., DOI, ACL).
title The title of the paper.
authors The authors of the paper with their authorid and name.
venue The venue of the paper.
year The year of the paper publication.
publicationdate A more precise publication date of the paper.
abstract The abstract of the paper.
referencecount The number of references of the paper.
citationcount The number of citations of the paper.
isopenaccess Whether the paper is open access.
influentialcitationcount The number of influential citations of the paper according to SemanticScholar.
s2fieldsofstudy The fields of study of the paper according to SemanticScholar.
publicationtypes The publication types of the paper.
journal The journal of the paper.
updated The last time the paper was updated.
url A url to the paper in SemanticScholar.

Authors

Feature Description
authorid The unique identifier of the author.
externalids The same author in other repositories (e.g., ACL, PubMed). This can include ORCID
name The name of the author.
affiliations The affiliations of the author.
homepage The homepage of the author.
papercount The number of papers the author has written.
citationcount The number of citations the author has received.
hindex The h-index of the author.
updated The last time the author was updated.
email The email of the author.
s2url A url to the author in SemanticScholar.

📖 Citation

If you use the dataset in any way, please cite:

@inproceedings{Wahle2022c,
  title        = {D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research},
  author       = {Wahle, Jan Philip and Ruas, Terry and Mohammad, Saif M. and Gipp, Bela},
  year         = {2022},
  month        = {July},
  booktitle    = {Proceedings of The 13th Language Resources and Evaluation Conference},
  publisher    = {European Language Resources Association},
  address      = {Marseille, France},
  doi          = {},
}

Also make sure to cite the following papers if you use SemanticScholar data:

@inproceedings{ammar-etal-2018-construction,
    title = "Construction of the Literature Graph in Semantic Scholar",
    author = "Ammar, Waleed  and
      Groeneveld, Dirk  and
      Bhagavatula, Chandra  and
      Beltagy, Iz",
    booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)",
    month = jun,
    year = "2018",
    address = "New Orleans - Louisiana",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/N18-3011",
    doi = "10.18653/v1/N18-3011",
    pages = "84--91",
}
@inproceedings{lo-wang-2020-s2orc,
    title = "{S}2{ORC}: The Semantic Scholar Open Research Corpus",
    author = "Lo, Kyle  and Wang, Lucy Lu  and Neumann, Mark  and Kinney, Rodney  and Weld, Daniel",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.447",
    doi = "10.18653/v1/2020.acl-main.447",
    pages = "4969--4983"
}

🧑‍⚖️ License

The DBLP Discovery Dataset is released under the CC BY-NC 4.0. By using this corpus, you are agreeing to its usage terms.

🙏 Acknowledgements

We adapted this readme from the awesome ACL Anthology Corpus. We thank Semantic Scholar for the metadata they provided for this work.

About

The official repository for the LREC 2022 paper "D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research"

Resources

License

Stars

Watchers

Forks