Skip to content
Dainis Boumber edited this page Oct 24, 2017 · 4 revisions

Description

A corpus of 400 publications by top-20 most cited authors in Machine Learning. Designed for multi-label authorship tasks.

Creation

  1. Using Google Scholar as a source, we created a list of top 20 authors in Machine Learning, ranked by the number of citations.
  2. For each author, 20 papers were downloaded for a total of 400 publications for the entire dataset.
  3. The text was extracted from the PDF files using pdfminer ~\cite{pdfminer} and pre-processed. Original data is available. The papers were stripped of title, authorship information, and bibliography to ensure the classifier abides by the rules of blind review instead of simply using author list. Formulas, table and figure information were retained as they may contain valuable style and topic information.
  4. Each work is assigned 20 binary labels. The labels indicate which of the authors contributed to the paper's creation.

Contents

Labels.csv contains the ground truths in the following format: ,<author_1>,...<author_20>\n is plain text and <author_n> is a digit 0 or 1 indicating whether this author is one of the co-authors. The first row is the header row.

There are 20 directories named after the authors. Each contains 20 papers in .txt format. Encoding is Unicode, line breaks are \n (Unix format).

Some statistics

100 papers out of 400 have more than one author from the 20 listed, with the number of authors ranging from 1 to 3 and the average being 1.2925. For 5-fold cross-validation using our CNN, $p=0.0828$.

Download

The data is located in ./ml_dataset directory of this repsitory. You can also obtain it as a tarball from

https://drive.google.com/open?id=0B_LjdXSWGw1YR2dIek95bGFfZEE

Full data dump including the pdfs, the extracted but unparsed and uncleaned text:

https://drive.google.com/file/d/0B_LjdXSWGw1YNDJ4ZUlBSk45ZWc/view?usp=sharing

Clone this wiki locally