MLPA 400 Dataset

Description

A corpus of 400 publications by top-20 most cited authors in Machine Learning. Designed for multi-label authorship tasks.

Creation

Using Google Scholar as a source, we created a list of top 20 authors in Machine Learning, ranked by the number of citations.
For each author, 20 papers were downloaded for a total of 400 publications for the entire dataset.
The text was extracted from the PDF files using pdfminer ~\cite{pdfminer} and pre-processed. Original data is available. The papers were stripped of title, authorship information, and bibliography to ensure the classifier abides by the rules of blind review instead of simply using author list. Formulas, table and figure information were retained as they may contain valuable style and topic information.
Each work is assigned 20 binary labels. The labels indicate which of the authors contributed to the paper's creation.

Some statistics

100 papers out of 400 have more than one author from the 20 listed, with the number of authors ranging from 1 to 3 and the average being 1.2925. For 5-fold cross-validation using our CNN, $p=0.0828$.

Download

The data is located in ./ml_dataset directory of this repsitory. You can also obtain it as a tarball from

https://drive.google.com/open?id=0B_LjdXSWGw1YR2dIek95bGFfZEE

Full data dump including the pdfs, the extracted but unparsed and uncleaned text:

https://drive.google.com/file/d/0B_LjdXSWGw1YNDJ4ZUlBSk45ZWc/view?usp=sharing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLPA 400 Dataset

Description

Creation

Contents

Some statistics

Download

Clone this wiki locally