-
Notifications
You must be signed in to change notification settings - Fork 4
MLPA 400 Dataset
A corpus of 400 publications by top-20 most cited authors in Machine Learning. Designed for multi-label authorship tasks.
- Using Google Scholar as a source, we created a list of top 20 authors in Machine Learning, ranked by the number of citations.
- For each author, 20 papers were downloaded for a total of 400 publications for the entire dataset.
- The text was extracted from the PDF files using pdfminer ~\cite{pdfminer} and pre-processed. Original data is available. The papers were stripped of title, authorship information, and bibliography to ensure the classifier abides by the rules of blind review instead of simply using author list. Formulas, table and figure information were retained as they may contain valuable style and topic information.
- Each work is assigned 20 binary labels. The labels indicate which of the authors contributed to the paper's creation.
Labels.csv
contains the ground truths in the following format: ,<author_1>,...<author_20>\n
is plain text and <author_n> is a digit 0 or 1 indicating whether this author is one of the co-authors. The first row is the header row.
There are 20 directories named after the authors. Each contains 20 papers in .txt format. Encoding is Unicode, line breaks are \n
(Unix format).
100 papers out of 400 have more than one author from the 20 listed, with the number of authors ranging from 1 to 3 and the average being 1.2925. For 5-fold cross-validation using our CNN,
The data is located in ./ml_dataset
directory of this repsitory. You can also obtain it as a tarball from
https://drive.google.com/open?id=0B_LjdXSWGw1YR2dIek95bGFfZEE
Full data dump including the pdfs, the extracted but unparsed and uncleaned text:
https://drive.google.com/file/d/0B_LjdXSWGw1YNDJ4ZUlBSk45ZWc/view?usp=sharing