This repository contains code and dataset or dataset description used to obtain the results reported in
- [1] Alon Kipnis, ``Higher Criticism for Discriminating Word-Frequency Tables and Authorship Attribution'', 2020
Content:
- AuthAttLib -- library to facilitate the use of HC-based similarity measure in authorship attribution challenges. See project https://github.com/alonkipnis/AuthorshipAttribution for more details.
- AuthorshipChallenge -- contains data and code (IPython notebook) for using HC-based similarity in the ``PAN 2018 Cross-domain authorship attribution'' challenge.
- Federalists -- data and code (IPython notebook) for using HC to attribute authorship in the Federalist papers
- Gutenberg -- code for attributing authorship using HC in a collection of more than 11,000 titles from the Gutenberg project. This folder contains the list of titles and authors in this collection, code for downloading the titles, and the results of the attribution procedure obtained via several cluster computations.
- var_analysis -- code (R notebook) and data for conducting an anlysis of the variation of words within corpus and the degree by which the affect the HC calculation.
- compare_HC_types -- code (IPython notebook) for comparing two variants of HC.