Skip to content

alonkipnis/AuthAttLib

Repository files navigation

HC-based test to discriminate word-frequency tables and attribute authorship.

Files:

  • AuthAttrib.py -- 2 models for authorship attribution: - AuthorshipAttributionMulti -- comparision of disputed text to each author - AuthorshipAttributionMultiBinary -- head to head comparison of each author against another
  • DocTermHC -- model for constructing large-sacle word-frequency table and HC testing against it.
  • HC_aux.py -- auxiliary functions to evaluate Higher Criticism tests

To use AuthorshipAttributionMulti and AuthorshipAttributionMultiBinary, arrange your datase in a pandas dataframe with columns author, doc_id, and text

  • author is the name of the class the document is assoicated with.
  • doc_id is a unique document identifyer.
  • text is a string representing the content of the document.

See AuthorshipAttribution_example.ipynb for a use case in authorship attribution challenges. Here is the Binder link: Binder

This code was used to get the results and figures reported in the paper:

Alon Kipnis, ``Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship'', 2019

About

Models for text classification and authorship attribution based on the Higher Criticism

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages