HC-based test to discriminate word-frequency tables and attribute authorship.

Files:

AuthAttrib.py -- 2 models for authorship attribution: - AuthorshipAttributionMulti -- comparision of disputed text to each author - AuthorshipAttributionMultiBinary -- head to head comparison of each author against another
DocTermHC -- model for constructing large-sacle word-frequency table and HC testing against it.
HC_aux.py -- auxiliary functions to evaluate Higher Criticism tests

To use AuthorshipAttributionMulti and AuthorshipAttributionMultiBinary, arrange your datase in a pandas dataframe with columns author, doc_id, and text

author is the name of the class the document is assoicated with.
doc_id is a unique document identifyer.
text is a string representing the content of the document.

See AuthorshipAttribution_example.ipynb for a use case in authorship attribution challenges. Here is the Binder link:

This code was used to get the results and figures reported in the paper:

Alon Kipnis, ``Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship'', 2019

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.idea		.idea
Data		Data
Examples		Examples
TwoSampleHC @ 77bc1a6		TwoSampleHC @ 77bc1a6
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
AuthAttLib.py		AuthAttLib.py
FreqTable.py		FreqTable.py
HCsim.py		HCsim.py
MultiDoc.py		MultiDoc.py
README.md		README.md
Requirements.txt		Requirements.txt
__init__.py		__init__.py
goodness_of_fit_tests.py		goodness_of_fit_tests.py
text_processing.py		text_processing.py
utils.py		utils.py
visualize_HC_scores.py		visualize_HC_scores.py

alonkipnis/AuthAttLib

Folders and files

Latest commit

History

Repository files navigation

HC-based test to discriminate word-frequency tables and attribute authorship.

Files:

About

Resources

Stars

Watchers

Forks

Languages