No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
classifiers
.gitignore
README.md
__init__.py
cleaners.py
data_handler.py
plotting.py
wrapper.py

README.md

Authorship Attribution Helper

Tools to make authorship attribution faster.

Dependencies

Libraries required:

numpy
sklearn
joblib
keras
matplotlib

How to Use

Currently available classifiers are SVM and CNN. To use SVM:

from classifiers import SVM
svm = SVM(C=1, analyzer="word", ngram_range=(1,2))

SVM takes 3 arguments.

  • C: The original C value to use.
  • analyzer: The type of the analyzer. Either "word" or "char".
  • ngram_range: The range of ngrams to use.

To use CNN:

from classifeirs import CNN
cnn = CNN(max_length=500, vector_size=150, minibatch_size=50, ngram=4, epochs=10) 

CNN takes 5 arguments.

  • max_length: How many input nodes.
  • vector_size: the embedding dimensions.
  • minibatch_size: The size of the minibatch to use.
  • ngram: Which ngram size to use. Currently can only take one number, not a range.
  • epochs: How many epochs to train.

After the classifer has been imported and created, the next step is to create a Handler to handle the rest.

from wrapper import Handler
attributer = Handler(train_data=train_data, test_data=test_data, positive_class=["Author 1"], split_size=500, split_type="char", data_type="book_split", classifier=svm, threads=10, iteration=-1)

Here the Handler takes in the data and multiple different arguments.

  • train_data: The train data. Needs to be a JSON dictionary, where the key is the name of the author, and the value is a dictionary of books / manuscripts from the author. e.g. train_data = {"auth1": {"book1": "book text"}}
  • test_data: The test data. Same format as train_data.
  • positive_class: List of authors to consider the positive class.
  • split_size: The size of splits to do, if any. -1 if no splits.
  • split_type: Whether to split by charcters or words.
  • data_type: How to do the splitting. "book": No splitting, one book is one training sample. "single": Books are split into parts and they are classified independent of each other. "book_split": Books are split into parts, but they are classified together.
  • classifier: The classifier to use. SVM or CNN.
  • threads: How many threads to use with SVM. CNN uses all by default.
  • iteration: Current iteration. If -1, do everything in one go, but otherwise only perform one classification. Useful if classifying in parallel.

Handler has multiple functions that can be called to perform different tasks.

attributer.optimize_C([2,4,6,8])

Optimizes the SVM C value using Leave-one-out Cross Validation. C values to test are give as argument.

attributer.cross_validate()

Performs Cross Validation to get values for the train data samples as well.

attributer.attribute.test_data()

Attributes the test data samples.

attributer.print_results(normalize=True)

Print results. Normalizing scales the values to be between 1 and -1, where 0 is the threshold.

attributer.plot_values(scale=True, title="Title")

Plots values. Scale scales the values. Title for the plot.

attributer.get_best_features()

Extract best features (Requires SVM).