Skip to content

avjves/AuthAttHelper

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Authorship Attribution Helper

Tools to make authorship attribution faster.

Dependencies

Libraries required:

numpy
sklearn
joblib
keras
matplotlib

How to Use

Currently available classifiers are SVM and CNN. To use SVM:

from classifiers import SVM
svm = SVM(C=1, analyzer="word", ngram_range=(1,2))

SVM takes 3 arguments.

  • C: The original C value to use.
  • analyzer: The type of the analyzer. Either "word" or "char".
  • ngram_range: The range of ngrams to use.

To use CNN:

from classifeirs import CNN
cnn = CNN(max_length=500, vector_size=150, minibatch_size=50, ngram=4, epochs=10) 

CNN takes 5 arguments.

  • max_length: How many input nodes.
  • vector_size: the embedding dimensions.
  • minibatch_size: The size of the minibatch to use.
  • ngram: Which ngram size to use. Currently can only take one number, not a range.
  • epochs: How many epochs to train.

After the classifer has been imported and created, the next step is to create a Handler to handle the rest.

from wrapper import Handler
attributer = Handler(train_data=train_data, test_data=test_data, positive_class=["Author 1"], split_size=500, split_type="char", data_type="book_split", classifier=svm, threads=10, iteration=-1)

Here the Handler takes in the data and multiple different arguments.

  • train_data: The train data. Needs to be a JSON dictionary, where the key is the name of the author, and the value is a dictionary of books / manuscripts from the author. e.g. train_data = {"auth1": {"book1": "book text"}}
  • test_data: The test data. Same format as train_data.
  • positive_class: List of authors to consider the positive class.
  • split_size: The size of splits to do, if any. -1 if no splits.
  • split_type: Whether to split by charcters or words.
  • data_type: How to do the splitting. "book": No splitting, one book is one training sample. "single": Books are split into parts and they are classified independent of each other. "book_split": Books are split into parts, but they are classified together.
  • classifier: The classifier to use. SVM or CNN.
  • threads: How many threads to use with SVM. CNN uses all by default.
  • iteration: Current iteration. If -1, do everything in one go, but otherwise only perform one classification. Useful if classifying in parallel.

Handler has multiple functions that can be called to perform different tasks.

attributer.optimize_C([2,4,6,8])

Optimizes the SVM C value using Leave-one-out Cross Validation. C values to test are give as argument.

attributer.cross_validate()

Performs Cross Validation to get values for the train data samples as well.

attributer.attribute.test_data()

Attributes the test data samples.

attributer.print_results(normalize=True)

Print results. Normalizing scales the values to be between 1 and -1, where 0 is the threshold.

attributer.plot_values(scale=True, title="Title")

Plots values. Scale scales the values. Title for the plot.

attributer.get_best_features()

Extract best features (Requires SVM).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages