# Ružička: Authorship Verification in Python

In this notebook, we offer a quick tutorial as to how you could use the code in this repository. While the package is very much geared towards our own work in authorship verification, you might some of the more general functions useful. All feedback and comments are welcome. This code assumes Python 2.7+ (Python 3 has not been tested). You do not need to install the library to run the code below, but please note that there are a number of well-known third-party Python libraries, including:
+ numpy
+ scipy
+ scikit-learn
+ matplotlib
+ seaborn
+ numba

and preferably (for GPU acceleration and/or JIT-compilation):
+ theano
+ numbapro

We recommend installing Continuum's excellent [Anaconda Python framework](https://www.continuum.io/downloads), which comes bundled with most of these dependencies.


## Walk through

By default, we assume that your data sets are stored in a directory the format on the PAN 2014 track on authorship attribution: a directory should minimally include one folder per verification problem (an `unknown.txt` and at least one `known01.txt`) and a `truth.txt`. E.g. for the corpus of Dutch essays (`../data/2014/du_essays/train`), `truth.txt` contains has a tab-separated line with the ground truth for each problem:

```
DE001 Y
DE002 Y
DE003 N
DE004 N
DE005 N
DE006 N
DE007 N
DE008 Y
...
```

To inspect the problems:

In [1]:
! ls ../data/2014/du_essays/train

[34mDE001[m[m         [34mDE021[m[m         [34mDE041[m[m         [34mDE061[m[m         [34mDE081[m[m
[34mDE002[m[m         [34mDE022[m[m         [34mDE042[m[m         [34mDE062[m[m         [34mDE082[m[m
[34mDE003[m[m         [34mDE023[m[m         [34mDE043[m[m         [34mDE063[m[m         [34mDE083[m[m
[34mDE004[m[m         [34mDE024[m[m         [34mDE044[m[m         [34mDE064[m[m         [34mDE084[m[m
[34mDE005[m[m         [34mDE025[m[m         [34mDE045[m[m         [34mDE065[m[m         [34mDE085[m[m
[34mDE006[m[m         [34mDE026[m[m         [34mDE046[m[m         [34mDE066[m[m         [34mDE086[m[m
[34mDE007[m[m         [34mDE027[m[m         [34mDE047[m[m         [34mDE067[m[m         [34mDE087[m[m
[34mDE008[m[m         [34mDE028[m[m         [34mDE048[m[m         [34mDE068[m[m         [34mDE088[m[m
[34mDE009[m[m         [34mDE029[m[m         [34mDE049[m

Let us now load the set of development problems for the Dutch essays:

In [2]:
from ruzicka.utilities import *

D = "../data/2014/du_essays/"
dev_train_data, dev_test_data = load_pan_dataset(D + "train")

This functions loads all documents and splits the development data into a development part (the known documents) and a testing part (the unknown documents). We can unpack these as follows:

In [3]:
dev_train_labels, dev_train_documents = zip(*dev_train_data)
dev_test_labels, dev_test_documents = zip(*dev_test_data)

Let us have a look at the actual test texts:

In [4]:
from __future__ import print_function

for doc in dev_test_documents[:10]:
    print("+ ", doc[:70])

+  ﻿Dankzij het internet zijn we een grote bron aan informatie rijker . A
+  ﻿Het is dus begrijpelijk dat de commerciële zenders meer reclame moete
+  ﻿" Hey , vuile nicht ! Hangt er nog stront aan je lul ? " . Dergelijke
+  ﻿Gelijkheid tussen man en vrouw is iets dat ons al eeuwen in de ban ho
+  ﻿Gisteren was er opnieuw een protest tegen homofilie in de grootstad P
+  ﻿Voetbal is vandaag de dag zonder twijfel de populairste sport in Belg
+  ﻿Door de ongekende groei van nieuwsbronnen en de opkomst van het inter
+  ﻿Woordenboekgebruik uit interesse De categorie woordenboekgebruikers d
+  ﻿Ze bouwden een tegencultuur op die alles verwierp waar hun ouders alt
+  ﻿Als we hier in België op straat rondlopen , merken we dat er zeer vee


For each of these documents we need to decide whether or not they were in fact written by the target authors proposed:

In [5]:
for doc in dev_test_labels[:10]:
    print("+ ", doc[:70])

+  DE001
+  DE002
+  DE003
+  DE004
+  DE005
+  DE006
+  DE007
+  DE008
+  DE009
+  DE010


The first and crucial step is to vectorize the documents using a vector space model. Below, we use generic example, using the 10,000 most common word unigrams and a plain *tf* model:

In [6]:
from ruzicka.vectorization import Vectorizer

vectorizer = Vectorizer(mfi=10000, vector_space="tf", ngram_type="word", ngram_size=1)

dev_train_X = vectorizer.fit_transform(dev_train_documents)
dev_test_X = vectorizer.transform(dev_test_documents)



In [7]:
dev_test_X.__class__

numpy.ndarray

Note that we use `sklearn` conventions here: we fit the vectorizer only on the vocabulary of the known documents and apply it it later to the unknown documents (since in real life too, we will not necessarily know the known documents in advance). This gives us two compatible corpus matrices:

In [8]:
print(dev_train_X.shape)
print(dev_test_X.shape)

(172, 9347)
(96, 9347)


We now encode the author labels in the development problem sets as integers, using sklearn's convenient `LabelEncoder`:

In [9]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(dev_train_labels + dev_test_labels)
dev_train_y = np.array(label_encoder.transform(dev_train_labels))
dev_test_y = np.array(label_encoder.transform(dev_test_labels))
print(dev_test_y)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95]


We now construct and fit an 'O2' verifier: this extrinsic verification technique is based on the General Imposters framework. We apply it with the minmax metric and a profile base, meaning that the known documents for each author will be represented as a mean centroid:

In [10]:
from ruzicka.Order2Verifier import Order2Verifier

dev_verifier = Order2Verifier(
    metric="minmax", base="profile", nb_bootstrap_iter=100, rnd_prop=0.5
)
dev_verifier.fit(dev_train_X, dev_train_y)

We can now obtain the probability which this O1 verifier would assign to each combination of an unknown document and the target author suggested in the problem:

In [11]:
dev_test_scores = dev_verifier.predict_proba(
    test_X=dev_test_X, test_y=dev_test_y, nb_imposters=30
)

	 - # test documents processed: 10 out of 96
	 - # test documents processed: 20 out of 96
	 - # test documents processed: 30 out of 96
	 - # test documents processed: 40 out of 96
	 - # test documents processed: 50 out of 96
	 - # test documents processed: 60 out of 96
	 - # test documents processed: 70 out of 96
	 - # test documents processed: 80 out of 96
	 - # test documents processed: 90 out of 96


This gives us as an array of probability scores for each problem, corresponding to the number of iterations in which the target's author's profile was closer to the anonymous document than to one of the imposters:

In [12]:
print(dev_test_scores)

[0.79 0.69 0.01 0.   0.08 0.04 0.02 1.   0.99 0.69 0.51 0.28 0.88 0.97
 0.01 0.58 0.62 0.   0.01 0.34 0.   0.   0.59 0.43 0.   0.76 0.03 0.53
 0.03 0.01 0.01 0.   0.28 0.   0.   0.26 0.   0.04 0.04 0.   0.12 0.81
 0.   0.74 0.99 0.   0.34 0.03 0.   0.2  0.02 0.45 0.07 0.47 0.65 0.
 0.01 0.71 0.02 0.69 0.98 0.65 0.92 0.46 0.54 0.98 0.81 0.12 0.5  0.91
 0.28 0.01 0.01 0.4  0.07 0.15 0.24 0.   0.97 0.   0.89 0.26 0.   0.
 0.99 0.   1.   0.62 0.   0.   0.   0.6  0.   0.   0.45 0.71]


Let us now load the ground truth to check how well we did:

In [13]:
dev_gt_scores = load_ground_truth(
    filepath=os.sep.join((D, "train", "truth.txt")), labels=dev_test_labels
)
print(dev_gt_scores)

[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0]


There is one final step needed: the PAN evaluation measures allow systems to leave a number of difficult problems unanswered, by setting the probability exactly at 0.5. To account for this strict threshold, we fit a score shifter, which will attempt to rectify mid-range score to 0.5. We can tune these parameters as follows:

In [14]:
from ruzicka.score_shifting import ScoreShifter

shifter = ScoreShifter()
shifter.fit(predicted_scores=dev_test_scores, ground_truth_scores=dev_gt_scores)
dev_test_scores = shifter.transform(dev_test_scores)

p1 for optimal combo: 0.08000000000000002
p2 for optimal combo: 0.29000000000000004
AUC for optimal combo: 0.9461805555555556
c@1 for optimal combo: 0.931640625


As you can see, this shifter optimizes 2 parameters using a grid search: all values in between *p1* and *p2* will be rectified to 0.5:

In [15]:
print(dev_test_scores)

[0.8509, 0.7799, 0.0007999999999999996, 0.0, 0.006399999999999997, 0.0031999999999999984, 0.0015999999999999992, 1.0, 0.9929, 0.7799, 0.6520999999999999, 0.5, 0.9148000000000001, 0.9787, 0.0007999999999999996, 0.7018, 0.7302, 0.0, 0.0007999999999999996, 0.5314, 0.0, 0.0, 0.7088999999999999, 0.5952999999999999, 0.0, 0.8295999999999999, 0.002399999999999999, 0.6662999999999999, 0.002399999999999999, 0.0007999999999999996, 0.0007999999999999996, 0.0, 0.5, 0.0, 0.0, 0.5, 0.0, 0.0031999999999999984, 0.0031999999999999984, 0.0, 0.5, 0.8651, 0.0, 0.8154, 0.9929, 0.0, 0.5314, 0.002399999999999999, 0.0, 0.5, 0.0015999999999999992, 0.6094999999999999, 0.005599999999999998, 0.6236999999999999, 0.7515, 0.0, 0.0007999999999999996, 0.7941, 0.0015999999999999992, 0.7799, 0.9858, 0.7515, 0.9431999999999999, 0.6166, 0.6734, 0.9858, 0.8651, 0.5, 0.645, 0.9361, 0.5, 0.0007999999999999996, 0.0007999999999999996, 0.574, 0.005599999999999998, 0.5, 0.5, 0.0, 0.9787, 0.0, 0.9219, 0.5, 0.0, 0.0, 0.9929, 0.0, 1

We can later apply this optimized score shifter to the test problems. Now the main question: how well would our O2 verifier perform on the development problems, given the optimal *p1* and *p2* found? We answer this question using the three evaluation measures used in the PAN competition.

In [16]:
from ruzicka.evaluation import pan_metrics

dev_acc_score, dev_auc_score, dev_c_at_1_score = pan_metrics(
    prediction_scores=dev_test_scores, ground_truth_scores=dev_gt_scores
)
print("Accuracy: ", dev_acc_score)
print("AUC: ", dev_auc_score)
print("c@1: ", dev_c_at_1_score)
print("AUC x c@1: ", dev_auc_score * dev_c_at_1_score)

Accuracy:  0.875
AUC:  0.9461805555555556
c@1:  0.931640625
AUC x c@1:  0.881500244140625


Our score shifting approach clearly pays off, since we are able to leave difficult problems unswered, yielding to a higher c@1 than pure accuracy. We can now proceed to the test problems. The following code block runs entire parallel to the approach above: only the score shifter isn't retrained again:

In [17]:
train_data, test_data = load_pan_dataset(D + "test")
train_labels, train_documents = zip(*train_data)
test_labels, test_documents = zip(*test_data)

# vectorize:
vectorizer = Vectorizer(mfi=10000, vector_space="tf", ngram_type="word", ngram_size=1)
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)

# encode author labels:
label_encoder = LabelEncoder()
label_encoder.fit(train_labels + test_labels)
train_y = np.array(label_encoder.transform(train_labels), dtype="int")
test_y = np.array(label_encoder.transform(test_labels), dtype="int")

# fit and predict a verifier on the test data:
test_verifier = Order2Verifier(
    metric="minmax", base="profile", nb_bootstrap_iter=100, rnd_prop=0.5
)
test_verifier.fit(train_X, train_y)
test_scores = test_verifier.predict_proba(
    test_X=test_X, test_y=np.array(test_y), nb_imposters=30
)

# load the ground truth:
test_gt_scores = load_ground_truth(
    filepath=os.sep.join((D, "test", "truth.txt")), labels=test_labels
)

# apply the optimzed score shifter:
test_scores = shifter.transform(test_scores)

test_acc_score, test_auc_score, test_c_at_1_score = pan_metrics(
    prediction_scores=test_scores, ground_truth_scores=test_gt_scores
)

print("Accuracy: ", test_acc_score)
print("AUC: ", test_auc_score)
print("c@1: ", test_c_at_1_score)
print("AUC x c@1: ", test_auc_score * test_c_at_1_score)



	 - # test documents processed: 10 out of 96
	 - # test documents processed: 20 out of 96
	 - # test documents processed: 30 out of 96
	 - # test documents processed: 40 out of 96
	 - # test documents processed: 50 out of 96
	 - # test documents processed: 60 out of 96
	 - # test documents processed: 70 out of 96
	 - # test documents processed: 80 out of 96
	 - # test documents processed: 90 out of 96
Accuracy:  0.9166666666666666
AUC:  0.9696180555555556
c@1:  0.931640625
AUC x c@1:  0.9033355712890625


While our final test results are a bit lower, the verifier seems to scale reasonably well to the unseen verification problems in the test set.

# First Order Verification

It is interesting now to compare the GI approach to a first-order verification system, which often yields very competitive results too. Our implementation closely resembles the system proposed by Potha and Stamatatos in 2014 (A Profile-based Method for Authorship Verification). We import and fit this O1 verifier:

In [18]:
from ruzicka.Order1Verifier import Order1Verifier

dev_verifier = Order1Verifier(metric="minmax", base="profile")
dev_verifier.fit(dev_train_X, dev_train_y)
dev_test_scores = dev_verifier.predict_proba(test_X=dev_test_X, test_y=dev_test_y)
print(dev_test_scores)

[0.70170134 0.7270711  0.37633175 0.28879446 0.49040943 0.42713267
 0.38902158 0.8981929  0.8610844  0.634507   0.5582054  0.5116984
 0.78004104 0.7290129  0.23379725 0.6144281  0.30090123 0.15461487
 0.37763566 0.50886804 0.3619166  0.20511049 0.56136984 0.43511206
 0.3470927  0.47622222 0.2909667  0.42521244 0.33016068 0.28262693
 0.29360873 0.2508412  0.40444332 0.22095674 0.27538615 0.49316198
 0.1846866  0.4193539  0.1801241  0.3017376  0.4689892  0.6709575
 0.34283203 0.6015002  0.8864545  0.10788292 0.4630553  0.39002222
 0.2620769  0.2199009  0.3541656  0.3327778  0.38411123 0.23165005
 0.6096738  0.17532307 0.44518214 0.6944714  0.37893575 0.62718385
 0.5080187  0.6302462  0.8078237  0.581901   0.53879637 0.72632724
 0.61635345 0.36859185 0.6126836  0.76257414 0.11920172 0.
 0.01431888 0.4648263  0.20724541 0.05414206 0.32293087 0.27691644
 0.78255326 0.253038   0.49909264 0.4017859  0.24249572 0.2609076
 0.85132855 0.23575276 1.         0.62689704 0.22979039 0.10448259
 0.266

Note that in this case, the 'probabilities' returned are only distance-based pseudo-probabilities and don't lie in the range of 0-1. Applying the score shifter is therefore quintessential with O1, since it will scale the distances to a more useful range:

In [19]:
shifter = ScoreShifter()
shifter.fit(predicted_scores=dev_test_scores, ground_truth_scores=dev_gt_scores)
dev_test_scores = shifter.transform(dev_test_scores)
print(dev_test_scores)

p1 for optimal combo: 0.4000000000000001
p2 for optimal combo: 0.4700000000000001
AUC for optimal combo: 0.904513888888889
c@1 for optimal combo: 0.8741319444444444
[0.8419017118215562, 0.8553476864099503, 0.15053269863128665, 0.11551778316497804, 0.7299169999361039, 0.5, 0.15560863018035892, 0.9460422277450562, 0.9263747328519821, 0.8062887102365495, 0.7658488756418229, 0.7412001651525497, 0.8834217506647111, 0.8563768404722214, 0.09351890087127687, 0.7956468945741654, 0.12036049365997317, 0.06184594631195069, 0.15105426311492923, 0.7397000604867936, 0.144766640663147, 0.08204419612884523, 0.7675260132551194, 0.5, 0.1388370752334595, 0.7223977750539781, 0.11638667583465578, 0.5, 0.13206427097320558, 0.11305077075958254, 0.11744349002838136, 0.10033648014068605, 0.5, 0.08838269710540773, 0.11015446186065676, 0.7313758474588394, 0.07387464046478273, 0.5, 0.0720496416091919, 0.12069504261016847, 0.5, 0.8256074780225754, 0.1371328115463257, 0.7887951129674912, 0.9398208969831467, 0.043153

And again, we are now ready to test the performance of O1 on the test problems.

In [20]:
train_data, test_data = load_pan_dataset(D + "test")
train_labels, train_documents = zip(*train_data)
test_labels, test_documents = zip(*test_data)

# vectorize:
vectorizer = Vectorizer(mfi=10000, vector_space="tf", ngram_type="word", ngram_size=1)
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)

# encode author labels:
label_encoder = LabelEncoder()
label_encoder.fit(train_labels + test_labels)
train_y = np.array(label_encoder.transform(train_labels), dtype="int")
test_y = np.array(label_encoder.transform(test_labels), dtype="int")

# fit and predict a verifier on the test data:
test_verifier = Order1Verifier(metric="minmax", base="profile")
test_verifier.fit(train_X, train_y)
test_scores = test_verifier.predict_proba(test_X=test_X, test_y=test_y)

# load the ground truth:
test_gt_scores = load_ground_truth(
    filepath=os.sep.join((D, "test", "truth.txt")), labels=test_labels
)

# apply the optimzed score shifter:
test_scores = shifter.transform(test_scores)

test_acc_score, test_auc_score, test_c_at_1_score = pan_metrics(
    prediction_scores=test_scores, ground_truth_scores=test_gt_scores
)

print("Accuracy: ", test_acc_score)
print("AUC: ", test_auc_score)
print("c@1: ", test_c_at_1_score)
print("AUC x c@1: ", test_auc_score * test_c_at_1_score)



Accuracy:  0.8125
AUC:  0.8899739583333334
c@1:  0.830078125
AUC x c@1:  0.7387479146321615
