# Ružička: Authorship Verification in Python

In this notebook, we offer a quick tutorial as to how you could use the code in this repository. While the package is very much geared towards our own work in authorship verification, you might some of the more general functions useful. All feedback and comments are welcome. This code assumes Python 2.7+ (Python 3 has not been tested). You do not need to install the library to run the code below, but please note that there are a number of well-known third-party Python libraries, including:
+ numpy
+ scipy
+ scikit-learn
+ matplotlib
+ seaborn
+ numba

and preferably (for GPU acceleration and/or JIT-compilation):
+ theano
+ numbapro

We recommend installing Continuum's excellent [Anaconda Python framework](https://www.continuum.io/downloads), which comes bundled with most of these dependencies.


In [1]:
import logging

logging.basicConfig(level="INFO")

In [2]:
from ruzicka.test_metrics import nini
import numpy as np
from scipy.stats import pearsonr
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
a = np.array([1, 1, 0, 1, 1, 0, 0])
b = np.array([1, 0, 0, 1, 1, 1, 0])

In [4]:
%%timeit

nini(a, b)

195 ns ± 2.26 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [5]:
import ruzicka.test_metrics

ruzicka.test_metrics.manhattan

CPUDispatcher(<function manhattan at 0x10a040790>)

In [6]:
import timeit

setup = """
import ruzicka.test_metrics
import numpy as np

a = np.array([1, 1, 0, 1, 1, 0, 0]*100)
b = np.array([1, 0, 0, 1, 1, 1, 0]*100)
"""
CPU_METRICS = ["manhattan", "euclidean", "minmax", "cosine", "nini"]
for m in CPU_METRICS:
    s = f"ruzicka.test_metrics.{m}(a,b)"
    print(timeit.timeit(s, setup=setup))

0.9011635
0.8937567080000006
0.8986360000000007
0.9022851250000006
1.0776017500000012


In [7]:
1.0 - pearsonr(a, b)[0]

0.5833333333333333

## Walk through

By default, we assume that your data sets are stored in a directory the format on the PAN 2014 track on authorship attribution: a directory should minimally include one folder per verification problem (an `unknown.txt` and at least one `known01.txt`) and a `truth.txt`. E.g. for the corpus of Dutch essays (`../data/2014/du_essays/train`), `truth.txt` contains has a tab-separated line with the ground truth for each problem:

```
DE001 Y
DE002 Y
DE003 N
DE004 N
DE005 N
DE006 N
DE007 N
DE008 Y
...
```

To inspect the problems:

In [8]:
! ls ../data/2014/du_essays/train

[1m[36mDE001[m[m         [1m[36mDE021[m[m         [1m[36mDE041[m[m         [1m[36mDE061[m[m         [1m[36mDE081[m[m
[1m[36mDE002[m[m         [1m[36mDE022[m[m         [1m[36mDE042[m[m         [1m[36mDE062[m[m         [1m[36mDE082[m[m
[1m[36mDE003[m[m         [1m[36mDE023[m[m         [1m[36mDE043[m[m         [1m[36mDE063[m[m         [1m[36mDE083[m[m
[1m[36mDE004[m[m         [1m[36mDE024[m[m         [1m[36mDE044[m[m         [1m[36mDE064[m[m         [1m[36mDE084[m[m
[1m[36mDE005[m[m         [1m[36mDE025[m[m         [1m[36mDE045[m[m         [1m[36mDE065[m[m         [1m[36mDE085[m[m
[1m[36mDE006[m[m         [1m[36mDE026[m[m         [1m[36mDE046[m[m         [1m[36mDE066[m[m         [1m[36mDE086[m[m
[1m[36mDE007[m[m         [1m[36mDE027[m[m         [1m[36mDE047[m[m         [1m[36mDE067[m[m         [1m[36mDE087[m[m
[1m[36mDE008[m[m         [1m[36mDE0

Let us now load the set of development problems for the Dutch essays:

In [9]:
from ruzicka.utilities import *

D = "../data/2014/du_essays/"
dev_train_data, dev_test_data = load_pan_dataset(D + "train")

This functions loads all documents and splits the development data into a development part (the known documents) and a testing part (the unknown documents). We can unpack these as follows:

In [10]:
dev_train_labels, dev_train_documents = zip(*dev_train_data)
dev_test_labels, dev_test_documents = zip(*dev_test_data)

Let us have a look at the actual test texts:

In [11]:
for doc in dev_test_documents[:10]:
    print("+ ", doc[:70])

+  ﻿Dankzij het internet zijn we een grote bron aan informatie rijker . A
+  ﻿Het is dus begrijpelijk dat de commerciële zenders meer reclame moete
+  ﻿" Hey , vuile nicht ! Hangt er nog stront aan je lul ? " . Dergelijke
+  ﻿Gelijkheid tussen man en vrouw is iets dat ons al eeuwen in de ban ho
+  ﻿Gisteren was er opnieuw een protest tegen homofilie in de grootstad P
+  ﻿Voetbal is vandaag de dag zonder twijfel de populairste sport in Belg
+  ﻿Door de ongekende groei van nieuwsbronnen en de opkomst van het inter
+  ﻿Woordenboekgebruik uit interesse De categorie woordenboekgebruikers d
+  ﻿Ze bouwden een tegencultuur op die alles verwierp waar hun ouders alt
+  ﻿Als we hier in België op straat rondlopen , merken we dat er zeer vee


For each of these documents we need to decide whether or not they were in fact written by the target authors proposed:

In [12]:
for doc in dev_test_labels[:10]:
    print("+ ", doc[:70])

+  DE001
+  DE002
+  DE003
+  DE004
+  DE005
+  DE006
+  DE007
+  DE008
+  DE009
+  DE010


The first and crucial step is to vectorize the documents using a vector space model. Below, we use generic example, using the 10,000 most common word unigrams and a plain *tf* model:

In [13]:
# from ruzicka.vectorization import Vectorizer
vectorizer = make_pipeline(
    TfidfVectorizer(
        sublinear_tf=True,
        use_idf=False,
        norm="l2",
        analyzer="char",
        ngram_range=(9, 9),
        max_features=10000,
    ),
    FunctionTransformer(lambda x: x.todense(), accept_sparse=True),
)
# vectorizer = Vectorizer(mfi=10000, vector_space="tf", ngram_type="word", ngram_size=1)

dev_train_X = vectorizer.fit_transform(dev_train_documents)
dev_test_X = vectorizer.transform(dev_test_documents)

In [14]:
dev_test_X.__class__

numpy.matrix

Note that we use `sklearn` conventions here: we fit the vectorizer only on the vocabulary of the known documents and apply it it later to the unknown documents (since in real life too, we will not necessarily know the known documents in advance). This gives us two compatible corpus matrices:

In [15]:
print(dev_train_X.shape)
print(dev_test_X.shape)

(172, 10000)
(96, 10000)


We now encode the author labels in the development problem sets as integers, using sklearn's convenient `LabelEncoder`:

In [16]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(dev_train_labels + dev_test_labels)
dev_train_y = np.array(label_encoder.transform(dev_train_labels))
dev_test_y = np.array(label_encoder.transform(dev_test_labels))
print(dev_test_y)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95]


We now construct and fit an 'O2' verifier: this extrinsic verification technique is based on the General Imposters framework. We apply it with the minmax metric and a profile base, meaning that the known documents for each author will be represented as a mean centroid:

In [17]:
from ruzicka.Order2Verifier import Order2Verifier

dev_verifier = Order2Verifier(
    metric="nini", base="profile", rank=True, nb_bootstrap_iter=100, rnd_prop=0.35
)
dev_verifier.fit(dev_train_X, dev_train_y)

In [18]:
import time

CPU_METRICS = ["manhattan", "euclidean", "minmax", "cng", "cosine", "nini"]
for d in CPU_METRICS[0:]:
    print(f"Starting {d}")
    dev_verifier = Order2Verifier(
        metric=d, base="profile", rank=True, nb_bootstrap_iter=100, rnd_prop=0.35
    )
    dev_verifier.fit(dev_train_X, dev_train_y)
    t = time.time()
    dev_test_scores = dev_verifier.predict_proba(
        test_X=dev_test_X, test_y=dev_test_y, nb_imposters=30
    )
    print(f"Time: {time.time()-t}")

Starting manhattan


08/07/2023 03:10:35 [ruzicka:INFO] # test documents processed: 10 out of 96
08/07/2023 03:10:36 [ruzicka:INFO] # test documents processed: 20 out of 96
08/07/2023 03:10:36 [ruzicka:INFO] # test documents processed: 30 out of 96
08/07/2023 03:10:36 [ruzicka:INFO] # test documents processed: 40 out of 96
08/07/2023 03:10:36 [ruzicka:INFO] # test documents processed: 50 out of 96
08/07/2023 03:10:37 [ruzicka:INFO] # test documents processed: 60 out of 96
08/07/2023 03:10:37 [ruzicka:INFO] # test documents processed: 70 out of 96
08/07/2023 03:10:37 [ruzicka:INFO] # test documents processed: 80 out of 96
08/07/2023 03:10:38 [ruzicka:INFO] # test documents processed: 90 out of 96


Time: 2.9504449367523193
Starting euclidean


08/07/2023 03:10:38 [ruzicka:INFO] # test documents processed: 10 out of 96
08/07/2023 03:10:38 [ruzicka:INFO] # test documents processed: 20 out of 96
08/07/2023 03:10:39 [ruzicka:INFO] # test documents processed: 30 out of 96
08/07/2023 03:10:39 [ruzicka:INFO] # test documents processed: 40 out of 96
08/07/2023 03:10:39 [ruzicka:INFO] # test documents processed: 50 out of 96
08/07/2023 03:10:40 [ruzicka:INFO] # test documents processed: 60 out of 96
08/07/2023 03:10:40 [ruzicka:INFO] # test documents processed: 70 out of 96
08/07/2023 03:10:40 [ruzicka:INFO] # test documents processed: 80 out of 96
08/07/2023 03:10:41 [ruzicka:INFO] # test documents processed: 90 out of 96


Time: 2.8919949531555176
Starting minmax


08/07/2023 03:10:41 [ruzicka:INFO] # test documents processed: 10 out of 96
08/07/2023 03:10:41 [ruzicka:INFO] # test documents processed: 20 out of 96
08/07/2023 03:10:42 [ruzicka:INFO] # test documents processed: 30 out of 96
08/07/2023 03:10:42 [ruzicka:INFO] # test documents processed: 40 out of 96
08/07/2023 03:10:42 [ruzicka:INFO] # test documents processed: 50 out of 96
08/07/2023 03:10:43 [ruzicka:INFO] # test documents processed: 60 out of 96
08/07/2023 03:10:43 [ruzicka:INFO] # test documents processed: 70 out of 96
08/07/2023 03:10:43 [ruzicka:INFO] # test documents processed: 80 out of 96
08/07/2023 03:10:43 [ruzicka:INFO] # test documents processed: 90 out of 96


Time: 2.9051671028137207
Starting cng


08/07/2023 03:10:44 [ruzicka:INFO] # test documents processed: 10 out of 96
08/07/2023 03:10:44 [ruzicka:INFO] # test documents processed: 20 out of 96
08/07/2023 03:10:45 [ruzicka:INFO] # test documents processed: 30 out of 96
08/07/2023 03:10:45 [ruzicka:INFO] # test documents processed: 40 out of 96
08/07/2023 03:10:45 [ruzicka:INFO] # test documents processed: 50 out of 96
08/07/2023 03:10:45 [ruzicka:INFO] # test documents processed: 60 out of 96
08/07/2023 03:10:46 [ruzicka:INFO] # test documents processed: 70 out of 96
08/07/2023 03:10:46 [ruzicka:INFO] # test documents processed: 80 out of 96
08/07/2023 03:10:46 [ruzicka:INFO] # test documents processed: 90 out of 96


Time: 2.6542868614196777
Starting cosine


08/07/2023 03:10:47 [ruzicka:INFO] # test documents processed: 10 out of 96
08/07/2023 03:10:47 [ruzicka:INFO] # test documents processed: 20 out of 96
08/07/2023 03:10:47 [ruzicka:INFO] # test documents processed: 30 out of 96
08/07/2023 03:10:48 [ruzicka:INFO] # test documents processed: 40 out of 96
08/07/2023 03:10:48 [ruzicka:INFO] # test documents processed: 50 out of 96
08/07/2023 03:10:48 [ruzicka:INFO] # test documents processed: 60 out of 96
08/07/2023 03:10:48 [ruzicka:INFO] # test documents processed: 70 out of 96
08/07/2023 03:10:49 [ruzicka:INFO] # test documents processed: 80 out of 96
08/07/2023 03:10:49 [ruzicka:INFO] # test documents processed: 90 out of 96


Time: 2.8551769256591797
Starting nini


08/07/2023 03:10:50 [ruzicka:INFO] # test documents processed: 10 out of 96
08/07/2023 03:10:50 [ruzicka:INFO] # test documents processed: 20 out of 96
08/07/2023 03:10:50 [ruzicka:INFO] # test documents processed: 30 out of 96
08/07/2023 03:10:50 [ruzicka:INFO] # test documents processed: 40 out of 96
08/07/2023 03:10:51 [ruzicka:INFO] # test documents processed: 50 out of 96
08/07/2023 03:10:51 [ruzicka:INFO] # test documents processed: 60 out of 96
08/07/2023 03:10:51 [ruzicka:INFO] # test documents processed: 70 out of 96
08/07/2023 03:10:52 [ruzicka:INFO] # test documents processed: 80 out of 96
08/07/2023 03:10:52 [ruzicka:INFO] # test documents processed: 90 out of 96


Time: 3.1775009632110596


We can now obtain the probability which this O1 verifier would assign to each combination of an unknown document and the target author suggested in the problem:

In [19]:
dev_test_scores = dev_verifier.predict_proba(
    test_X=dev_test_X, test_y=dev_test_y, nb_imposters=30
)

08/07/2023 03:10:53 [ruzicka:INFO] # test documents processed: 10 out of 96
08/07/2023 03:10:53 [ruzicka:INFO] # test documents processed: 20 out of 96
08/07/2023 03:10:53 [ruzicka:INFO] # test documents processed: 30 out of 96
08/07/2023 03:10:54 [ruzicka:INFO] # test documents processed: 40 out of 96
08/07/2023 03:10:54 [ruzicka:INFO] # test documents processed: 50 out of 96
08/07/2023 03:10:54 [ruzicka:INFO] # test documents processed: 60 out of 96
08/07/2023 03:10:55 [ruzicka:INFO] # test documents processed: 70 out of 96
08/07/2023 03:10:55 [ruzicka:INFO] # test documents processed: 80 out of 96
08/07/2023 03:10:55 [ruzicka:INFO] # test documents processed: 90 out of 96


In [20]:
dev_test_X[:, np.array([1, 2, 3, 4])][0].shape

(1, 4)

In [21]:
dev_test_X[1].shape

(1, 10000)

This gives us as an array of probability scores for each problem, corresponding to the number of iterations in which the target's author's profile was closer to the anonymous document than to one of the imposters:

In [22]:
print(dev_test_scores)

[0.54956746 0.925      0.28086257 0.07564954 0.06891756 0.23719246
 0.122734   1.         1.         0.60127778 0.97833333 0.45436759
 0.96333333 1.         0.21615095 0.689      0.97833333 0.0414371
 0.069098   0.84866667 0.11956691 0.07060362 0.86666667 0.7212619
 0.05059532 0.99       0.21897205 0.695      0.12017579 0.05483521
 0.07915433 0.06255912 0.85061111 0.04185901 0.22043407 0.40113709
 0.05339389 0.09989206 0.51962302 0.09920503 0.27286903 0.99
 0.03941991 0.995      1.         0.06135734 0.66108766 0.1044389
 0.12977856 0.81884524 0.24402215 0.96666667 0.19907155 0.915
 0.98       0.05783703 0.10885008 0.94666667 0.05815845 0.93
 0.995      0.93166667 0.98       0.62833333 0.7275     1.
 0.79483333 0.48370379 0.91833333 0.82916667 0.06986484 0.05800062
 0.14851908 0.647      0.10542088 0.19622162 0.71210354 0.07445719
 1.         0.06655279 0.82159524 0.95916667 0.06963815 0.04883471
 1.         0.05591854 1.         0.985      0.13055905 0.05810575
 0.05151996 0.83916667 

Let us now load the ground truth to check how well we did:

In [23]:
dev_gt_scores = load_ground_truth(
    filepath=os.sep.join((D, "train", "truth.txt")), labels=dev_test_labels
)
print(dev_gt_scores)

[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0]


There is one final step needed: the PAN evaluation measures allow systems to leave a number of difficult problems unanswered, by setting the probability exactly at 0.5. To account for this strict threshold, we fit a score shifter, which will attempt to rectify mid-range score to 0.5. We can tune these parameters as follows:

In [24]:
from ruzicka.score_shifting import ScoreShifter

shifter = ScoreShifter()
shifter.fit(predicted_scores=dev_test_scores, ground_truth_scores=dev_gt_scores)
dev_test_scores = shifter.transform(dev_test_scores)

08/07/2023 03:11:00 [ruzicka:INFO] p1 for optimal combo: 0.2
08/07/2023 03:11:00 [ruzicka:INFO] p2 for optimal combo: 0.218
08/07/2023 03:11:00 [ruzicka:INFO] AUC for optimal combo: 0.962890625
08/07/2023 03:11:00 [ruzicka:INFO] c@1 for optimal combo: 0.9683159722222221


As you can see, this shifter optimizes 2 parameters using a grid search: all values in between *p1* and *p2* will be rectified to 0.5:

In [25]:
print(dev_test_scores)

[0.6333063069678532, 0.9389423938313132, 0.4145561892030386, 0.007543273439418252, 0.006141627545432334, 0.37900477067800603, 0.017346602387567946, 0.9999991859094389, 0.9999991859094389, 0.6754031883207662, 0.9823605570868693, 0.5558049875730467, 0.9701491986712442, 0.9999991859094389, 0.5, 0.7468170214254776, 0.9823605570868693, 0.0004199942627066528, 0.006179194947041658, 0.8768001476717985, 0.016687191400970063, 0.006492676739468902, 0.8914537777705488, 0.7730811335733696, 0.002326802986245756, 0.9918582802990222, 0.364171706362838, 0.7517015647917276, 0.01681396364137888, 0.0032095791377580694, 0.008272996909436831, 0.004817752828334939, 0.8783831015404907, 0.0005078389681925847, 0.365361922151632, 0.512470539561635, 0.00290948541323467, 0.01259074394288072, 0.608928817389994, 0.012447699530589027, 0.40804872547783666, 0.9918582802990222, 0.0, 0.9959287331042306, 0.9999991859094389, 0.004567533254188501, 0.7240938507979603, 0.013537429719709207, 0.018813331625102847, 0.85252280415

We can later apply this optimized score shifter to the test problems. Now the main question: how well would our O2 verifier perform on the development problems, given the optimal *p1* and *p2* found? We answer this question using the three evaluation measures used in the PAN competition.

In [26]:
from ruzicka.evaluation import pan_metrics

dev_acc_score, dev_auc_score, dev_c_at_1_score = pan_metrics(
    prediction_scores=dev_test_scores, ground_truth_scores=dev_gt_scores
)
print("Accuracy: ", dev_acc_score)
print("AUC: ", dev_auc_score)
print("c@1: ", dev_c_at_1_score)
print("AUC x c@1: ", dev_auc_score * dev_c_at_1_score)

Accuracy:  0.96875
AUC:  0.962890625
c@1:  0.9683159722222221
AUC x c@1:  0.9323823716905381


In [27]:
np.array([]).size

0

Our score shifting approach clearly pays off, since we are able to leave difficult problems unswered, yielding to a higher c@1 than pure accuracy. We can now proceed to the test problems. The following code block runs entire parallel to the approach above: only the score shifter isn't retrained again:

In [28]:
train_data, test_data = load_pan_dataset(D + "test")
train_labels, train_documents = zip(*train_data)
test_labels, test_documents = zip(*test_data)

# # vectorize:
# vectorizer = Vectorizer(mfi=10000, vector_space="tf", ngram_type="word", ngram_size=1)
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)

# encode author labels:
label_encoder = LabelEncoder()
label_encoder.fit(train_labels + test_labels)
train_y = np.array(label_encoder.transform(train_labels), dtype="int")
test_y = np.array(label_encoder.transform(test_labels), dtype="int")

# fit and predict a verifier on the test data:
test_verifier = Order2Verifier(
    metric="nini", base="profile", rank=True, nb_bootstrap_iter=100, rnd_prop=0.35
)
test_verifier.fit(train_X, train_y)
test_scores = test_verifier.predict_proba(
    test_X=test_X, test_y=np.array(test_y), nb_imposters=30
)

# load the ground truth:
test_gt_scores = load_ground_truth(
    filepath=os.sep.join((D, "test", "truth.txt")), labels=test_labels
)

# apply the optimzed score shifter:
test_scores = shifter.transform(test_scores)

test_acc_score, test_auc_score, test_c_at_1_score = pan_metrics(
    prediction_scores=test_scores, ground_truth_scores=test_gt_scores
)

print("Accuracy: ", test_acc_score)
print("AUC: ", test_auc_score)
print("c@1: ", test_c_at_1_score)
print("AUC x c@1: ", test_auc_score * test_c_at_1_score)

08/07/2023 03:11:01 [ruzicka:INFO] # test documents processed: 10 out of 96
08/07/2023 03:11:01 [ruzicka:INFO] # test documents processed: 20 out of 96
08/07/2023 03:11:01 [ruzicka:INFO] # test documents processed: 30 out of 96
08/07/2023 03:11:02 [ruzicka:INFO] # test documents processed: 40 out of 96
08/07/2023 03:11:02 [ruzicka:INFO] # test documents processed: 50 out of 96
08/07/2023 03:11:02 [ruzicka:INFO] # test documents processed: 60 out of 96
08/07/2023 03:11:03 [ruzicka:INFO] # test documents processed: 70 out of 96
08/07/2023 03:11:03 [ruzicka:INFO] # test documents processed: 80 out of 96
08/07/2023 03:11:04 [ruzicka:INFO] # test documents processed: 90 out of 96


Accuracy:  0.9270833333333334
AUC:  0.9661458333333333
c@1:  0.9367404513888888
AUC x c@1:  0.9050278840241608


While our final test results are a bit lower, the verifier seems to scale reasonably well to the unseen verification problems in the test set.

# First Order Verification

It is interesting now to compare the GI approach to a first-order verification system, which often yields very competitive results too. Our implementation closely resembles the system proposed by Potha and Stamatatos in 2014 (A Profile-based Method for Authorship Verification). We import and fit this O1 verifier:

In [29]:
from ruzicka.Order1Verifier import Order1Verifier

dev_verifier = Order1Verifier(metric="minmax", base="profile")
dev_verifier.fit(dev_train_X, dev_train_y)
dev_test_scores = dev_verifier.predict_proba(test_X=dev_test_X, test_y=dev_test_y)
print(dev_test_scores)

[0.42535614 0.74927379 0.26695359 0.16275048 0.21467687 0.29455667
 0.26252537 1.         0.80831416 0.45999645 0.52716197 0.36116755
 0.7796528  0.55581659 0.21243646 0.54392007 0.48089283 0.07698859
 0.14965419 0.4574793  0.24245035 0.10823839 0.45582588 0.28738832
 0.13099702 0.4190554  0.16569552 0.32288118 0.17656823 0.13955606
 0.16316343 0.16393878 0.38605962 0.06464104 0.26145505 0.32362532
 0.13787526 0.21472127 0.39039325 0.21638494 0.36993046 0.63283866
 0.07829127 0.57549803 0.81738412 0.13671046 0.21176318 0.17422687
 0.19448074 0.32706191 0.242668   0.41381086 0.17834924 0.18266943
 0.63603349 0.05323041 0.27473804 0.74735545 0.15256908 0.50023748
 0.31438291 0.45844207 0.53372577 0.29979233 0.56232487 0.68896497
 0.46943581 0.26551732 0.49437108 0.64728991 0.         0.0435419
 0.07987962 0.3401858  0.08148759 0.0793854  0.25232955 0.21950873
 0.52490162 0.18992932 0.27613654 0.31032119 0.20150882 0.12103311
 0.90724384 0.15537835 0.83401041 0.50631333 0.18294779 0.16173

Note that in this case, the 'probabilities' returned are only distance-based pseudo-probabilities and don't lie in the range of 0-1. Applying the score shifter is therefore quintessential with O1, since it will scale the distances to a more useful range:

In [30]:
shifter = ScoreShifter()
shifter.fit(predicted_scores=dev_test_scores, ground_truth_scores=dev_gt_scores)
dev_test_scores = shifter.transform(dev_test_scores)
print(dev_test_scores)

08/07/2023 03:11:07 [ruzicka:INFO] p1 for optimal combo: 0.248
08/07/2023 03:11:07 [ruzicka:INFO] p2 for optimal combo: 0.29
08/07/2023 03:11:07 [ruzicka:INFO] AUC for optimal combo: 0.9713541666666666
08/07/2023 03:11:07 [ruzicka:INFO] c@1 for optimal combo: 0.9366319444444444


[0.5920025550340389, 0.8219838618068396, 0.5, 0.04036207822825101, 0.0532398102057862, 0.49913502323345105, 0.5, 0.99999929000071, 0.8639024805791422, 0.6165971533568956, 0.6642846233794497, 0.5464287059225732, 0.8435529354502258, 0.6846293852389597, 0.05268418897992613, 0.6761828624206632, 0.6314335711870445, 0.019093150452799728, 0.03711420225720802, 0.6148099804920136, 0.06012762543291783, 0.026843094921742344, 0.6136360520144619, 0.5, 0.03248722772127578, 0.5875290384741624, 0.041092447436328984, 0.5192454062817186, 0.043788878278152414, 0.03460986751881078, 0.04046448954483097, 0.040656775591436996, 0.5641020588604513, 0.016030961112914004, 0.5, 0.5197737497148793, 0.03419303019619392, 0.05325082165415214, 0.5671789322275539, 0.05366341082910204, 0.5526503662354869, 0.7393149966467563, 0.019416214796527935, 0.6986031934616808, 0.8703421448698241, 0.03390416087433113, 0.052517215227772945, 0.043208221072078175, 0.04823117466306837, 0.5222137226660875, 0.06018160488743134, 0.5838054

And again, we are now ready to test the performance of O1 on the test problems.

In [31]:
train_data, test_data = load_pan_dataset(D + "test")
train_labels, train_documents = zip(*train_data)
test_labels, test_documents = zip(*test_data)

# vectorize:
vectorizer = Vectorizer(mfi=5000, vector_space="tf", ngram_type="word", ngram_size=1)
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)

# encode author labels:
label_encoder = LabelEncoder()
label_encoder.fit(train_labels + test_labels)
train_y = np.array(label_encoder.transform(train_labels), dtype="int")
test_y = np.array(label_encoder.transform(test_labels), dtype="int")

# fit and predict a verifier on the test data:
test_verifier = Order1Verifier(metric="minmax", base="profile")
test_verifier.fit(train_X, train_y)
test_scores = test_verifier.predict_proba(test_X=test_X, test_y=test_y)

# load the ground truth:
test_gt_scores = load_ground_truth(
    filepath=os.sep.join((D, "test", "truth.txt")), labels=test_labels
)

# apply the optimzed score shifter:
test_scores = shifter.transform(test_scores)

test_acc_score, test_auc_score, test_c_at_1_score = pan_metrics(
    prediction_scores=test_scores, ground_truth_scores=test_gt_scores
)

print("Accuracy: ", test_acc_score)
print("AUC: ", test_auc_score)
print("c@1: ", test_c_at_1_score)
print("AUC x c@1: ", test_auc_score * test_c_at_1_score)

Accuracy:  0.71875
AUC:  0.890842013888889
c@1:  0.7591145833333333
AUC x c@1:  0.6762511641890914
