# General Imposters: Poetic Style

The 'General Imposters' method for authorship attribution is a bootstrap-based ensemble classifier that is one of the state-of-the-art approaches. It essentially provides a bootstrap likelihood in answer to this precise question: 'is this document _more similar_ to the style of a _candidate author_ than to any of the _distractor authors_ (imposters)'. It is useful in that the classifier is allowed to express no opinion, usually taken to mean 'the true author is none of the above'--this overcomes a limitation of many categorical machine-learning classifiers, which are obliged to suggest a 'best match' author.

In general, this follows the methods and updates the code from this paper:

`Kestemont, M., Stover, J., Koppel, M., Karsdorp, F., & Daelemans, W. (2016). Authenticating the writings of Julius Caesar. Expert Systems with Applications, 63, 86-96.`

Github for that code is at: https://github.com/mikekestemont/ruzicka
My (many) changes are at: https://github.com/bnagy/ruzicka

The Kestemont code in turn is based on:

`Koppel, M., & Winter, Y. (2014). Determining if two documents are written by the same author. Journal of the Association for Information Science and Technology, 65(1), 178-187.`

But I have implemented _some_ of the additional ideas (particularly ranked scoring) from:

`Potha, N., & Stamatatos, E. (2017). An improved impostors method for authorship verification. In CLEF 2017, Dublin, Ireland, September 11–14, 2017, Proceedings 8 (pp. 138-144)`


# NOTE
This is mainly archival. I don't think the results from the original GI formulation should lack as much power as it appears here, but I have not spent much time debugging since I moved to the newer BDI method.

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit

from ruzicka.Order2Verifier import Order2Verifier
from ruzicka import utilities
from ruzicka.score_shifting import ScoreShifter

In [2]:
import warnings

warnings.filterwarnings("ignore")

import logging

logging.basicConfig(level="INFO")

## Corpus

See [this notebook](build_corpus.ipynb) for corpus creation details. I use Augustan 'short elegy' as elsewhere, but no poem that is less than twenty lines. For the poetic corpus, I use the vectorisation I created for previous work on Augustan elegy. It considers the following features:

<img src="./es_poetics_summary.png" alt="Drawing" style="width: 800px;"/>

In [3]:
elegy_vecs = pd.read_csv("elegy_poetic.csv", index_col=0)
elegy_corpus = elegy_vecs[elegy_vecs.LEN >= 20].reset_index(drop=True)
elegy_corpus

Unnamed: 0,Author,Work,Poem,H1SP,H2SP,H3SP,H4SP,H1CF,H2CF,H3CF,...,P4SC,P1WC,P2WC,P3WC,P4WC,ELC,RS,LEO,LEN,PFSD
0,Ovid,Ep.,Ep. 1,0.086207,0.500000,0.500000,0.448276,0.241379,0.706897,0.810345,...,0.0,0.206897,0.068966,0.396552,1.000000,0.094828,4.393948,0.739842,116,0.000000
1,Ovid,Ep.,Ep. 2,0.189189,0.527027,0.581081,0.391892,0.283784,0.743243,0.878378,...,0.0,0.202703,0.067568,0.337838,1.000000,0.114865,4.071062,1.027448,148,0.000000
2,Ovid,Ep.,Ep. 3,0.220779,0.493506,0.519481,0.480519,0.181818,0.597403,0.818182,...,0.0,0.116883,0.025974,0.324675,1.000000,0.090909,3.845700,0.484285,154,0.000000
3,Ovid,Ep.,Ep. 4,0.102273,0.511364,0.545455,0.465909,0.147727,0.659091,0.829545,...,0.0,0.215909,0.045455,0.329545,1.000000,0.073864,3.822098,0.893575,176,0.000000
4,Ovid,Ep.,Ep. 5,0.215190,0.455696,0.632911,0.417722,0.164557,0.658228,0.911392,...,0.0,0.202532,0.037975,0.341772,1.000000,0.056962,3.727347,0.713715,158,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
293,ps-Ovid,Consolatio,Consolatio 3,0.329114,0.506329,0.658228,0.582278,0.291139,0.594937,0.772152,...,0.0,0.151899,0.037975,0.240506,0.987342,0.202532,4.590044,1.062847,158,0.225018
294,ps-Ovid,Ibis,Ibis 1,0.156250,0.718750,0.562500,0.593750,0.156250,0.562500,0.906250,...,0.0,0.187500,0.000000,0.218750,1.000000,0.109375,3.986751,1.053890,64,0.000000
295,ps-Ovid,Ibis,Ibis 2,0.160000,0.530000,0.620000,0.440000,0.100000,0.580000,0.960000,...,0.0,0.230000,0.060000,0.360000,1.000000,0.130000,4.683774,0.994626,200,0.000000
296,ps-Ovid,Ibis,Ibis 3,0.190000,0.450000,0.730000,0.550000,0.180000,0.730000,0.950000,...,0.0,0.240000,0.050000,0.260000,1.000000,0.060000,4.070276,0.787213,200,0.000000


In [4]:
X = np.array(elegy_corpus.iloc[:, 3:])

In [5]:
lenc = LabelEncoder()
labels = lenc.fit_transform(elegy_corpus.Author)

In [6]:
logger = logging.getLogger("ruzicka")

In [7]:
# set to logging.DEBUG or higher for less noise

for handler in logger.handlers:
    handler.setLevel(logging.INFO)

In [8]:
# Verifier options

verifier_cosine = Order2Verifier(
    metric="cosine", base="instance", nb_bootstrap_iter=500, rnd_prop=0.35
)

verifier_minmax = Order2Verifier(
    metric="minmax", base="instance", nb_bootstrap_iter=500, rnd_prop=0.35
)

In [9]:
# Splitter

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

## Methods

I z-scale the poetic features and then test with the Cosine and MinMax metrics (MinMax performs considerably better, and also at least 1% better than the best performing text-based classifier)

In [10]:
# Already vectorized, but not scaled

scaler = StandardScaler(with_mean=False)

## Comparison / Evaluation

In each case we fit a 'score shifter' on a random 20% subsample and then apply that shifting to the GI Verifier. The final metric is C@1 Accuracy:

`A. Peñas and A. Rodrigo. A Simple Measure to Assess Nonresponse.
        In Proc. of the 49th Annual Meeting of the Association for
        Computational Linguistics, Vol. 1, pages 1415-1424, 2011.`

This measure is useful because it allows the model to refuse to classify (to say 'I don't know') without unduly penalising it, which helps with regularisation and interpretability.

In [11]:
cosine_shifter = utilities.fit_shifter(
    X,
    labels,
    test_size=0.2,
    vectorizer=scaler,
    verifier=verifier_cosine,
    shifter=ScoreShifter(min_spread=0.2),
)
aa, cc = utilities.benchmark_imposters(
    X, labels, sss, scaler, verifier_cosine, cosine_shifter
)
print()
print(f"{'Splits: ':>11} {sss.n_splits}")
print(f"{'Test %: ':>11} {sss.test_size:.0%}")
print(f"{'Accuracy: ':>11} Mean {np.mean(aa):.2%}, SD {np.std(aa):.2f}")
print(f"{'C@1: ':>11} Mean {np.mean(cc):.2%}, SD {np.std(cc):.2f}")

01/20/2025 08:16:59 [ruzicka:INFO] Fitting the provided score shifter on a 20.0% sample
01/20/2025 08:16:59 [ruzicka:INFO] Fitting on 238 documents in instance mode...
01/20/2025 08:16:59 [ruzicka:INFO] Running verifier on sub-sample
01/20/2025 08:16:59 [ruzicka:INFO] Predicting on 120 documents
01/20/2025 08:17:04 [ruzicka:INFO] Actually fitting...
01/20/2025 08:17:05 [ruzicka:INFO] p1 for optimal combo: 0.536
01/20/2025 08:17:05 [ruzicka:INFO] p2 for optimal combo: 0.740
01/20/2025 08:17:05 [ruzicka:INFO] Objective function result for optimal combo: 96.66%
01/20/2025 08:17:05 [ruzicka:INFO] Starting benchmark: 10 splits, test size 10%
01/20/2025 08:17:05 [ruzicka:INFO] Fitting on 268 documents in instance mode...
01/20/2025 08:17:05 [ruzicka:INFO] Predicting on 60 documents
01/20/2025 08:17:08 [ruzicka:INFO] Accuracy: 88.33% AUC: 99.11% c@1: 95.69% AUC x c@1: 94.84%
01/20/2025 08:17:08 [ruzicka:INFO] Fitting on 268 documents in instance mode...
01/20/2025 08:17:08 [ruzicka:INFO] Pred


   Splits:  10
   Test %:  10%
 Accuracy:  Mean 90.33%, SD 0.03
      C@1:  Mean 95.94%, SD 0.03


In [12]:
minmax_shifter = utilities.fit_shifter(
    X,
    labels,
    test_size=0.2,
    vectorizer=scaler,
    verifier=verifier_minmax,
    shifter=ScoreShifter(min_spread=0.2),
)
aa, cc = utilities.benchmark_imposters(
    X, labels, sss, scaler, verifier_minmax, minmax_shifter
)
print()
print(f"{'Splits: ':>11} {sss.n_splits}")
print(f"{'Test %: ':>11} {sss.test_size:.0%}")
print(f"{'Accuracy: ':>11} Mean {np.mean(aa):.2%}, SD {np.std(aa):.2f}")
print(f"{'C@1: ':>11} Mean {np.mean(cc):.2%}, SD {np.std(cc):.2f}")

01/20/2025 08:17:32 [ruzicka:INFO] Fitting the provided score shifter on a 20.0% sample
01/20/2025 08:17:32 [ruzicka:INFO] Fitting on 238 documents in instance mode...
01/20/2025 08:17:32 [ruzicka:INFO] Running verifier on sub-sample
01/20/2025 08:17:32 [ruzicka:INFO] Predicting on 120 documents
01/20/2025 08:17:37 [ruzicka:INFO] Actually fitting...
01/20/2025 08:17:38 [ruzicka:INFO] p1 for optimal combo: 0.578
01/20/2025 08:17:38 [ruzicka:INFO] p2 for optimal combo: 0.782
01/20/2025 08:17:38 [ruzicka:INFO] Objective function result for optimal combo: 97.88%
01/20/2025 08:17:38 [ruzicka:INFO] Starting benchmark: 10 splits, test size 10%
01/20/2025 08:17:38 [ruzicka:INFO] Fitting on 268 documents in instance mode...
01/20/2025 08:17:38 [ruzicka:INFO] Predicting on 60 documents
01/20/2025 08:17:41 [ruzicka:INFO] Accuracy: 95.00% AUC: 99.11% c@1: 96.58% AUC x c@1: 95.72%
01/20/2025 08:17:41 [ruzicka:INFO] Fitting on 268 documents in instance mode...
01/20/2025 08:17:41 [ruzicka:INFO] Pred


   Splits:  10
   Test %:  10%
 Accuracy:  Mean 90.50%, SD 0.03
      C@1:  Mean 95.71%, SD 0.01


## Actually apply the method

Finally, we apply the method to the problem texts. All the texts are classified as certainly closer to Ovidian style than to any other author (1 is '100% confidence'). This suggests that this method, particularly when applied to purely textual input does not have the statistical power to distinguish between Ovidian imitation (_Consolatio_) and genuine work (_Ibis_, _Medicamina_). This leaves the situation unclear with respect to the _Nux_

In [13]:
real_verifier = Order2Verifier(
    metric="minmax", base="instance", nb_bootstrap_iter=500, rnd_prop=0.35
)

In [14]:
# Fit on the whole solidly-attributed corpus now.

real_verifier.fit(scaler.fit_transform(X), np.array(labels))

01/20/2025 08:18:05 [ruzicka:INFO] Fitting on 298 documents in instance mode...


In [15]:
problems = elegy_corpus[elegy_corpus.Author == "ps-Ovid"]
problems

Unnamed: 0,Author,Work,Poem,H1SP,H2SP,H3SP,H4SP,H1CF,H2CF,H3CF,...,P4SC,P1WC,P2WC,P3WC,P4WC,ELC,RS,LEO,LEN,PFSD
288,ps-Ovid,Nux,Nux,0.153846,0.450549,0.626374,0.626374,0.175824,0.604396,0.868132,...,0.0,0.197802,0.043956,0.285714,1.0,0.082418,3.09536,0.524756,182,0.0
289,ps-Ovid,Medicamina,Medicamina,0.28,0.48,0.52,0.54,0.18,0.62,0.88,...,0.0,0.2,0.04,0.3,1.0,0.08,4.901116,0.909967,100,0.0
290,ps-Ovid,Pamphilus,Pamphilus,0.343434,0.505051,0.656566,0.616162,0.282828,0.636364,0.929293,...,0.010101,0.080808,0.121212,0.141414,0.959596,0.0,4.120489,0.683937,198,0.357215
291,ps-Ovid,Consolatio,Consolatio 1,0.240506,0.481013,0.64557,0.531646,0.164557,0.582278,0.924051,...,0.0,0.088608,0.037975,0.278481,1.0,0.246835,4.619877,0.606677,158,0.0
292,ps-Ovid,Consolatio,Consolatio 2,0.253165,0.556962,0.556962,0.493671,0.240506,0.696203,0.810127,...,0.0,0.088608,0.025316,0.240506,1.0,0.278481,3.608988,0.824542,158,0.0
293,ps-Ovid,Consolatio,Consolatio 3,0.329114,0.506329,0.658228,0.582278,0.291139,0.594937,0.772152,...,0.0,0.151899,0.037975,0.240506,0.987342,0.202532,4.590044,1.062847,158,0.225018
294,ps-Ovid,Ibis,Ibis 1,0.15625,0.71875,0.5625,0.59375,0.15625,0.5625,0.90625,...,0.0,0.1875,0.0,0.21875,1.0,0.109375,3.986751,1.05389,64,0.0
295,ps-Ovid,Ibis,Ibis 2,0.16,0.53,0.62,0.44,0.1,0.58,0.96,...,0.0,0.23,0.06,0.36,1.0,0.13,4.683774,0.994626,200,0.0
296,ps-Ovid,Ibis,Ibis 3,0.19,0.45,0.73,0.55,0.18,0.73,0.95,...,0.0,0.24,0.05,0.26,1.0,0.06,4.070276,0.787213,200,0.0
297,ps-Ovid,Ibis,Ibis 4,0.123596,0.438202,0.617978,0.52809,0.179775,0.685393,0.988764,...,0.0,0.258427,0.05618,0.213483,0.977528,0.033708,4.358413,0.791811,178,0.471886


## Results

Here we can _start_ to see the difference in the _Consolatio_ (and the last third of the poem is actually placed in the 'not sure' band by the classifier). However, the summary statistic output of the GI Verifier is still not satisfyingly powerful. A better result can be seen in the [full bootstrapping notebook](nux_boot_poet.ipynb), which displays full distributions of differences instead of a voting-based ensemble summary.

In [16]:
t = minmax_shifter.transform(
    real_verifier.predict_proba(
        np.array(scaler.transform(problems.iloc[:, 3:])),
        np.array(lenc.transform(["Ovid"] * len(problems))),
        nb_imposters=30,
    )
)

01/20/2025 08:18:05 [ruzicka:INFO] Predicting on 10 documents


In [17]:
pd.DataFrame(zip(problems.Poem, t), columns=["Poem", "GI %"])

Unnamed: 0,Poem,GI %
0,Nux,0.732187
1,Medicamina,0.704719
2,Pamphilus,0.5
3,Consolatio 1,0.655277
4,Consolatio 2,0.672063
5,Consolatio 3,0.5
6,Ibis 1,0.704719
7,Ibis 2,0.681066
8,Ibis 3,0.719216
9,Ibis 4,0.672826


In [18]:
%load_ext watermark
%watermark -n -u -v -iv -w

Last updated: Mon Jan 20 2025

Python implementation: CPython
Python version       : 3.12.3
IPython version      : 8.20.0

sklearn: 1.4.2
pandas : 2.2.2
ruzicka: 1.1.0
logging: 0.5.1.2
numpy  : 1.26.4

Watermark: 2.5.0

