# General Imposters: Lexico-Grammatical Style

The 'General Imposters' method for authorship attribution is a bootstrap-based ensemble classifier that is one of the state-of-the-art approaches. It essentially provides a bootstrap likelihood in answer to this precise question: 'is this document _more similar_ to the style of a _candidate author_ than to any of the _distractor authors_ (imposters)'. It is useful in that the classifier is allowed to express no opinion, usually taken to mean 'the true author is none of the above'--this overcomes a limitation of many categorical machine-learning classifiers, which are obliged to suggest a 'best match' author.

In general, this follows the methods and updates the code from this paper:

`Kestemont, M., Stover, J., Koppel, M., Karsdorp, F., & Daelemans, W. (2016). Authenticating the writings of Julius Caesar. Expert Systems with Applications, 63, 86-96.`

Github for that code is at: https://github.com/mikekestemont/ruzicka

The Kestemont code in turn is based on:

`Koppel, M., & Winter, Y. (2014). Determining if two documents are written by the same author. Journal of the Association for Information Science and Technology, 65(1), 178-187.`

But I have implemented _some_ of the additional ideas (particularly ranked scoring) from:

`Potha, N., & Stamatatos, E. (2017). An improved impostors method for authorship verification. In CLEF 2017, Dublin, Ireland, September 11–14, 2017, Proceedings 8 (pp. 138-144)`


# NOTE
This is mainly archival. I don't think the results from the original GI formulation should lack as much power as it appears here, but I have not spent much time debugging since I moved to the newer BDI method.

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import Normalizer, LabelEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import StratifiedShuffleSplit

from ruzicka.Order2Verifier import Order2Verifier
from ruzicka import utilities
from ruzicka.score_shifting import ScoreShifter

In [2]:
import warnings

warnings.filterwarnings("ignore")

import logging

logging.basicConfig(level="INFO")

## Corpus

See [this notebook](build_corpus.ipynb) for corpus creation details. I use Augustan 'short elegy' as elsewhere, but no poem that is less than twenty lines. In addition, I include 200 samples with length $\sim N(80,20)$ drawn from assorted works of epic hexameter.

In [3]:
elegy_vecs = pd.read_csv("elegy_corpus.csv", index_col=0)
elegy_corpus = elegy_vecs[elegy_vecs.LEN >= 20]
hexameter_vecs = pd.read_csv("non_elegy_corpus.csv", index_col=0)

In [4]:
test_corpus = pd.concat(
    [elegy_corpus[elegy_corpus.Author != "ps-Ovid"], hexameter_vecs]
).reset_index(drop=True)
test_corpus

Unnamed: 0,Author,Work,Poem,LEN,Chunk
0,Ovid,Ep.,Ep. 1,116,hank tua penelope lento tibi mittit ulikse\nni...
1,Ovid,Ep.,Ep. 2,148,hospita demopoon tua te rodopeia pyllis\nultra...
2,Ovid,Ep.,Ep. 3,154,kwam legis a rapta briseide littera wenit\nwik...
3,Ovid,Ep.,Ep. 4,176,kwam nisi tu dederis karitura_st ipsa salutem\...
4,Ovid,Ep.,Ep. 5,158,perlegis an konjunks prohibet nowa perlege non...
...,...,...,...,...,...
483,V.Flaccus,195-Argonautica,195-Argonautica,62,eumenidumkwe komae non tristis ab aetere gorgo...
484,Lucretius,196-DRN,196-DRN,92,kernere koeperunt kontendere _satkwe parare\nn...
485,Horace,197-Hor.,197-Hor. Sat.,72,kwod plakui tibi kwi turpi sekernis honestum\n...
486,Vergil,198-Aeneid,198-Aeneid,79,insinjem gemmis tum fumida lumine fulwo\ninwol...


In [5]:
lenc = LabelEncoder()
labels = lenc.fit_transform(test_corpus.Author)

In [6]:
logger = logging.getLogger("ruzicka")

In [7]:
# set to logging.DEBUG or higher for less noise

for handler in logger.handlers:
    handler.setLevel(logging.INFO)

In [8]:
# Verifier options

verifier_nini = Order2Verifier(
    metric="nini", base="instance", nb_bootstrap_iter=500, rnd_prop=0.35
)

verifier_minmax = Order2Verifier(
    metric="minmax", base="instance", nb_bootstrap_iter=500, rnd_prop=0.35
)

In [9]:
# Splitter

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

## Methods

I test two methods (which turn out to perform similarly). 

First, I use _z_-scaling and the MinMax (Ruzička) metric as recommended by the Kestemont paper, but apply it to 2-, 3-, and 4-grams instead of most frequent words (MFW). This is my general approach when dealing with small poetic samples.

Second, I use the approach described elsewhere from _Nini, A. (2023). A Theory of Linguistic Individuality for Authorship Analysis (Elements in Forensic Linguistics). Cambridge: Cambridge University Press. doi:10.1017/9781108974851_. Here the vector space is 5000 most frequent 5-grams and the metric is the Pearson Correlation between set-inclusion indicator (binary) vectors.

In [10]:
# Vectorizer options

vec_ngrams_std = make_pipeline(
    TfidfVectorizer(
        sublinear_tf=True,
        use_idf=False,
        norm="l2",
        analyzer="char",
        ngram_range=(2, 4),
        max_features=5000,
    ),
    StandardScaler(
        with_mean=False
    ),  # never centre frequency data for the minmax metric!
    FunctionTransformer(lambda x: np.asarray(x.todense()), accept_sparse=True),
    Normalizer(),
)

vec_5grams = make_pipeline(
    TfidfVectorizer(
        sublinear_tf=True,
        use_idf=False,
        norm="l2",
        analyzer="char",
        ngram_range=(5, 5),
        max_features=5000,
    ),
    FunctionTransformer(lambda x: np.asarray(x.todense()), accept_sparse=True),
)

## Comparison / Evaluation

In each case we fit a 'score shifter' on a random 20% subsample and then apply that shifting to the GI Verifier. The final metric is C@1 Accuracy:

`A. Peñas and A. Rodrigo. A Simple Measure to Assess Nonresponse.
        In Proc. of the 49th Annual Meeting of the Association for
        Computational Linguistics, Vol. 1, pages 1415-1424, 2011.`

This measure is useful because it allows the model to refuse to classify (to say 'I don't know') without unduly penalising it, which helps with regularisation and interpretability.

In general, both methods perform very strongly, especially given the small samples.

In [11]:
nini_shifter = utilities.fit_shifter(
    test_corpus.Chunk,
    labels,
    test_size=0.2,
    vectorizer=vec_5grams,
    verifier=verifier_nini,
    shifter=ScoreShifter(min_spread=0.2),
)
aa, cc = utilities.benchmark_imposters(
    test_corpus.Chunk, labels, sss, vec_5grams, verifier_nini, nini_shifter
)
print()
print(f"{'Splits: ':>11} {sss.n_splits}")
print(f"{'Test %: ':>11} {sss.test_size:.0%}")
print(f"{'Accuracy: ':>11} Mean {np.mean(aa):.2%}, SD {np.std(aa):.2f}")
print(f"{'C@1: ':>11} Mean {np.mean(cc):.2%}, SD {np.std(cc):.2f}")

01/20/2025 07:53:29 [ruzicka:INFO] Fitting the provided score shifter on a 20.0% sample
01/20/2025 07:53:30 [ruzicka:INFO] Fitting on 390 documents in instance mode...
01/20/2025 07:53:30 [ruzicka:INFO] Running verifier on sub-sample
01/20/2025 07:53:30 [ruzicka:INFO] Predicting on 196 documents
01/20/2025 07:54:12 [ruzicka:INFO] Actually fitting...
01/20/2025 07:54:13 [ruzicka:INFO] p1 for optimal combo: 0.566
01/20/2025 07:54:13 [ruzicka:INFO] p2 for optimal combo: 0.782
01/20/2025 07:54:13 [ruzicka:INFO] Objective function result for optimal combo: 94.86%
01/20/2025 07:54:13 [ruzicka:INFO] Starting benchmark: 10 splits, test size 10%
01/20/2025 07:54:14 [ruzicka:INFO] Fitting on 439 documents in instance mode...
01/20/2025 07:54:14 [ruzicka:INFO] Predicting on 98 documents
01/20/2025 07:54:36 [ruzicka:INFO] Accuracy: 90.82% AUC: 99.92% c@1: 98.23% AUC x c@1: 98.15%
01/20/2025 07:54:37 [ruzicka:INFO] Fitting on 439 documents in instance mode...
01/20/2025 07:54:37 [ruzicka:INFO] Pred


   Splits:  10
   Test %:  10%
 Accuracy:  Mean 90.10%, SD 0.01
      C@1:  Mean 96.21%, SD 0.02


In [12]:
ngram_shifter = utilities.fit_shifter(
    test_corpus.Chunk,
    labels,
    test_size=0.2,
    vectorizer=vec_ngrams_std,
    verifier=verifier_minmax,
    shifter=ScoreShifter(min_spread=0.2),
)
aa, cc = utilities.benchmark_imposters(
    test_corpus.Chunk, labels, sss, vec_ngrams_std, verifier_minmax, ngram_shifter
)
print()
print(f"{'Splits: ':>11} {sss.n_splits}")
print(f"{'Test %: ':>11} {sss.test_size:.0%}")
print(f"{'Accuracy: ':>11} Mean {np.mean(aa):.2%}, SD {np.std(aa):.2f}")
print(f"{'C@1: ':>11} Mean {np.mean(cc):.2%}, SD {np.std(cc):.2f}")

01/20/2025 07:58:04 [ruzicka:INFO] Fitting the provided score shifter on a 20.0% sample
01/20/2025 07:58:05 [ruzicka:INFO] Fitting on 390 documents in instance mode...
01/20/2025 07:58:05 [ruzicka:INFO] Running verifier on sub-sample
01/20/2025 07:58:05 [ruzicka:INFO] Predicting on 196 documents
01/20/2025 07:58:41 [ruzicka:INFO] Actually fitting...
01/20/2025 07:58:42 [ruzicka:INFO] p1 for optimal combo: 0.578
01/20/2025 07:58:42 [ruzicka:INFO] p2 for optimal combo: 0.782
01/20/2025 07:58:42 [ruzicka:INFO] Objective function result for optimal combo: 93.59%
01/20/2025 07:58:42 [ruzicka:INFO] Starting benchmark: 10 splits, test size 10%
01/20/2025 07:58:43 [ruzicka:INFO] Fitting on 439 documents in instance mode...
01/20/2025 07:58:43 [ruzicka:INFO] Predicting on 98 documents
01/20/2025 07:59:02 [ruzicka:INFO] Accuracy: 91.84% AUC: 99.29% c@1: 95.45% AUC x c@1: 94.77%
01/20/2025 07:59:03 [ruzicka:INFO] Fitting on 439 documents in instance mode...
01/20/2025 07:59:03 [ruzicka:INFO] Pred


   Splits:  10
   Test %:  10%
 Accuracy:  Mean 90.61%, SD 0.03
      C@1:  Mean 95.04%, SD 0.01


## Actually apply the method

Finally, we apply the method to the problem texts. All the texts are classified as certainly closer to Ovidian style than to any other author (1 is '100% confidence'). This suggests that this method, particularly when applied to purely textual input does not have the statistical power to distinguish between Ovidian imitation (_Consolatio_) and genuine work (_Ibis_, _Medicamina_). This leaves the situation unclear with respect to the _Nux_

In [13]:
real_verifier = Order2Verifier(
    metric="minmax", base="instance", nb_bootstrap_iter=1000, rnd_prop=0.35
)

In [14]:
real_verifier.fit(vec_ngrams_std.fit_transform(test_corpus.Chunk), np.array(labels))

01/20/2025 08:02:08 [ruzicka:INFO] Fitting on 488 documents in instance mode...


In [15]:
problems = elegy_corpus[elegy_corpus.Author == "ps-Ovid"]
problems

Unnamed: 0,Author,Work,Poem,LEN,Chunk
296,ps-Ovid,Nux,Nux,182,nuks ego junkta wiae kum sim sine krimine wita...
297,ps-Ovid,Medicamina,Medicamina,100,diskite kwae fakiem kommendet kura puellae\net...
298,ps-Ovid,Pamphilus,Pamphilus,198,postkwam pampileas rumor perwenit ad aures\ngl...
299,ps-Ovid,Consolatio,Consolatio 1,158,wisa diu feliks mater modo dikta neronum\njam ...
300,ps-Ovid,Consolatio,Consolatio 2,158,at_kwutinam drusi manus alte_ret altera fratri...
301,ps-Ovid,Consolatio,Consolatio 3,158,kwo raperis laniata komas similiskwe furenti\n...
302,ps-Ovid,Ibis,Ibis 1,64,tempus ad hok lustris bis jam mihi kwinkwe per...
303,ps-Ovid,Ibis,Ibis 2,200,di maris et terrae kwi_kwis meliora tenetis\ni...
304,ps-Ovid,Ibis,Ibis 3,200,kwi_kwokulis karuit per kwos male widerat auru...
305,ps-Ovid,Ibis,Ibis 4,178,aut te dewoweat kertis abdera diebus\nsaksakwe...


In [16]:
t = real_verifier.predict_proba(
    vec_ngrams_std.transform(problems.Chunk),
    np.array(lenc.transform(["Ovid"] * len(problems))),
    nb_imposters=60,
)

01/20/2025 08:02:08 [ruzicka:INFO] Predicting on 10 documents


In [17]:
pd.DataFrame(zip(problems.Poem, t), columns=["Poem", "GI %"])

Unnamed: 0,Poem,GI %
0,Nux,1.0
1,Medicamina,0.9965
2,Pamphilus,0.999
3,Consolatio 1,1.0
4,Consolatio 2,1.0
5,Consolatio 3,1.0
6,Ibis 1,1.0
7,Ibis 2,1.0
8,Ibis 3,1.0
9,Ibis 4,1.0


In [18]:
%load_ext watermark
%watermark -n -u -v -iv -w

Last updated: Mon Jan 20 2025

Python implementation: CPython
Python version       : 3.12.3
IPython version      : 8.20.0

numpy  : 1.26.4
ruzicka: 1.1.0
logging: 0.5.1.2
sklearn: 1.4.2
pandas : 2.2.2

Watermark: 2.5.0

