## Burrows's Delta

Attempts to attribute authorship are typically undertaken in scenarios where there is a large (enough) number of texts securely attributable to a known author, and a text, or at most a small number of texts, of unknown authorship. The attempt is then made to attribute the unknown text to the known author, or to rule out such an attribution. Take *The Federalist* as an example. There are numbers of *The Federalist* of disputed or unknown attribution, a small and well-defined number of candidates for authorship --- Hamilton, Jay, Madison --- to whom those numbers might be attributed, and securely attributed samples from each of the candidates, from the same work no less. Such an approach is not possible in the case of the *dicta* from Gratian’s *Decretum*. As the survey in Chapter 3 above indicated, near-contemporaries knew next to nothing about Gratian. Perhaps most notably, although Gratian was thought to have been a teacher, no one in the generation following made an unambiguous claim to have been his student. **There are no other writings securely, or even insecurely, attributed to him.**

This does not mean that we are unable to apply the established techniques of authorship attribution to Gratian’s *dicta*. It simply means that we will have to adapt the tradtional assumptions behind the design of authorship attribution experiments for a different set of circumstances. Every authorship attribution experiment starts from an hypothesis (sometimes implicitly rather than explicitly stated) concerning authorship of the text under consideration. The hypothesis traditional in Gratian studies has been that the *dicta* from the *Decretum* --- defined as the hypothetical case statements (*themata*) plus the first- and second-recension *dicta* including the *dicta* from *de Pen*. --- are the work of a unitary author, the eponymous Gratian, **In this view, the *dicta* represent a complete, closed, corpus -- the total population of words attributable to the author Gratian.**

In [1]:
import re

def get_tokens(filename):
    '''open text file and return list of tokens'''
    # text = open(filename, 'r').read().lower()
    f = open(filename, 'r') # open file
    text = f.read() # read file
    text = text.lower() # lower-case text
    tokens = [word for word in re.split('\W', text) if word != ''] # remove punctuation
    return tokens

In [2]:
import pandas as pd

def get_lengths(samples):
    filenames = [sample + '.txt' for sample in samples]
    lengths = {}
    for i in range(len(samples)):
       lengths[samples[i]] = len(get_tokens(filenames[i]))
    return pd.DataFrame(lengths, index = ['words'])

samples = ['Gratian0', 'Gratian1', 'dePen', 'Gratian2']
lengths = get_lengths(samples)
lengths

Unnamed: 0,Gratian0,Gratian1,dePen,Gratian2
words,3605,56713,10081,14255


In [3]:
def occurrences(tokens):
    '''create and return token occurrence dictionary'''
    types = list(set(tokens))
    tmp = dict.fromkeys(types, 0)
    for token in tokens: tmp[token] += 1
    occurrences = {
        key: value for key, value in sorted(tmp.items(),
        key = lambda item: (-item[1], item[0]))
    }
    return occurrences

def get_features(texts, n):
    corpus = []
    for text in texts:
        corpus += get_tokens(text + '.txt')
    features = list(occurrences(corpus).keys())[:n]
    return features

# mfws = get_features(samples, 4)
# temporarily remove test from samples
# can't remove permanently until after get_counts()
ignore = 'Gratian0'
mfws = get_features([sample for sample in samples if sample != ignore], 4)
mfws

['in', 'non', 'et', 'est']

In [4]:
def get_counts(features, subcorpora):
    columns = {}
    for subcorpus in subcorpora:
        columns[subcorpus] = []
        tokens = get_tokens(subcorpus + '.txt')
        all = occurrences(tokens)
        for feature in features:
            columns[subcorpus].append(all.get(feature, 0))
    return pd.DataFrame(columns, index = features)

counts = get_counts(mfws, samples)
counts

Unnamed: 0,Gratian0,Gratian1,dePen,Gratian2
in,74,1450,252,411
non,24,1360,270,306
et,70,1293,260,345
est,13,965,182,167


**Once we've gotten to this point, we've gathered all the preliminary information we need, and are ready to move the analysis into Pandas dataframes.**

**Explain use of occurrences per 1,000 words instead of percent here. Using occurrences per 1,000 words is more convenient than using percentages, because at that scale the word frequency values we are concerned with (at least most them) are greater than 1.0.**

In [5]:
frequencies = (counts / lengths.values) * 1000
frequencies

Unnamed: 0,Gratian0,Gratian1,dePen,Gratian2
in,20.527046,25.56733,24.99752,28.831989
non,6.65742,23.980393,26.783057,21.466152
et,19.417476,22.799006,25.791092,24.202034
est,3.606103,17.015499,18.053765,11.715188


This is the point where we need to temporarily drop the Gratian0 column. We're only interested at this point in calculating the mean and sample standard deviation of the values in the two columns we're comparing the candidate to: Gratian1 and Gratian2.

In [6]:
samples.remove(ignore)
selected = frequencies[samples]
selected

Unnamed: 0,Gratian1,dePen,Gratian2
in,25.56733,24.99752,28.831989
non,23.980393,26.783057,21.466152
et,22.799006,25.791092,24.202034
est,17.015499,18.053765,11.715188


In [7]:
means = selected.mean(axis = 1).to_frame('mean')
means

Unnamed: 0,mean
in,26.465613
non,24.076534
et,24.264044
est,15.594817


$s=\sqrt{\frac{1}{N - 1}\sum_{i=1}^N(x_i-\bar{x})^2}$

![Standard Deviation](stdev.jpg)

In [8]:
standard_deviations = selected.std(axis = 1).to_frame('std')
standard_deviations

Unnamed: 0,std
in,2.069051
non,2.659756
et,1.497007
est,3.399727


$z=\frac{x - \bar{x}}{s}$

![z-score](z-score.jpg)

In [9]:
z_scores = (frequencies - means.values) / standard_deviations.values
z_scores

Unnamed: 0,Gratian0,Gratian1,dePen,Gratian2
in,-2.870189,-0.434152,-0.709549,1.143701
non,-6.54914,-0.036147,1.017583,-0.981437
et,-3.237506,-0.978645,1.020068,-0.041422
est,-3.526376,0.417881,0.723278,-1.141159


**Again, remember that the means and standard deviations have been computed from the values in the Gratian1 and Gratian2 columns *only*!**

Now, break the consolidated z-scores dateframe into two dataframes: one for the hypothetical case statements (*themata*), the other for the first- and second recension *dicta* (including the *dicta* from *de Pen*.) with which we want to compare the case statements.

In [10]:
test = z_scores[[ignore]]
corpus = z_scores[samples]
test

Unnamed: 0,Gratian0
in,-2.870189
non,-6.54914
et,-3.237506
est,-3.526376


$\Delta_B = \frac{1}{N}\sum_{i = 1}^N|z_i(t) - z_i(c)|$

![Burrows's Delta](Burrows.jpg)

In [11]:
# tmp = (corpus - test.values).abs()
tmp = (test.values - corpus).abs()
tmp

Unnamed: 0,Gratian1,dePen,Gratian2
in,2.436037,2.16064,4.01389
non,6.512993,7.566723,5.567703
et,2.258861,4.257573,3.196083
est,3.944257,4.249654,2.385217


In [12]:
# is there a better way to do this?
deltas = (tmp.mean(axis = 0)).to_frame(ignore).transpose()
deltas

Unnamed: 0,Gratian1,dePen,Gratian2
Gratian0,3.788037,4.558648,3.790723


In [13]:
%%html
<style>
    table {margin-left: 0 !important;}
</style>

||Gratian1|dePen|Gratian2|
|-:|-:|-:|-:|
|**Gratian0**|3.788037|4.558648|3.790723|

Calculating  Burrows's Delta ($\Delta_B$) for Gratian0 with respect to Gratian1 and Gratian2 has limited value. The value of $\Delta_B$ for Gratian0 with respect to both Gratian1 and Gratian2 is 5.120042. This seems to be related to the problem of comparing two samples using a mean-of-means and getting standard deviations of 1.0 and -1.0.

**This requires explanation of and justification for the fact that we're breaking out dePen1 from Gratian1, but *not* breaking out dePen2 from Gratian2.**

Adding a third comparison, to the first-recension *dicta* in *de Pen*., makes for a better demo. The values of $\Delta_B$ for Gratian0 with respect to Gratian1, dePen1, and Gratian2 are 3.432577, 4.15664, and 3.36245. To put those values of $\Delta_B$ in context, if we swap out the Gratian0 sample (3605 words) for a similarly-sized pseudo-Augustine sample (3917 words), the values of $\Delta_B$ for pseudo-Augustine with respect to Gratian1, dePen1, and Gratian2 are 2.389431, 1.502354, and 3.176282.

In each case, the value of $\Delta_B$ for pseudo-Augustine is *lower* than the corresponding value for Gratian0. Keep in mind that the lowest value of Burrows's Delta ($\Delta_B$) indicates the most likely attribution of authorship. Therefore, pseudo-Augustine is *more* likely than the author of Gratian0 to be author of Gratian1, Gratian2, and dePen1 (significantly more likely in the cases of Gratian1 and de Pen1).