## Burrows's Delta

Attempts to attribute authorship are typically undertaken in scenarios where there is a large (enough) number of texts securely attributable to a known author, and a text, or at most a small number of texts, of unknown authorship. The attempt is then made to attribute the unknown text to the known author, or to rule out such an attribution. Take *The Federalist* as an example. There are numbers of *The Federalist* of disputed or unknown attribution, a small and well-defined number of candidates for authorship --- Hamilton, Jay, Madison --- to whom those numbers might be attributed, and securely attributed samples from each of the candidates, from the same work no less. Such an approach is not possible in the case of the *dicta* from Gratian’s *Decretum*. As the survey in Chapter 3 above indicated, near-contemporaries knew next to nothing about Gratian. Perhaps most notably, although Gratian was thought to have been a teacher, no one in the generation following made an unambiguous claim to have been his student. **There are no other writings securely, or even insecurely, attributed to him.**

This does not mean that we are unable to apply the established techniques of authorship attribution to Gratian’s *dicta*. It simply means that we will have to adapt the tradtional assumptions behind the design of authorship attribution experiments for a different set of circumstances. Every authorship attribution experiment starts from an hypothesis (sometimes implicitly rather than explicitly stated) concerning authorship of the text under consideration. The hypothesis traditional in Gratian studies has been that the *dicta* from the *Decretum* --- defined as the hypothetical case statements (*themata*) plus the first- and second-recension *dicta* including the *dicta* from *de Pen*. --- are the work of a unitary author, the eponymous Gratian, **In this view, the *dicta* represent a complete, closed, corpus -- the total population of words attributable to the author Gratian.**

In [1]:
import re

def get_tokens(filename):
    '''open text file and return list of tokens'''
    # text = open(filename, 'r').read().lower()
    f = open(filename, 'r') # open file
    text = f.read() # read file
    text = text.lower() # lower-case text
    tokens = [word for word in re.split('\W', text) if word != ''] # remove punctuation
    return tokens

In [2]:
import pandas as pd

def get_lengths(samples):
    filenames = [sample + '.txt' for sample in samples]
    lengths = {}
    for i in range(len(samples)):
       lengths[samples[i]] = len(get_tokens(filenames[i]))
    return pd.DataFrame(lengths, index = ['words'])

samples = ['Gratian0', 'Gratian1', 'Gratian2']
lengths = get_lengths(samples)
lengths

Unnamed: 0,Gratian0,Gratian1,Gratian2
words,3605,66238,14811


In [3]:
def occurrences(tokens):
    '''create and return token occurrence dictionary'''
    types = list(set(tokens))
    tmp = dict.fromkeys(types, 0)
    for token in tokens: tmp[token] += 1
    occurrences = {
        key: value for key, value in sorted(tmp.items(),
        key = lambda item: (-item[1], item[0]))
    }
    return occurrences

def get_features(texts, n):
    corpus = []
    for text in texts:
        corpus += get_tokens(text + '.txt')
    features = list(occurrences(corpus).keys())[:n]
    return features

# mfws = get_features(samples, 4)
mfws = get_features(['Gratian1', 'Gratian2'], 4)
mfws

['in', 'non', 'et', 'est']

In [4]:
def get_counts(features, subcorpora):
    columns = {}
    for subcorpus in subcorpora:
        columns[subcorpus] = []
        tokens = get_tokens(subcorpus + '.txt')
        all = occurrences(tokens)
        for feature in features:
            columns[subcorpus].append(all.get(feature, 0))
    return pd.DataFrame(columns, index = features)

counts = get_counts(mfws, samples)
counts

Unnamed: 0,Gratian0,Gratian1,Gratian2
in,74,1682,431
non,24,1622,314
et,70,1542,356
est,13,1136,178


**Once we've gotten to this point, we've gathered all the preliminary information we need, and are ready to move the analysis into Pandas dataframes.**

**Explain use of occurrences per 1,000 words instead of percent here. Using occurrences per 1,000 words is more convenient than using percentages, because at that scale the word frequency values we are concerned with (at least most them) are greater than 1.0.**

In [5]:
frequencies = (counts / lengths.values) * 1000
frequencies


Unnamed: 0,Gratian0,Gratian1,Gratian2
in,20.527046,25.393279,29.099993
non,6.65742,24.487454,21.200459
et,19.417476,23.279688,24.036189
est,3.606103,17.150276,12.018095


This is the point where we need to temporarily drop the Gratian0 column. We're only interested at this point in calculating the mean and sample standard deviation of the values in the two columns we're comparing the candidate to: Gratian1 and Gratian2.

In [6]:
selected = frequencies[['Gratian1', 'Gratian2']]
selected

Unnamed: 0,Gratian1,Gratian2
in,25.393279,29.099993
non,24.487454,21.200459
et,23.279688,24.036189
est,17.150276,12.018095


In [7]:
means = selected.mean(axis = 1).to_frame('mean')
means

Unnamed: 0,mean
in,27.246636
non,22.843957
et,23.657939
est,14.584185


$s=\sqrt{\frac{1}{N - 1}\sum_{i=1}^N(x_i-\bar{x})^2}$

In [8]:
standard_deviations = selected.std(axis = 1).to_frame('std')
standard_deviations

Unnamed: 0,std
in,2.621043
non,2.324257
et,0.534927
est,3.629


$z=\frac{x - \bar{x}}{s}$

In [9]:
z_scores = (frequencies - means.values) / standard_deviations.values
z_scores

Unnamed: 0,Gratian0,Gratian1,Gratian2
in,-2.563709,-0.707107,0.707107
non,-6.964178,0.707107,-0.707107
et,-7.927182,-0.707107,0.707107
est,-3.025098,0.707107,-0.707107


**Again, remember that the means and standard deviations have been computed from the values in the Gratian1 and Gratian2 columns *only*!**

Now, break the consolidated z-scores dateframe into two dataframes: one for the hypothetical case statements (*themata*), the other for the first- and second recension *dicta* (including the *dicta* from *de Pen*.) with which we want to compare the case statements.

In [10]:
test = z_scores[['Gratian0']]
corpus = z_scores[['Gratian1', 'Gratian2']]
test

Unnamed: 0,Gratian0
in,-2.563709
non,-6.964178
et,-7.927182
est,-3.025098


$\Delta = \sum_{i=1}^n\frac{|z_{t(i)} - z_{c(i)}|}{n}$

or

$\Delta = \frac{\sum_{i=1}^n|z_{t(i)} - z_{c(i)}|}{n}$

or

$\Delta = \frac{\sum_{i=1}^n|z_{t_{i}} - z_{c_{i}}|}{n}$

![Burrows's Delta](Burrows2_150.jpg)

$\Delta_c = \frac{\sum_{i=1}^n|z_{t(i)} - z_{c(i)}|}{n}$

In [11]:
# tmp = (corpus - test.values).abs()
tmp = (test.values - corpus).abs()
tmp

Unnamed: 0,Gratian1,Gratian2
in,1.856602,3.270815
non,7.671285,6.257071
et,7.220075,8.634289
est,3.732205,2.317991


In [12]:
# better way to do this?
deltas = (tmp.mean(axis = 0)).to_frame('Gratian0').transpose()
deltas

Unnamed: 0,Gratian1,Gratian2
Gratian0,5.120042,5.120042
