### Burrows's Delta

In [1]:
path = './corpus/a/'

In [2]:
import re

def get_tokens(filename):
    '''open text file and return list of tokens'''
    # text = open(filename, 'r').read().lower()
    f = open(filename, 'r') # open file
    text = f.read() # read file
    text = text.lower() # lower-case text
    tokens = [word for word in re.split('\W', text) if word != ''] # remove punctuation
    return tokens

In [3]:
import nltk

def get_tokens_fdl(filename):
    f = open(filename, 'r') # open file
    text = f.read() # read file
    text = text.lower() # lower-case text
    tokens = nltk.word_tokenize(text)
    retval = ([token for token in tokens if any(c.isalpha() for c in token)])
    return retval

In [4]:
def get_features(samples):
    tokens = []
    for sample in samples:
        tokens += get_tokens(path + sample + '.txt')
    types = list(set(tokens)) # create unordered list of unique words
    tmp = dict.fromkeys(types, 0) # create temporary dictionary, initialize counts to 0
    for token in tokens: tmp[token] += 1 # count words
    # re-order words in temporary dictionary numerically by descending frequency
    # re-order words with same frequency alphabetically
    features = { 
        key: value for key, value in sorted(tmp.items(),
        key = lambda item: (-item[1], item[0]))
    }
    return features

In [5]:
import pandas as pd

def get_counts(features, samples):
    columns = {}
    for sample in samples:
        columns[sample] = []
        tmp = get_features([sample])
        for feature in features:
            columns[sample].append(tmp.get(feature, 0))
    return pd.DataFrame(columns, index = features)

In [6]:
def get_lengths(samples):
    filenames = [path + sample + '.txt' for sample in samples]
    lengths = {}
    for i in range(len(samples)):
       lengths[samples[i]] = len(get_tokens(filenames[i]))
    return pd.DataFrame(lengths, index = ['words'])

In [7]:
limit = 4 # 4 most frequent words (MFWs)
samples = ['Gratian1', 'dePen', 'Gratian2']
unknown = 'Gratian0'
features = get_features(samples)
mfws = list(features.keys())[:limit]
counts = get_counts(mfws, [unknown] + samples)
lengths = get_lengths([unknown] + samples)
frequencies = (counts / lengths.values) * 1000
means = frequencies[samples].mean(axis = 1).to_frame('mean')
standard_deviations = frequencies[samples].std(axis = 1).to_frame('std')
z_scores = (frequencies - means.values) / standard_deviations.values
pd.options.display.float_format = '{:,.4f}'.format
z_scores

Unnamed: 0,Gratian0,Gratian1,dePen,Gratian2
in,-2.8702,-0.4342,-0.7095,1.1437
non,-6.5491,-0.0361,1.0176,-0.9814
et,-3.2375,-0.9786,1.0201,-0.0414
est,-3.5264,0.4179,0.7233,-1.1412


---

Attempts to attribute authorship are typically undertaken in scenarios where there is a large (enough) number of texts securely attributable to a known author, and a text, or at most a small number of texts, of unknown authorship. The attempt is then made to attribute the unknown text to a known author, or to rule out such an attribution. Take the *Federalist* as an example. There are numbers of the *Federalist* of disputed or unknown attribution, a small and well-defined number of candidates for authorship---Hamilton, Jay, Madison---to whom those numbers might be attributed, and securely attributed samples from each of the candidates, conveniently from the same work no less.

Such an approach is obviously not possible in the case of the *dicta* from Gratian's *Decretum*. As the survey in Chapter 3 above indicated, near-contemporaries knew next to nothing about Gratian. Perhaps most notably, although Gratian was thought to have been a teacher, no one in the generation following made an unambiguous claim to have been his student. There are no other writings securely, or even insecurely, attributed to him. That does not mean that we cannot apply the established techniques of authorship attribution to Gratian's *dicta*, but it does mean that we have to make careful decisions about experimental design.

Introducing Burrow's Delta at this point advances the argument in two ways. As a technical matter, Burrows's Delta gets around the limitation on the number of words to the two or, at best, three dimensions that the human mind can visualize by collapsing distance measurements across an arbitrary number of dimensions into a single metric, the 'Delta'. It can also be fairly straightforwardly adapted to the particular situation in which we find ourselves where there are no other texts securely attributed to Gratian with which we can compare, for example, the hypothetical case statements (*themata*) or second-recension *dicta*.

Two experiments that demonstrate how Burrow's Delta can be applied to meet both of these objectives follow. The first will be a comparison of four subcorpora, Gratian0 (the hypothetical case statements or *themata*), Gratian1 (the first-recension *dicta* excluding the *dicta* from *de Penitentia*), dePen (first- and second-recension *dicta* from *de Penitentia*), and Gratian2 (the second-recension *dicta* excluding the *dicta* from *de Penitentia*), using the frequencies of occurrence of the four most frequent words (MFWs) in Gratian's *dicta* as the basis for comparison. We will hypothesize that the subcorpus containing the hypothetical case statements (*themata*) is the work of an unknown author, and will treat the other three subcorpora as making up a corpus of works by a known author. Using four subcorpora and four dimensions makes the solution compact enough to show all of the intermediate steps.

The second experiment will compare the thirty most frequent words (MFWs) across fourteen subcorpora: cases (C.1-36 d.init.), laws (D.1-20 R1 *dicta*), orders1 (D.21-80 R1 *dicta*), orders2 (D.81-101 R1 *dicta*), simony (C.1 R1 *dicta*), procedure (C.2-6 R1 *dicta*), other1 (C.7-10 R1 *dicta*), other2 (C.11-15 R1 *dicta*), monastic (C.16-20 R1 *dicta*), other3 (C.21-22 R1 *dicta*), heresy (C.23-26 R1 *dicta*), marriage (C.27-36 R1 *dicta*), penance (R1 and R2 *dicta* from *de Penitentia*), and second (all R2 *dicta*, excluding those from *de Penitentia*).[<sup>1</sup>](#fn1) For each of the fourteen subcorpora, we will hypothesize each subcorpus in turn to be the work of an unknown author, and will treat the other thirteen subcorpora as composing a corpus of works by a known author. The scale of the second experiment is closer to that of the experiments carried out by John Burrows and David Hoover, the pioneers of the technique, but makes it impractical to show the results at every intermediate step in the process.

<span id="fn1">The division of the first-recension (R1) dicta into twelve sections follows the division of Gratian’s Decretum proposed by Alfred Beyer in *Lokale Abbreviationen Des Decretum Gratiani: Analyse Und Vergleich Der Dekretabbreviationen "Omnes Leges Aut Divine" (Bamberg), "Humanum Genus Duobus Regitur" (Pommersfelden) Und "de His Qui Intra Claustra Monasterii Consistunt" (Lichtenthal, Baden-Baden)*, Bamberger Theologische Studien ; Bd. 6 (Frankfurt am Main ; PLang, 1998), 17-18.</span>

In [8]:
test = z_scores[[unknown]]
corpus = z_scores[samples]
test

Unnamed: 0,Gratian0
in,-2.8702
non,-6.5491
et,-3.2375
est,-3.5264


$\Delta_B = \frac{1}{N}\sum_{i = 1}^N|z_i(t) - z_i(c)|$

![Burrows's Delta](JPGs/Burrows.jpg)

In [9]:
differences = (test.values - corpus).abs()
differences

Unnamed: 0,Gratian1,dePen,Gratian2
in,2.436,2.1606,4.0139
non,6.513,7.5667,5.5677
et,2.2589,4.2576,3.1961
est,3.9443,4.2497,2.3852


In [10]:
row = (differences.mean(axis = 0)).to_frame(unknown).transpose()
row

Unnamed: 0,Gratian1,dePen,Gratian2
Gratian0,3.788,4.5586,3.7907


In [11]:
path = './corpus/a/'

# author candidates, e.g. Gratian 1, the Master of Penance, Gratian 2, etc.
candidates = ['Gratian0', 'Gratian1', 'dePen', 'Gratian2']
# candidates = ['cases', 'laws', 'orders1', 'orders2', 'simony', 'procedure', 'other1', 'other2', 'monastic', 'other3', 'heresy', 'marriage', 'penance', 'second']
deltas = pd.DataFrame(columns = candidates)
limit = 4 # 30 most frequent words (MFWs)
for candidate in candidates:
    unknown = candidate
    samples = candidates[:]
    samples.remove(unknown)
    features = get_features(samples)
    mfws = list(features.keys())[:limit]
    counts = get_counts(mfws, [unknown] + samples)
    lengths = get_lengths([unknown] + samples)
    frequencies = (counts / lengths.values) * 1000
    means = frequencies[samples].mean(axis = 1).to_frame('mean')
    standard_deviations = frequencies[samples].std(axis = 1).to_frame('std')
    z_scores = (frequencies - means.values) / standard_deviations.values
    test = z_scores[[unknown]]
    corpus = z_scores[samples]
    differences = (test.values - corpus).abs()
    row = (differences.mean(axis = 0)).to_frame(unknown).transpose()
    deltas = deltas.append(row)
# csv = open('./CSVs/deltas.csv', 'w')
# csv.write(deltas.to_csv())
# csv.close()
deltas


Unnamed: 0,Gratian0,Gratian1,dePen,Gratian2
Gratian0,,3.788,4.5586,3.7907
Gratian1,1.4361,,0.3628,0.5453
dePen,1.9873,0.4515,,0.7673
Gratian2,1.7185,0.6278,0.7905,


Saved results from running this notebook with the **PLE** tokenizer:

In [12]:
saved = pd.read_csv('CSVs/saved1.csv', index_col=0)
saved

Unnamed: 0,Gratian0,Gratian1,dePen,Gratian2
Gratian0,,3.788,4.5586,3.7907
Gratian1,1.4361,,0.3628,0.5453
dePen,1.9873,0.4515,,0.7673
Gratian2,1.7185,0.6278,0.7905,


Output predicted by [burrows2.py](https://github.com/decretist/Delta/blob/master/burrows/burrows2.py):

In [13]:
predicted = pd.read_csv('CSVs/predicted0.csv', index_col=0)
predicted

Unnamed: 0,Gratian0,Gratian1,dePen,Gratian2
Gratian0,0.0,3.788,4.5586,3.7907
Gratian1,1.4361,0.0,0.3628,0.5453
dePen,1.9873,0.4515,0.0,0.7673
Gratian2,1.7185,0.6278,0.7905,0.0


---

Saved results from running this notebook with the **FDL** tokenizer:

In [14]:
saved = pd.read_csv('CSVs/saved0.csv', index_col=0)
saved

Unnamed: 0,Gratian0,Gratian1,dePen,Gratian2
Gratian0,,3.7871,4.5482,3.7864
Gratian1,1.4397,,0.3589,0.5487
dePen,1.98,0.4457,,0.7621
Gratian2,1.7222,0.6314,0.7889,


```
Delta score for candidate Gratian1 is 3.7870825924138742
Delta score for candidate dePen is 4.54819908193029
Delta score for candidate Gratian2 is 3.7863551467454863

Delta score for candidate Gratian0 is 1.4397470876402583
Delta score for candidate dePen is 0.35894332284182573
Delta score for candidate Gratian2 is 0.5486849685416524

Delta score for candidate Gratian0 is 1.9800015811392637
Delta score for candidate Gratian1 is 0.44572269950080656
Delta score for candidate Gratian2 is 0.762058096218929

Delta score for candidate Gratian0 is 1.7221751707509894
Delta score for candidate Gratian1 is 0.6313943206230836
Delta score for candidate dePen is 0.7888802410102747
```

---

In [15]:
path = './corpus/a/'

# author candidates, e.g. pseudo-Augustine, Gratian 1, the Master of Penance, Gratian 2, etc.
candidates = ['psAug', 'Gratian1', 'dePen', 'Gratian2']
# candidates = ['cases', 'laws', 'orders1', 'orders2', 'simony', 'procedure', 'other1', 'other2', 'monastic', 'other3', 'heresy', 'marriage', 'penance', 'second']
deltas = pd.DataFrame(columns = candidates)
limit = 4 # 4 most frequent words (MFWs)
for candidate in candidates:
    unknown = candidate
    samples = candidates[:]
    samples.remove(unknown)
    features = get_features(samples)
    mfws = list(features.keys())[:limit]
    counts = get_counts(mfws, [unknown] + samples)
    lengths = get_lengths([unknown] + samples)
    frequencies = (counts / lengths.values) * 1000
    means = frequencies[samples].mean(axis = 1).to_frame('mean')
    standard_deviations = frequencies[samples].std(axis = 1).to_frame('std')
    z_scores = (frequencies - means.values) / standard_deviations.values
    test = z_scores[[unknown]]
    corpus = z_scores[samples]
    differences = (test.values - corpus).abs()
    row = (differences.mean(axis = 0)).to_frame(unknown).transpose()
    deltas = deltas.append(row)
# csv = open('./CSVs/deltas.csv', 'w')
# csv.write(deltas.to_csv())
# csv.close()
deltas


Unnamed: 0,psAug,Gratian1,dePen,Gratian2
psAug,,2.6456,1.7373,3.4318
Gratian1,1.0228,,0.4653,0.9325
dePen,0.5178,0.4733,,1.3453
Gratian2,5.2005,3.3574,4.2857,


---

In [16]:
path = './corpus/b/'

candidates = ['cases', 'laws', 'orders1', 'orders2', 'simony', 'procedure', 'other1', 'other2', 'monastic', 'other3', 'heresy', 'marriage', 'penance', 'second']
deltas = pd.DataFrame(columns = candidates)
limit = 30 # 30 most frequent words (MFWs)
for candidate in candidates:
    unknown = candidate
    samples = candidates[:]
    samples.remove(unknown)
    features = get_features(samples)
    mfws = list(features.keys())[:limit]
    counts = get_counts(mfws, [unknown] + samples)
    lengths = get_lengths([unknown] + samples)
    frequencies = (counts / lengths.values) * 1000
    means = frequencies[samples].mean(axis = 1).to_frame('mean')
    standard_deviations = frequencies[samples].std(axis = 1).to_frame('std')
    z_scores = (frequencies - means.values) / standard_deviations.values
    test = z_scores[[unknown]]
    corpus = z_scores[samples]
    differences = (test.values - corpus).abs()
    row = (differences.mean(axis = 0)).to_frame(unknown).transpose()
    deltas = deltas.append(row)
# csv = open('./CSVs/deltas.csv', 'w')
# csv.write(deltas.to_csv())
# csv.close()
deltas


Unnamed: 0,cases,laws,orders1,orders2,simony,procedure,other1,other2,monastic,other3,heresy,marriage,penance,second
cases,,2.2765,1.9247,2.0252,1.9637,1.9545,1.5714,2.2782,1.7622,2.3628,1.8717,1.8923,1.8589,1.6334
laws,2.141,,1.249,1.502,1.4633,1.3147,1.4223,1.4369,1.1931,1.4345,1.1875,1.1924,1.6218,1.2323
orders1,1.6184,1.0949,,1.1223,0.9685,0.8843,1.0499,1.1109,0.8693,1.2397,0.8267,1.0124,0.7505,0.7777
orders2,1.8982,1.5244,1.2686,,1.382,1.684,1.4149,1.6873,1.4492,1.6208,1.4198,1.4526,1.5523,1.3195
simony,1.6667,1.3491,0.9772,1.2195,,0.8878,1.1304,1.1287,1.0413,1.1711,0.59,0.9166,0.9059,1.0863
procedure,1.6187,1.1991,0.892,1.5095,0.8789,,1.079,1.1223,0.821,1.0726,0.6569,0.9993,0.8818,0.9852
other1,1.3353,1.3,1.0619,1.2722,1.1383,1.0753,,1.2792,0.9649,1.3054,0.996,1.0853,1.3272,0.8152
other2,1.9416,1.3233,1.0913,1.6291,1.1386,1.109,1.2963,,0.7979,1.0346,1.0592,0.654,0.8633,1.0961
monastic,1.4555,1.0451,0.8554,1.2676,1.0114,0.7986,0.93,0.7429,,1.0578,0.7602,0.6611,0.7999,0.7799
other3,2.0705,1.3388,1.289,1.5146,1.1997,1.1057,1.3497,0.9505,1.1229,,1.1209,0.7121,1.1521,1.3067


Output predicted by [burrows2.py](https://github.com/decretist/Delta/blob/master/burrows/burrows2.py):

In [17]:
predicted = pd.read_csv('CSVs/predicted3.csv', index_col=0)
predicted

Unnamed: 0,cases,laws,orders1,orders2,simony,procedure,other1,other2,monastic,other3,heresy,marriage,penance,second
cases,0.0,2.2765,1.9247,2.0252,1.9637,1.9545,1.5714,2.2782,1.7622,2.3628,1.8717,1.8923,1.8589,1.6334
laws,2.141,0.0,1.249,1.502,1.4633,1.3147,1.4223,1.4369,1.1931,1.4345,1.1875,1.1924,1.6218,1.2323
orders1,1.6184,1.0949,0.0,1.1223,0.9685,0.8843,1.0499,1.1109,0.8693,1.2397,0.8267,1.0124,0.7505,0.7777
orders2,1.8982,1.5244,1.2686,0.0,1.382,1.684,1.4149,1.6873,1.4492,1.6208,1.4198,1.4526,1.5523,1.3195
simony,1.6667,1.3491,0.9772,1.2195,0.0,0.8878,1.1304,1.1287,1.0413,1.1711,0.59,0.9166,0.9059,1.0863
procedure,1.6187,1.1991,0.892,1.5095,0.8789,0.0,1.079,1.1223,0.821,1.0726,0.6569,0.9993,0.8818,0.9852
other1,1.3353,1.3,1.0619,1.2722,1.1383,1.0753,0.0,1.2792,0.9649,1.3054,0.996,1.0853,1.3272,0.8152
other2,1.9416,1.3233,1.0913,1.6291,1.1386,1.109,1.2963,0.0,0.7979,1.0346,1.0592,0.654,0.8633,1.0961
monastic,1.4555,1.0451,0.8554,1.2676,1.0114,0.7986,0.93,0.7429,0.0,1.0578,0.7602,0.6611,0.7999,0.7799
other3,2.0705,1.3388,1.289,1.5146,1.1997,1.1057,1.3497,0.9505,1.1229,0.0,1.1209,0.7121,1.1521,1.3067
