## Burrows's Delta

Attempts to attribute authorship are typically undertaken in scenarios where there is a large (enough) number of texts securely attributable to a known author, and a text, or at most a small number of texts, of unknown authorship. The attempt is then made to attribute the unknown text to the known author, or to rule out such an attribution. Take *The Federalist* as an example. There are numbers of *The Federalist* of disputed or unknown attribution, a small and well-defined number of candidates for authorship --- Hamilton, Jay, Madison --- to whom those numbers might be attributed, and securely attributed samples from each of the candidates, from the same work no less. Such an approach is not possible in the case of the *dicta* from Gratian’s *Decretum*. As the survey in Chapter 3 above indicated, near-contemporaries knew next to nothing about Gratian. Perhaps most notably, although Gratian was thought to have been a teacher, no one in the generation following made an unambiguous claim to have been his student. **There are no other writings securely, or even insecurely, attributed to him.**

This does not mean that we are unable to apply the established techniques of authorship attribution to Gratian’s *dicta*. It simply means that we will have to adapt the tradtional assumptions behind the design of authorship attribution experiments for a different set of circumstances. Every authorship attribution experiment starts from an hypothesis (sometimes implicitly rather than explicitly stated) concerning authorship of the text under consideration. The hypothesis traditional in Gratian studies has been that the *dicta* from the *Decretum* --- defined as the hypothetical case statements (*themata*) plus the first- and second-recension *dicta* including the *dicta* from *de Pen*. --- are the work of a unitary author, the eponymous Gratian, **In this view, the *dicta* represent a complete, closed, corpus -- the total population of words attributable to the author Gratian.**

Now, break the consolidated z-scores dateframe into two dataframes: one for the hypothetical case statements (*themata*), the other for the first- and second recension *dicta* (including the *dicta* from *de Pen*.) with which we want to compare the case statements.

In [1]:
import pandas as pd

unknown = 'Gratian0'
samples = ['Gratian1', 'dePen', 'Gratian2']
z_scores = pd.read_csv('CSVs/z-scores.csv', index_col=0)
f = open('burrows.md', 'w')

In [2]:
test = z_scores[[unknown]]
corpus = z_scores[samples]
f.write(test.to_markdown() + '\n\n')
test

Unnamed: 0,Gratian0
in,-2.870189
non,-6.54914
et,-3.237506
est,-3.526376


$\Delta_B = \frac{1}{N}\sum_{i = 1}^N|z_i(t) - z_i(c)|$

![Burrows's Delta](JPGs/Burrows.jpg)

In [3]:
# tmp = (corpus - test.values).abs()
tmp = (test.values - corpus).abs()
f.write(tmp.to_markdown() + '\n\n')
tmp

Unnamed: 0,Gratian1,dePen,Gratian2
in,2.436037,2.16064,4.01389
non,6.512993,7.566723,5.567703
et,2.258861,4.257573,3.196083
est,3.944257,4.249654,2.385217


In [4]:
# is there a better way to do this?
deltas = (tmp.mean(axis = 0)).to_frame(unknown).transpose()
f.write(deltas.to_markdown() + '\n\n')
f.close()
deltas

Unnamed: 0,Gratian1,dePen,Gratian2
Gratian0,3.788037,4.558648,3.790723


Calculating  Burrows's Delta ($\Delta_B$) for Gratian0 with respect to Gratian1 and Gratian2 has limited value. The values of $\Delta_B$ for Gratian0 with respect to both Gratian1 and Gratian2 are the same:

||Gratian1|Gratian2|
|-:|-:|-:|
|**Gratian0**|5.120042|5.120042|

This appears to be related to the problem of mean-of-means comparisons between two samples yielding standard deviations of 1.0 and -1.0.

Adding a third comparison, to the first- and second-recension *dicta* in *de Pen*., makes for a better demo. **Explain and justify the fact that I am separating the first- and second-recension *dicta* in Gratian1 and Gratian2, but keeping them together in dePen.** The values of $\Delta_B$ for Gratian0 with respect to Gratian1, dePen1, and Gratian2 are:

||Gratian1|dePen|Gratian2|
|-:|-:|-:|-:|
|**Gratian0**|3.788037|4.558648|3.790723|

To put those values of $\Delta_B$ in context, if we substitute a 3917-word sample from the pseudo-Augustinian *De vera et falsa penitentia* in place of Gratian0, the values of $\Delta_B$ for pseudo-Augustine with respect to Gratian1, dePen1, and Gratian2 are:

||Gratian1|dePen|Gratian2|
|-:|-:|-:|-:|
|**ps-Aug**|2.691456|1.783147|3.477697|

\[Gratian quotes *De vera et falsa penitentia* extensively in *de Pen*., and the sample was created by concatenating *De Pen*. D.1 c.88, D.3 c.32, D3 c.42, D.3 c.45, D.3 c49, D.5 c.1, D.6 c.1, and D.7 c.6. (This has the incidental advantage of guaranteeing orthographic consistency with the other samples derived from the Friedberg edition.) As noted in Chapter 0 above, *De vera et falsa penitentia* is extremely unlikely to have been written by Gratian. Gratian and pseudo-Augustine had markedly different preferences in postpositive conjunctions: pseudo-Augustine strongly preferred *enim*, while Gratian preferred *autem*. John Wei, *Gratian the Theologian*, 84, describes Gratian's *Decretum* as "the first work to draw on *De vera et falsa penitentia*. **wc -l reports 3918 words, tokenizer reports 3917. Verify that these are all the pseudo-Augustine quotes in *de Pen*. To reproduce March 2014 enim/autem boxplots, use oppose.**\]

In each case, the value of $\Delta_B$ for pseudo-Augustine is *lower* than the corresponding value for Gratian0. Keep in mind that the lowest value of Burrows's Delta ($\Delta_B$) indicates the most likely attribution of authorship. Therefore, pseudo-Augustine is *more* likely than the author of Gratian0 to be author of Gratian1, dePen, and Gratian2 (significantly more likely in the cases of Gratian1 and dePen).

