## Burrows's Delta

In [1]:
print('Humanum genus duobus regitur')


Humanum genus duobus regitur


$\sigma=\sqrt{\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2}$

Demonstrate Burrows's Delta using 4 most frequent words (MFWs) and 3 samples (Gratian[012].txt). That keeps the scale of the example manageable -- we can show all the intermediate results, so the reader can follow along. 4 MFWs makes the point that we are not limited to the 3 dimensions the human mind can visualize.

Make sure that we're using the correct versions of the sample files by verifying their SHA1 cryptographic checksums:

In [2]:
import hashlib
assert(hashlib.sha1(open('Gratian0.txt', 'rb').read()).hexdigest() == '80b33a5ede195049ebb39d76a9b52c9314be2452')
assert(hashlib.sha1(open('Gratian1.txt', 'rb').read()).hexdigest() == '5dc6dd1a83efbf85ea63e15539f1714db43bb84c')
assert(hashlib.sha1(open('Gratian2.txt', 'rb').read()).hexdigest() == '62c0e13d63466caafa343bc5ece76fb2f24a8697')


Verify word counts. There are 3,605 words in the hypothetical cases statements (*themata*), 66,238 words in the first-recension *dicta* (including *de Pen.*), and 14,811 words in the second-recension *dicta* (including *de Pen.*) This represents the 84,654 word corpus of extended text attributable to Gratian. (The qualifier *extended* indicating that the samples do not include the short caption-like rubrics that Gratian prepended to the canons.)

In [3]:
import re
def tokenize(filename):
    '''open text file and return list of tokens'''
    text = open(filename, 'r').read().lower()
    tokens = [word for word in re.split('\W', text) if word != '']
    return tokens
assert(len(tokenize('Gratian0.txt')) == 3605)
assert(len(tokenize('Gratian1.txt')) == 66238)
assert(len(tokenize('Gratian2.txt')) == 14811)


Extract lists of the 4 most frequent words (MFWs) from the sample files Gratian[012].txt:

In [4]:
def frequencies(tokens):
    '''create and return token frequency dictionary'''
    types = list(set(tokens))
    tmp = dict.fromkeys(types, 0)
    for token in tokens: tmp[token] += 1
    token_frequencies = {
        key: value for key, value in sorted(tmp.items(),
        key = lambda item: (-item[1], item[0]))
    }
    return token_frequencies

def get_features(texts, n):
    '''
    Assemble a large corpus made up of texts written by an arbitrary
    number of authors; let’s say that number of authors is x.
    '''
    corpus = []
    for text in texts:
        corpus += tokenize(text + '.txt')
    '''
    Find the n most frequent words in the corpus to use as features.
    '''
    features = list(frequencies(corpus).items())[:n]
    return features

print(get_features(['Gratian0', 'Gratian1', 'Gratian2'], 4))
print(get_features(['Gratian0', 'Gratian1'], 4))
print(get_features(['Gratian0', 'Gratian2'], 4))
print(get_features(['Gratian1', 'Gratian2'], 4))


[('in', 2187), ('et', 1968), ('non', 1960), ('est', 1327)]
[('in', 1756), ('non', 1646), ('et', 1612), ('est', 1149)]
[('in', 505), ('et', 426), ('non', 338), ('unde', 234)]
[('in', 2113), ('non', 1936), ('et', 1898), ('est', 1314)]


In [5]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

|in|non|et|est|
|:-|:-|:-|:-|
|2113|1936|1898|1314|

Note that you get different lists of MFWs from different possible combinations or permutations of the sample files!