## Burrows's Delta

$\sigma=\sqrt{\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2}$

Demonstrate Burrows's Delta using 4 most frequent words (MFWs) and 3 samples (Gratian[012].txt). That keeps the scale of the example manageable -- we can show all the intermediate results, so the reader can follow along. 4 MFWs makes the point that we are not limited to the 3 dimensions the human mind can visualize.

Make sure that we're using the correct versions of the sample files by verifying their SHA1 cryptographic checksums:

In [1]:
import hashlib
assert(hashlib.sha1(open('Gratian0.txt', 'rb').read()).hexdigest() == '80b33a5ede195049ebb39d76a9b52c9314be2452')
assert(hashlib.sha1(open('Gratian1.txt', 'rb').read()).hexdigest() == '5dc6dd1a83efbf85ea63e15539f1714db43bb84c')
assert(hashlib.sha1(open('Gratian2.txt', 'rb').read()).hexdigest() == '62c0e13d63466caafa343bc5ece76fb2f24a8697')


Verify word counts. There are 3,605 words in the hypothetical cases statements (*themata*), 66,238 words in the first-recension *dicta* (including *de Pen.*), and 14,811 words in the second-recension *dicta* (including *de Pen.*) This represents the 84,654 word corpus of extended text attributable to Gratian. (The qualifier *extended* indicating that the samples do not include the short caption-like rubrics that Gratian prepended to the canons.)

In [2]:
import re
def tokenize(filename):
    '''open text file and return list of tokens'''
    text = open(filename, 'r').read().lower()
    tokens = [word for word in re.split('\W', text) if word != '']
    return tokens
assert(len(tokenize('Gratian0.txt')) == 3605)
assert(len(tokenize('Gratian1.txt')) == 66238)
assert(len(tokenize('Gratian2.txt')) == 14811)


There are {{len(tokenize('Gratian0.txt'))}} words in the hypothetical cases statements (themata), {{len(tokenize('Gratian1.txt'))}} words in the first-recension dicta (including de Pen.), and {{len(tokenize('Gratian2.txt'))}} words in the second-recension dicta (including de Pen.) 

In [None]:
from IPython.display import Markdown as md
md('There are {} words in the hypothetical cases statements (themata).'.format(len(tokenize('Gratian0.txt'))))

Extract lists of the 4 most frequent words (MFWs) from the sample files Gratian[012].txt:

In [3]:
def occurrences(tokens):
    '''create and return token occurrence dictionary'''
    types = list(set(tokens))
    tmp = dict.fromkeys(types, 0)
    for token in tokens: tmp[token] += 1
    occurrences = {
        key: value for key, value in sorted(tmp.items(),
        key = lambda item: (-item[1], item[0]))
    }
    return occurrences

def features(texts, n):
    '''
    Assemble a large corpus made up of texts written by an arbitrary
    number of authors; let’s say that number of authors is x.
    '''
    corpus = []
    for text in texts:
        corpus += tokenize(text + '.txt')
    '''
    Find the n most frequent words in the corpus to use as features.
    '''
    features = list(occurrences(corpus).keys())[:n]
    return features

features(['Gratian0', 'Gratian1', 'Gratian2'], 4)


['in', 'et', 'non', 'est']

In [4]:
features(['Gratian0', 'Gratian1'], 4)

['in', 'non', 'et', 'est']

In [5]:
features(['Gratian0', 'Gratian2'], 4)

['in', 'et', 'non', 'unde']

In [6]:
features(['Gratian1', 'Gratian2'], 4)

['in', 'non', 'et', 'est']

Note that you get different lists of MFWs from different possible combinations or permutations of the sample files!

In [7]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

|in|non|et|est|
|:-|:-|:-|:-|
|2113|1936|1898|1314|

Note that you get different lists of MFWs from different possible combinations or permutations of the sample files!

In [8]:
!cat Gratian1.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | tr '[:space:]' '\n' | egrep -v '^$' | egrep "\bin\b" | wc -l


    1682


In [9]:
def frequencies(features, subcorpora):
    '''
    For each of these n features, calculate the share of each of
    the x authors’ subcorpora represented by this feature, as a
    percentage of the total number of words.
    '''
    frequencies = {}
    empty = dict.fromkeys(features, 0)
    for subcorpus in subcorpora:
        frequencies[subcorpus] = empty.copy()
        tokens = tokenize(subcorpus + '.txt')
        counts = occurrences(tokens)
        for feature in features:
            frequencies[subcorpus][feature] = (counts.get(feature, 0) / len(tokens)) * 1000
            # frequencies[subcorpus][feature] = (counts.get(feature, 0))
    return frequencies

frequencies(['in', 'et', 'non', 'est', 'unde'], ['Gratian0', 'Gratian1', 'Gratian2'])


{'Gratian0': {'in': 20.527045769764214,
  'et': 19.41747572815534,
  'non': 6.65742024965326,
  'est': 3.6061026352288486,
  'unde': 0.0},
 'Gratian1': {'in': 25.393278782571937,
  'et': 23.279688396388778,
  'non': 24.487454331350584,
  'est': 17.150276276457625,
  'unde': 7.518342945137232},
 'Gratian2': {'in': 29.099993248261427,
  'et': 24.036189318749578,
  'non': 21.200459118222945,
  'est': 12.018094659374789,
  'unde': 15.79906826007697}}