## Burrows's Delta

Demonstrate Burrows's Delta using 4 most frequent words (MFWs) and 3 samples (Gratian[012].txt). That keeps the scale of the example manageable -- we can show all the intermediate results, so the reader can follow along. 4 MFWs makes the point that we are not limited to the 3 dimensions the human mind can visualize.

There are 3,605 words in the hypothetical cases statements (*themata*), 66,238 words in the first-recension *dicta* (including *de Pen.*), and 14,811 words in the second-recension *dicta* (including *de Pen.*) This represents the 84,654 word corpus of extended text attributable to Gratian. (The qualifier *extended* indicating that the samples do not include the short caption-like rubrics that Gratian prepended to the canons.)

In [1]:
import re

def tokenize(filename):
    '''open text file and return list of tokens'''
    text = open(filename, 'r').read().lower()
    tokens = [word for word in re.split('\W', text) if word != '']
    return tokens


There are {{len(tokenize('Gratian0.txt'))}} words in the hypothetical cases statements (*themata*), {{len(tokenize('Gratian1.txt'))}} words in the first-recension *dicta* (including *de Pen*.), and {{len(tokenize('Gratian2.txt'))}} words in the second-recension *dicta* (including *de Pen*.) 

In [2]:
from IPython.display import Markdown as md

g0 = len(tokenize('Gratian0.txt'))
g1 = len(tokenize('Gratian1.txt'))
g2 = len(tokenize('Gratian2.txt'))
md(
    '''There are {} words in the hypothetical cases statements
    (*themata*), {} words in the first-recension *dicta* (including
    *de Pen*.), and {} words in the second-recension *dicta*
    (including *de Pen*.).  This represents the {} word corpus of
    extended text attributable to Gratian.  (The qualifier *extended*
    indicating that the samples do not include the short caption-like
    rubrics that Gratian prepended to the 
    canons.)'''.format(g0, g1, g2, g0 + g1 +g2)
)   
    

There are 3605 words in the hypothetical cases statements
    (*themata*), 66238 words in the first-recension *dicta* (including
    *de Pen*.), and 14811 words in the second-recension *dicta*
    (including *de Pen*.).  This represents the 84654 word corpus of
    extended text attributable to Gratian.  (The qualifier *extended*
    indicating that the samples do not include the short caption-like
    rubrics that Gratian prepended to the 
    canons.)

Extract lists of the 4 most frequent words (MFWs) from the sample files Gratian[012].txt:

In [3]:
def occurrences(tokens):
    '''create and return token occurrence dictionary'''
    types = list(set(tokens))
    tmp = dict.fromkeys(types, 0)
    for token in tokens: tmp[token] += 1
    occurrences = {
        key: value for key, value in sorted(tmp.items(),
        key = lambda item: (-item[1], item[0]))
    }
    return occurrences

def features(texts, n):
    '''
    Assemble a large corpus made up of texts written by an arbitrary
    number of authors; let’s say that number of authors is x.
    '''
    corpus = []
    for text in texts:
        corpus += tokenize(text + '.txt')
    '''
    Find the n most frequent words in the corpus to use as features.
    '''
    features = list(occurrences(corpus).keys())[:n]
    return features

samples = ['Gratian1', 'Gratian2']
features(samples, 4)


['in', 'non', 'et', 'est']

In [4]:
features(['Gratian0', 'Gratian1'], 4)

['in', 'non', 'et', 'est']

In [5]:
features(['Gratian0', 'Gratian2'], 4)

['in', 'et', 'non', 'unde']

In [6]:
features(['Gratian1', 'Gratian2'], 4)

['in', 'non', 'et', 'est']

Note that you get different lists of MFWs from different possible combinations or permutations of the sample files!

In [7]:
def frequencies(features, subcorpora):
    '''
    For each of these n features, calculate the share of each of
    the x authors’ subcorpora represented by this feature, as a
    percentage of the total number of words.
    '''
    frequencies = {}
    empty = dict.fromkeys(features, 0)
    for subcorpus in subcorpora:
        frequencies[subcorpus] = empty.copy()
        tokens = tokenize(subcorpus + '.txt')
        counts = occurrences(tokens)
        for feature in features:
            frequencies[subcorpus][feature] = (counts.get(feature, 0) / len(tokens)) * 1000
            # frequencies[subcorpus][feature] = (counts.get(feature, 0))
    return frequencies

mfws = features(samples, 4)
frequencies(mfws, samples)


{'Gratian1': {'in': 25.393278782571937,
  'non': 24.487454331350584,
  'et': 23.279688396388778,
  'est': 17.150276276457625},
 'Gratian2': {'in': 29.099993248261427,
  'non': 21.200459118222945,
  'et': 24.036189318749578,
  'est': 12.018094659374789}}

$s=\sqrt{\frac{1}{N - 1}\sum_{i=1}^N(x_i-\bar{x})^2}$


In [8]:
import statistics

def means(frequencies):
    '''
    Then, calculate the mean and the standard deviation of these x
    values and use them as the offical mean and standard deviation
    for this feature over the whole corpus. In other words, we will
    be using a mean of means instead of calculating a single value
    representing the share of the entire corpus represented by each
    word.
    '''
    subcorpora = list(frequencies.keys())
    features = list(frequencies[subcorpora[0]].keys())
    frequencies['means'] = dict.fromkeys(features, 0)
    frequencies['stdevs'] = dict.fromkeys(features, 0)
    for feature in features:
        frequency_list = []
        for subcorpus in subcorpora:
            frequency_list.append(frequencies[subcorpus][feature])
        frequencies['means'][feature] = statistics.mean(frequency_list)
        frequencies['stdevs'][feature] = statistics.stdev(frequency_list)
    return frequencies

means(frequencies(mfws, samples))


{'Gratian1': {'in': 25.393278782571937,
  'non': 24.487454331350584,
  'et': 23.279688396388778,
  'est': 17.150276276457625},
 'Gratian2': {'in': 29.099993248261427,
  'non': 21.200459118222945,
  'et': 24.036189318749578,
  'est': 12.018094659374789},
 'means': {'in': 27.246636015416684,
  'non': 22.843956724786764,
  'et': 23.657938857569178,
  'est': 14.584185467916207},
 'stdevs': {'in': 2.621042934611309,
  'non': 2.324256604930275,
  'et': 0.5349269321751997,
  'est': 3.6290004237202145}}

In [9]:
def z_scores(subcorpora, frequencies):
    '''
    For each of the n features and x subcorpora, calculate a z-score
    describing how far away from the corpus norm the usage of this
    particular feature in this particular subcorpus happens to be.
    To do this, subtract the "mean of means" for the feature from
    the feature’s frequency in the subcorpus and divide the result
    by the feature’s standard deviation.
    '''
    z_scores = {}
    features = list(frequencies[subcorpora[0]].keys())
    for subcorpus in subcorpora:
        z_scores[subcorpus] = dict.fromkeys(features, 0)
        for feature in features:
            z_scores[subcorpus][feature] = (frequencies[subcorpus][feature] - frequencies['means'][feature]) / frequencies['stdevs'][feature]
    return z_scores

z_scores(samples, means(frequencies(mfws, samples)))


{'Gratian1': {'in': -0.7071067811865482,
  'non': 0.7071067811865475,
  'et': -0.7071067811865475,
  'est': 0.7071067811865475},
 'Gratian2': {'in': 0.7071067811865468,
  'non': -0.7071067811865475,
  'et': 0.7071067811865475,
  'est': -0.7071067811865475}}

In [10]:
def testcase(test, features, frequencies, z_scores):
    test_tokens = []
    test_tokens = tokenize(test + '.txt')
    test_frequencies = occurrences(test_tokens)
    frequencies[test] = dict.fromkeys(features, 0)
    z_scores[test] = dict.fromkeys(features, 0)
    for feature in features:
       frequencies[test][feature] = (test_frequencies.get(feature, 0) / len(test_tokens)) * 1000
       z_scores[test][feature] = (frequencies[test][feature] - frequencies['means'][feature]) / frequencies['stdevs'][feature]
    return (frequencies, z_scores)

f, z = testcase('Gratian0', mfws, means(frequencies(mfws, samples)), z_scores(samples, means(frequencies(mfws, samples))))
