## Burrows's Delta

In [1]:
import re

def tokenize(filename):
    '''open text file and return list of tokens'''
    text = open(filename, 'r').read().lower()
    tokens = [word for word in re.split('\W', text) if word != '']
    return tokens

In [2]:
lengths = {}
samples = ['Gratian0', 'Gratian1', 'Gratian2']
filenames = [sample + '.txt' for sample in samples]
for i in range(len(samples)):
   lengths[samples[i]] = len(tokenize(filenames[i]))
lengths

{'Gratian0': 3605, 'Gratian1': 66238, 'Gratian2': 14811}

In [3]:
def occurrences(tokens):
    '''create and return token occurrence dictionary'''
    types = list(set(tokens))
    tmp = dict.fromkeys(types, 0)
    for token in tokens: tmp[token] += 1
    occurrences = {
        key: value for key, value in sorted(tmp.items(),
        key = lambda item: (-item[1], item[0]))
    }
    return occurrences

def features(texts, n):
    corpus = []
    for text in texts:
        corpus += tokenize(text + '.txt')
    features = list(occurrences(corpus).keys())[:n]
    return features

mfws = features(samples, 4)
mfws

['in', 'et', 'non', 'est']

In [4]:
def counts(features, subcorpora):
    columns = {}
    for subcorpus in subcorpora:
        columns[subcorpus] = []
        tokens = tokenize(subcorpus + '.txt')
        all = occurrences(tokens)
        for feature in features:
            columns[subcorpus].append(all.get(feature, 0))
    return columns

counts(mfws, samples)

{'Gratian0': [74, 70, 24, 13],
 'Gratian1': [1682, 1542, 1622, 1136],
 'Gratian2': [431, 356, 314, 178]}

**Once we've gotten to this point, we've gathered all the preliminary information we need, and are ready to move the analysis into Pandas dataframes.**

In [5]:
import pandas as pd

df_counts = pd.DataFrame(counts(mfws, samples), index = mfws)
df_counts

Unnamed: 0,Gratian0,Gratian1,Gratian2
in,74,1682,431
et,70,1542,356
non,24,1622,314
est,13,1136,178


In [6]:
df_lengths = pd.DataFrame(lengths, index = ['words'])
df_lengths

Unnamed: 0,Gratian0,Gratian1,Gratian2
words,3605,66238,14811


**Explain use of occurrences per 1,000 words instead of percent here. Using occurrences per 1,000 words is more convenient than using percentages, because at that scale the word frequency values we are concerned with (at least most them) are greater than 1.0.**

In [7]:
frequencies = (df_counts / df_lengths.values) * 1000
frequencies


Unnamed: 0,Gratian0,Gratian1,Gratian2
in,20.527046,25.393279,29.099993
et,19.417476,23.279688,24.036189
non,6.65742,24.487454,21.200459
est,3.606103,17.150276,12.018095


This is the point where we need to temporarily drop the Gratian0 column. We're only interested at this point in calculating the mean and sample standard deviation of the values in the two columns we're comparing the candidate to: Gratian1 and Gratian2.

In [8]:
selected = frequencies[['Gratian1', 'Gratian2']]
selected

Unnamed: 0,Gratian1,Gratian2
in,25.393279,29.099993
et,23.279688,24.036189
non,24.487454,21.200459
est,17.150276,12.018095


In [9]:
means = selected.mean(axis = 1).to_frame('mean')
means

Unnamed: 0,mean
in,27.246636
et,23.657939
non,22.843957
est,14.584185


$s=\sqrt{\frac{1}{N - 1}\sum_{i=1}^N(x_i-\bar{x})^2}$

In [10]:
stds = selected.std(axis = 1).to_frame('std')
stds

Unnamed: 0,std
in,2.621043
et,0.534927
non,2.324257
est,3.629


$z=\frac{x - \bar{x}}{s}$

In [11]:
zs = (frequencies - means.values) / stds.values
zs

Unnamed: 0,Gratian0,Gratian1,Gratian2
in,-2.563709,-0.707107,0.707107
et,-7.927182,-0.707107,0.707107
non,-6.964178,0.707107,-0.707107
est,-3.025098,0.707107,-0.707107


**Again, remember that the means and standard deviations have been computed from the values in the Gratian1 and Gratian2 columns *only*!**