# Word Distributions in the Simulated Text

As noted in the paper, the rhyme baselines at the various positions match well between the simulated texts and the real texts, but there might be concern that the overall distribution of words that end up being used in the simulation are differently distributed (eg some words might never appear due to the algorithm, affecting the rhyme possibilities). Here we carry out a fairly rough and ready sanity check on the final words of the Aeneid vs the simulated Aeneid-like lines. The final words are chosen because they are the ones most constrained by my fairly simplistic greedy line-building algorithm.

In [1]:
from bs4 import BeautifulSoup

from mqdq import rhyme, utils, rhyme_classes, babble

import random
import string
import scipy as sp
import pandas as pd
from collections import Counter

In [2]:
# Build the Babbler that will create out simulated lines

aen_single_bab = babble.Babbler.from_file('mqdq/VERG-aene.xml', name='Aeneid')

In [3]:
print("%d lines in the Aeneid." % len(aen_single_bab._syl_source()))

9840 lines in the Aeneid.


## We create 5x more simulated text than real

As a natural consequence of random sampling, if we create a simulated text of the same length there will be many words that do not appear. In general for $N$ unique items, a sample with replacement of $n=N$ will hit about two-thirds of the items. To see if the full range of words are able to appear we need to sample more than $N$. The factor of five is arbitrary, but should be enough.

In [4]:
random.seed(42)
sim = rhyme.syllabify(aen_single_bab.hexameter(9840*5))

In [5]:
strip_punc = lambda s: s.translate(str.maketrans('', '', string.punctuation))

In [6]:
# Set up Counter objects to get counts by unique word

sim_ctr = Counter([strip_punc(x[-1].mqdq.text.lower()) for x in sim])
aen_ctr = Counter([strip_punc(x[-1].mqdq.text.lower()) for x in aen_single_bab._syl_source()])

## The *range* of words that appear is more or less the same

This shows that any final word in the real Aeneid might be selected as a final word for the simulated text. The is an apparent discrepancy in the numbers here--there are 3636 unique words in the simulated text, but only 3625 that appear in both, ie there are 11 final words in the simulated text that do not appear in the original. This is a feature of the algorithm, which can (very rarely) pick a word from the penultimate position if certain coincidences of elision occur, and isn't something to worry about.

In [7]:
print("%d unique final words in Aeneid, %d in simulated text." % (len(aen_ctr),len(sim_ctr)))

3640 unique final words in Aeneid, 3636 in simulated text.


In [8]:
# For each word that appears in both texts, store the difference in counts.
# Divide the counts for the simulated text by 5 (obviously)

diffs = []
for (word, count) in aen_ctr.items():
    if sim_ctr[word]:
        diffs.append(count - (sim_ctr[word]/5))

In [9]:
print("%d unique words appear in both (%.2f%% of Aeneid final words are represented)" %(len(diffs), len(diffs)/len(aen_ctr)*100))

3625 unique words appear in both (99.59% of Aeneid final words are represented)


## The *distribution* of words is not significantly different

We use the Wilcoxon signed rank test (more information and references [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html#scipy.stats.wilcoxon))

<blockquote>
The Wilcoxon signed-rank test tests the null hypothesis that two related paired samples come from the same distribution. In particular, it tests whether the distribution of the differences x - y is symmetric about zero. It is a non-parametric version of the paired T-test.
</blockquote>

In the tests below we test with and without Pratt's adjustment for zero differences (which is more conservative). In both cases the p-value is not significant, so we retain $H_0$, which is that the two samples are drawn from the same distribution. In other words the frequency with which the words appear in the simulated text is a fairly good match to the real text.

In [10]:
sp.stats.wilcoxon(diffs)

WilcoxonResult(statistic=2394540.5, pvalue=0.34391369159103546)

In [11]:
sp.stats.wilcoxon(diffs, zero_method='pratt')

WilcoxonResult(statistic=3141540.5, pvalue=0.1925915136016152)

## The most common words are a good match

Finally, here are the 15 most common words from both sets. The match seems fairly good.

In [14]:
aen_ctr.most_common(15)

[('armis', 100),
 ('auras', 60),
 ('urbem', 49),
 ('alto', 42),
 ('omnis', 41),
 ('arma', 37),
 ('circum', 35),
 ('est', 34),
 ('oris', 30),
 ('undis', 30),
 ('fatur', 29),
 ('altis', 27),
 ('hostis', 27),
 ('ferro', 26),
 ('bello', 26)]

In [15]:
sim_ctr.most_common(15)

[('armis', 494),
 ('auras', 303),
 ('urbem', 229),
 ('alto', 211),
 ('omnis', 204),
 ('circum', 185),
 ('arma', 172),
 ('oris', 168),
 ('ferro', 146),
 ('apollo', 145),
 ('hostis', 145),
 ('bello', 143),
 ('fatur', 136),
 ('altis', 132),
 ('undis', 131)]

## General Conclusions

This seems to indicate that the lexical palette is a good statistical match for the real text (good enough to fool a computer, at least). There are other meta-lexical features (double-disyllables etc) that would be noticed as non-Vergilian by a human, but the algorithm is good enough for its purpose. 

Any word in a given position might appear in the simulated text, and will appear with roughly the same frequency as in the real text (rare words remain rare, common words remain common).