# Love and War: Stylometry in Latin Poetry

In this lab, we're going to use word counts to look at the difference between two genres of Latin poetry, Epic and Elegy.
<div class="alert alert-info" style="margin:2em;">
<h5>Epic</h5>
<ul>
    <li>Treats grand events from mythology and history: e.g., the Trojan War, Jason and the Argonauts, the founding of Rome.</li>
    <li>Long poems (10,000 lines or more), divided into chapter-like "books."</li>
    <li>Written as a continous series of hexameter lines.</li>
    <li>Elevated style and diction.</li>
</ul>
</div>
<div class="alert alert-info" style="margin: 2em;">
<h5>Elegy</h5>
<ul>
    <li>Treats personal issues—especially love, but also friendships and grudges.</li>
    <li>Shorter poems (tens to hundreds of lines), arranged loosely in groups.</li>
    <li>Lines are grouped in couplets: one hexameter and one pentameter.</li>
    <li>Diction is often lower—includes conversational and vulgar language.</li>
</ul>
</div>

### The corpus

These are the texts I've chosen to represent each genre:

#### Epics

<table class="table table-striped">
    <thead>
        <tr>
            <th>Author</th>
            <th>Title</th>
            <th>Date</th>
            <th>Lines</th>
            <th>Divided into</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Vergil</td>
            <td><em>Aeneid</em></td>
            <td>19 BCE</td>
            <td>9896</td>
            <td>12 Books</td>
        </tr>
        <tr>
            <td>Lucan</td>
            <td><em>Civil War</em></td>
            <td>65 CE</td>
            <td>8061</td>
            <td>10 Books</td>
        </tr>
        <tr>
            <td>Statius</td>
            <td><em>Thebaid</em></td>
            <td>90s CE</td>
            <td>9731</td>
            <td>12 Books</td>
        </tr>
        <tr>
            <td>Valerius Flaccus</td>
            <td><em>Argonautica</em></td>
            <td>90s CE</td>
            <td>5592</td>
            <td>8 Books</td>
        </tr>
    </tbody>
</table>

#### Elegies

<table class="table table-striped">
    <thead>
        <tr>
            <th>Author</th>
            <th>Title</th>
            <th>Date</th>
            <th>Lines</th>
            <th>Divided into</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Catullus</td>
            <td>Poems 65–116</td>
            <td>50s BCE</td>
            <td>643</td>
            <td>—</td>
        </tr>
        <tr>
            <td>Tibullus</td>
            <td><em>Elegies</em> (Books 1 and 2)</td>
            <td>30–20 BCE</td>
            <td>1241</td>
            <td>2 Books</td>
        </tr>
        <tr>
            <td>Propertius</td>
            <td><em>Elegies</em></td>
            <td>20s–10s BCE</td>
            <td>3982</td>
            <td>4 Books</td>
        </tr>
        <tr>
            <td>Ovid</td>
            <td><em>Amores</em></td>
            <td>10s (?) BCE</td>
            <td>2458</td>
            <td>3 Books</td>
        </tr>
        <tr>
            <td>Pseudo-Tibullus (perhaps several authors)</td>
            <td>Transmitted with Tibullus as <em>Elegies</em> Book 3</td>
            <td>1st century CE (??)</td>
            <td>684</td>
            <td>—</td>
        </tr>
    </tbody>
</table>

### Data and sources

The texts we're using today were all drawn from online, public-domain or open-source editions. 

The Latin texts I occasionally refer to are all from the [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/); Perseus in turn digitized and hand-corrected them from out-of-copyright editions of the late 19th and early 20th centuries. You can browse Perseus' library [here](http://scaife.perseus.org/), or download their entire Latin collection in (XML format) [here](https://github.com/PerseusDL/canonical-latinLit).

**The English texts** we'll use for most of the lab are from a couple of places, but all essentially go back to out-of-copyright volumes from the [Loeb Classical Library](https://www.loebclassics.com/page/history). Catullus is from Perseus. Most of the epic authors are from the [Theoi Texts Library](http://theoi.com/Library.html). The rest I got by searching for specific Loebs at the [Internet Archive](https://archive.org/).

I've modified all of these texts to make them easier for us to use. Depending on the source, digital texts can be easy to work with or very difficult; Perseus' texts are formatted so that you can download and manipulate them easily using scripts; the Internet Archive texts, on the other hand, are just raw text—often uncorrected OCR, which means they have lots weird "typos."

You can browse the texts using Jupyter's dashboard. Click on the `texts` folder, and then the `latin` or `english` subfolder, and then pick a text to examine. If you have trouble finding the files, just let me know and I can help out.

<div class="alert alert-warning">
<h5>✏️ Analysis</h5>
<p>Before moving on, take a moment with your group to **look at the data summary** above. Then **skim through one or two of the text files.**</p>
<ul>
    <li>Do you notice any patterns, trends, or systematic differences between the groups?</li>
    <li>Do any of the texts stand out as anomalous?</li>
    <li>Does anyone in your group know something interesting about these authors or works?</li>
    <li>Can you suggest any possible sources of bias we might want to watch out for?</li>
</ul>
<p>Spend a few minutes talking it over, then **write a brief response** in the cell below.</p>
</div>

## Preliminaries

For this lab, I've written a lot of the code we'll need ahead of time. Please run the following cell to get started. 

In [None]:
# for dealing with files and directories
import os

# for string processing
import re

# for dictionary-like counters
from collections import Counter

# for graphing
import numpy
from matplotlib import pyplot
%matplotlib inline

### Read the directory

The following code snippet reads the contents `texts/english` directory.
   - files ending with `.txt` are saved to a list called `files`

<div class="alert alert-info" style="margin:0.5em 2em;">
👉🏻 If you want to try processing the Latin files later, just change `english` to `latin` and re-run this code.
</div>

In [None]:
# choose a folder to work with
folder = os.path.join('texts', 'english')

# create a list of files
files = []
for f in os.listdir(folder):
    if f.endswith('.txt'):
        files.append(f)

# reorder the list alphabetically
files.sort()

### Create a list of genre labels

To help keep our data organized, it's useful to have another list, *in the same order*, that gives the genre for each of the files. The code below assigns each file to a genre based on the author's name, which is built into the file name.

In [None]:
# start with an empty list
genres = []

for file in files:
    
    # author is first element of filename
    auth = file.split('.')[0]
    
    # assign genre
    if auth in ['lucan', 'statius', 'valerius_flaccus', 'vergil']:
        genre = 'epic'
    else:
        genre = 'elegy'
        
    # add genre to list
    genres.append(genre)

<div class="alert alert-warning">
<h5>✏️ Double check</h5>
<p>
Before moving on, check the contents of the two lists against the names of the files in the `texts/english` folder.
</p>
<ul>
    <li>In the cell below, **write a `for` loop** to print out the items of `files` and `genres`.</li>
</ul>
<p><strong>Hint:</strong> Begin like this...</p>
<pre>
for file, genre in zip(files, genres):
</pre>
</div>

## Part I: Counting words

The following code defines a new function, `wordCount()`:
  - it takes a file as its argument
  - it returns a dictionary (actually, a dictionary-like object called a [*Counter*](https://docs.python.org/3.6/library/collections.html#counter-objects)).

<div class="alert alert-info">
<strong>Dictionaries</strong> are containers that store multiple <strong>values</strong>. Each value is accessed using a unique <strong>key</strong>. In this case, the values are counts, and the keys are the words to be counted. You access the count for a given word by passing that word as a key to the dictionary.
</div>

Run the cell below, then proceed with the example that follows.

In [None]:
def wordCount(path, normalize=False):
    '''Read a text file and produce a set of word counts as a Counter.'''
    
    # a dictionary to collect the word counts
    wc = Counter()
    
    # open the file
    file = open(path)
    
    # process each line in turn:
    #  - cut off the line number tags
    #  - make the line lowercase
    #  - remove all non-letter characters
    #  - break into words
    #  - make sure there are no empty strings
    
    for line in file:

        # remove locus tags if present (only in latin)
        if '\t' in line:
            text = line.split('\t')[-1]
        else:
            text = line
        
        # lowercase
        text = text.lower()
        
        # scrub non-letter chars
        text = re.sub(pattern='[^a-z ]', repl='', string=text)
        
        # split into words
        words = text.strip().split()
        
        # add words to counter
        wc.update(words)
          
    # optional normalization to freq per thousand words
    if normalize:
        total = sum(wc.values())
        for word in wc:
            wc[word] = round(1000 * wc[word] / total, 2)

    return wc

#### Example

```python
# pick a file to examine; must include the enclosing folder
file = os.path.join(folder, 'lucan.civil_war_07.txt')

# call the function and assign the result to a variable
wc = wordCount(file)

# check a specific word count
print(wc['caesar'])
```

<div class="alert alert-warning">
<h5>✏️ Try it out</h5>
<p>
Test the function on book 1 of Vergil's *Aeneid*. How many times does this book refer to . . .
</p>
<ul>
    <li>Aeneas</li>
    <li>Dido</li>
    <li>Troy</li>
    <li>Rome</li>
</ul> 
<p><strong>Show your code</strong> in the cell below. Remember to use square brackets around the word you want to look up.</p>
<p>🤔 Can you use a loop to check all these words?</p>
<p>🤔 If you find you're getting the hang of it, then try some other words, or maybe a different book for comparison.</p> 
</div>

## Part II: A toy model of genre

Let's start with a comparison of word counts for a couple of important thematic words for these genres: "arms" (Latin *arma*) and "love" (Latin *amor*).

<div class="alert alert-info" style="margin: 0.5em 2em;">
<p><strong><em>arma</em></strong>, famously the first word of Vergil's *Aeneid*, is a key word for epic. Its meaning takes in both weapons and armour: swords and spears used for attack, but also the divine shield made for Aeneas. It's not completly absent from elegy, though. A common elegiac trope compares the lover to a soldier on campaign; and Ovid uses the word to open his *Amores* as a kind of literary joke on Vergil.</p>
<p><strong><em>amor</em></strong> means "love," either in the abstract or personified as Cupid. In the plural it means "love affairs." Obviously a principal theme of elegy, it also has an important place in epic: think of the affairs of Aeneas and Dido, Jason and Medea, among others.</p>
</div>

### Tallying

In the code below, `wordCount()` is called on each of the files in turn. A summary table is produced comparing counts for 'arms' and 'love', listed alongside the file and genre labels.

In [None]:
# iterate over file and genre tags together
for file, genre in zip(files, genres):
    
    # get file path and call wordCount
    path = os.path.join(folder, file)
    wc = wordCount(path)
    
    # check 
    arms = wc.get('arms', 0)
    love = wc.get('love', 0)

    print('\t'.join([genre, str(arms), str(love), file]))

<div class="alert alert-warning">
<h5>✏️ Analysis</h5>
<p>Before moving on, take a moment with your group to **look at the data** above.</p>
<ul>
    <li>What range of values does the 'love' tally cover? What about 'arms'?</li>
    <li>Is there a noticeable difference between the genres?</li>
    <li>What kind of variation exists *within* a given text, (i.e. from book to book)? Do single books stand out?</li>
    <li>Is there a more useful way to organize this data?</li>
</ul>
<p>Spend a few minutes talking it over, then **write a brief response** in the cell below.</p>
</div>

### Normalization

One issue that I expect will have come up in our discussions is that the books are of different lengths. Raw word counts are therefore not as meaningful as word use **rates**. A common measure of word frequency is **count per thousand words**. The process of turning observed counts into frequencies like this is called **normalization**.

I've already built an **optional parameter** called `normalize` into the `wordCount()` function. By default, it's set to `False`; to use it, you have to manually pass a value of `True` when you call the function.

**Example:**
```python
wc = wordCount('texts/english/vergil.aeneid_01.txt', normalize=True)
```

<div class="alert alert-warning">
<h5>✏️ You try it</h5>

<p>Redo the table above with normalized values.</p>
<ul>
    <li>Copy-paste the previous code block to the cell below.</li>
    <li>Modify the call to `wordCount()` so that results are normalized</li>
</ul>    
</div>


<div class="alert alert-warning">
<h5>✏️ Analysis</h5>
<p>Very briefly, use the cell below to respond:</p>
<ul>
    <li>What are the new ranges like?</li>
    <li>Are the normalized values harder to read, or easier?</li>
    <li>Does the broad epic/elegy relationship still hold?</li>
</ul>
</div>

### Plotting

As you likely guessed, the next step is to plot the values; often we notice things visually that might be difficult to pick out from the numbers. The code below defines a new function, `plotWords()`:

 - arguments are two strings: `x_word` and `y_word`
 - the function plots all the files on a graph, using the frequencies of these two words
 - add an optional `normalize=True` to plot freq / 1000 words instead of raw counts

**Example:**
```python
plotWords('arms', 'love')
```

In [None]:
from matplotlib import pyplot
%matplotlib inline
import numpy

#
# define custom plotting function
#

def plotWords(x_word, y_word, normalize=False):
    '''Count frequencies of given words and draw a graph'''

    # create a graph
    fig = pyplot.figure(figsize=(8,5))
    ax = fig.add_axes([.1, .1, .8, .8])
    ax.set_xlabel(x_word)
    ax.set_ylabel(y_word)

    # collect word counts
    x_vals = []
    y_vals = []

    for file in files:
        path = os.path.join(folder, file)
        wc = wordCount(path, normalize=normalize)

        x_vals.append(wc.get(x_word, 0))
        y_vals.append(wc.get(y_word, 0))

    # use numpy for easier array slicing
    x = numpy.array(x_vals)
    y = numpy.array(y_vals)
    g = numpy.array(genres)

    # plot epic and elegy as two series
    ax.plot(x[g=='epic'], y[g=='epic'], marker='o', linestyle='', label='epic')
    ax.plot(x[g=='elegy'], y[g=='elegy'], marker='o', linestyle='', label='elegy')
    ax.legend()

<div class="alert alert-warning">
<h5>✏️ You try it</h5>

<p>Run the function to plot `arms` versus `love` for all the files. Compare the raw and normalized results.</p>
</div>

<div class="alert alert-warning">
<h5>✏️ Analysis</h5>

<p>With your group, take a few minutes to look at the plots and discuss.</p>
<ul>
    <li>What patterns are apparent?</li>
    <li>Can you suggest a simple rule for predicting the genre of a work? How many would it get right?</li>
    <li>Are both word counts equally useful in distinguishing epic and elegy?</li>
    <li>Are there outliers among the texts? Can you figure out which they are?</li>
</ul>

<p>Respond briefly in the cell below.</p>
</div>

## Part III: Expanding the model

Having seen what we can do with the toy model, let's broaden our approach to consider other words. First, we'll get an overview of the lexicon for the entire corpus, but creating one master word count. Run the code below to get started.

In [None]:
# new count for the whole corpus
wc_total = Counter()

# add up individual counts
for file in files:
    path = os.path.join(folder, file)
    wc_total.update(wordCount(path))

You can check the frequencies of individual words, as you did above. You can also get the top <var>N</var> words in descending order with a special Counter method called `.most_common()`.

**Example:**
```python
wc_total.most_common(25)
```

<div class="alert alert-success">
<h4>Try it out</h4>
<p>🤔 Investigate the most common words in our corpus.</p>
<p>🤔 Try using plotWords() to explore other word pairs.</p>
<p>Then proceed to the analysis questions below.</p>
</div>

<div class="alert alert-warning">

<h5>✏️ Analysis</h5>
<ul>
    <li>Do you see an approximate division by frequency into function / content words?</li>
    <li>Among the content words, which seem most prominent?</li>
    <li>Do function/content words behave differently in `plotWords()`?</li>
    <li>What words in particular get the best separation between the genres?</li>
    </ul>
    <p>Respond briefly in the cell below.</p>
</div>

### Bonus: Principal Components Analysis

<p>In the Hildegard study, the authors used Principal Components Analysis to distil a large number of word frequencies down to <var>x</var> and <var>y</var> parameters for their graphs.</p>

<p>The code below will do the same for our own corpus. It defines a new function, `plotPCA()`, which takes an optional parameter `n`, the number of word frequencies to use. By default, `n` is 500.</p>

**Example**:

```python
# with default n
plotPCA()

# with custom n
plotPCA(250)
```

In [None]:
%matplotlib inline
from sklearn import decomposition

def plotPCA(n=500):
    '''Create PCA chart from top n word frequencies'''

    features = [w for w, c in wc_total.most_common(n)]

    # build feature vectors
    data = []

    for file in files:
        path = os.path.join(folder, file)
        wc = wordCount(path, True)

        this_vec = [wc.get(w, 0) for w in features]
        data.append(this_vec)

    data = numpy.array(data)

    # create author labels
    authors = []
    for f in files:
        author = f.split('.')[0]
        authors.append(author)

    # reduce dimensionality with PCA
    npcs = 2
    pcmodel = decomposition.PCA(npcs)
    pca = pcmodel.fit_transform(data)

    # create a graph
    fig, ax = pyplot.subplots()
    ax.set_xlabel('PC1')
    ax.set_ylabel('PC2')
    ax.set_title('Top {} words'.format(n))

    # use numpy for easier array slicing
    x = pca[:,0]
    y = pca[:,1]
    a = numpy.array(authors)

    # plot each author as a separate series
    for auth in sorted(set(a)):
        ax.plot(x[a==auth], y[a==auth], marker='o', linestyle='', label=auth)    
        ax.legend(loc="upper left", bbox_to_anchor=(1,1))

<div class="alert alert-success">
<h4>You try it</h4>
<p>🤔 Run the code and look at the results.</p>
<p>🤔 Try adjusting the number of words considered (line 5: `nwords = 500`) and re-running it.</p>
<p>Then proceed to the analysis below (if there's time).
</div>

<div class="alert alert-warning">
<h5>✏️ Analysis</h5>
<ul>
<li>What do you notice in the graphs?</li>
<li>Any final thoughts on this lab?</li>
</ul>
<p>Record your reflections in the cell below.</p>
</div>