# [MEDST-250] Stylometry

---
<img src="http://www.cleargoals.com/wp-content/uploads/2017/04/data-science-methods-and-algorithms-for-big-data.jpg" style="width: 500px; height: 275px;" />

This notebook is designed to reproduce several findings from Emily Thornbury's chapter "The Poet Alone" in her book Becoming a Poet in Anglo-Saxon England. In particular, Fig. 4.5 on page 170.

---

### Topics Covered
- Python `str` and `list` Basics
- Analyzing Text
- Basic Visualization

### Table of Contents

1 - [Python Lists](#section 1)<br>

2 - [List Comprehensions](#section 2)<br>

3 - [Word Frequencies](#section 3)<br>

4 - [Counting](#section 4) <br>

5 - [Ad Hoc Stylometry](#section 5)<br>

6 - [Super Challenge: Acrostics](#section 6)<br>

## 1. Python Lists<a id='section 1'></a>

First, we're going to learn how to work with Python lists. We've already seen lists in bit in previous lessons. Lists allow us to store words for easier manipulation later on. After all, how else can we count features of a string unless we can somehow make a list of items out of it?

Here's an example of a list:

In [None]:
["þæt", "wearð", "underne"]

How do I know?

In [None]:
type(["þæt", "wearð", "underne"])

We can assign these to variables too!

In [None]:
first_hemistich = ["þæt", "wearð", "underne"]
second_hemistich = ["eorðbuendum"]
print(first_hemistich)
print(second_hemistich)

And perform mathematical operations:


In [None]:
print(first_hemistich + second_hemistich)

Let's assign that to `first_line`:

In [None]:
first_line = first_hemistich + second_hemistich
print(first_line)

You can get the length of a list using the `len` function:


In [None]:
len(first_line)

You can index into lists with brackets [ ], let's get the first word of the first line:

#### NOTE: Python (and many other languages) start counting from 0!

In [None]:
print(first_line[0]) # returns the first word 

In [None]:
print(first_line[1]) # returns the second word 

You can get a range of elements using a semi-colon. Querying a range of elements from a list returns another list.

In [None]:
print(first_line[:2])
print(type(first_line[:2])) 

Here's a small excercise to test your knowledge or python lists.

Below are the first three lines of *Christ and Satan* assigned to three `list`s:

In [None]:
first_line = ['þæt', 'wearð', 'underne', 'eorðbuendum,']
second_line = ['þæt', 'meotod', 'hæfde', 'miht', 'and', 'strengðo']
third_line = ['ða', 'he', 'gefestnade', 'foldan', 'sceatas.']

### Challenge 1:
Concatenate the first three lines of *Christ and Satan*.

In [None]:
## YOUR CODE HERE

### Challenge 2:
Retrieve the third element from the combined list.

In [None]:
## YOUR CODE HERE

### Challenge 3:
Retrieve the fourth through sixth elements from the combined list.

In [None]:
## YOUR CODE HERE

### Challenge 4:
`print` the number of words in the first three lines.

In [None]:
## YOUR CODE HERE

## 2. List Comprehensions <a id='section 2'></a>

List comprehensions allow us to quickly and easily manipulate elements in a list without having to deal with loops. This can also involve removing and inserting items from a list. For example, here's our first line again:

In [None]:
print(first_line)

We can subset that to those words containing an "e":

In [None]:
[word for word in first_line if "e" in word] #Using List Comprehension

##### INSTEAD OF:

In [None]:
has_e = []
for word in first_line:
    if "e" in word:
        has_e.append(word)
has_e

Now you know why list comprehensions are one of the best parts of Python!
In relation to text analysis, list comprehensions will come in handy when we want to parse and sift through text.

### Challenge 5:

Create a new list from the first three lines of *Christ and Satan* that contains the first letter of each word.

In [None]:
## YOUR CODE HERE

### Challenge 6:

Create a new list from the first three lines that contains only words longer than three letters.

In [None]:
## YOUR CODE HERE

## 3. Word Frequencies  <a id='section 3'></a>

Lets get started with analyzing the different word frequencies in our text. Run the cell below to open up the text and read in into our notebook:

In [None]:
with open('christ-and-satan.txt', 'r') as f:
    christ_and_satan = f.read()

print(christ_and_satan)

In [None]:
tokens = christ_and_satan.split()
print(tokens)

Looks like a decent start. But we still have verse numbering in there, as well as some punctuation. What if we just want the words?

### Challenge 7

Get a new `list` of words without numbers or punctuation try using `punctuation` and `digits`:

In [None]:
from string import punctuation, digits

In [None]:
punctuation

In [None]:
digits

Does it feel like time for a list comprehension? It should.

In [None]:
## YOUR CODE HERE

## 4. Counting <a id='section 4'></a>

Python comes with the convenient `Counter` method from the collections library. It returns a dictionary like object that will return the frequency of a particular key.

In [None]:
from collections import Counter
cs_dict = Counter(tokens)

In [None]:
cs_dict

In [None]:
cs_dict.keys()

In [None]:
cs_dict.values()

In [None]:
cs_dict.most_common()

Believe it or not, even 1000 years ago "and" was still used all the time :) .

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.DataFrame(cs_dict.most_common()[:10])
df = df.set_index(0)
df.plot.barh()
plt.xlabel('Count')
plt.ylabel('Word')
plt.title('Word Frequency')

### Challenge 8:

A common measure of lexical diversity for a given text is its *Type-Token Ratio*: the ratio of unique words (type) to number of all words (tokens) in the text. Calculate the Type-Token Ratio for Christ and Satan. ***HINT: Try `set`***

In [None]:
## YOUR CODE HERE

## 5. Ad Hoc Stylometry  <a id='section 5'></a>

We can now put together our knowledge of strings, list comprehensions, and plotting frequencies to look at frequency of alliteration letters. Remember: Alliteration is the repetition of a sound at the beginning of two or more words in the same line.

Let's start by looking at the first letter of every word in the whole text:

In [None]:
cs_tokens = christ_and_satan.lower().split()
first_letters = [x[0] if x[0] not in ['a','e','i','o','u','y'] else 'a' for x in cs_tokens]
first_l_dict = Counter(first_letters)
print(first_l_dict.most_common())

We can plot this too:

In [None]:
df = pd.DataFrame(first_l_dict.most_common()[:10])
df = df.set_index(0)
df.plot.barh()
plt.xlabel('Count')
plt.ylabel('Letter')
plt.title('First Letter Frequency')

Cool! But we need it within a line, and Thornbury specifically does it for each Fitt. What's a "Fitt"? It's a further division in poetry constituted by a group of lines. Luckily this is nicely delimited by double line breaks (`\n\n`) in our text. If we had a nice XML corpus, it's likely it would be noted there as well!

In [None]:
cs_fitts = christ_and_satan.split('\n\n')  # splits up our text based on the location of double line breaks
print(cs_fitts[0])  # lets just look at the first element for now

Now we need to iterate through each fitt. In each fitt, we'll clean the text and get a list of words *for each line*. Then we'll cycle through *each line* and get counts for the first letters of every word. We'll then append the most frequeny letter, ultimately collecting the most frequent first letter for each line. Once we have the most frequent for each line, we'll get the most frequent for that particular fitt. We'll normalize the proportions so we can compare fitts against one another. Then we'll plot the proportion of the *four most common* alliterations for that fitt, as Thornbury does. We'll do this for each fitt, and `show` the plot.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize = (10,10))

# iterate through fitts
for i in range(len(cs_fitts)):
    
    # lowercase the string and get the tokens for each line back
    fitt_tokens = [l.split() for l in cs_fitts[i].lower().split('\n')]
    
    # collect letter of most freq alliteration
    most_freq_allit = []
    
    # cycle through lines
    for l in fitt_tokens:
        
        # get first letter of all words in line
        first_letters = [x[0] if x[0] not in ['a','e','i','o','u','y'] else 'a' for x in l]
        
        # count freq of all first letters
        allit_freq = Counter(first_letters).most_common()
        try:
            # append most freq letter (alliterated letter) to list for all lines
            most_freq_allit.append(allit_freq[0][0])
        except:
            pass
    
    # use Counter to get the most common alliterations
    allit_freq = Counter(most_freq_allit).most_common()

    # need keys for x axis
    common_keys = [x[0] for x in allit_freq]
    
    # need values for y axes
    common_values = [x[1] for x in allit_freq]
    
    # normalize so we can compare across Fitts despite different number of words
    normed_values = [x[1]/sum(common_values) for x in allit_freq]
    
    # add up to get cumulative alliteration of the four most preferred patterns
    cumulative_values = np.cumsum(normed_values)

    # add the Fitt to the plot
    plt.xticks(range(4), ['1st','2nd','3rd','4th'], rotation='vertical')
    plt.plot(cumulative_values[:4], color = plt.cm.bwr(i*.085), lw=3)

plt.legend(labels=['Fitt '+str(i+1) for i in range(12)], loc=0)
plt.show()

How does this compare to Thornbury's plot?

![img](thornbury-4-5.png)

> Tables 4.1-4.3 list the test poems' four most common alliterative patterns
by fitt. These patterns are graphed cumulatively in Figures 4.1—4.3, so that
the rightmost point in the series indicates the total percentage of the fitt's
lines occupied by the four most-favoured patterns. We can see from these
figures that, within poems, the correspondence between fitts in overall
patterns of variation is relatively close. (p. 164)

> What is particularly noteworthy for our purposes is the gap visible between the upper and lower clusters: the average (seen in Figure 4.5) divides Fitts 1-5 plus 8 from Fitts 6,7,
and 9—12. These two clusters have significant shared characteristics. The upper
set (1-5, 8), which we will call A, has extremely high rates of vowel alliteration
and a restricted number of alliterative patterns overall. On average, 65 per cent
of the lines of fitts in A alliterate on one of their four preferred letters, while
only 53 per cent of those in the lower cluster B do. The average difference
between A and B is greater than that between *Christ III*'s most distant outliers,
and the gap between *Christ and Satan*'s outliers is more than 20 points. It
seems, therefore, that the variation in poetic technique within *Christ and
Satan* is substantial enough to - at very least - require explanation. (p. 169)

---

### Super Challenge: Acrostics  <a id='section 6'></a>

In poetry, an acrostic is a message created by taking certain letters in a pattern over lines. One 9th century German writer, Otfrid of Weissenburg, was notorius for his early use of acrostics, one instance of which is in the text below: Salomoni episcopo Otfridus. His message can be found by taking the first character of every other line. Print Otfrid's message!

Source: http://titus.uni-frankfurt.de/texte/etcs/germ/ahd/otfrid/otfri.htm

In [None]:
text = '''si sálida gimúati      sálomones gúati, 
     ther bíscof ist nu édiles      kóstinzero sédales; 
     allo gúati gidúe thio sín,      thio bíscofa er thar hábetin, 
     ther ínan zi thiu giládota,      in hóubit sinaz zuívalta! 
     lékza ih therera búachi      iu sentu in suábo richi, 
     thaz ir irkíaset ubar ál,      oba siu frúma wesan scal; 
     oba ir hiar fíndet iawiht thés      thaz wírdig ist thes lésannes: 
     iz iuer húgu irwállo,      wísduames fóllo. 
     mir wárun thio iuo wízzi      ju ófto filu núzzi, 
     íueraz wísduam;      thes duan ih míhilan ruam. 
     ófto irhugg ih múates      thes mánagfalten gúates, 
     thaz ír mih lértut hárto      íues selbes wórto. 
     ni thaz míno dohti      giwérkon thaz io móhti, 
     odo in thén thingon      thio húldi so gilángon; 
     iz datun gómaheiti,      thio íues selbes gúati, 
     íueraz giráti,      nales míno dati. 
     emmizen nu ubar ál      ih druhtin férgon scal, 
     mit lón er iu iz firgélte      joh sínes selbes wórte; 
     páradyses résti      gébe iu zi gilústi; 
     ungilónot ni biléip      ther gotes wízzode kleip. 
     in hímilriches scóne      so wérde iz iu zi lóne 
     mit géltes ginúhti,      thaz ír mir datut zúhti. 
     sínt in thesemo búache,      thes gómo theheiner rúache; 
     wórtes odo gúates,      thaz lích iu iues múates: 
     chéret thaz in múate      bi thia zúhti iu zi gúate, 
     joh zellet tház ana wánc      al in íuweran thanc. 
     ofto wírdit, oba gúat      thes mannes júngoro giduat, 
     thaz es líwit thráto      ther zúhtari gúato. 
     pétrus ther rícho      lono iu es blídlicho, 
     themo zi rómu druhtin gráp      joh hús inti hóf gap; 
     óbana fon hímile      sént iu io zi gámane 
     sálida gimýato      selbo kríst ther gúato! 
     oba ih irbálden es gidár,      ni scal ih firlázan iz ouh ál, 
     nub ih ío bi iuih gerno      gináda sina férgo, 
     thaz hóh er iuo wírdi      mit sínes selbes húldi, 
     joh iu féstino in thaz múat      thaz sinaz mánagfalta gúat; 
     firlíhe iu sines ríches,      thes hohen hímilriches, 
     bi thaz ther gúato hiar io wíaf      joh émmizen zi góte riaf; 
     rihte íue pédi thara frúa      joh míh gifúage tharazúa, 
     tház wir unsih fréwen thar      thaz gotes éwiniga jár, 
     in hímile unsih blíden,      thaz wízi wir bimíden; 
     joh dúe uns thaz gimúati      thúruh thio síno guati! 
     dúe uns thaz zi gúate      blídemo múate! 
     mit héilu er gibóran ward,      ther io thia sálida thar fand, 
     uuanta es ni brístit furdir      (thes gilóube man mír), 
     nirfréwe sih mit múatu      íamer thar mit gúatu. 
     sélbo krist ther guato      firlíhe uns hiar gimúato, 
     wir íamer fro sin múates      thes éwinigen gúates!'''

In [None]:
# HINT: remember what % does, (maybe) lookup enumerate
## YOUR CODE HERE

Otfrid was more skillful than to settle for the first letter of every other line. What happens if you extract the last letter of the last word of each line, for every other line starting on the second line?

In [None]:
# HINT: first remove punctuation, tab is represented by \t
from string import punctuation

## YOUR CODE HERE

---
Notebook developed by: Shubham Gupta

Data Science Modules: http://data.berkeley.edu/education/modules
