# Plotting Generic Diction in Latin Poetry with Scattertext

This notebook plots keyness in generic diction in Latin poetry, specifically Latin love elegy (Propertius, Tibullus, early Ovid) and epic (Virgil's *Aeneid*) and it does so using the very attractive interactive plots produced by Jason Kessler's [Scattertext](https://github.com/JasonKessler/scattertext). Kessler's Scattertext demo plot visualized the relative keyness of Democratic/Republican diction using the corpus of 2012 political convention addresses. I have created a binary model of my own, following my work on elegiac diction in Latin epic poetry from my dissertation ["*Amor belli*: Elegiac Diction and the Theme of Love in Lucan's *Bellum civile*"](https://fordham.bepress.com/dissertations/AAI10125245/) as well as a related article in the *Journal of Data Mining and Digital Humanities* titled ["Measuring and Mapping Intergeneric Allusion using *Tesserae*"](https://hal.archives-ouvertes.fr/hal-01282568). Here I create a chunked corpus from the authors listed above. The texts are preprocessed, lemmatized, and then plotted using Scattertext. As I mentioned above, the results are attractive.

<a href='Lemmatized-Genre-Visualization.html' target='_blank'>![Scattertext visualization of elegy/epic](./images/scattertext.png "Scattertext visualization of elegy/epic")</a>

In a single, easy-to-take-in plot, we get an excellent sense of which words are 'elegiac' and which words are 'epic' as well as their relative frequency. (The fact that we can easily search for terms in context is a nice bonus, as is the summary information.) Readers of Latin poetry should not be suprised by the upper-left and lower-right corners, that is frequent-elegiac and frequent-epic, respectively. *Puella*, *domina*, *formosus* are decidely elegiac; *clipeus*, *arduus*, *immanis* are epic. Cynthia finds a place among the elegiac words, while Turnus and Anchises rest in epic diction. (Aeneas is more epic than not, but [*Heroides* 7](http://www.thelatinlibrary.com/ovid/ovid.her7.shtml) alone brings up the elegiac value.) But what is much more interesting to me, and what has been a central idea in my research on generic diction are the words in the middle of the plot. 

This plot makes clear that generic diction does not have to be thought of in discrete, binary terms, but rather we can create language models like the simplified elegy-epic model presented here, that show generic diction as continuous. For example, I can say not only that *cupio* is more elegiac than epic, but I can be more specific and say that—in this model at least—*cupio* is 79% elegiac / 21% epic. In future notebooks, I will expand on this kind of thinking, that is observing literary features as continuous variables in texts. For example, by assigning generic weights to words in these texts we can map epicness or elegiacness in [narrative space](https://github.com/diyclassics/literature-experiments/blob/master/plot-narrative-space.ipynb).

One last point—this notebook uses [spaCy's English NLP tools](https://spacy.io) for preparing the Scattertext plots. I have been working on developing [Latin-specific tools for spaCy](https://github.com/diyclassics/spaCy/tree/latin/spacy/lang) and this experiment further convinces me of the usefulness of this work. Hopefully this is something I can devote more time to soon. [PJB 3.22.18]

In [1]:
# Imports

import os

import pandas as pd

import spacy

from cltk.corpus.latin import latinlibrary
from cltk.tokenize.word import WordTokenizer
from cltk.stem.latin.j_v import JVReplacer
from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer
from cltk.utils.file_operations import open_pickle

import scattertext as st

from pprint import pprint

In [2]:
# Set up spaCy for Scattertext

nlp = spacy.load('en')

In [3]:
# We need to import a data model to train the lemmatizer.

# Set up training sentences
rel_path = os.path.join('~/cltk_data/latin/model/latin_models_cltk/lemmata/backoff')
path = os.path.expanduser(rel_path)

# Check for presence of latin_pos_lemmatized_sents
file = 'latin_pos_lemmatized_sents.pickle'      

latin_pos_lemmatized_sents_path = os.path.join(path, file)
if os.path.isfile(latin_pos_lemmatized_sents_path):
    latin_pos_lemmatized_sents = open_pickle(latin_pos_lemmatized_sents_path)
else:
    latin_pos_lemmatized_sents = []
    print('The file %s is not available in cltk_data' % file)

In [4]:
# Set up CLTK tools

word_tokenizer = WordTokenizer('latin')
replacer = JVReplacer()
lemmatizer = BackoffLatinLemmatizer(latin_pos_lemmatized_sents)    

In [5]:
# Set up Latin Library files and build text array

virgil = [file for file in latinlibrary.fileids() if "vergil/a" in file]
propertius = ['prop2.txt', 'prop3.txt', 'prop4.txt', 'propertius1.txt']
tibullus = ['tibullus1.txt', 'tibullus2.txt']
ovid = [file for file in latinlibrary.fileids() if "ovid.amor" in file]

TextArray = [
    ('epic', 'aeneid 1', 'vergil/aen1.txt'),
    ('epic', 'aeneid 2', 'vergil/aen2.txt'),
    ('epic', 'aeneid 3', 'vergil/aen3.txt'),
    ('epic', 'aeneid 4', 'vergil/aen4.txt'),
    ('epic', 'aeneid 5', 'vergil/aen5.txt'),
    ('epic', 'aeneid 6', 'vergil/aen6.txt'),
    ('epic', 'aeneid 7', 'vergil/aen7.txt'),
    ('epic', 'aeneid 8', 'vergil/aen8.txt'),
    ('epic', 'aeneid 9', 'vergil/aen9.txt'),
    ('epic', 'aeneid 10', 'vergil/aen10.txt'),
    ('epic', 'aeneid 11', 'vergil/aen11.txt'),
    ('epic', 'aeneid 12', 'vergil/aen12.txt'),
    ('elegy', 'propertius 1', 'propertius1.txt'),
    ('elegy', 'propertius 2', 'prop2.txt'),
    ('elegy', 'propertius 3', 'prop3.txt'),
    ('elegy', 'propertius 4', 'prop4.txt'),
    ('elegy', 'tibullus 1', 'tibullus1.txt'),
    ('elegy', 'tibullus 2', 'tibullus2.txt'),
    ('elegy', 'amores 1', 'ovid/ovid.amor1.txt'),
    ('elegy', 'amores 2', 'ovid/ovid.amor2.txt'),
    ('elegy', 'amores 3', 'ovid/ovid.amor3.txt'),
]

In [6]:
# Script for preprocessing texts

def preprocess(text):
    
    import html
    import re
    
    text = html.unescape(text) # Handle html entities
    
    text = text.lower()
    text = replacer.replace(text) #Normalize u/v & i/j
    
    punctuation ="\"#$%&\'()*+,-/:;<=>@[\]^_`{|}~.?!«»—"
    translator = str.maketrans({key: " " for key in punctuation})
    text = text.translate(translator)
    
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)
    
    return text

# Script for getting & preprocessing LL texts

def get_ll_text(fileid):
    text = latinlibrary.raw(fileid)
    text = preprocess(text)
    text = text[95:-95] # Fix to get real start of texts!
    text = text[text.find(' '):]
    return text

# Script for chunking text

def make_text_chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

In [7]:
# Build chunked TextArray

TextArrayChunks = []

for item in TextArray:
    genre, work, fileid = item
    tokens = get_ll_text(fileid).split() 
    chunk_text = make_text_chunks(tokens, 250)
    chunk_text = [" ".join(chunk) for chunk in chunk_text]
    for i, chunk in enumerate(list(chunk_text)):
        chunk_name = '{}_{}'.format(work, i)
        TextArrayChunks.append((genre, work, chunk_name, fileid, chunk))

In [8]:
# Create and populate text dataframe 

columns = ['genre', 'work', 'chunk_name', 'fileid', 'text']
df = pd.DataFrame(TextArrayChunks, columns=columns)

In [9]:
# Example from dataframe

df[:10]

Unnamed: 0,genre,work,chunk_name,fileid,text
0,epic,aeneid 1,aeneid 1_0,vergil/aen1.txt,arma uirumque cano troiae qui primus ab oris i...
1,epic,aeneid 1,aeneid 1_1,vergil/aen1.txt,exurere classem argiuom atque ipsos potuit sub...
2,epic,aeneid 1,aeneid 1_2,vergil/aen1.txt,capessere fas est tu mihi quodcumque hoc regni...
3,epic,aeneid 1,aeneid 1_3,vergil/aen1.txt,puppim ferit excutitur pronusque magister uolu...
4,epic,aeneid 1,aeneid 1_4,vergil/aen1.txt,auribus adstant ille regit dictis animos et pe...
5,epic,aeneid 1,aeneid 1_5,vergil/aen1.txt,telis nemora inter frondea turbam nec prius ab...
6,epic,aeneid 1,aeneid 1_6,vergil/aen1.txt,deumque aeternis regis imperiis et fulmine ter...
7,epic,aeneid 1,aeneid 1_7,vergil/aen1.txt,stetit ilia regno triginta magnos uoluendis me...
8,epic,aeneid 1,aeneid 1_8,vergil/aen1.txt,exire locosque explorare nouos quas uento acce...
9,epic,aeneid 1,aeneid 1_9,vergil/aen1.txt,cui pater intactam dederat primisque iugarat o...


In [10]:
# Example from dataframe

df[-10:]

Unnamed: 0,genre,work,chunk_name,fileid,text
450,elegy,amores 3,amores 3_12,ovid/ovid.amor3.txt,captus et ante tuis tu dominum fallis per te d...
451,elegy,amores 3,amores 3_13,ovid/ovid.amor3.txt,castra sequi proque bono uersu primum deducite...
452,elegy,amores 3,amores 3_14,ovid/ovid.amor3.txt,male quaesitas puluere mutet opes ix memnona s...
453,elegy,amores 3,amores 3_15,ovid/ovid.amor3.txt,e toto parua quod urna capit tene sacer uates ...
454,elegy,amores 3,amores 3_16,ovid/ovid.amor3.txt,in agris falce coloratas subsecuitque comas pr...
455,elegy,amores 3,amores 3_17,ovid/ovid.amor3.txt,et quae non puduit ferre tulisse pudet uicimus...
456,elegy,amores 3,amores 3_18,ovid/ovid.amor3.txt,facit ad mores tam bona forma malos facta mere...
457,elegy,amores 3,amores 3_19,ovid/ovid.amor3.txt,aeolios ithacis inclusimus utribus euros prodi...
458,elegy,amores 3,amores 3_20,ovid/ovid.amor3.txt,superba pedes more patrum graio uelatae uestib...
459,elegy,amores 3,amores 3_21,ovid/ovid.amor3.txt,uideo mitti recipique tabellas cur pressus pri...


In [11]:
# Build Scatterplot corpus

corpus = st.CorpusFromPandas(df,
                            category_col='genre',
                            text_col='text',
                            nlp=nlp).build()

In [12]:
# Give keyness example

print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))

['haec', 'mihi', 'atque', 'tibi', 'quae', 'nunc', 'saepe', 'armis', 'ipse', 'tamen']


In [13]:
# Create Scattertext HTML file

html_doc = st.produce_scattertext_explorer(corpus,
          category='elegy',
          category_name='Elegy',
          not_category_name='Epic',
          width_in_pixels=1000,
          metadata=df['work'])

open("Genre-Visualization.html", 'wb').write(html_doc.encode('utf-8'))          

1709648

In [14]:
# Script for lemmatizing text

def lemmatize_text(text):
    tokens = word_tokenizer.tokenize(text)
    lemma_pairs = lemmatizer.lemmatize(tokens)
    lemmas = [lemma[1] for lemma in lemma_pairs]
    lemmatized_text = " ".join(lemmas)
    return lemmatized_text

In [15]:
# Lemmatize text column

df['text'] = df.apply(lambda row: lemmatize_text(row['text']), axis=1)
df[:10]

Unnamed: 0,genre,work,chunk_name,fileid,text
0,epic,aeneid 1,aeneid 1_0,vergil/aen1.txt,arma uir -que cano troia qui primus ab ora ita...
1,epic,aeneid 1,aeneid 1_1,vergil/aen1.txt,exuro classis argiuom atque ipse possum sub-me...
2,epic,aeneid 1,aeneid 1_2,vergil/aen1.txt,capesso fas sum tu ego quicumque hic regnum tu...
3,epic,aeneid 1,aeneid 1_3,vergil/aen1.txt,puppis ferio excutio pronus -que magister uolu...
4,epic,aeneid 1,aeneid 1_4,vergil/aen1.txt,auris asto ille rego dico animus et pectus mul...
5,epic,aeneid 1,aeneid 1_5,vergil/aen1.txt,telum nemus inter frondeus turba neque prior a...
6,epic,aeneid 1,aeneid 1_6,vergil/aen1.txt,deus -que aeternus rex imperium et fulmen terr...
7,epic,aeneid 1,aeneid 1_7,vergil/aen1.txt,sto ilia regnum triginta magnus volvo mensis o...
8,epic,aeneid 1,aeneid 1_8,vergil/aen1.txt,exeo locus -que exploraris nouus qui uentus ac...
9,epic,aeneid 1,aeneid 1_9,vergil/aen1.txt,qui pater intactus1 do primus -que jugo1 omen ...


In [16]:
# Build Scatterplot corpus with lemmatized texts

corpus = st.CorpusFromPandas(df,
                            category_col='genre',
                            text_col='text',
                            nlp=nlp).build()

In [17]:
# Get words with elegiac keyness

term_freq_df = corpus.get_term_freq_df()
term_freq_df['Elegiac Score'] = corpus.get_scaled_f_scores('elegy')
pprint(list(term_freq_df.sort_values(by='Elegiac Score', ascending=False).index[:10]))

['formosus',
 'cynthia',
 'puella',
 'ocellus',
 'tu sum',
 'scribo',
 'ingenium',
 'in amor',
 'domina',
 'bene']


In [18]:
# Get words with epic keyness

term_freq_df = corpus.get_term_freq_df()
term_freq_df['Epic Score'] = corpus.get_scaled_f_scores('epic')
pprint(list(term_freq_df.sort_values(by='Epic Score', ascending=False).index[:10]))

['teucer',
 'clamor',
 'turnus',
 'aether',
 'immanis',
 'telis',
 'ingens',
 'clipeus',
 'latinus',
 'fremo']


In [19]:
# Create Scattertext HTML file with lemmatized text

html_doc = st.produce_scattertext_explorer(corpus,
          category='elegy',
          category_name='Elegy',
          not_category_name='Epic',
          width_in_pixels=1000,
          metadata=df['chunk_name'])

open("Lemmatized-Genre-Visualization.html", 'wb').write(html_doc.encode('utf-8'))        

1648681

Click [here](./Lemmatized-Genre-Visualization.html) to view the interactive Scattertext plot.