# Plotting Generic Diction in Latin Poetry with Scattertext

This notebook plots keyness in generic diction in Latin poetry, specifically Latin love elegy (Propertius, Tibullus, early Ovid) and epic (Virgil's *Aeneid*) and it does so using the very attractive interactive plots produced by Jason Kessler's [Scattertext](https://github.com/JasonKessler/scattertext). Kessler's Scattertext demo plot visualized the relative keyness of Democratic/Republican diction using the corpus of 2012 political convention addresses. I have created a binary model of my own, following my work on elegiac diction in Latin epic poetry from my dissertation ["*Amor belli*: Elegiac Diction and the Theme of Love in Lucan's *Bellum civile*"](https://fordham.bepress.com/dissertations/AAI10125245/) as well as a related article in the *Journal of Data Mining and Digital Humanities* titled ["Measuring and Mapping Intergeneric Allusion using *Tesserae*"](https://hal.archives-ouvertes.fr/hal-01282568). Here I create a chunked corpus from the authors listed above. The texts are preprocessed, lemmatized, and then plotted using Scattertext. As I mentioned above, the results are attractive.

<a href='https://cdn.rawgit.com/diyclassics/literature-experiments/21f004b3af0cac0eb08bbd64166983e3c2090a97/Lemmatized-Genre-Visualization.html' target='_blank'><img src="./images/scattertext.png" alt="Scattertext visualization of elegy/epic"></a>

In a single, easy-to-take-in plot, we get an excellent sense of which words are 'elegiac' and which words are 'epic' as well as their relative frequency. (The fact that we can easily search for terms in context is a nice bonus, as is the summary information.) Readers of Latin poetry should not be suprised by the upper-left and lower-right corners, that is frequent-elegiac and frequent-epic, respectively. *Puella*, *domina*, *formosus* are decidely elegiac; *clipeus*, *arduus*, *immanis* are epic. Cynthia finds a place among the elegiac words, while Turnus and Anchises rest in epic diction. (Aeneas is more epic than not, but [*Heroides* 7](http://www.thelatinlibrary.com/ovid/ovid.her7.shtml) alone brings up the elegiac value.) But what is much more interesting to me, and what has been a central idea in my research on generic diction are the words in the middle of the plot. 

This plot makes clear that generic diction does not have to be thought of in discrete, binary terms, but rather we can create language models like the simplified elegy-epic model presented here, that show generic diction as continuous. For example, I can say not only that *cupio* is more elegiac than epic, but I can be more specific and say that—in this model at least—*cupio* is 79% elegiac / 21% epic. In future notebooks, I will expand on this kind of thinking, that is observing literary features as continuous variables in texts. For example, by assigning generic weights to words in these texts we can map epicness or elegiacness in [narrative space](https://github.com/diyclassics/literature-experiments/blob/master/plot-narrative-space.ipynb).

One last point—this notebook uses [spaCy's English NLP tools](https://spacy.io) for preparing the Scattertext plots. I have been working on developing [Latin-specific tools for spaCy](https://github.com/diyclassics/spaCy/tree/latin/spacy/lang) and this experiment further convinces me of the usefulness of this work. Hopefully this is something I can devote more time to soon. [PJB 3.22.18]

In [12]:
# Imports

import os

import pandas as pd

import spacy

from cltk.corpus.latin import latinlibrary
from cltk.tokenize.word import WordTokenizer
from cltk.stem.latin.j_v import JVReplacer
from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer
from cltk.utils.file_operations import open_pickle

import scattertext as st

from pprint import pprint

In [13]:
# Set up spaCy for Scattertext

nlp = spacy.load('en')

In [14]:
import pickle
Lemmatizer = pickle.load(open("./tools/lemmatizer.p", "rb"))

In [20]:
TextArrayChunks = pickle.load(open("./data/text_array_chunks.p", "rb"))

In [28]:
def preprocess(text):
    
    import html
    import re
    
    text = html.unescape(text) # Handle html entities
    
    text = text.lower()
    text = replacer.replace(text) #Normalize u/v & i/j
    
    punctuation ="\"#$%&\'()*+,-/:;<=>@[\]^_`{|}~.?!«»—"
    translator = str.maketrans({key: "" for key in punctuation})
    text = text.translate(translator)
    
    translator = str.maketrans({key: "" for key in '0123456789'})
    text = text.translate(translator)
    
    return text

# Script for lemmatizing text

def lemmatize_text(text):
    tokens = word_tokenizer.tokenize(text)
    lemma_pairs = Lemmatizer.lemmatize(tokens)
    lemmas = [lemma[1] for lemma in lemma_pairs]
    lemmatized_text = preprocess(" ".join(lemmas))
    return lemmatized_text.strip()

In [29]:
# Lemmatize text column

df['text'] = df.apply(lambda row: lemmatize_text(row['text']), axis=1)
df[:10]

Unnamed: 0,type,work,chunk,text
0,prose,caesar/alex.txt,caesar/alex.txt_0,locus in tantus munitio profero nam incendium ...
1,prose,caesar/alex.txt,caesar/alex.txt_1,acuo qui ab nos fio uideo is sollertia efficio...
2,prose,caesar/alex.txt,caesar/alex.txt_2,ex priuo aedificium specus atque puteus extrah...
3,prose,caesar/alex.txt,caesar/alex.txt_3,habeo qui si alius sum litor aegypti natura at...
4,prose,caesar/alex.txt,caesar/alex.txt_4,nauigium actuarius caesar facio certus caesar ...
5,prose,caesar/alex.txt,caesar/alex.txt_5,punc ne qui suus culpa detrimentum accipio uid...
6,prose,caesar/alex.txt,caesar/alex.txt_6,in ipse portus confligo uideo itaque paucus di...
7,prose,caesar/alex.txt,caesar/alex.txt_7,graecus comparo hic ob nosco scientia atque an...
8,prose,caesar/alex.txt,caesar/alex.txt_8,superus dies saepenumero caesar suus euerat ut...
9,prose,caesar/alex.txt,caesar/alex.txt_9,custodia portus relinquo nauis ad litus et uic...


In [30]:
df = pd.DataFrame(TextArrayChunks, columns = ['type', 'work', 'chunk', 'text'])

In [33]:
# Lemmatize text column

df['text'] = df.apply(lambda row: lemmatize_text(row['text']), axis=1)
df[:10]

Unnamed: 0,type,work,chunk,text
0,prose,caesar/alex.txt,caesar/alex.txt_0,locus in tantus munitio profero nam incendium ...
1,prose,caesar/alex.txt,caesar/alex.txt_1,acuo qui ab nos fio uideo is sollertia efficio...
2,prose,caesar/alex.txt,caesar/alex.txt_2,ex priuatus aedificium specus atque puteus ext...
3,prose,caesar/alex.txt,caesar/alex.txt_3,habeo qui si alius sum litor aegypti natura at...
4,prose,caesar/alex.txt,caesar/alex.txt_4,nauigium actuarius caesar facio certus caesar ...
5,prose,caesar/alex.txt,caesar/alex.txt_5,ne qui suus culpa detrimentum accipio uideo it...
6,prose,caesar/alex.txt,caesar/alex.txt_6,in ipse portus confligo uideo itaque paucus di...
7,prose,caesar/alex.txt,caesar/alex.txt_7,graecus comparo hic ob nosco scientia atque an...
8,prose,caesar/alex.txt,caesar/alex.txt_8,superus dies saepenumero caesar suus euerat ut...
9,prose,caesar/alex.txt,caesar/alex.txt_9,custodia portus relinquo nauis ad litus et uic...


In [34]:
# Build Scatterplot corpus with lemmatized texts

corpus = st.CorpusFromPandas(df,
                            category_col='type',
                            text_col='text',
                            nlp=nlp).build()

In [35]:
# Get words with prosaic keyness

term_freq_df = corpus.get_term_freq_df()
term_freq_df['Prose Score'] = corpus.get_scaled_f_scores('prose')
pprint(list(term_freq_df.sort_values(by='Prose Score', ascending=False).index[:10]))

['m',
 'proelium',
 'ciuitas',
 'multitudo',
 'res publicus',
 'populus romanus',
 'sed etiam',
 'altitudo',
 'existimo',
 'hispania']


In [36]:
# Get words with poetic keyness

term_freq_df = corpus.get_term_freq_df()
term_freq_df['Verse Score'] = corpus.get_scaled_f_scores('verse')
pprint(list(term_freq_df.sort_values(by='Verse Score', ascending=False).index[:10]))

['aequor',
 'phoebus',
 'como',
 'osculum',
 'genitor',
 'teucer',
 'penna',
 'domina',
 'ensis',
 'aether']


In [39]:
# Create Scattertext HTML file with lemmatized text

html_doc = st.produce_scattertext_explorer(corpus,
          category='prose',
          category_name='Prose',
          not_category_name='Verse',
          width_in_pixels=1000,
          minimum_term_frequency=5,  
          metadata=df['chunk'])

open("Type-Visualization.html", 'wb').write(html_doc.encode('utf-8'))        

8122491

Click [here](https://cdn.rawgit.com/diyclassics/literature-experiments/21f004b3af0cac0eb08bbd64166983e3c2090a97/Lemmatized-Genre-Visualization.html) to view the interactive Scattertext plot.