# Concordances & N-Gram Analysis

This is going to be quite the undertaking, but I believe that we can manage it. The first box below imports all the necessary packages that we've discussed in the previous sessions and also defines a preprocessing method so we can easily work with our texts. Let's take a look at this before we go further: 

In [1]:
# Now that we have our texts and models downloaded, we need to import them.
# We need to define a processing function to regulate the texts
# Imports
import string
import re
from collections import Counter
from pprint import pprint
#from cltk.corpus.latin import latinlibrary
from cltk.tokenize.word import WordTokenizer
from cltk.stem.latin.j_v import JVReplacer

# These statements set up tools that help us normalize the texts.
# They will be discussed in more detail in the next session.
word_tokenizer = WordTokenizer('latin')
replacer = JVReplacer()

# This function uses the previously defined tools to preprocess the texts.
# This comes directly w/o modification from Patrick Burns
def preprocess(text):    

    text = re.sub(r'&aelig;','ae',text)
    text = re.sub(r'&AElig;','AE',text)
    text = re.sub(r'&oelig;','oe',text)
    text = re.sub(r'&OElig;','OE',text)
    
    text = re.sub('\x00',' ',text)
    
    text = text.lower()
    
    text = replacer.replace(text)
    

    text= re.sub(r'&lt;','<',text)
    text= re.sub(r'&gt;','>',text)    
    
    punctuation ="\"#$%&\'()*+,-/:;<=>@[\]^_`{|}~.?!"
    translator = str.maketrans({key: " " for key in punctuation})
    text = text.translate(translator)
    
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)

    remove_list = [r'\bthe latin library\b',
                   r'\bthe classics page\b',
                   r'\bneo-latin\b', 
                   r'\bmedieval latin\b',
                   r'\bchristian latin\b',
                   r'\bthe miscellany\b'
                  ]

    for pattern in remove_list:
        text = re.sub(pattern, '', text)
    
    text = re.sub('[ ]+',' ', text) # Remove double spaces
    text = re.sub('\s+\n+\s+','\n', text) # Remove double lines and trim spaces around new lines
    text = re.sub('p uergili maronis aeneidos liber ', '', text)
    
    return text

## Importing the texts

For this tutorial, we'll be working with a few different texts. To begin, let's import Vergil's *Aeneid* into a list data-structure so it's ready to go when we need it. 

In [29]:
# You will have to navigate to the location on your machine where the texts are downloaded
# to find the filenames. All of the data that CLTK uses on your machine will be in a folder
# entitled 'cltk_data' in your home directory. This is the same directory that holds folders
# such as 'Documents' and 'Desktop'. We need to summon this textual data.

# Some of the texts are split-up by book, like Vergil's Aeneid, so we need a good way of
# gathering all that information into the system. Below, I have created a list of the filenames
# without the '.txt' appended.
filenames = ['aen1','aen2','aen3','aen5','aen6','aen7','aen8','aen9','aen10','aen11','aen12']

# Let's create a list called 'aeneid' that will hold all the text. It will have 12 entries, one
# for each book/file. We can leave it blank for now because we're about to read in all the files.
aeneid = []

# Now we will use a for loop to iterate over all the filenames we just created.
# This basically says: for each file in filenames, open the file and read in it's context to the
# proper place in the list 'aeneid'. 
for filename in filenames:
    with open(f'/Users/chaduhl/cltk_data/latin/text/latin_text_latin_library/vergil/{filename}.txt') as f:
        aeneid.append(preprocess(f.read()))
        
# Let's make sure we did that right. The line below should print out the first book of the
# Aeneid. Remember that counting in python starts at 0.
print(aeneid[0])

uergil aeneid i
primus
arma uirumque cano troiae qui primus ab oris 
italiam fato profugus lauiniaque uenit 
litora multum ille et terris iactatus et alto 
ui superum saeuae memorem iunonis ob iram 
multa quoque et bello passus dum conderet urbem 
inferretque deos latio genus unde latinum 
albanique patres atque altae moenia romae
musa mihi causas memora quo numine laeso 
quidue dolens regina deum tot uoluere casus 
insignem pietate uirum tot adire labores 
impulerit tantaene animis caelestibus irae
urbs antiqua fuit tyrii tenuere coloni 
karthago italiam contra tiberinaque longe 
ostia diues opum studiisque asperrima belli 
quam iuno fertur terris magis omnibus unam 
posthabita coluisse samo hic illius arma 
hic currus fuit hoc regnum dea gentibus esse 
si qua fata sinant iam tum tenditque fouetque 
progeniem sed enim troiano a sanguine duci 
audierat tyrias olim quae uerteret arces 
hinc populum late regem belloque superbum 
uenturum excidio libyae sic uoluere parcas 
id metuens uete

In [35]:
aeneid_str = ""
for book in range(len(aeneid)):
    aeneid_str += (aeneid[book])

## Creating a Concordance

The CLTK has made it enormously easy to create concordances of any given text. There are two main methods to accomplish this, both of which are illustrated below:

In [None]:
# Now we can create a Concordance from the string of all the books from the Facta et Dicta
# This Concordance will be saved as a .txt file in a folder called user_data inside cltk_data in 
# your root directory
from cltk.utils import philology
philology.write_concordance_from_string(aeneid, 'aeneid-concordance-from-string')

# We can also create a concordance directly from a file. Let's try this with Catullus, since
# all of his poems are saved in a single text file.
catullus = '/Users/chaduhl/cltk_data/latin/text/latin_text_latin_library/catullus.txt'
philology.write_concordance_from_file(catullus, 'catullus-concordance-from-file')

## Onto N-Gram Analysis...



In [36]:
from nltk.tokenize.punkt import PunktLanguageVars
from nltk.util import bigrams
from nltk.util import trigrams
from nltk.util import ngrams

p = PunktLanguageVars()
tokens = p.word_tokenize(aeneid_str)
b = bigrams(tokens)
b_list = [x for x in b]
print(b_list)

[('uergil', 'aeneid'), ('aeneid', 'i'), ('i', 'primus'), ('primus', 'arma'), ('arma', 'uirumque'), ('uirumque', 'cano'), ('cano', 'troiae'), ('troiae', 'qui'), ('qui', 'primus'), ('primus', 'ab'), ('ab', 'oris'), ('oris', 'italiam'), ('italiam', 'fato'), ('fato', 'profugus'), ('profugus', 'lauiniaque'), ('lauiniaque', 'uenit'), ('uenit', 'litora'), ('litora', 'multum'), ('multum', 'ille'), ('ille', 'et'), ('et', 'terris'), ('terris', 'iactatus'), ('iactatus', 'et'), ('et', 'alto'), ('alto', 'ui'), ('ui', 'superum'), ('superum', 'saeuae'), ('saeuae', 'memorem'), ('memorem', 'iunonis'), ('iunonis', 'ob'), ('ob', 'iram'), ('iram', 'multa'), ('multa', 'quoque'), ('quoque', 'et'), ('et', 'bello'), ('bello', 'passus'), ('passus', 'dum'), ('dum', 'conderet'), ('conderet', 'urbem'), ('urbem', 'inferretque'), ('inferretque', 'deos'), ('deos', 'latio'), ('latio', 'genus'), ('genus', 'unde'), ('unde', 'latinum'), ('latinum', 'albanique'), ('albanique', 'patres'), ('patres', 'atque'), ('atque', 'a

In [37]:
from nltk.probability import FreqDist
fdist_b = FreqDist(b_list)
fdist_b.most_common(25)

[(('si', 'qua'), 33),
 (('ab', 'alto'), 21),
 (('tum', 'uero'), 21),
 (('in', 'proelia'), 21),
 (('nec', 'non'), 20),
 (('in', 'armis'), 20),
 (('ad', 'litora'), 18),
 (('ad', 'sidera'), 17),
 (('pius', 'aeneas'), 17),
 (('inter', 'se'), 17),
 (('pater', 'aeneas'), 17),
 (('in', 'litore'), 16),
 (('et', 'in'), 16),
 (('neque', 'enim'), 16),
 (('haec', 'ubi'), 15),
 (('et', 'iam'), 15),
 (('de', 'gente'), 15),
 (('in', 'arma'), 15),
 (('et', 'nunc'), 15),
 (('atque', 'haec'), 15),
 (('ait', 'et'), 14),
 (('dixit', 'et'), 14),
 (('nec', 'te'), 14),
 (('quae', 'nunc'), 14),
 (('in', 'limine'), 14)]

In [38]:
t = trigrams(tokens)
t_list = [x for x in t]
fdist_t = FreqDist(t_list)
fdist_t.most_common(25)

[(('haec', 'ubi', 'dicta'), 12),
 (('nec', 'non', 'et'), 9),
 (('uix', 'ea', 'fatus'), 8),
 (('si', 'qua', 'est'), 8),
 (('ubi', 'dicta', 'dedit'), 8),
 (('ac', 'talia', 'fatur'), 8),
 (('sic', 'ait', 'et'), 6),
 (('ea', 'fatus', 'erat'), 6),
 (('hinc', 'atque', 'hinc'), 5),
 (('pater', 'atque', 'hominum'), 4),
 (('atque', 'hominum', 'rex'), 4),
 (('haec', 'ait', 'et'), 4),
 (('at', 'pius', 'aeneas'), 4),
 (('nec', 'minus', 'interea'), 4),
 (('pater', 'aeneas', 'et'), 4),
 (('et', 'quae', 'sit'), 4),
 (('et', 'pater', 'anchises'), 4),
 (('nec', 'iam', 'amplius'), 4),
 (('tum', 'pius', 'aeneas'), 4),
 (('sed', 'non', 'et'), 4),
 (('rex', 'ipse', 'latinus'), 4),
 (('messapus', 'equum', 'domitor'), 4),
 (('ad', 'sidera', 'palmas'), 3),
 (('troiae', 'sub', 'moenibus'), 3),
 (('sub', 'moenibus', 'altis'), 3)]