# Latin Text Prep Aid - Vocabulary Builder
This program will take in a Latin and text and return an excel file with multiple tabs of useful vocabulary lists to aid in preparing texts to be taught. The tabs will be:
 1. Full parsed text, which will give each word from the text as a line of data with its lemma, frequency, definition, etc.
 2. A 'priority vocab' list, which will give each word that does *not* appear in the [DCC Latin Core vocabulary](https://dcc.dickinson.edu/vocab/core-vocabulary) (top 1000 frequency list)
 3. The tagged core list, to compare

In order to run this notebook, ensure that you have installed/downloaded:
- [pandas](https://pandas.pydata.org/)
- [spaCy](https://spacy.io/)
- [latinCy](https://diyclassics.github.io/latincy/)

In addition, download the 'data' folder and save it in the working directory with this notebook. It includes:
- lat_core.xlsx, a modified Latin core list based off [DCC Latin Core vocabulary](https://dcc.dickinson.edu/vocab/core-vocabulary)
- 'lat_lewis_elementary_lexicon', a modified Lewis & Short dictionary used to get definitions, from [Perseus Project](https://www.perseus.tufts.edu/hopper/opensource/download).
- a few texts from downloaded from [the Latin Library](https://thelatinlibrary.com/)

To run this program on your own text, rather than the livy sample used here, put your Latin text as a txt file in the data folder and change the appropriate variable in *1. Text selection* below.

NOTE: Works on **Ubuntu**. Doesn't work on Windows. Don't know why. (The pandas merge function throws errors on Windows.)
  

## 0. Imports

In [1]:
# Import the following to run the program:

import spacy
import pandas as pd
import numpy as np
import json
import os
from sklearn.manifold import TSNE
from spacy import displacy
from pprint import pprint

## 1. Text selection 
Put any latin text you'd like in the folder 'data,' which lives in the folder with this notebook. Replace *'lat-livy.txt'* with your text of choice.


The process below will print the first line of your text, to verify it's what you've put in, and then the length by characters and by words.

In [2]:
# Set your input. Replace *lat-livy.txt* with your text file. You can uncomment the caesar or vergil for testing.
working_directory = os.getcwd()
# file_path = working_directory + '/data/lat_text_latin_library/caesar/gall1.txt'
# file_path = working_directory + '/data/lat_text_latin_library/vergil/aen1.txt'
file_path = working_directory + '/data/lat-livy.txt'

# read in your text
with open(file_path) as f:
    text_full = f.read()

# "replace" command gets rid of line breaks, changes v's -> u's, and em-dashes to spaces
text_full = text_full.replace("\n", " ").replace("\t", " ").replace("v","u").replace("V","U").replace("—"," ")

# Check that this is the correct file, and see the length
print(f"First line: {text_full[:100]}")
print(f"Character count: {len(text_full)}")
print(f"Approximate token count total: {len(text_full.split())}")

First line: Iam primum omnium satis constat Troia capta in ceteros saeuitum esse Troianos, duobus, Aeneae Anteno
Character count: 920884
Approximate token count total: 130066


### text length problems
On my little old laptop, any text above ~20,000 words crashes Jupyter Notebook. Solving this problem (via multithreading, or increasing memory allowed to Jupyter, or something) is not currently a priority. I will fix this someday, but not today.

So, the process below splits the text down to ~20,000 tokens.

In [3]:
# shorting for example purposes (and for speed) to a doc of ~20,000 tokens
if len(text_full.split()) < 20000:
    text=text_full
    print(f"your text is short already")
else:
    divisor = (len(text_full.split())) // 20000
    text = text_full[:len(text_full) // divisor]
    print(f"Approximate token count after cutting down: {len(text.split())}")

Approximate token count after cutting down: 21890


## 2. Run NLP on the text, validate things look right, and convert to a dataframe

### run NLP and validate

In [4]:
# Set up spaCy NLP
model = 'la_core_web_lg'
nlp = spacy.load(model)

In [5]:
# Create spacy Doc object - this might take a minute
doc = nlp(text)

In [6]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# prints the text first's word
# print(doc[0])
# print(type(doc[0]))

In [7]:
# # (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# # All the attributes LatinCy has tagged for the first word
# print([item for item in dir(doc[0]) if not item.startswith("_")])

In [8]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# Run this to get sentences from text to verify parsing 
# sents = doc.sents
# for i, x in enumerate(sents, 1):
#         print(f"{i}: {x}")

In [9]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# number = 0
# for token in doc[:]:
#     if token.text == ' ':
#         number = number+1
#         print(token.lemma_)

# number

### make dataframe

In [10]:
# Make dataframe with token attributes. first, make a list for all the tokens
data = []
for token in doc[:]:
    if token.text != ' ': 
        data.append(
            [
                token.text,
                token.norm_,
                token.lower_,
                token.lemma_,
                token.pos_,
                token.tag_,
                token.text in nlp.vocab,
                
            ]
        )

In [11]:
# then make that [] into a df with the proper columns
df = pd.DataFrame(
    data,
    columns=[
        "text",
        "norm",
        "lower",
        "lemma",
        "pos",
        "tag",
        "in_vocab",
    ],
)

In [12]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# info about the first word
# print(f' text: {doc[0].text}, norm: {doc[0].norm_}, lemma: {doc[0].lemma_}, POS: {doc[0].pos_}')

### clean the dataframe: get rid of punctuation, nulls, common errors

In [13]:
# I don't care about punctuation, so remove all the rows for punctuation
print(f"before: {df['text'].count()}")
df = df[(df['tag'] != "number")]
print(f"after: {df['text'].count()}")

before: 26045
after: 25821


In [14]:
# I don't care about numbers, so remove all the rows for numbers
print(f"before: {df['text'].count()}")
df = df[(df['tag'] != "punc")]
print(f"after: {df['text'].count()}")

before: 25821
after: 22250


In [15]:
# blank space will come through as empty or null entries, or some seemingly random amount of whitespace.
# I do not know why it comes through as seemingly random amount of whitespace.

# note: I don't know why these cells don't count as null. No version of "dropna()" gets rid of them

print(f"before: {df['text'].count()}")
df = df[df['text'] != ' ']
df = df[df['text'] != '  ']
df = df[df['text'] != '   ']
df = df[df['text'] != '    ']
df = df[df['text'] != '     ']
df = df[df['text'] != '      ']
df = df[df['text'] != '       ']
df = df[df['text'] != '        ']
df = df[df['text'] != '         ']
df = df[df['text'] != '          ']
df = df[df['text'] != '           ']
df = df[df['text'] != '            ']
df = df[df['text'] != '             ']
df = df[df['text'] != '              ']
df = df[df['text'] != '               ']
df = df[df['text'] != '                ']
df = df[df['text'] != '                 ']
df = df[df['text'] != '                  ']
df = df[df['text'] != '                   ']
df = df[df['text'] != '                    ']
df = df[df['text'] != '                     ']
df = df[df['text'] != '                      ']
df = df[df['text'] != '                       ']
df = df[df['text'] != '                        ']
df = df[df['text'] != '                         ']
df = df[df['text'] != '                          ']
df = df[df['text'] != '                           ']
df = df[df['text'] != '                            ']
df = df[df['text'] != '                             ']
df = df[df['text'] != '                              ']
df = df[df['text'] != '                               ']
df = df[df['text'] != '                                ']
df = df[df['text'] != '                                 ']
df = df[df['text'] != '                                  ']
df = df[df['text'] != '                                   ']
df = df[df['text'] != '                                    ']
df = df[df['text'] != '                                     ']
df = df[df['text'] != '                                      ']
df = df[df['text'] != '                                       ']
print(f"after: {df['text'].count()}")

before: 22250
after: 22236


In [16]:
# this process checks if there were any unknown words, because these will not have been parsed correctly.
# Unknown words are removed from the frequency / comparative analyses, and exported as their own sheet in the excel
print(f"after: {df['text'].count()}")
df_unknownwords = df[(df['in_vocab'] != True)]
df_unknownwords
df = df[(df['in_vocab'] != False)]
print(f"after: {df['text'].count()}")

after: 22236
after: 22236


In [17]:
# The parser makes some errors detecting lemmata. This fixes many of them.
# I'm sure it's gonna make a lot more than this, but these are what I've found.
df.loc[df['lemma']=='ab', 'lemma'] = "a"
df.loc[df['lemma']=='eum', 'lemma'] = "is"
df.loc[df['lemma']=='eam', 'lemma'] = "is"
df.loc[df['lemma']=='(h)asta', 'lemma'] = "hasta"
df.loc[df['lemma']=='abi', 'lemma'] = "abeo"
df.loc[df['lemma']=='Patres', 'lemma'] = "pater"
df.loc[df['lemma']=='patr', 'lemma'] = "pater"
df.loc[df['lemma']=='omnium', 'lemma'] = "omnis"
df.loc[df['lemma']=='Romana', 'lemma'] = "romanus"
df.loc[df['lemma']=='magne', 'lemma'] = "magnus"
df.loc[df['lemma']=='sua', 'lemma'] = "suus"
df.loc[df['lemma']=='ui', 'lemma'] = "uis"
df.loc[df['lemma']=='Ui', 'lemma'] = "uis"
df.loc[df['lemma']=='Feo', 'lemma'] = "sum"
df.loc[df['lemma']=='ror', 'lemma'] = "reor"
df.loc[df['lemma']=='domi', 'lemma'] = "domus"
df.loc[df['lemma']=='Consules', 'lemma'] = "consul"
df.loc[df['lemma']=='bonum', 'lemma'] = "bonus"
df.loc[df['lemma']=='virs', 'lemma'] = "uir"
df.loc[df['lemma']=='uirs', 'lemma'] = "uir"
df.loc[df['lemma']=='fuerit', 'lemma'] = "sum"
df.loc[df['lemma']=='primor', 'lemma'] = "primus"
df.loc[df['lemma']=='uido', 'lemma'] = "uideo"
df.loc[df['lemma']=='iuuenus', 'lemma'] = "iuuenis"
df.loc[df['lemma']=='coepi', 'lemma'] = "coepio"
df.loc[df['lemma']=='coipio', 'lemma'] = "coepio"
df.loc[df['lemma']=='Iouis', 'lemma'] = "Iuppiter"
df.loc[df['lemma']=='tanto', 'lemma'] = "tantus"
df.loc[df['lemma']=='quique', 'lemma'] = "quisque"
df.loc[df['lemma']=='auco', 'lemma'] = "augeo"
df.loc[df['lemma']=='uico', 'lemma'] = "uinco"
df.loc[df['lemma']=='aeneae', 'lemma'] = "Aeneas"
df.loc[df['lemma']=='Aenea', 'lemma'] = "Aeneas"
df.loc[df['lemma']=='Aenee', 'lemma'] = "Aeneas"
df.loc[df['text']=='capta', 'lemma'] = "capio"

## 3. Frequency, Comparative Analyses, English Definitions

- First, get most common forms in the text and most common lemmas in the text.
  - Add those as column to the dataframe
- After, compare those lemmas to the top 1000 most common words.
  - This is done by reading in an excel file I've prepared, based on [DCC Latin Core vocabulary](https://dcc.dickinson.edu/vocab/core-vocabulary)
- (Aspirational: compare to the rest of the author/corpus/etc -- not currently implemented)
- Finally, read in Lewis and Short to get definitions. Do this last b/c it takes a while

### get frequencies

In [18]:
df_form_frequency = df['lower'].value_counts()
df_lemma_frequency = df['lemma'].value_counts()

In [19]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# print(df_form_frequency.head(5))
# print(' ')
# print(df_lemma_frequency.head(5))

In [20]:
df = df.merge(df_form_frequency,
                  left_on= 'lower',
                  right_on = 'lower',
                  how= 'left')
df.rename(columns={'count':'form_frequency'},inplace=True)

In [21]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# df.head(3)

In [22]:
df = df.merge(df_lemma_frequency,
                  left_on= 'lemma',
                  right_on = 'lemma',
                  how= 'left')
df.rename(columns={'count':'lemma_frequency'},inplace=True)

In [23]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# df.head(3)

## 3.5. get all definitions

Whew, this was a doozy. The code below will:
- Read in Lewis & Short as an xml file ([downloded From Perseus Project](https://www.perseus.tufts.edu/hopper/opensource/download)).
- Parse your text list against the lewis and short
- for words with more than one definition (*many words, up to four different definitions, plus more when including variable capitalization, (e.g., dis - 'wealthy' vs Dis - 'Pluto'*)), it will concat all the different L&S definitions


### get dictionary

In [24]:
import codecs
from bs4 import BeautifulSoup
from lxml import etree
from yaml import dump

In [25]:
working_directory = os.getcwd()
dictionary_file_path = working_directory + '/data/lat_lewis_elementary_lexicon/lewis_mod.xml'
yml_file_path = working_directory + '/data/lat_lewis_elementary_lexicon/lewis.yaml'

def get_root(filename):
    parser = etree.XMLParser(load_dtd=True, no_network=False)
    tree = etree.parse(filename, parser=parser)
    return tree.getroot()


def get_entries(filename):
    root = get_root(filename)
    lemmata = set()
    d = {}
    for entry in root.findall(".//entry"):
        lemma = entry.get("key", "")
        entry_bs = BeautifulSoup(etree.tostring(entry), features="lxml")
        d[lemma] = entry_bs.text.strip()
        lemmata.add(lemma)
    return d


def save_yaml(data, filename):
    with open(filename, "w") as f:
        dump(data, f)


if __name__ == "__main__":
    entries = get_entries(dictionary_file_path)
    save_yaml(entries, "lewis.yaml")


### get dictionary entries for our parsed text, merge on the earlier dataframe

In [26]:
# helper function to get a definition from the dictionary
# returns only the 'simple' version of the definition, set off in the xml code by a "$."
def getdefinition(lemma):
    definition = ""
    multiple_defs_testers = [
        lemma.lower(), 
        lemma.capitalize(),
        lemma.lower()+"1", 
        lemma.capitalize()+"1",
        lemma.lower()+"2", 
        lemma.capitalize()+"2", 
        lemma.lower()+"3",
        lemma.capitalize()+"3",
        lemma.lower()+"4",
        lemma.capitalize()+"4",
    ]

    for words in multiple_defs_testers:
        if words in entries:
            definition = definition + entries[words][:entries[words].find("$")].replace("\n", " ").replace("  "," ").replace("  "," ").replace("  "," ")
    if definition == "":           
        return "error getting definition"
    else:
        return definition

In [27]:
dict_defs = []
for lemma in df['lemma']:
    dict_defs.append(
        [
            lemma,
            getdefinition(lemma)
        ]
    )

In [28]:
df_defs = pd.DataFrame(
    dict_defs,
    columns=["lemma","def"],
)
df_defs.drop_duplicates(inplace=True)

In [29]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# df_defs.head(10)

In [30]:
df = df.merge(df_defs,
                  left_on= 'lemma',
                  right_on = 'lemma',
                  how= 'left')
df.rename(columns={'count':'lemma_frequency'},inplace=True)

In [31]:
df.rename(columns={'def':'definition'},inplace=True)

In [32]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# df.head(20)
# df[(df["tag"] == 'proper_noun') & (df['definition'] == 'error getting definition')]

### clean up proper nouns
Many proper names give definition errors, so this sets the 'definition' to be the lemma (i.e., the proper name).

In [33]:
df.loc[(df['definition'] == "error getting definition") & (df['pos'] == 'PROPN'), 'definition'] = df['lemma']
df.loc[(df['definition'] == "error getting definition") & (df['tag']=='proper_noun'), 'definition'] = df['lemma']
df.loc[(df['definition'] == "error getting definition") & (df['tag']=='proper_noun_particle'), 'definition'] = df['lemma']

## 3.75 compare to top used words chart

In [34]:
working_directory = os.getcwd()
core_file_path = working_directory + '/data/core_lat.xlsx'
df_core = pd.read_excel(core_file_path)

In [35]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# testword = 'abs'
# df_core[(df_core['l1'] == testword) | (df_core['l2'] == testword) | (df_core['l3'] == testword) | (df_core['l4'] == testword)]

In [36]:
# don't make fun of me georgios I know you could do this in one line but I don't know how
# just enjoy your little nested if

def isfrequent(lemma):
    if lemma in df_core['l1'].values: 
        return True
    else:
        if lemma in df_core['l2'].values: 
            return True
        else:
            if lemma in df_core['l3'].values: 
                return True
            else:
                if lemma in df_core['l4'].values: 
                    return True
                else:
                    return False

In [37]:
# helper function to check the lemma against the core list
checking_against_top = []
for lemma in df['lemma']:
    checking_against_top.append(
        [
            lemma,
            isfrequent(lemma)
        ]
    )

In [38]:
# creates a data frame with the new 'is in the core' tag
df_freq = pd.DataFrame(
    checking_against_top,
    columns=["lemma","tops"],
)
df_freq.drop_duplicates(inplace=True)

In [39]:
# (FOR DEBUGGING, UNNECCESSARY FOR THE FINAL PRODUCT)
# df_freq.head(10)

In [40]:
df = df.merge(df_freq,
                  left_on= 'lemma',
                  right_on = 'lemma',
                  how= 'left')
df.rename(columns={'tops':'in_core'},inplace=True)

## 4. Exporting
Here, I create a few slices that will be useful and export them to a xlsx file, with which you can make your vocab lists / charts as you please

Create all the views that I want as different data frames and export them all

In [41]:
# df.head(1)

In [42]:
df = df[['text', 'lemma', 'pos', 'tag', 'form_frequency', 'lemma_frequency', 'definition', 'in_core']]
df_core = df_core[['Headword', 'Definition', 'Part of Speech', 'Semantic Group', 'Frequency Rank']]

In [43]:
# New words to be learned immediately: words that are not in DCC top 1000, occur more than 5 times in the text, and are not names
df_priority_list = df[(df["lemma_frequency"] > 5) & ~(df['in_core']) & (df['tag'] != 'proper_noun') & (df['definition'] != 'error getting definition')]
df_priority_list = df_priority_list[['lemma','lemma_frequency','definition']]
df_priority_list = df_priority_list.sort_values('lemma_frequency', ascending=False).drop_duplicates()

In [44]:
# creates an ExcelWriter class, necessary for writing multiple dfs to the same file
working_directory = os.getcwd()
file_path = working_directory + '/exports/vocab_builder_output.xlsx'
xlwriter = pd.ExcelWriter(path=file_path)
df.to_excel(xlwriter, sheet_name='full_text',index=True)
df_priority_list.to_excel(xlwriter, sheet_name='priority_words',index=False)
df_core.to_excel(xlwriter, sheet_name='core',index=False)
xlwriter.close()