# Basic Vocabulary for Latin

The first task here is going to be to take *Paul B. Diederich's Basic Vocabulary* as found on [Carolus Raeticus's page](http://hiberna-cr.wikidot.com/downloads) and convert it into a `.tsv` file suitable for import into the [Anki](http://ankisrs.net/) flashcard program.

There are a few things to keep in mind.  First, we need to separate verbs from the remaining vocabulary.  This is because I treat verbs differently: they have principal parts, which come in varying numbers.  By contrast nouns and adjectives can fit into a more general "vocabulary" type, generally defined by three forms (nominative & genitive singular, plus nominative plural, for nouns vs. nominative singular for three genders for adjectives).

## Setup

Let's first load some data-manipulation packages and get the raw data loaded into memory.

In [1]:
import pandas as pd
import os

In [2]:
homedir   = os.path.expanduser("~")
curdir    = os.curdir
datadir   = curdir + '/' + 'data'
tmpdir    = curdir + '/' + 'tmp'
rawfile   = 'diederich_lodge_practice_raeticus20121219_macrons.txt'
outfile   = 'basic_vocabulary.tsv'
rawdata   = datadir + '/' + rawfile
saveddata = datadir + '/' + outfile

In [3]:
rawdf = pd.read_csv(rawdata, sep='\t', skiprows=79)

In [4]:
rawdf.head(20)

Unnamed: 0,Struct,d.,Sect.,Sub-Sect.,Latin,English (Lodge)
0,x,,1,1,"deus, -ī, m.",god
1,xx,,1,1,"dea, -ae, f.",goddess
2,xx,,1,1,"dīvīnus, -a, -um","divine, godlike, inspired"
3,xx,,1,1,"dīvus, -a, -um","divine, godlike; (noun) god, goddess"
4,xx,d.,1,1,"dīves, -itis",rich
5,x,,1,1,"nympha, -ae, f.",nymph
6,x,,1,1,"religiō, -ōnis, f.","conscientiousness, sense of right; scruples; (..."
7,x,,1,1,"templum, -ī, n.",place marked off for augury; holy ground; shri...
8,x,,1,1,"āra, -ae, f.",altar
9,x,,1,1,"vātēs, -is, mf.","prophet, soothsayer, seer, bard"


Looking at the output, we need to split the "Latin" column into pieces, primarily splitting on the commas.  The problem is that sometimes we have nouns, sometimes adjectives, sometimes verbs.  And even within these we might get a varying number of elements to split (e.g. 1st/2nd declension adjectives vs. 3rd declension, or regular vs. deponent verbs, etc.).

In [5]:
rawdf.tail()

Unnamed: 0,Struct,d.,Sect.,Sub-Sect.,Latin,English (Lodge)
1513,xx,,8,1,"postrēmus, -a, -um","hindmost, the last; end or last part of any th..."
1514,x,,8,1,con- (com-),(prefix) together
1515,x,,8,1,dis-,"(prefix) apart, not"
1516,x,,8,1,re- (red-),"(prefix) back, again"
1517,x,,8,1,trāns- (trā-),(prefix) across
