# Latin Vocabulary from the Bridge

I found some Latin & Greek vocabulary tools hosted at [The Bridge](http://bridge.haverford.edu/), a project hosted by Haverford College.  In particular, you can select from a range of texts (not only classical texts like Vergil's *Aeneid*, but also textbooks like Moreland & Fleisher's *Latin: an Intensive Course*) and download the vocabulary as a TSV or Excel file.  Since the TSV files don't seem to include column headers, I think I'll work with Excel files for now.

The files are in principle set up directly for import into flashcard programs like Anki (they actually mention this specifically), but only for naive import.  Specifically, all forms for a single vocabulary entry (e.g. all principal parts for a given verb) are a single string in a single column.  If your Anki cards have separate data cells for each principal part, say, then the data format won't work as it stands.  So this notebook is going to start by separating individual data elements from one another for easier manipulation.

In [1]:
import pandas as pd
#import numpy as np
#import math
#import re
#import subprocess
#import os
#import string
#from nltk.corpus import stopwords

Since the data is in an Excel file (we could use a TSV file, but that doesn't come with column names), you'll need to make sure you've installed `xlrd`:
```
> pyenv activate nlp3 # an environment with Python 3.x and pandas
> pip install xlrd
```
Make sure to re-import `pandas` in the above setup box so that it picks up on the dependency.

In [2]:
#xl = pd.ExcelFile("data/dcc_core_latin_vocabulary_bridge20150730_all.xls")

In [3]:
#xl.sheet_names

In [4]:
#df = xl.parse("Sheet1")

As it turns out, trying to read the data file as a true Excel file gives an error.  It's similar to what you find in [this thread about unsupported formats in `xlrd`](http://stackoverflow.com/questions/9623029/python-xlrd-unsupported-format-or-corrupt-file).  According to that thread, and also with a look [here](http://stackoverflow.com/questions/28710618/pandas-read-html-equivalent-for-a-lxml-table), the thing to do is install the dependency `lxml` and use the `pandas` function `read_html()`.  So make sure you do
```
> pyenv activate nlp3
> pip install lxml
```
and then proceed as follows.

In [5]:
xl = pd.read_html("data/dcc_core_latin_vocabulary_bridge20150730_all.xls")

In [6]:
xl[:5]

[                            Dictionary Entry Part of Speech  \
 0                  abeÅ abÄ«re abiÄ« abitum           Verb   
 1             absum abesse ÄfuÄ« ÄfutÅ«rus           Verb   
 2                                         ac    Conjunction   
 3       accÄdÅ accÄdere accessÄ« accessum           Verb   
 4         accidÅ accidere accidÄ« âââ           Verb   
 5        accipiÅ accipere accÄpÄ« acceptus           Verb   
 6                         Äcer Äcris Äcre      Adjective   
 7                          aciÄs aciÄÄ« f.           Noun   
 8                                         ad    Preposition   
 9               addÅ addere addidÄ« additus           Verb   
 10       addÅ«cÅ adducere addÅ«xÄ« adductus           Verb   
 11         adeÅ adÄ«re adÄ«vÄ«/adiÄ« aditus           Verb   
 12                                     adeÅ         Adverb   
 13     adhibeÅ adhibÄre adhibuÄ« adhibitus           Verb   
 14                                    a