# Making an *index locorum* from the Perseus edition of Smyth

Patrick J. Burns 1.10.20 (updated 1.11.20)

Here is an xml parsing project that came out of a discussion on the 'gltreebank' listserv. There was a request for "an Index Locorum to Smyth's 'Greek Grammar'" and one was identified quickly—W.A. Schumann's 1961 "Index of passages cited in Herbert Weir Smyth, Greek grammar." (GRBS Scholarly Aids 1). [Here](https://catalog.hathitrust.org/Record/001811341) is a link to the item—with full text—in HathiTrust. I often use the Perseus Digital Library edition of [Smyth](http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.04.0007%3Asmythp%3D1) and remembered that citations in this text are linked to their Perseus sources. With this is mind, I decided to recompile something like Schumann's index by parsing the Perseus xml. The code is in the notebook below and a csv with results can be found [here](https://github.com/diyclassics/perseus-experiments/blob/master/data/smyth_citations.csv).

The workflow is as follows: 1. compile a list of Smyth xml docs from the Perseus TOC (provided in the left column of the Smyth html); 2. parse the docs in the TOC for ```<bibl>``` elements, i.e. the element used for the ciations, and where applicable annotations for ```<author>``` and ```<title>```; 3. put this data into more user-friendly formats, e.g. Pandas DataFrame and .csv. The result is compilation of passages per Smyth chapter. This can be sorted and reidex to more closely approximate Schumann's index (cf. the ```df_auth``` DataFrame and corresponding .csv file).

Note a little data cleanup trick here—the encoding for Homeric references in this edition 1. does not include the author, and 2. refers to the books by their Greek letter indices, e.g. 'Α' for Iliad 1 and 'α' for Odyssey 1. These are corrected both in the parser and with some helper functions below. Similar corrections will need to be made for the CIA and IGA references as well at some point.

In [1]:
# Imports
import urllib.request
from lxml import etree

from collections import defaultdict
import natsort as ns

import pandas as pd

import time

import pickle
from pprint import pprint

In [2]:
# Constants
perseus_xml_base_url = "http://www.perseus.tufts.edu/hopper/xmlchunk?doc="

smyth_toc_url = "http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.04.0007%3Asmythp%3D1"

## Getting Smyth xml docs from Perseus TOC

In [3]:
# Get all Smyth xml docs from Perseus TOC
with urllib.request.urlopen(smyth_toc_url) as f:
    perseus_toc_xml = f.read()

root = etree.fromstring(perseus_toc_xml)

In [4]:
# Get list of refs from <chunk> elements

chapters = root.findall(".//chunk")
refs = [chapter.attrib['ref'] for chapter in chapters]

## Parse Smyth xml docs for citations

In [5]:
# Helper functions

def get_smyth_xml(ref):
#     time.sleep(.1)
    with urllib.request.urlopen(perseus_xml_base_url+ref) as f:
        smyth_xml = f.read()
    return smyth_xml

def get_smyth_xmls(refs):
    for ref in refs:
        yield get_smyth_xml(ref)

In [6]:
# citations = []

# for i, xml in enumerate(get_smyth_xmls(refs)):
#     root = etree.fromstring(xml)
#     milestone = root.find('.//milestone')
#     smyth_id = milestone.attrib['id']
#     print(f'Processing file {i+1}... Smyth ch. {smyth_id}')
#     bibls = root.findall('.//bibl')
#     for bibl in bibls:
#         cit, author_, title_, loc = None, None, None, None
#         if 'n' in bibl.attrib.keys():
#             cit = bibl.attrib['n']
#         else:
#             cit = None
#         if bibl.find('author') is not None:
#             author = bibl.find('author')
#             author_ = author.text
#         else:
#             author = None
#         if bibl.find('title') is not None:
#             title = bibl.find('title')
#             if title.xpath("foreign"): # Handle Homer
#                 author_ = "H."
#                 title_ = title.find('foreign').text
#             else:
#                 title_ = title.text
#             loc = title.tail
#         else:
#             title = None
#             loc = author.tail
#         print(cit, author_, title_, loc)
#         citations.append((smyth_id, cit, author_, title_, loc))
       
# pickle.dump(citations, open("data/smyth_citations.p", "wb"))

citations = pickle.load(open("data/smyth_citations.p", "rb"))        

## Organize citations in dataframe

In [7]:
# # Make DataFrame from citations

df = pd.DataFrame(citations, columns =['smyth-id', 'citation', 'author', 'work', 'loc']) 

In [8]:
# Helper functions for fixing Homeric citations

def fix_homeric_citation(letter, cit):
    if letter:
        letters = [l for l in 'ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ']
        if letter.upper() in letters:
            return f'{letters.index(letter.upper())+1}.{cit.strip()}'
    return cit

def get_homeric_work(letter):
    if letter:
        letters = [l for l in 'ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ']
        if letter.upper() in letters:
            if letter.isupper():
                return 'Il.'
            else:
                return 'Od.'
    return letter

In [9]:
# Natsort df by Smyth ids

df['smyth-id'] = pd.Categorical(df['smyth-id'], ordered=True, categories= ns.natsorted(df['smyth-id'].unique()))
df = df.sort_values('smyth-id')
df['loc'] = df[['work', 'loc']].apply(lambda x: fix_homeric_citation(*x), axis=1)
df['work'] = df['work'].apply(lambda x: get_homeric_work(x))

In [10]:
df

Unnamed: 0,smyth-id,citation,author,work,loc
0,s44a.D,Hom. Il. 21.567,H.,Il.,21.567
1,s180,Hom. Il. 14.472,H.,Il.,14.472
2,s185,Thuc. 4.47,T.,,4.47
3,s188,Xen. Anab. 1.2.2,X.,A.,1.2.2
4,s325D,Hom. Od. 4.62,H.,Od.,4.62
...,...,...,...,...,...
5187,s3045,Plat. Apol. 20e,P.,A.,20e
5189,s3046,Aeschin. 3.202,Aes.,,3.202
5191,s3048,Soph. El. 435,S.,El.,435
5190,s3048,Aesch. PB 21,A.,Pr.,21


In [11]:
df_auth = df
df_auth['loc'] = pd.Categorical(df_auth['loc'], ordered=True, categories= ns.natsorted(df_auth['loc'].unique()))
df_auth = df_auth.sort_values(['author', 'work', 'loc'])
df_auth = df_auth.reset_index(drop=True)
df_auth

Unnamed: 0,smyth-id,citation,author,work,loc
0,s2329,Aesch. Ag. 37,A.,Ag.,37
1,s1882,Aesch. Ag. 126,A.,Ag.,126
2,s2104,Aesch. Ag. 161,A.,Ag.,161
3,s2328,Aesch. Ag. 208,A.,Ag.,208
4,s2033,Aesch. Ag. 252,A.,Ag.,252
...,...,...,...,...,...
5188,s1473,,,C.I.A.,/lref>
5189,s1488,,,C.I.A.,/lref>
5190,s1527,,,C.I.A.,/lref>
5191,s1923,,,C.I.A.,/lref>


## Export DataFrames to csv

In [12]:
df.to_csv('data/smyth_citations.csv', index=False)
df_auth.to_csv('data/smyth_citations_by_author.csv', index=False)

## Get Stats on Perseus-Smyth citations

In [13]:
total_citations = len(df['smyth-id'])
unique_authors = len(set(df['author']))
unique_works = len(set([''.join(filter(None, (item[0], item[1]))) 
                        for item in zip(df['author'],df['work'])]))
freq_author = df['author'].value_counts().keys()[0]
freq_work = " ".join(df.groupby(['author','work']).size().idxmax())

print(f'Perseus-Smyth stats')
print(f'-'*20)
print(f'Total citations: {total_citations}')
print(f'Unique authors: {unique_authors}')
print(f'Unique works: {unique_works}')
print(f'Most frequent author: {freq_author}')
print(f'Most frequent work: {freq_work}')

Perseus-Smyth stats
--------------------
Total citations: 5193
Unique authors: 21
Unique works: 99
Most frequent author: X.
Most frequent work: X. A.


In [14]:
print(f'Perseus-Smyth top authors')
print(f'-'*20)
for item in list(df['author'].value_counts().items())[:10]:
    print(f'{item[0]} ({item[1]})')

Perseus-Smyth top authors
--------------------
X. (1532)
P. (881)
T. (698)
D. (542)
S. (312)
H. (223)
L. (179)
E. (164)
I. (130)
Ar. (127)


In [15]:
print(f'Perseus-Smyth top works')
print(f'-'*20)
df['work_'] = df['work'].astype(str)
for item in list(df.groupby(['author','work_']).size().nlargest(10).items()):
    if item[0][1] == 'None':
        print(f'{item[0][0]} ({item[1]})')
    else:
        print(f'{item[0][0]}, {item[0][1]} ({item[1]})')

Perseus-Smyth top works
--------------------
X., A. (701)
T. (698)
D. (542)
X., C. (399)
L. (179)
X., H. (178)
P., A. (165)
P., R. (150)
X., M. (138)
H., Od. (136)
