# Making an *index locorum* from the Perseus edition of Smyth

Patrick J. Burns 1.10.20

Here is an xml parsing project that came out of a discussion on the 'gltreebank' listserv. There was a request for "an Index Locorum to Smyth's 'Greek Grammar'" and one was identified quickly—W.A. Schumann's 1961 "Index of passages cited in Herbert Weir Smyth, Greek grammar." (GRBS Scholarly Aids 1). [Here](https://catalog.hathitrust.org/Record/001811341) is a link to the item—with full text—in HathiTrust. I often use the Perseus Digital Library edition of [Smyth](http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.04.0007%3Asmythp%3D1) and remembered that citations in this text are linked to their Perseus sources. With this is mind, I decided to recompile something like Schumann's index by parsing the Perseus xml. The code is in the notebook below and a csv with results can be found [here](https://github.com/diyclassics/perseus-experiments/blob/master/data/smyth_citations.csv).

The workflow is as follows: 1. compile a list of Smyth xml docs from the Perseus TOC (provided in the left column of the Smyth html); 2. parse the docs in the TOC for ```<bibl>``` elements, i.e. the element used for the ciations, and add them to a dictionary where the Smyth chapter is the key, e.g. ```{'s910': 'Lys. 13.10'}```; 3. put this data into more user-friendly formats, e.g. Pandas DataFrame and .csv. The result is compilation of passages per Smyth chapter. Schumann's index is actually Smyth chapters per author/work citation; this could be done (and I probably will do it soon) easily enough by changing the parsing script to extract the ```<author>``` and ```<title>``` elements in addition to ```<bibl>```. But for now, we have a searchable version of Schumann, built up from the Perseus data.

In [1]:
# Imports
import urllib.request
from lxml import etree

from collections import defaultdict
import natsort as ns

import pandas as pd

import time

import pickle
from pprint import pprint

In [2]:
# Constants
perseus_xml_base_url = "http://www.perseus.tufts.edu/hopper/xmlchunk?doc="

smyth_toc_url = "http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.04.0007%3Asmythp%3D1"

## Getting Smyth xml docs from Perseus TOC

In [3]:
# Get all Smyth xml docs from Perseus TOC
with urllib.request.urlopen(smyth_toc_url) as f:
    perseus_toc_xml = f.read()

root = etree.fromstring(perseus_toc_xml)

In [4]:
# Get list of refs from <chunk> elements

chapters = root.findall(".//chunk")
refs = [chapter.attrib['ref'] for chapter in chapters]

## Parse Smyth xml docs for citations

In [5]:
# Helper functions

def get_smyth_xml(ref):
#     time.sleep(.1)
    with urllib.request.urlopen(perseus_xml_base_url+ref) as f:
        smyth_xml = f.read()
    return smyth_xml

def get_smyth_xmls(refs):
    for ref in refs:
        yield get_smyth_xml(ref)

In [6]:
## Pickled to prevent server calls

# citations = defaultdict(list)

# for i, xml in enumerate(get_smyth_xmls(refs)):
#     root = etree.fromstring(xml)
#     milestone = root.find(".//milestone")
#     smyth_id = milestone.attrib['id']
#     print(f'Processing file {i+1}... Smyth ch. {smyth_id}')  
#     bibls = []
#     bibls_ = root.findall('.//bibl')
#     for bibl in bibls_:
#         if 'n' in bibl.attrib.keys(): # Add check for elements like Smyth 904
#             bibls.append(bibl.attrib['n'])
# #     bibls = [bibl.attrib['n'] for bibl in root.findall('.//bibl')]
#     if bibls:
#         citations[smyth_id] = bibls
# pickle.dump(citations, open("smyth-citations.p", "wb"))

citations = pickle.load(open("data/smyth_citations.p", "rb"))

## Organize citations in dataframe

In [7]:
# Put citations in first normal form; make dataframe
# cf. https://stackoverflow.com/a/54368505/1816347

df = pd.DataFrame.from_dict(citations, orient='index').stack().reset_index().drop('level_1', axis=1).rename(columns={'level_0': 'smyth-id', 0: 'citation'})

In [8]:
# Natsort df by Smyth ids

df['smyth-id'] = pd.Categorical(df['smyth-id'], ordered=True, categories= ns.natsorted(df['smyth-id'].unique()))
df = df.sort_values('smyth-id')

In [9]:
df

Unnamed: 0,smyth-id,citation
0,s44a.D,Hom. Il. 21.567
1,s180,Hom. Il. 14.472
2,s185,Thuc. 4.47
3,s188,Xen. Anab. 1.2.2
4,s325D,Hom. Od. 4.62
...,...,...
5156,s3045,Plat. Apol. 20e
5158,s3046,Aeschin. 3.202
5160,s3048,Soph. El. 435
5159,s3048,Aesch. PB 21


## Export df to csv

In [10]:
df.to_csv('data/smyth_citations.csv', index=False)