Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: c84b0c373c
Fetching contributors…

Cannot retrieve contributors at this time

32 lines (22 sloc) 1.066 kb
Parser for the Harvard Library Bibliographic Dataset
carlos@bueno.org
In late April 2012, Harvard University released 12 million records from their
library catalogs, including photos, journals, books, recordings, and manuscripts.
This data is in the public domain, but the format is wonky.
http://openmetadata.lib.harvard.edu/bibdata
This is a parser for the "MARC21" data contained in the dump. It makes use of
Nathan Denny's MARC21 library, and adds a ton of stuff on top, including
friendly names for fields, and lots of heuristic tricks to determine the
content type of the items, which opens up even more metadata encoded in the
infamous "Record 008".
To use it, you need to download and unzip the Harvard set in the same directory.
Then run:
python marc.py [sql|json] > your_file.txt
Also included are samples of the JSON and SQL output.
Released under the "MIT" or "BSD" license scheme. See LICENSE file.
TODO:
The SQL schema is terrible!
More metadata parsing
Better detection of music and audio
Translate more fields
alt_glyph support
Jump to Line
Something went wrong with that request. Please try again.