Given a word (a "great idea"), search OCLC, cache a set of OCR files from the HathiTrust, and submit the result to the Distant Reader
More specifically, given an SRU query intended to be applied against WorldCat using the Search API, this system of files will then:
- search WorldCat
- cache the results as XML files
- extract the OCLC numbers from the XML files
- search the HathiTrust for matching records
- cache the results as JSON files
- extract the HathiTrust identifiers from the JSON files
- cache the OCR associated with the HathiTrust identifiers
The previous steps are sufficient for many purposes, but the following steps go further and analyze/index the OCR using the Distant Reader:
- create a simple metadata (CSV) file
- compress the cache along with the metadata file into a zip file
- submit the result to the Distant Reader
The really big goal is to create substantial collections of plain text files where each collection surrounds one of the 102 Great Ideas. These collections are wonderful fodder for the purposes of demonstrating the functionality of the Reader.
Use the following sequence to make this system work for you:
./bin/pre-search.sh
- see what is available./bin/build.sh
- initialize a collection./bin/search.sh
- search OCLC and cache batches of search results./bin/make-metadata.sh
- loop through search results, join them with HathiTrust, and output rudimentary metadata./bin/make-cache.sh
- loop through the metadata and cache OCR; very not fast./bin/make-ready.sh
- create a metadata file apropos to the Reader, zip the cache, and get ready processing./bin/clean.sh
- optionally remove the temporary and staging files from the collection./bin/read.sh
- submit the result of all the above to the Distant Reader; "Cook until done."
If you want to do everything at one go, then:
./bin/make-all.sh
- one script to rule them all
Eric Lease Morgan <emorgan@nd.edu>
March 13, 2020 -- "While coronavirus is still happening"