Worldcat2Texts

Given a word (a "great idea"), search OCLC, cache a set of OCR files from the HathiTrust, and submit the result to the Distant Reader

More specifically, given an SRU query intended to be applied against WorldCat using the Search API, this system of files will then:

search WorldCat
cache the results as XML files
extract the OCLC numbers from the XML files
search the HathiTrust for matching records
cache the results as JSON files
extract the HathiTrust identifiers from the JSON files
cache the OCR associated with the HathiTrust identifiers

The previous steps are sufficient for many purposes, but the following steps go further and analyze/index the OCR using the Distant Reader:

create a simple metadata (CSV) file
compress the cache along with the metadata file into a zip file
submit the result to the Distant Reader

The really big goal is to create substantial collections of plain text files where each collection surrounds one of the 102 Great Ideas. These collections are wonderful fodder for the purposes of demonstrating the functionality of the Reader.

Cookbook

Use the following sequence to make this system work for you:

./bin/pre-search.sh - see what is available
./bin/build.sh - initialize a collection
./bin/search.sh - search OCLC and cache batches of search results
./bin/make-metadata.sh - loop through search results, join them with HathiTrust, and output rudimentary metadata
./bin/make-cache.sh - loop through the metadata and cache OCR; very not fast
./bin/make-ready.sh - create a metadata file apropos to the Reader, zip the cache, and get ready processing
./bin/clean.sh - optionally remove the temporary and staging files from the collection
./bin/read.sh - submit the result of all the above to the Distant Reader; "Cook until done."

If you want to do everything at one go, then:

./bin/make-all.sh - one script to rule them all

Eric Lease Morgan <emorgan@nd.edu>
March 13, 2020 -- "While coronavirus is still happening"

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
bin		bin
etc		etc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Worldcat2Texts

Cookbook

About

Releases

Packages

Languages

License

ericleasemorgan/worldcat2texts

Folders and files

Latest commit

History

Repository files navigation

Worldcat2Texts

Cookbook

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages