ArchivesSpace, ArcLight, and Hyrax Workflow
This repo contains documentation and scripts for how the M.E. Grenander Department of Special Collections & Archives connects ArchivesSpace, ArcLight, and Hyrax and keeps everything synced together. It contains:
- Documentation for uploading digital object in Hyrax using existing
- Overnight exporting and indexing scripts that update data between each service
Updated documentation for this repo is on our documentation site:
Uploading Digital Objects to Hyrax with Existing Description
Uploading Digital Objects to Hyrax
Go to Hyrax and login, or create an account and request uploading access
- Let Greg know when you create an account and return when you have upload permissions.
Once you have upload permissions, go to Arclight, find the file that represents the digital object you want to upload. From the URI, copy the long string of letters and numbers right after the “aspace_”. This is the unique ArchivesSpace ID for that record.
- Notice the collection ID is in the URI as well.
- In your Dashboard, select “Works” on the left side menu
- Select the “Add new work” button on the right side
- For most cases, select “Digital Archival Objects” and then the “Create Work” button.
- In the “Descriptions” tab, enter only the ArchivesSpace ID, and the Collection number
- Add additional Metadata, Resource Type and Rights Statement is required, while “Additional fields” are not
- In the “Files” tab, browse and upload any files represented by the Arclight record. These can be PDFs, Office documents (doc, docx, ppt, xlsx, etc.), or any image file.
- Select the Visibility of the work on the right side, and Save the work.
Overnight Export and Indexing Scripts
What Each Script Does
- Each night,
exportPublicData.pyuses ArchivesSnake to query ArchivesSpace for resources updated since the last run.
- For collections with the complete set of DACS-minimum elements it exports EAD 2002 files and for collections with only abstracts and extents it saves them to Pipe-delimited CSVs.
- It also builds a CSV of local subjects and collection IDs.
- All this data is pushed to Github.
staticPages.pywhen its finished, which builds static browse pages for all collections, including a complete A-Z list, alpha lists for each collecting area, and pages for each local subject.
Indexing Shell Scripts
- Later, collection data is updated with
indexNewEAD.shindexes EAD files updated in the past day with
find -mtime -1into the ArcLight Solr instance.
- There are also additional indexing shell scripts for ad hoc updates.
indexAllEAD.shreindexes all EAD files
indexOneEAD.shindexes only one EAD by collection ID (
indexOneNDPA.shindexes one NDPA EAD file, necessary because they have the same collection ID prefixes
indexNewNoLog.shindexes one EAD file, but logs to the stdout instead of a log file
indexOneURL.shindexes via a URL instead of from disk (not actively used)
processNewUploads.pyqueries the Hyrax Solr index for new uploads that are connected to ArchivesSpace ref_ids, but do not have accession numbers.
- It downloads the new binaries and metadata and creates basic Archival Information Packages (AIPs) using bagit-python
- It then uses ArchivesSnake to add a new Digital Object Record in ArchivesSpace that links to the object in Hyrax
- Last, it adds a new accession ID in Hyrax
- (Also check out Noah Huffman's talk that probably does this better [Direct Link].)
- A simple library that converts Posix timestamps and ISO 8601 Dates to DACS-compliant display dates.
exportPublicData.pyuses this to make dates for the static browse pages.
- Queries the Bing background image API each night to display new background images for ArchivesSpace and Find-It just for fun.
# get new image from Bing 0 2 * * * source /home/user/.bashrc; pyenv activate aspaceExport && python /opt/lib/ArchivesSpace-ArcLight-Workflow/image_a_day.py 1>> /media/SPE/indexing-logs/image_a_day.log 2>&1 && pyenv deactivate # export data from ASpace 0 0 * * * source /home/user/.bashrc; pyenv activate aspaceExport && python /opt/lib/ArchivesSpace-ArcLight-Workflow/exportPublicData.py 1>> /media/SPE/indexing-logs/export.log 2>&1 && pyenv deactivate # pull new EADs from Gitub 30 0 * * * echo "$(date) $line git pull" >> /media/SPE/indexing-logs/git.log && git --git-dir=/opt/lib/collections/.git --work-tree=/opt/lib/collections pull 1>> /media/SPE/indexing-logs/git.log 2>&1 # Index modified apap collections 5 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "apap" # Index modified ua collections 15 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ua" # Index modified ndpa collections 25 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ndpa" # Index modified ger collections 35 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ger" # Index modified mss collections 45 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "mss" # Download new Hyrax uploads and create new ASpace digital objects 0 2 * * * source /home/user/.bashrc; pyenv activate processNewUploads && python /opt/lib/ArchivesSpace-ArcLight-Workflow/processNewUploads.py 1>> /media/SPE/indexing-logs/processNewUploads.log 2>&1 && pyenv deactivate