Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Collecting and Parsing Flickr Metadata
A one-off procedure used to collect metadata from the CHS Flickr Commons account and reformat it as MODS XML for Islandora ingest.
Make sure you have Python installed
Install Python 2.7 for Windows following directions here: http://www.howtogeek.com/197947/how-to-install-python-on-windows/
Download CHS Flickr Commons Metadata
The digital images in CHS's Flickr Commons account have descriptive metadata of generally good quality. We were able to scrape and parse this data using the following procedure:
Download Flickr metadata using Python script found here, and follow these instructions on getting Flickr API key, API secret, and userID: http://huntertrek.com/wp/2009/07/27/flickr-metadata-downloader-in-python/
See also: http://drewtarvin.com/business/export-flickr-metadata-csv-file/
See this solution to authentication error that may occur: https://github.com/jmahmood/Flickr-Cli/issues/7
Run the script:
- For simplicity's sake, make sure Python script is in same folder as where you want the resulting database file and CSV to be.
- In Windows command line, navigate to that directory
- Script will run, creating a DB file in the folder
- Once that's complete, run the script again, but append the export flag:
python.exe flickr_download_photo_metadata.py --export
- This creates a CSV file
Parse the Flickr Data in OpenRefine
Open the CSV in OpenRefine, creating a new project. Make sure to use correct character encoding! This will save you lots of headaches.
Almost all of the relevant metadata is in a single column called Description. Transform Description column by replacing all line breaks with any character not used in the data itself -- pipes would be a good choice -- using GREL script:
Begin to parse the Description column by choosing "Add column based on this column..."
The GREL implementation of regex was not giving us the results we expected, so we instead used this basic Jython function, editing the regex as needed:
import re m = re.match(r".*publisher:(.*?);", value, re.I) return m.group(1)
Repeat this step, with edits where necessary, for all of the metadata fields contained in the Description column:
- Call Number
- Digital Object ID
- Preferred Citation
- Online Finding Aid
- General note(s)
Flesh out the rest of the spreadsheet by adding columns for all remaining MODS tags and attributes. Here's a CSV file with our column headings: https://github.com/calhist/mods_xml/blob/master/mods_column_headers.csv
Make sure the metadata is complete! There may be good metadata in PastPerfect, the CHS OPAC, finding aids in OAC, etc.
You may opt to do some item-level cataloging at this point.
To Export your data as MODS XML, follow the steps outlined here.
To kick off the digital preservation workflow and create batches for Islandora ingest, follow the steps here.