Collecting and Parsing Flickr Metadata

California Historical Society edited this page May 5, 2017 · 4 revisions

A one-off procedure used to collect metadata from the CHS Flickr Commons account and reformat it as MODS XML for Islandora ingest.

Make sure you have Python installed

Install Python 2.7 for Windows following directions here: http://www.howtogeek.com/197947/how-to-install-python-on-windows/

Download CHS Flickr Commons Metadata

The digital images in CHS's Flickr Commons account have descriptive metadata of generally good quality. We were able to scrape and parse this data using the following procedure:

Download Flickr metadata using Python script found here, and follow these instructions on getting Flickr API key, API secret, and userID: http://huntertrek.com/wp/2009/07/27/flickr-metadata-downloader-in-python/

See also: http://drewtarvin.com/business/export-flickr-metadata-csv-file/
See this solution to authentication error that may occur: https://github.com/jmahmood/Flickr-Cli/issues/7

Run the script:

  • For simplicity's sake, make sure Python script is in same folder as where you want the resulting database file and CSV to be.
  • In Windows command line, navigate to that directory
  • Run python.exe flickr_download_photo_metadata.py
  • Script will run, creating a DB file in the folder
  • Once that's complete, run the script again, but append the export flag: python.exe flickr_download_photo_metadata.py --export
  • This creates a CSV file

Parse the Flickr Data in OpenRefine

Open the CSV in OpenRefine, creating a new project. Make sure to use correct character encoding! This will save you lots of headaches.

Almost all of the relevant metadata is in a single column called Description. Transform Description column by replacing all line breaks with any character not used in the data itself -- pipes would be a good choice -- using GREL script:

value.replace("\n", "|")

Begin to parse the Description column by choosing "Add column based on this column..."

The GREL implementation of regex was not giving us the results we expected, so we instead used this basic Jython function, editing the regex as needed:

import re   
m = re.match(r".*publisher:(.*?);", value, re.I)   
return m.group(1)

Repeat this step, with edits where necessary, for all of the metadata fields contained in the Description column:

  • Repository
  • Collection
  • Date
  • Call Number
  • Digital Object ID
  • Preferred Citation
  • Photographer
  • Publisher
  • Format
  • Online Finding Aid
  • General note(s)

Flesh out the rest of the spreadsheet by adding columns for all remaining MODS tags and attributes. Here's a CSV file with our column headings: https://github.com/calhist/mods_xml/blob/master/mods_column_headers.csv

Make sure the metadata is complete! There may be good metadata in PastPerfect, the CHS OPAC, finding aids in OAC, etc.

You may opt to do some item-level cataloging at this point.

Next Steps

To Export your data as MODS XML, follow the steps outlined here.

To kick off the digital preservation workflow and create batches for Islandora ingest, follow the steps here.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.