# Extracting and Transforming Metadata

This notebook involves many steps, but each of the main steps is signposted with headings that align with the initial outline below. As previously, we will walk through the steps in class, but at this point you have most of the tools to read through and reconstruct a process like this on your own, so this work will be more self-guided. The general process here follows the generalized "Extract - Transform - Load" process, which is frequently the abstract model for pulling data from one system, transporting, cleaning, and outputting to another system, which is the goal here: extract the metadata from the Library of Congress, change it into a structure that makes sense to Omeka, then ingest that data and associated content.

## Learning objectives

After completing the assignment associated with this notebook, you should: 

* Have a conceptual and a practical understanding of how collection metadata is made available by a REST API.
* Be able to explain the concept of metadata extraction and transformation.
* Create a structure for documenting metadata practices in a collection or repository (a Metadata Application Profile) and implement that structure for transformations. 
* Use programming to work with data supplied by an API in JSON format, to manage and transform useful parts of that data into CSV format.
* Create ingest-ready collection metadata that conforms to Dublin Core and other digital collection metadata standards, which can be used to load content into another site (in this case, an Omeka S site). 

## Introduction

The main steps outlined in this notebook are as follows:

* **Extract the metadata.** This may be done in whatever way works for you. As illustrated here, there are two main steps that involve requesting JSON data from the Library of Congress: 
  1. Get collections list - using the requests library, make a request to the library of congress API to get the list of items in the "Free to Use" libraries collection. Write this to a local file (here called `collection_items_list.csv` and in the `data` directory). 
  1. Get item metadata - using the list from the previous step, use that a source to query each item in the collection to get details about it. Save the JSON responses locally so we can extract information from them in the next steps. (In this example, you will have around 60 files, but a maximum of 62 as of September 2022. This number may vary when you run this code yourself since the website may have different response rates.)
* **Transform the metadata.** As illustrated here, there are three substeps: develop the conceptual model for your transformation (expressed in a Metadata Application Profile and an implementation of the MAP in a crosswalk), test the implementation on a small subset, then run your transformation on the entire set.
  1. Draft a metadata crosswalk - this is an exploratory activity and you will need to take some time examining one or two sample responses from the previous step to identify the attributes that you want to extract (the goal is to identify the information that you want to import to your Omeka site collection, essentially we are going to recreate the collection), to see how to extract these from the JSON, and to write a test transformation in the next step. This is largely conceptual and, although it is sketched out in this notebook will not use python like the other steps here. That said, the next step does require this step. 
  1. Develop your transformation script with a small subset of the metadata. In this case, one record.
  1. Transform the data you've gathered in JSON into a CSV file according to the metadata crosswalk you've developed. The goal in this step is to create a CSV that we can use to import items into your Omeka site (using the CSV Import module). Note that the code outlined here suggests how all of these data elements may be extracted and transformed, but it does not necessarily output all of the elements that you will need to complete your assignment. In other words, there is still work to do to complete this code, but you are welcome to adopt or reuse the code here.  
* **Load the metadata** into your target system, in this case Omeka which we are using as a display platform. This step is not described in this notebook, because it requires the use of the CSV developed here to be ingested to your Omeka site. Without the above steps, however, you wouldn't be able to directly display these items. 

# Get collection list

In [22]:
import csv
import json
import requests

# for later, when working with local files
import glob
import os
from os.path import join

In [23]:
endpoint = 'https://www.loc.gov/free-to-use'
parameters = {
    'fo' : 'json'
}

In [24]:
collection = 'lighthouses'

In [25]:
collection_list_response = requests.get(endpoint + '/' + collection, params=parameters)

In [26]:
collection_list_response.url

'https://www.loc.gov/free-to-use/lighthouses?fo=json'

In [27]:
collection_json = collection_list_response.json()

Take a moment to look around in the JSON response. Where would you look for the data about the items in the collection of free to use library images? 

_Hint: At this point we're not really looking for the information about the images, but the pointers to them (such as headings, links, etc)._ 

In [28]:
# .keys() is a helpful function to see what the data elements are
collection_json.keys()

dict_keys(['breadcrumbs', 'content', 'content_is_post', 'description', 'expert_resources', 'next', 'next_sibling', 'options', 'pages', 'portal', 'previous', 'previous_sibling', 'site_type', 'timestamp', 'title', 'type'])

Looking further into the dictionary, it seems that you can get a list of the items in the set by looking into `content`, then `set`, then the `items` element:

In [29]:
for k in collection_json['content']['set']['items']:
    print(k)

{'alt': 'Color photo shows a white tower and black lantern attached to a low building with a red roof and surrounded by green grass.', 'image': '/static/portals/free-to-use/public-domain/lighthouses/lighthouses-1.jpg', 'link': '/resource/highsm.12127/', 'title': 'Heceta Head Lighthouse, Pacific Ocean, Oregon'}
{'alt': 'Color photo shows a  very tall tower with spiral bands in black and white. The base is red and surrounded by green grass.', 'image': '/static/portals/free-to-use/public-domain/lighthouses/lighthouses-2.jpg', 'link': '/resource/highsm.44760/', 'title': 'Cape Hatteras Lighthouse, Outer Banks, North Carolina'}
{'alt': 'Black-and-white photo shows details of the lighting mechanism inside the glass lens.', 'image': '/static/portals/free-to-use/public-domain/lighthouses/lighthouses-3.jpg', 'link': '/resource/hhh.nc0497.photos/?sp=24', 'title': 'Bodie Island Lighthouse, Outer Banks, North Carolina. Lamp inside the lens'}
{'alt': 'Color photo shows a red-painted square tower at 

How many items are there in the set?

In [37]:
len(collection_json['content']['set']['items'])

50

 Now that you can find the list of items in the collection, note that each of these "items" has 3 elements: `image`, `link`, and `title`. 

In [38]:
collection_json['content']['set']['items'][0].keys()

dict_keys(['alt', 'image', 'link', 'title'])

In a more fully automated environment, you might want to make a function that can return and save the collection list, then reuse it in other code, but for this task, it is useful to save the information. So, extract these and save them locally to a CSV. 

In [39]:
collection_set_list = os.path.join('data','lh_collection_set_list.csv')
headers = ['alt','image','link','title']

with open(collection_set_list, 'w', encoding='utf-8', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=headers)
    writer.writeheader()
    for item in collection_json['content']['set']['items']:
        
        # clean up errant spaces in the title fields
        item['title'] = item['title'].rstrip()
        writer.writerow(item)
    print('wrote',collection_set_list)

wrote data/lh_collection_set_list.csv


# Get metadata for individual items 

Now that you have the list of what is in the set, this can serve as your baseline collection information. Next, you want to get more complete information about each item. Details about these items are available on individual item pages, so now we have to look at a different location, as specified in the `'link'` fields of the item list.

In [40]:
# update endpoint info
endpoint = 'https://www.loc.gov'
parameters = {
    'fo' : 'json'
}

The task now is to request metadata for each item. So that the data is reusable, save it locally as a JSON file. In the next blocks, you will create individual files for each item, which will save to a directory named `ftu_libs_metadata` in the `data` directory. 

If you don't have that directory, you will first need to create it. 

In [41]:
# run this cell to confirm that you have a location for the JSON files
item_metadata_directory = os.path.join('data','ftu_lh_metadata')

if os.path.isdir(item_metadata_directory):
    print(item_metadata_directory,'exists')
else:
    os.mkdir(item_metadata_directory)
    print('created',item_metadata_directory)

data/ftu_lh_metadata exists


Now, with the `collection_set_list`, use the included links to query the API for metadata for each item:

In [42]:
item_count = 0
error_count = 0
file_count = 0

data_directory = 'data'
item_metadata_directory = 'ftu_lh_metadata'
item_metadata_file_start = 'item_metadata'
json_suffix = '.json'

collection_set_list = os.path.join('data','collection_set_list.csv')

with open(collection_set_list, 'r', encoding='utf-8', newline='') as f:
    reader = csv.DictReader(f, fieldnames=headers)
    for item in reader:
        if item['link'] == 'link':
            continue
        # these resource links could redirect to item pages, but currently don't work
        if '?' in item['link']:
            resource_ID = item['link']
            short_ID = item['link'].split('/')[2]
            item_metadata = requests.get(endpoint + resource_ID + '&fo=json')
            print('requested',item_metadata.url,item_metadata.status_code)
            if item_metadata.status_code != 200:
                print('requested',item_metadata.url,item_metadata.status_code)
                error_count += 1
                continue
            try:
                item_metadata.json()
            except: #basically this catches all of the highsmith photos with hhh in the ID
                error_count += 1
                print('no json found')
                continue
            fout = os.path.join(data_directory, item_metadata_directory, str(item_metadata_file_start + '-' + short_ID + json_suffix))
            with open(fout, 'w', encoding='utf-8') as json_file:
                json_file.write(json.dumps(item_metadata.json()['item']))
                file_count += 1
                print('wrote', fout)
            item_count += 1
        else:
            resource_ID = item['link']
            short_ID = item['link'].split('/')[2]
            item_metadata = requests.get(endpoint + resource_ID, params=parameters)
            print('requested',item_metadata.url,item_metadata.status_code)
            if item_metadata.status_code != 200:
                print('requested',item_metadata.url,item_metadata.status_code)
                error_count += 1
                continue
            try:
                item_metadata.json()
            except:
                error_count += 1
                print('no json found')
                continue
            fout = os.path.join(data_directory, item_metadata_directory, str(item_metadata_file_start + '-' + short_ID + json_suffix))
            with open(fout, 'w', encoding='utf-8') as json_file:
                json_file.write(json.dumps(item_metadata.json()['item']))
                file_count += 1
                print('wrote', fout)
            item_count += 1

print('--- mini LOG ---')
print('items requested:',item_count)
print('errors:',error_count)
print('files written:',file_count)

IndexError: list index out of range

# Write a metadata crosswalk

Below is a start. This is going to get a bit complicated, but identify at least 10 fields that you want to move into the new site. Consider using DublinCore, but also at least one field from another schema, I would suggest MODS (more of a bibliographic schema and allows for more granularity than DublinCore), which is also supported by Omeka. Plus, you should be able to find MODS information for most (if not all) items in any of these sets. For example, looking at resource `highsm.20336`, note the last field in the item metadata is a URL to an `item` page: https://www.loc.gov/item/2012630017/. That item page links to MODS and DublinCore records.


| source field name | source field path/dict name | target        | target namespace | notes |
|-------------------|-----------------------------|---------------|------------------|-------|
| title | item['title'] | dc:title | DC Element | Title provided by the orginal metadata, could also be mapped to MODS:titleInfo:title or other fields in other namespaces | 
| date              | item['date']                | dc:date       | DC Element | This is a 4-digit year, corresponds to date of creation in most cases   |
| LC call number    | item['item']['call_number']  | dc:identifier | DC Element | Alphanumeric string. A Library of Congress number, should record for source/provenance reasons.|
| LC control number | item['item']['control_number'] | dc:identifier @type=lccn | DC Element with attribute | Corresponds to the Library of Congress Control Number (can be checked at http://lccn.loc.gov/ |
| creator           | item['creator']             | dc:creator    | DC Element | Should be a name. May be repeated. If possible, are various roles needed? Such as 'photographer', 'author', etc |
| description | item['description'] / item['summary'] | mods:physicaldescription / dc:abstract | MODS | In the source data, this seems most like physical description, although it might correspond to dc:format or dc:type. Content in the record may come from a controlled vocabulary, such as LC Genre & Form Thesaurus. |
| mime_type | | | DC |
| notes (may be multiple) | item['notes'] (array) | dcterms:abstract | DC Terms | This appears to be closest to a "summary" or description of the content of the items. |
| source_collection | | | | |
| rights | | | | |
| place | | | | |
| image (link to the full image) | | | | |
| languages | | | | |
| subject_heading | | mods:subject | mods | | 
| format, physical | item['formats'][0]['title'] / also look at item['type'] | mods:physicalDescription:form | Description of the original physical format of this item (photograph, book, poster) | Note: this may not be present or in the same place for the different types of objects in the collection |
| format | item['format'] | dc:format | DC Element | The basic type of the digital surrogate (e.g., 'image' or 'text' | |

# Transformation Part 1: Testing

Now, it's time to implement the second step, which is to accomplish the transformation. At a high level, this step involves the creation of code or another implementation workflow, which will search the item metadata files downloaded previously, extract the target fields identified in the MAP, then write that information to a CSV for later import to Omeka.

First, develop a search pattern for identifying the desired JSON files. Here, you create a list of the files that you want to transform, called `list_of_item_metadata_files`. 

**Reminder:** This step builds on your regular expression and shell skills! (Note, however, these are technically file path expansions, not actual regular expressions, but the general idea of creating a pattern and asking the computer to respond with a list of results that meet your criteria, is similar.)

In [34]:
current_loc = os.getcwd()

print(current_loc)

/Users/emmarooney/Desktop/networked-services-labs-2023


In [35]:
metadata_file_path = os.path.join('data','ftu_lh_metadata')

print(metadata_file_path)

data/ftu_lh_metadata


The next cell uses the `glob` library, which supports the use of file path expanders
to look for patterns in file paths. In this case, the previous item metadata exraction
wrote files that had the pattern `item_metadata-[item-identifier].json`. 
So, to match any pattern for the `item-identifier` section, `glob` allows
the use of the `*` (asterisk) character to match any pattern:

In [43]:
file_count = 0

for file in glob.glob('data/ftu_lh_metadata/item_metadata-*.json'):
    file_count += 1
    print(file)
    
print('found',file_count)

data/ftu_lh_metadata/item_metadata-ppmsca.09037.json
data/ftu_lh_metadata/item_metadata-highsm.05517.json
data/ftu_lh_metadata/item_metadata-mrg.02624.json
data/ftu_lh_metadata/item_metadata-hhh.hi0156.photos.json
data/ftu_lh_metadata/item_metadata-ppmsca.18187.json
data/ftu_lh_metadata/item_metadata-ppmsca.09094.json
data/ftu_lh_metadata/item_metadata-hhh.mi0225.sheet.json
data/ftu_lh_metadata/item_metadata-g3311p.fi000169a.json
data/ftu_lh_metadata/item_metadata-highsm.12154.json
data/ftu_lh_metadata/item_metadata-highsm.50327.json
data/ftu_lh_metadata/item_metadata-det.4a24366.json
data/ftu_lh_metadata/item_metadata-hhh.ny1264.photos.json
data/ftu_lh_metadata/item_metadata-highsm.15337.json
data/ftu_lh_metadata/item_metadata-highsm.12124.json
data/ftu_lh_metadata/item_metadata-highsm.41229.json
data/ftu_lh_metadata/item_metadata-ppmsca.18159.json
data/ftu_lh_metadata/item_metadata-ppmsca.09089.json
data/ftu_lh_metadata/item_metadata-highsm.45698.json
data/ftu_lh_metadata/item_metada

In [44]:
list_of_item_metadata_files = list() 
for file in glob.glob('data/ftu_lh_metadata/item_metadata-*.json'):
    list_of_item_metadata_files.append(file)

In [45]:
len(list_of_item_metadata_files)

49

In [46]:
# quick duplicate check
list_of_item_metadata_files.sort()

for file in list_of_item_metadata_files:
    print(file)

data/ftu_lh_metadata/item_metadata-cph.3c36378.json
data/ftu_lh_metadata/item_metadata-det.4a24366.json
data/ftu_lh_metadata/item_metadata-g3311p.fi000169a.json
data/ftu_lh_metadata/item_metadata-g3721p.fi000169cr.json
data/ftu_lh_metadata/item_metadata-hhh.ca0581.sheet.json
data/ftu_lh_metadata/item_metadata-hhh.fl0421.photos.json
data/ftu_lh_metadata/item_metadata-hhh.hi0156.photos.json
data/ftu_lh_metadata/item_metadata-hhh.me0226.photos.json
data/ftu_lh_metadata/item_metadata-hhh.mi0225.sheet.json
data/ftu_lh_metadata/item_metadata-hhh.mn0184.photos.json
data/ftu_lh_metadata/item_metadata-hhh.nc0497.photos.json
data/ftu_lh_metadata/item_metadata-hhh.nc0610.sheet.json
data/ftu_lh_metadata/item_metadata-hhh.ny1264.photos.json
data/ftu_lh_metadata/item_metadata-hhh.pa0452.sheet.json
data/ftu_lh_metadata/item_metadata-hhh.pr0047.photos.json
data/ftu_lh_metadata/item_metadata-hhh.wi0305.sheet.json
data/ftu_lh_metadata/item_metadata-highsm.05517.json
data/ftu_lh_metadata/item_metadata-hi

To develop your data transformation and metadata profile, 
first you need to explore the information that you have about each item. 
To do this, explore one item to understand how the information is structured.
How do you open the json? How is it structured? Where is the information you want?

In [47]:
# try first with one file, can you open the json, can you see what elements are in the json?
with open(list_of_item_metadata_files[0], 'r', encoding='utf-8') as item:
    # what are we looking at?
    print('file:',list_of_item_metadata_files[0],'\n')
    
    # load the item data
    item_data = json.load(item)
    
    for element in item_data.keys():
        print(element,':',item_data[element])

file: data/ftu_lh_metadata/item_metadata-cph.3c36378.json 

_version_ : 1731681821433790464
access_restricted : False
aka : ['https://www.loc.gov/pictures/item/2006678920/', 'http://www.loc.gov/item/2006678920/', 'http://www.loc.gov/pictures/item/2006678920/', 'https://www.loc.gov/pictures/collection/cph/item/2006678920/', 'http://www.loc.gov/pictures/collection/cph/item/2006678920/', 'http://www.loc.gov/resource/cph.3c36378/', 'http://lccn.loc.gov/2006678920', 'https://hdl.loc.gov/loc.pnp/cph.3c36378']
call_number : NYWTS - SUBJ/GEOG-- Lighthouses--Seagate Light--New York Harbor [item] [P&P]
campaigns : []
contributor_names : ['Higgins, Roger, photographer']
contributors : [{'higgins, roger': 'https://www.loc.gov/search/?fa=contributor:higgins,+roger&fo=json'}]
control_number : 
created : 2016-04-20T11:40:32Z
created_published : ['1961 June 12.']
created_published_date : 1961 June 12.
date : 1961-01-01
dates : [{'1961': 'https://www.loc.gov/search/?dates=1961/1961&fo=json'}]
descripti

Look around in the dictionary a bit more:

In [48]:
item_data.keys()

dict_keys(['_version_', 'access_restricted', 'aka', 'call_number', 'campaigns', 'contributor_names', 'contributors', 'control_number', 'created', 'created_published', 'created_published_date', 'date', 'dates', 'description', 'digital_id', 'digitized', 'display_offsite', 'extract_timestamp', 'extract_urls', 'format', 'format_headings', 'genre', 'group', 'hassegments', 'id', 'image_url', 'index', 'item', 'language', 'languages', 'library_of_congress_control_number', 'link', 'location', 'locations', 'marc', 'medium', 'medium_brief', 'mime_type', 'modified', 'notes', 'number', 'number_former_id', 'number_lccn', 'number_source_modified', 'online_format', 'original_format', 'other_control_numbers', 'other_formats', 'other_title', 'partof', 'place', 'related', 'repository', 'reproduction_number', 'reproductions', 'resource_links', 'resources', 'rights', 'rights_advisory', 'rights_information', 'score', 'shard', 'shelf_id', 'site', 'sort_date', 'source_collection', 'source_created', 'source_mo

For the development of your metadata transformation, you're looking for 
how to extract the elements identified in the MAP table. For example, which date fields do you want and where are they? Where will you find the format information?

In [49]:
    # can you get the date?
    print('\ndate:',item_data['date'], type(item_data['date']))
    # can you get the format?
    print('\nformat:',item_data['format'][0], type(item_data['format']))


date: 1961-01-01 <class 'str'>

format: {'photo, print, drawing': 'https://www.loc.gov/search/?fa=original_format:photo,+print,+drawing&fo=json'} <class 'list'>


## Test: Try it with one example

First, try to set up the extract process with one example. This may get more complicated later since you don't know yet if every item has the same metadata attributes in the JSON. But start with some basics and build up from there. 

For a first pass, look out for these items, and find where in the JSON you can locate them:

* 'item_id'
* 'title'
* 'date' 
* 'source_url'
* 'phys_format'
* 'dig_format'
* 'rights'

_Hint: use the JSON viewer in JupyterLab, use an extension in VSCode, or use a browser to look through sample JSON. The block below uses item `cph.3b41963`._

You may need to use try/except patterns to create workarounds for cases where some items may not have exactly the same attributes that you've identified in your test cases.

In [50]:
# set up the containers to create the csv of all the item fields
# file for csv to read out
collection_info_csv = 'collection_items_data.csv'

# set up a list for the columns in your csv; 
# your goal should be to automate this, but . . . 
# it works for demonstration as you set up the crosswalk
headers = ['source_file', 'item_id', 'title', 'date', 'source_url', 'phys_format', 'dig_format', 'rights']

# try first with one file
with open(list_of_item_metadata_files[0], 'r', encoding='utf-8') as data:
    # load the item data
    item_data = json.load(data)
    
    # extract the data you want
    # for checking purposes, add in the source of the info
    source_file = str(file)
    # make sure there's some unique and stable identifier
    try:
        item_id = item_data['library_of_congress_control_number']
    except:
        item_id = item_data['url'].split('/')[-2]
    title = item_data['title']
    date = item_data['date']
    source_url = item_data['url']
    try:
        phys_format = item_data['format'][0]
    except:
        phys_format = 'Not found'
    try:
        dig_format = item_data['online_format'][0]
    except:
        dig_format = 'Not found'
    mime_type = item_data['mime_type']
    try:
        rights = item_data['rights_information']
    except:
        rights = 'Undetermined'


    # dictionary for the rows
    row_dict = dict()
    
    # look for the item metadata, assign it to the dictionary; 
    # start with some basic elements likely (already enumerated in the headers list) :
    # source file
    row_dict['source_file'] = source_file
    # identifier
    row_dict['item_id'] = item_id
    # title
    row_dict['title'] = title
    # date
    row_dict['date'] = date
    # link
    row_dict['source_url'] = source_url
    # format
    row_dict['phys_format'] = phys_format
    # digital format
    row_dict['dig_format'] = dig_format
    #rights
    row_dict['rights'] = rights 
    print('created row dictionary:',row_dict)

    # write to the csv
    with open(collection_info_csv, 'w', encoding='utf-8') as fout:
        writer = csv.DictWriter(fout, fieldnames=headers)
        writer.writeheader()
        writer.writerow(row_dict)
        print('wrote',collection_info_csv)

created row dictionary: {'source_file': 'data/ftu_lh_metadata/item_metadata-ppmsca.58709.json', 'item_id': '2006678920', 'title': "Frank Schubert polishes the station's lens once a week / World Telegram & Sun photo by Higgins.", 'date': '1961-01-01', 'source_url': 'https://www.loc.gov/item/2006678920/', 'phys_format': {'photo, print, drawing': 'https://www.loc.gov/search/?fa=original_format:photo,+print,+drawing&fo=json'}, 'dig_format': 'image', 'rights': 'No known copyright restriction. For information see "New York World-Telegram & ...," https://www.loc.gov/rr/print/res/076_nyw.html'}
wrote collection_items_data.csv


You're now developing the structure of the CSV file that will import items into your Omeka S site. The CSV import module supports the loading of item files via a URL. This provides the location of a file (in this case, an image), which Omeka will copy into its database and attach to your item. This means that it isn't necessary to upload individual files after or during metadata creation. 

To allow this, you need to find a direct url to a good image file for the item. There are multiple options, and the code below demonstrates looking for the url to a medium-sized image of an item:

In [51]:
collection_info_csv = 'collection_items_data.csv'

# set up a list for the columns in your csv; in future, this should be more automated but this works for now as you set up the crosswalk
headers = ['source_file', 'item_id', 'title', 'date', 'source_url', 'contributor', 'description', 'medium', 'isPartOf', 'rights', 'subjects', 'type']

# try first with one file
with open(list_of_item_metadata_files[0], 'r', encoding='utf-8') as data:
    # load the item data
    item_data = json.load(data)
    
    print(item_data['image_url'][3])

https://tile.loc.gov/storage-services/service/pnp/cph/3c30000/3c36000/3c36300/3c36378v.jpg#h=1024&w=828


# Transformation Part 2: Write your CSV

The goal of this final step is to create a CSV file, which will be possible to import into your Omeka site. It may seem like it's taken a long time to get to this point... but remember, when this works you will be importing around 60 items into the site at one time, so if you can get all of this to work for an even larger set of materials, you will be saving quite a lot of time in the future when you need to import items. Even if you were to collect the items piecemeal, which would need a different workflow than illustrated here, you can accomplish similar goals by recording metadata for each item consistently and in a spreadsheet, which you can then use to import the items in batch.

So now that your transformation script is tested, the goal is to extend this to the whole set by looping through each of the desired JSON files:

In [52]:
# for purposes of demonstration, use this block to make sure there isn't already a list file:

items_data_file = os.path.join(data_directory, 'collection_items_data.csv')

if os.path.isfile(items_data_file):
    os.unlink(items_data_file)
    print('removed',items_data_file)

# clear row_dict
row_dict = ()

removed data/collection_items_data.csv


In [53]:
from datetime import date

date_string_for_today = date.today().strftime('%Y-%m-%d') # see https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

print(date_string_for_today)

2023-11-20


In [56]:
# set up the containers to create the csv & counters 
# file for csv to read out
collection_info_csv = os.path.join('data','collection_items_data.csv')
file_count = 0
items_written = 0
error_count = 0

# add in a couple of extras for Omeka, including item type and date uploaded

# set up a list for the columns in your csv; in future, this should be more automated but this works for now as you set up the crosswalk
headers = ['item_type', 'date_uploaded','source_file', 'item_id', 'title', 'date', 'source_url', 'contributors', 'description', 'medium', 'partof', 'rights', 'subjects', 'type']


# now, adapt the previous loop to open each file:
for file in list_of_item_metadata_files:
    file_count += 1
    print('opening',file)
    with open(file, 'r', encoding='utf-8') as item:
        # load the item data
        try:
            item_data = json.load(item)
        except:
            print('error loading',file)
            error_count += 1
            continue

        # extract/name the data you want
        # item type
        item_type = 'Item'
        # date uplaoded
        date_uploaded = date_string_for_today
        # for checking purposes, add in the source of the info
        source_file = str(file)
        # make sure there's some unique and stable identifier
        try:
            item_id = item_data['library_of_congress_control_number']
        except:
            item_id = item_data['url'].split('/')[-2]
        try:
            title = item_data['item']['title']
        except:
                title = item_data['title']
        #date
        try:
            date = item_data['created_published_date']
        except:
            # date = item_data['item']['date']
            date = item_data['date']
        #url
        source_url = item_data['url']
        #contrib
        try:
            contributors = item_data['contributors']
        except:
            contributors = "Not known."

        #contrib
        description = item_data['description'][0]
        #medium
        try:
            medium = item_data['item']['medium']
        except:
            medium = item_data['item']['mediums']
        #partof
        partof = item_data['partof']
        # try:
        #     phys_format = item_data['format'][0]
        # except:
        #     phys_format = 'Not found'
        # try:
        #     dig_format = item_data['online_format'][0]
        # except:
        #     dig_format = 'Not found'
        # mime_type = item_data['mime_type']
        try:
            rights = item_data['rights']
        except:
            rights = 'Undetermined'
        #subjects
        try:
            subjects = item_data['item']['subjects']
        except:
            subjects = item_data['item']['subject_headings']
        #url
        try:
            image_url = item_data['image_url'][3]
        except:
            image_url = 'Did not identify a URL.'
        #type
        type = item_data['type']

        # dictionary for the rows
        row_dict = dict()

        # look for the item metadata, assign it to the dictionary; 
        # start with some basic elements likely (already enumerated in the headers list) :
        # item type
        row_dict['item_type'] = item_type
        # date uploaded
        row_dict['date_uploaded'] = date_uploaded
        # source filename
        row_dict['source_file'] = source_file
        # identifier
        row_dict['item_id'] = item_id
        # title
        row_dict['title'] = title
        # date
        row_dict['date'] = date
        # link
        row_dict['source_url'] = source_url
        # format
        row_dict['contributors'] = contributors
        # description
        row_dict['description'] = description
        #medium
        row_dict['medium'] = medium
        #part of
        row_dict['partof'] = partof
        #rights
        row_dict['rights'] = rights
        #subject
        row_dict['subjects'] = subjects
        #type
        row_dict['type'] = type

        # write to the csv
        with open(collection_info_csv, 'a', encoding='utf-8') as fout:
            writer = csv.DictWriter(fout, fieldnames=headers)
            if items_written == 0:
                writer.writeheader()
            writer.writerow(row_dict)
            items_written += 1
            print('adding',item_id)

print('\n\n--- LOG ---')
print('wrote',collection_info_csv)
print('with',items_written,'items')
print(error_count,'errors (info not written)')

opening data/ftu_lh_metadata/item_metadata-cph.3c36378.json
adding 2006678920
opening data/ftu_lh_metadata/item_metadata-det.4a24366.json
adding 2016812260
opening data/ftu_lh_metadata/item_metadata-g3311p.fi000169a.json
adding 2018588031
opening data/ftu_lh_metadata/item_metadata-g3721p.fi000169cr.json
adding 2018588023
opening data/ftu_lh_metadata/item_metadata-hhh.ca0581.sheet.json
adding ca0581
opening data/ftu_lh_metadata/item_metadata-hhh.fl0421.photos.json
adding fl0421
opening data/ftu_lh_metadata/item_metadata-hhh.hi0156.photos.json
adding hi0156
opening data/ftu_lh_metadata/item_metadata-hhh.me0226.photos.json
adding me0226
opening data/ftu_lh_metadata/item_metadata-hhh.mi0225.sheet.json
adding mi0225
opening data/ftu_lh_metadata/item_metadata-hhh.mn0184.photos.json
adding mn0184
opening data/ftu_lh_metadata/item_metadata-hhh.nc0497.photos.json
adding nc0497
opening data/ftu_lh_metadata/item_metadata-hhh.nc0610.sheet.json
adding nc0610
opening data/ftu_lh_metadata/item_metada

Now, you should have a well-formed, complete CSV file at `data/collection_items_data.csv`. This file should ahve all the information to import the 59 items that you were able to identify, ready for import to Omeka. 