# Step 1, Part 2: Metadata Collection and Organization

In this section, we will work to extract, save, and structure the relevant metadata from the DC image collection created in the last section, creating a JSON file with all the relevant metadata.

First, we'll be importing the libraries, modules, and functions needed from the workflow_helpers file.

In [76]:
from workflow_helpers import *
import json

# Load in Collection CSV and Access Collection Links

Like in the previous notebook, let's load in the collection csv, access the collection links, and fetch the JSON data.

In [77]:
#assigning the filepath to the csv file with the collection information
file = "jfp-collections-starter-collections.csv"
#using the function from the helper file to read in the csv
collection = read_in_collection_csv_for_links(file)

#making sure the data is as expected
for row in collection:
    print(row)

['\ufeffCollection Name', 'Collection Link', 'Notes', 'Objects of Interest']
['National Photo Company Collection', 'https://www.loc.gov/collections/national-photo-company/', 'Filter by restriction: "right_information" and "rights_advisory"', 'People, Animals, Landmarks, Vehicles']
['Highsmith (Carol M.) Archive', 'https://www.loc.gov/collections/carol-m-highsmith/', 'Would need to be filtered by place since it captures projects outside of D.C. Given how Highsmith organized/labeled these images, we could (for the most part) safely do a "if D.C. in XXX:" from the "subject_headings" field or "title" field (the latter seems easier as it\'s less nested).\n\nFilter by restriction: "right_information" and "rights_advisory"', 'Landscape, Landmarks, Roads']
['Free to Use', 'https://www.loc.gov/free-to-use/', 'Filter for D.C. by "title" and "subhect_headings"\nFilter by restriction: "right_information" and "rights_advisory"\n\n', 'Animals, People']


In [78]:
#isolating  and saving the collection links
collection_links = []

for link in collection[1:]:
    collection_links.append(link[1])

for link in collection_links:
    print(link)

https://www.loc.gov/collections/national-photo-company/
https://www.loc.gov/collections/carol-m-highsmith/
https://www.loc.gov/free-to-use/


Now that we've saved the collection links, let's access the JSON information.

In [79]:
#list stores the json response for each collection
json_responses = []

for link in collection_links:
        #using the access_and_store_json function from the helper file
        json_response = access_and_store_json(link)
        json_responses.append(json_response)


{'aka': ['npco', 'http://lccn.loc.gov/2005684470'], 'breadcrumbs': [{'Library of Congress': 'https://www.loc.gov'}, {'Digital Collections': 'https://www.loc.gov/collections/'}, {'National Photo Company Collection': 'https://www.loc.gov/collections/national-photo-company/'}], 'categories': ['about-this-collection', 'articles-and-essays'], 'content': {'active': True, 'link': 'https://www.loc.gov/collections/national-photo-company/', 'markup': None, 'pagination': '2 of 3', 'partof': [], 'results': [{'access_restricted': False, 'aka': ['https://www.loc.gov/pictures/item/93511941/', 'http://www.loc.gov/item/93511941/', 'http://www.loc.gov/pictures/item/93511941/', 'https://www.loc.gov/pictures/collection/npco/item/93511941/', 'http://www.loc.gov/pictures/collection/npco/item/93511941/', 'http://www.loc.gov/resource/cph.3c08418/', 'http://lccn.loc.gov/93511941', 'https://hdl.loc.gov/loc.pnp/cph.3c08418'], 'campaigns': [], 'date': '1927-01-01', 'dates': ['1927'], 'description': ['1 photograph

In this next step, we will optionally download the JSON files for each collection for easier inspection of the structure and organization.

In [80]:
for index in range(len(collection_links)):
    name = collection[index+1][0].lower().replace(" ", "_")
    with open(f"{name}.json", 'w') as f:
        json.dump(json_responses[index], f, indent=4)

# The Metadata

Now, we will be creating a new JSON file with the relevant metadata we want to save about each image. To demonstrate this, we will be filtering out and collecting images from our chosen collections that include "DC" as a topic, and saving select metadata for reference for each image.

For our purposes, we've selected to save the following: the resource ID, the title, the image URL, the subjects, the date, the contributor names, the description, the alt text (if available), the collection name (source_collection), and the original format.

As noted in the last notebook, the structure of the JSON for collections with alt text is different from that of collections without alt text, so we will have to extract the metadata slightly differently.

In [81]:
json_responses_no_alt = [] #A list to store the JSON information from collections with no alt text
json_responses_alt = [] #A list to store the JSON information from collections with alt text

for link in collection_links:
    try:
        response = access_and_store_json(link)
        if response["site_type"] == "collections":
            json_responses_no_alt.append(response)
        elif response["site_type"] == "free-to-use": 
        #the free-to-use collection has been specially curated and given alt text
            json_responses_alt.append(response)
    except:
        print('try again')


{'aka': ['npco', 'http://lccn.loc.gov/2005684470'], 'breadcrumbs': [{'Library of Congress': 'https://www.loc.gov'}, {'Digital Collections': 'https://www.loc.gov/collections/'}, {'National Photo Company Collection': 'https://www.loc.gov/collections/national-photo-company/'}], 'categories': ['about-this-collection', 'articles-and-essays'], 'content': {'active': True, 'link': 'https://www.loc.gov/collections/national-photo-company/', 'markup': None, 'pagination': '2 of 3', 'partof': [], 'results': [{'access_restricted': False, 'aka': ['https://www.loc.gov/pictures/item/93511941/', 'http://www.loc.gov/item/93511941/', 'http://www.loc.gov/pictures/item/93511941/', 'https://www.loc.gov/pictures/collection/npco/item/93511941/', 'http://www.loc.gov/pictures/collection/npco/item/93511941/', 'http://www.loc.gov/resource/cph.3c08418/', 'http://lccn.loc.gov/93511941', 'https://hdl.loc.gov/loc.pnp/cph.3c08418'], 'campaigns': [], 'date': '1927-01-01', 'dates': ['1927'], 'description': ['1 photograph

Let's start with the collections with alt text. Similarly to the last notebook, we'll begin by getting the urls of each image in the collections with alt text so we can use them for get requests to get the necessary metadata.

In [82]:
#A list of partial URLs extracted from the JSON data, structure similar 
#to the following: '/resource/highsm.12695/'."""
unformatted_links = []

#A list of URLs without the JSON filter parameter, but with the root of the URL: 'https://www.loc.gov'.
partially_formatted_links = []

#A list of fully formatted urls, with root and JSON filter applied.
formatted_links = []

#A holding list for using try and except.
error_links = []

#A list of URLs from the data that already had an 'https://' construction and cannot have the
#JSON filter parameted applied.
research_guide_not_image_collection = []

for collection_data in json_responses_alt:
    results = collection_data['pages']
    for result in results:
        get_children = result['children'][1:]
        for group in get_children:
            set = group['set']['items']
            for item in set:
                link = item['link']
                if 'alt' in item.keys():
                    alt = item['alt'] 
                    #saving alt text for easier retrieval to built the JSON file later
                else:
                    alt = "NA"
                to_append = []
                to_append.append(link)
                to_append.append(alt)
                unformatted_links.append(to_append)

for link in unformatted_links:
    root = 'https://www.loc.gov'
    try:
        if 'https://' not in link[0]:
            y = root + link[0]
            to_append = []
            to_append.append(y)
            to_append.append(link[1])
            partially_formatted_links.append(to_append)
        if 'https://' in link[0]:
            research_guide_not_image_collection.append(link)
    except:
        error_links.append(link)

for link in partially_formatted_links:
    try:
        if '?' not in link[0]:
            x = link[0] + '?fo=json'
            to_append = []
            to_append.append(x)
            to_append.append(link[1])
            formatted_links.append(to_append)
        if '?' in link[0]:
            #Image URL may already have a filter, so we need to append the JSON filter to a existing filter.

            x =  link[0] + '&fo=json'
            to_append = []
            to_append.append(x)
            to_append.append(link[1])
            formatted_links.append(to_append)
    except:
        error_links.append(link)

Now that we have all the links, we can perform get requests, filter for relevant DC images, and store the relevant metadata.

In the collections with no alt text, we'll be able to do this directly from the json we already have, without having to use get requests on individual images.

In [83]:
#Confirming the structure of json for a single image
image_request = request_link(formatted_links[0][0])
print(image_request['item'].keys())


dict_keys(['_version_', 'access_restricted', 'aka', 'call_number', 'campaigns', 'contributor_names', 'contributors', 'control_number', 'created', 'created_published', 'created_published_date', 'date', 'dates', 'description', 'digital_id', 'digitized', 'display_offsite', 'extract_timestamp', 'extract_urls', 'format', 'format_headings', 'genre', 'group', 'hassegments', 'id', 'image_url', 'index', 'item', 'language', 'languages', 'library_of_congress_control_number', 'link', 'location', 'location_city', 'location_country', 'location_county', 'location_state', 'locations', 'locations_city', 'locations_country', 'locations_county', 'locations_state', 'marc', 'medium', 'medium_brief', 'mime_type', 'modified', 'notes', 'number', 'number_carrier_type', 'number_former_id', 'number_lccn', 'number_source_modified', 'online_format', 'original_format', 'other_control_numbers', 'other_formats', 'other_title', 'partof', 'place', 'related', 'repository', 'reproduction_number', 'reproductions', 'resour

In [84]:
metadata_dict = {} #Our dictionary for storing metadata, to be converted into a JSON file


counter = 0 #Each image will be given an index for organization purposes -- IN PROGRESS, MIGHT USE LCCN INSTEAD

#We'll limit the amount of links we process here for a faster runtime
for link in formatted_links[:20]:
    try:
        json_results = request_link(link[0])
        #Filter by Washington, D.C. and images that can be freely distributed and used.

        if 'No known restrictions' in json_results['item']['item']['rights_advisory'] and json_results['item']['item']['rights_information']:
            if 'D.C.' or 'District of Columbia' in json_results['item']['item']['title'] or json_results['item']['item']['location']:
                #ID:
                id = json_results['item']['number_lccn']
                #Title:
                title = json_results['item']['title']
                #URL:
                url = json_results['item']['image_url']
                #Subjects:
                subjects = json_results['item']['subject_headings']
                #Date:
                date = json_results['item']['date']
                #Contributors:
                contributors = json_results['item']['contributor_names']
                #Description:
                description = json_results['item']['description']
                #Collection:
                collection = json_results['item']['source_collection']
                #Original Format:
                original_format = json_results['item']['original_format']
                metadata_dict.update({
                    counter: {
                    "Resource ID": id[0],
                    "Item Title": title,
                    "Item URL" : url[-1],
                    "Subjects": subjects,
                    "Date": date,
                    "Contributors": contributors,
                    "Description": description,
                    "Alt Text": link[1],
                    "Collection": collection,
                    "Original Format": original_format
                    }
                
                })
                counter += 1

    except:
        error_links.append(link[0])


Now, let's move onto the collections without alt text.

In [86]:
for collection_data in json_responses_no_alt:
    results = collection_data['content']['results']
    for result in results[:20]:
        # Again, limiting the data returned for faster processing.

        #Filter by Washington, D.C. and images that can be freely distributed and used.
        if 'No Known restrictions' in result['item']['rights_advisory'] or result['item']['rights_information']:
            if 'D.C.' or 'District of Columbia' in result['item']['title'] or result['item']['notes'][:]:

                #As the fields of collections can be varied, we have to check which fields are included
                #with certain images
                #ID:
                id = result['number_lccn']
                #Title:
                title = result['title']
                #URL:
                url = result['image_url']
                #Subjects:
                if 'subject_heading' in result.keys():
                    subjects = result['subject']
                else:
                    subjects = "NA"
                #Date:
                if 'date' in result.keys():
                    date = result['date']
                else:
                    date = "NA"
                #Contributors:
                if 'contributor' in result.keys():
                    contributors = result['contributor']
                else:
                    contributors = "NA"
                #Description:
                if 'description' in result.keys():
                    description = result['description']
                else:
                    description = "NA"
                #Collection:
                if 'partof' in result.keys():
                    collection = result['partof']
                else:
                    collection = "NA"
                #Original Format:
                original_format = result['original_format']
                
                metadata_dict.update({
                    counter: {
                    "Resource ID": id[0],
                    "Item Title": title,
                    "Item URL" : url[0],
                    "Subjects": subjects,
                    "Date": date,
                    "Contributors": contributors,
                    "Description": description,
                    "Alt Text": link[1],
                    "Collection": collection,
                    "Original Format": original_format
                    }
                
                })
                

# Saving the JSON File

Now that we've collected and organized all the metadata into Python dictionaries, we can convert metadata_dict to a JSON file and save it.

In [87]:
with open(f"DC_Set.json", 'w') as f:
        json.dump(metadata_dict, f, indent=4)

The process is complete!