# Step 1: Metadata Collection and Image Download

In this section, we will work to extract, save, and structure the relevant metadata from the DC image collection created in the last section, creating a JSON file with all the relevant metadata. Using the metadata, we will also download and save the images.

# I. Imports

Before we get started, let's import some libraries and modules that that are specific to this step in the workflow. We'll also import the libraries and modules important to the overall process from workflow_helpers.py.

In [15]:
from PIL import Image
from io import BytesIO
from workflow_helpers import *
import json
import requests

In [16]:
number_of_instances = 80

# Load in Collection CSV and Access Collection Links

In this step, we'll assign the filepath of the collection data we want to work with to a variable. For simplicity, we'll call this variable "file". 

Please note that you may use any data storage format that you are comfortable with, such as JSON. However, our helper function is written based on the data storage format that our team is most likely to start with.

In [17]:
#assigning the filepath to the csv file with the collection information
file = "jfp-collections-starter-collections.csv"
#using the function from the helper file to read in the csv
collection = read_in_collection_csv_for_links(file)


Let's iterate through the data to become familar with it, verify that it is in a format we can work with, and that it has all the information we expect.

If your collection data is larger than what we have here, you can limit the output from the collection by tweaking our code as shown below, where 'n' is the number of items you want to return from the collection list:


To append each link without the CSV header, we'll iterate through our "collection" list variable from row 1, where row 0 is the header row.

In [18]:
#making sure the data is as expected
for row in collection:
    print(row)

#isolating  and saving the collection links
collection_links = []

for link in collection[1:]:
    collection_links.append(link[1])


['\ufeffCollection Name', 'Collection Link', 'Notes', 'Objects of Interest']
['National Photo Company Collection', 'https://www.loc.gov/collections/national-photo-company/', 'Filter by restriction: "right_information" and "rights_advisory"', 'People, Animals, Landmarks, Vehicles']
['Highsmith (Carol M.) Archive', 'https://www.loc.gov/collections/carol-m-highsmith/', 'Would need to be filtered by place since it captures projects outside of D.C. Given how Highsmith organized/labeled these images, we could (for the most part) safely do a "if D.C. in XXX:" from the "subject_headings" field or "title" field (the latter seems easier as it\'s less nested).\n\nFilter by restriction: "right_information" and "rights_advisory"', 'Landscape, Landmarks, Roads']
['Free to Use', 'https://www.loc.gov/free-to-use/', 'Filter for D.C. by "title" and "subhect_headings"\nFilter by restriction: "right_information" and "rights_advisory"\n\n', 'Animals, People']


Now that we've saved the collection links, let's access the JSON information.

In [19]:
#list stores the json response for each collection
json_responses = []

for link in collection_links:
        #using the access_and_store_json function from the helper file
        json_response = access_and_store_json(link)
        json_responses.append(json_response)


In this next step, we will optionally download the JSON files for each collection for easier inspection of the structure and organization.

In [20]:
for index in range(len(collection_links)):
    name = collection[index+1][0].lower().replace(" ", "_")
    with open(f"{name}.json", 'w') as f:
        json.dump(json_responses[index], f, indent=4)

# The Metadata

Now, we will be creating a new JSON file with the relevant metadata we want to save about each image. To demonstrate this, we will be filtering out and collecting images from our chosen collections that include "DC" as a topic, and saving select metadata for reference for each image.

For our purposes, we've selected to save the following: the resource ID, the title, the image URL, the subjects, the date, the contributor names, the description, the alt text (if available), the collection name (source_collection), and the original format.

**The structure of the JSON for collections with alt text is different from that of collections without alt text, so we will have to extract the metadata slightly differently.**

In [21]:
json_responses_no_alt = [] #A list to store the JSON information from collections with no alt text
json_responses_alt = [] #A list to store the JSON information from collections with alt text

for link in collection_links:
    try:
        response = access_and_store_json(link)
        if response["site_type"] == "collections":
            json_responses_no_alt.append(response)
        elif response["site_type"] == "free-to-use": 
        #the free-to-use collection has been specially curated and given alt text
            json_responses_alt.append(response)
    except:
        print('try again')


Let's start with the collections with alt text. We'll begin by getting the urls of each image in the collections with alt text so we can use them for get requests to get the necessary metadata.

The Free to Use and Reuse collection is specifically curated and has alt-text, whereas most other Library collections do not.

As such, we cannot access the image URLs in the same way. We need to find a way to separate collections with alt-text and those without, and store their JSON responses into separate variables.

There are many ways to do this at scale, but some customization will be needed regardless.

We opted to sort the collection via key lookup, using ``` ['site_type'] ``` as a way to sort the collections since only one of these collections have alt-text.

In [22]:
#A list of partial URLs extracted from the JSON data, structure similar 
#to the following: '/resource/highsm.12695/'."""
unformatted_links = []

#A list of URLs without the JSON filter parameter, but with the root of the URL: 'https://www.loc.gov'.
partially_formatted_links = []

#A list of fully formatted urls, with root and JSON filter applied.
formatted_links = []

#A holding list for using try and except.
error_links = []

#A list of URLs from the data that already had an 'https://' construction and cannot have the
#JSON filter parameted applied.
research_guide_not_image_collection = []

for collection_data in json_responses_alt:
    results = collection_data['pages']
    for result in results:
        get_children = result['children'][1:]
        for group in get_children:
            set = group['set']['items']
            for item in set:
                link = item['link']
                if 'alt' in item.keys():
                    alt = item['alt'] 
                    #saving alt text for easier retrieval to built the JSON file later
                else:
                    alt = "NA"
                to_append = []
                to_append.append(link)
                to_append.append(alt)
                unformatted_links.append(to_append)

for link in unformatted_links:
    root = 'https://www.loc.gov'
    try:
        if 'https://' not in link[0]:
            y = root + link[0]
            to_append = []
            to_append.append(y)
            to_append.append(link[1])
            partially_formatted_links.append(to_append)
        if 'https://' in link[0]:
            research_guide_not_image_collection.append(link)
    except:
        error_links.append(link)

for link in partially_formatted_links:
    try:
        if '?' not in link[0]:
            x = link[0] + '?fo=json'
            to_append = []
            to_append.append(x)
            to_append.append(link[1])
            formatted_links.append(to_append)
        if '?' in link[0]:
            #Image URL may already have a filter, so we need to append the JSON filter to a existing filter.

            x =  link[0] + '&fo=json'
            to_append = []
            to_append.append(x)
            to_append.append(link[1])
            formatted_links.append(to_append)
    except:
        error_links.append(link)

Now that we have all the links, we can perform get requests, filter for relevant DC images, and store the relevant metadata.

In the collections with no alt text, we'll be able to do this directly from the json we already have, without having to use get requests on individual images.

In [23]:
#Confirming the structure of json for a single image
image_request = request_link(formatted_links[0][0])
print(image_request['item'].keys())


dict_keys(['_version_', 'access_restricted', 'aka', 'call_number', 'campaigns', 'contributor_names', 'contributors', 'control_number', 'created', 'created_published', 'created_published_date', 'date', 'dates', 'description', 'digital_id', 'digitized', 'display_offsite', 'extract_timestamp', 'extract_urls', 'format', 'format_headings', 'genre', 'group', 'hassegments', 'id', 'image_url', 'index', 'item', 'language', 'languages', 'library_of_congress_control_number', 'link', 'location', 'location_city', 'location_country', 'location_county', 'location_state', 'locations', 'locations_city', 'locations_country', 'locations_county', 'locations_state', 'marc', 'medium', 'medium_brief', 'mime_type', 'modified', 'notes', 'number', 'number_carrier_type', 'number_former_id', 'number_lccn', 'number_source_modified', 'online_format', 'original_format', 'other_control_numbers', 'other_formats', 'other_title', 'partof', 'place', 'related', 'repository', 'reproduction_number', 'reproductions', 'resour

In [24]:
metadata_dict = {} #Our dictionary for storing metadata, to be converted into a JSON file


# counter = 0 #Each image will be given an index for organization purposes -- IN PROGRESS, MIGHT USE LCCN INSTEAD

#We'll limit the amount of links we process here for a faster runtime
for i,link in enumerate(formatted_links[:number_of_instances]):
    # if i%10 == 0 and i !=0:
    #     sleep(10)

    try:
        json_results = request_link(link[0])
        #Filter by Washington, D.C. and images that can be freely distributed and used.

        if 'No known restrictions' in json_results['item']['item']['rights_advisory'] and json_results['item']['item']['rights_information']:
            if 'D.C.' or 'District of Columbia' in json_results['item']['item']['title'] or json_results['item']['item']['location']:
                #ID:
                id = json_results['item']['number_lccn']
                #Title:
                title = json_results['item']['title']
                #URL:
                url = json_results['item']['image_url']
                #Subjects:
                subjects = json_results['item']['subject_headings']
                #Date:
                date = json_results['item']['date']
                #Contributors:
                contributors = json_results['item']['contributor_names']
                #Description:
                description = json_results['item']['description']
                #Collection:
                collection = json_results['item']['source_collection']
                #Original Format:
                original_format = json_results['item']['original_format']

                image_name = f"image_{id[0]}.jpg"
                metadata_dict.update({
                    image_name: {
                    "resource_id": id[0],
                    "title": title,
                    "item_url" : url[-1],
                    "subjects": subjects,
                    "dates": date,
                    "contributors": contributors,
                    "description": description,
                    "alt_text": link[1],
                    "collection": collection,
                    "original_format": original_format
                    }
                
                })
                # counter += 1

    except:
        error_links.append(link[0])


Now, let's move onto the collections without alt text.

In [25]:
for collection_data in json_responses_no_alt:
    results = collection_data['content']['results']
    for result in results[:number_of_instances]:
        # Again, limiting the data returned for faster processing.

        #Filter by Washington, D.C. and images that can be freely distributed and used.
        if 'No Known restrictions' in result['item']['rights_advisory'] or result['item']['rights_information']:
            if 'D.C.' or 'District of Columbia' in result['item']['title'] or result['item']['notes'][:]:

                #As the fields of collections can be varied, we have to check which fields are included
                #with certain images
                #ID:
                id = result['number_lccn']
                #Title:
                title = result['title']
                #URL:
                url = result['image_url']
                #Subjects:
                if 'subject_heading' in result.keys():
                    subjects = result['subject']
                else:
                    subjects = "NA"
                #Date:
                if 'date' in result.keys():
                    date = result['date']
                else:
                    date = "NA"
                #Contributors:
                if 'contributor' in result.keys():
                    contributors = result['contributor']
                else:
                    contributors = "NA"
                #Description:
                if 'description' in result.keys():
                    description = result['description']
                else:
                    description = "NA"
                #Collection:
                if 'partof' in result.keys():
                    collection = result['partof']
                else:
                    collection = "NA"
                #Original Format:
                original_format = result['original_format']
            
                image_name = f"image_{id[0]}.jpg"
                metadata_dict.update({
                    image_name: {
                        "resource_id": id[0],
                        "title": title,
                        "item_url": url[-1],
                        "subjects": subjects,
                        "dates": date,
                        "contributors": contributors,
                        "description": description,
                        "alt_text": link[1],
                        "collection": collection,
                        "original_format": original_format
                    }
                })
                

Before we continue, let's create a folder to store the images after we download them.

In [26]:
if not os.path.exists('image-collection-output'):
    os.mkdir('image-collection-output')

Now, using the metadata dictionary let's download the images.

In [27]:


for item in metadata_dict:
    try:
        item_data = metadata_dict[item]
        id = item_data["resource_id"]
        image = requests.get(item_data["item_url"])
        image_filename = f"image-collection-output/image_{id}.jpg"
        img_bytes_io = BytesIO(image.content)
        converted_file = Image.open(img_bytes_io).convert('RGB').save(image_filename)
        print(f"Saved: {item}")
    except:
        print(f"Failed to Save: {item} ")

Saved: image_2017686730.jpg
Saved: image_2011630889.jpg
Saved: image_2017687007.jpg
Saved: image_2018703447.jpg
Saved: image_2023632670.jpg
Saved: image_2019689231.jpg
Saved: image_2020742127.jpg
Saved: image_2017879462.jpg
Saved: image_2016826637.jpg
Saved: image_2011631485.jpg
Saved: image_2020742358.jpg
Saved: image_2010648441.jpg
Saved: image_2017661007.jpg
Saved: image_2020714546.jpg
Saved: image_2018698633.jpg
Saved: image_2014633196.jpg
Saved: image_2018700461.jpg
Saved: image_2010630073.jpg
Saved: image_2017647455.jpg
Saved: image_2017702122.jpg
Saved: image_2013630622.jpg
Saved: image_2010637045.jpg
Saved: image_2020721404.jpg
Saved: image_2014630613.jpg
Saved: image_2004661541.jpg
Saved: image_2017732234.jpg
Saved: image_2020734020.jpg
Saved: image_2010630399.jpg
Saved: image_2017882233.jpg
Saved: image_2008680192.jpg
Saved: image_2015652321.jpg
Saved: image_2019708497.jpg
Saved: image_2021643419.jpg
Saved: image_2020732637.jpg
Saved: image_2020733354.jpg
Saved: image_2021638

# Saving the JSON File

Now that we've collected and organized all the metadata into Python dictionaries, we can convert metadata_dict to a JSON file and save it.

In [28]:
with open(f"items_metadata.json", 'w') as f:
        json.dump(metadata_dict, f, indent=4)

The process is complete!