# I. Import Models and Libraries From "helper.ipynb"

Before we get started, let's import some libraries and modules that that are specific to this step in the workflow. We'll also import the libraries and modules important to the overall process from workflow_helpers.py.

In [135]:

"""general imports"""
from PIL import Image
from io import BytesIO

"""import functions from helper.py"""
from workflow_helpers import *

# Load in Collection CSV and Access Collection Links

In this step, we'll assign the filepath of the collection data we want to work with to a variable. For simplicity, we;ll call this variable "file".

In [136]:
file = "jfp-collections-starter-collections.csv"

Use the imported read_in_collection_csv_for_links() function from workflow_helpers.py to read in the CSV.

Please note that you may use any data storage format that you are comfortable with, such as JSON. However, our helper function is written based on the data storage format that our team is most likely to start with.

In [137]:
collection = read_in_collection_csv_for_links(file)

Now that we've read in the CSV, we can manipulate the data to extract the links of the Library's collections that we want to work with.

Let's iterate through the data to become familar with it, verify that it is in a format we can work with, and that it has all the information we expect.

If your collection data is larger than what we have here, you can limit the output from the collection by tweaking our code as shown below, where 'n' is the number of items you want to return from the collection list:

```

for row in collection[:n]:
    print(row)

```

In [138]:
for row in collection:
    print(row)

['\ufeffCollection Name', 'Collection Link', 'Notes', 'Objects of Interest']
['National Photo Company Collection', 'https://www.loc.gov/collections/national-photo-company/', 'Filter by restriction: "right_information" and "rights_advisory"', 'People, Animals, Landmarks, Vehicles']
['Highsmith (Carol M.) Archive', 'https://www.loc.gov/collections/carol-m-highsmith/', 'Would need to be filtered by place since it captures projects outside of D.C. Given how Highsmith organized/labeled these images, we could (for the most part) safely do a "if D.C. in XXX:" from the "subject_headings" field or "title" field (the latter seems easier as it\'s less nested).\n\nFilter by restriction: "right_information" and "rights_advisory"', 'Landscape, Landmarks, Roads']
['Free to Use', 'https://www.loc.gov/free-to-use/', 'Filter for D.C. by "title" and "subhect_headings"\nFilter by restriction: "right_information" and "rights_advisory"\n\n', 'Animals, People']


Now that we're familiar with the collection data, let's use indexing to access the collection URL in each  row. We'll assign these links to a list variable called "collection_links."

To append each link without the CSV header, we'll iterate through our "collection" list variable from row 1, where row 0 is the header row.

In [139]:
collection_links = []

for link in collection[1:]:
    collection_links.append(link[1])


### Do some test prints to test the data structure.

Now that we have the links stored in "collection_links," let's do a test print to verify that we're getting the information we want.

In [140]:
"""Keep in mind the actual data structure: a list of strings..."""
# print(collection_links)

"""but, for easier viewing:"""
for link in collection_links:
    print(link)

https://www.loc.gov/collections/national-photo-company/
https://www.loc.gov/collections/carol-m-highsmith/
https://www.loc.gov/free-to-use/


Now that the data structure is tested and verified, use the access_and_store_json() function from workflow_helpers.py to loop through the "collection_links" variable and access the Library's API. Load this data as a JSON and store it in a variable for easy access and reuse.

In [143]:
json_response = []

# for collection in collection_links[:2]:
json = access_and_store_json(collection_links[2])
json_response.append(json)

"""Do test print to verify data structure. The output will be a list of dictionaries."""
        # print(json)

'Do test print to verify data structure. The output will be a list of dictionaries.'

From the previous test print, we can see that the data structure for collections with alt-text and collections without alt-text vary, so we will need to account for this as we access the image_urls for download.

Let's test the data structure of the JSON response by accessing it's keys to get started.

In [144]:
for key in json_response:
    print(key.keys())

dict_keys(['breadcrumbs', 'content', 'content_is_post', 'description', 'expert_resources', 'next', 'next_sibling', 'options', 'pages', 'portal', 'previous', 'previous_sibling', 'site_type', 'timestamp', 'title', 'type'])


The Free to Use and Reuse collection is specifically curated and has alt-text, whereas most other Library collections do not.

As such, we cannot access the image URLs in the same way. We need to find a way to separate collections with alt-text and those without, and store their JSON responses into separate variables.

There are many ways to do this at scale, but some customization will be needed regardless.

We opted to sort the collection via key lookup, using ``` ['site_type'] ``` as a way to sort the collections since only one of these collections have alt-text.

In [154]:
"""Store json data with alt-text field."""
alt_json = []

"""Store json data without alt-text field."""
no_alt_json = []

# for link in collection_links[2]:
try:
    response = access_and_store_json(link)
    if response["site_type"] == "collections":
        no_alt_json.append(response)
    elif response["site_type"] == "free-to-use":
        alt_json.append(response)
except:
    print('try again')

"""You can some test prints here to verify that the information was sorted and appended according to
your expectations."""
for x in alt_json:
    print(x)

for y in no_alt_json:
    print(y)

try again


Now that we've found a way to seperate the different data structures, let's initialize the list that we'll be storing the links from each collection.

In [146]:
urls = []

Let's write a script that captures the image URL from each data structure and assign it to a variable.

Let's start by getting the image URLs from the "alt_json" data, which necessitates a more legnthy and custom script.

Unlike the collections in "no_alt_json" which has the downloadable image URL listed, we must perform multiple iterations on the collection in "alt_json." In other words, we must perform multiple get requests.

In [155]:
"""A list of fully formatted urls, with root and JSON filter applied."""
formatted_links = []


for collection_data in alt_json:
    results = collection_data['pages']
    for result in results:
        get_children = result['children'][1:]
        for group in get_children:
            set = group['set']['items']
            for item in set:
                link = item['link']
                root = 'https://www.loc.gov'
                if 'https://' not in link:
                    partially_formatted_link = root + link
                    if '?' not in partially_formatted_link:
                        x = partially_formatted_link + '?fo=json'
                        formatted_links.append(x)
                    elif '?' in partially_formatted_link:
                        """Image URL may already have a filter, so we need to append the JSON filter to a existing filter."""
                        x =  partially_formatted_link + '&fo=json'
                        formatted_links.append(x)
                else:
                    continue

In [156]:
print(len(formatted_links))

0


In [45]:
# import pickle

# class PickledResponse:
#     def __init__(self, resource_id, title, url, subject_list, date, contributor_names, locations, collection_name, alt_text=None, collection_set=None):
#         self.dict = {
#             f'image_{resource_id}.jpg': {
#                 'resource_id': resource_id,
#                 'title': title,
#                 'url': url,
#                 'subject_list': subject_list,
#                 'date': date,
#                 'alt_text': alt_text,
#                 'contributor_names': contributor_names,
#                 'locations': locations,
#                 'collection_name': collection_name,
#                 'set': collection_set
#             }
#         }


#     def save_to_file(self, filename):
#         filename = os.path.join('response_pickles', filename)
#         with open(filename, 'wb') as file:
#             pickle.dump(self, file)

#     @staticmethod
#     def load_from_file(filename):
#         filename = os.path.join('response_pickles', filename)
#         with open(filename, 'rb') as file:
#             return pickle.load(file)


In [151]:
from time import sleep
def main_dict(url, main_dictionary):
    try: 
        json_results = request_link(url)
        item_data = json_results['item']['item']
        rights = json_results['item']['item'].get('rights_advisory', None)

        if rights == None:
            pass

        else:

            if 'No known restrictions' in json_results['item']['item']['rights_advisory'] or 'No known restrictions' in json_results['item']['item']['rights_information']:
                if 'D.C.' or 'District of Columbia' in json_results['item']['item']['title'] or json_results['item']['item']['location']:
                    image_url = json_results['item']['image_url'][-1]
                    title = item_data['title']
                    resource_id = item_data['id']
                    subjects = item_data.get('Subject', 'None') 
                    date = item_data.get('date', 'None')  # Need to specify the creation date type
                    locations = json_results.get('locations', 'None') 
                    contributors = item_data.get('contributors', 'None') 
                    collection_name = item_data.get('source_collection', 'None') 

                    # Cannot find set or alt_text, setting as 'None'
                    main_dictionary[f'image_{resource_id}.jpg'] = {
                        'resource_id': resource_id,
                        'title': title,
                        'url': url,
                        'image_url': image_url,
                        'subject_list': subjects,
                        'date': date,
                        'alt_text': 'None',
                        'contributor_names': contributors,
                        'locations': locations,
                        'collection_name': collection_name,
                        'set': 'None'
                    }
    except Exception as e:
        return


image_metadata = {}

for i,link in enumerate(formatted_links[:100]):
    if i%10 == 0 and i != 0:
        print('Resting in compliance with Rate limits', i)
        sleep(10)
    main_dict(link, image_metadata)

Resting in compliance with Rate limits 10
Resting in compliance with Rate limits 20
Resting in compliance with Rate limits 30
Resting in compliance with Rate limits 40
Resting in compliance with Rate limits 50
Resting in compliance with Rate limits 60
Resting in compliance with Rate limits 70
Resting in compliance with Rate limits 80
Resting in compliance with Rate limits 90


In [152]:
print(len(image_metadata))

49


In [None]:
# object_1 = PickledResponse(resource_id,title,image_url,subjects,date,contributors,locations,collection_name)
# object_1.load_from_file('pickle_1.bin').dict

In [153]:
import json
with open('request_sample_bulk', 'w') as json_file:
    json.dump(image_metadata, json_file)

In [None]:

for image in formatted_links[:20]:
    """Limit the data returned for faster processing."""

    try:
        json_results = request_link(image)
        """Filter by Washington, D.C. and images that can be freely distributed and used."""

        if 'No known restrictions' in json_results['item']['item']['rights_advisory'] and json_results['item']['item']['rights_information']:
            if 'D.C.' or 'District of Columbia' in json_results['item']['item']['title'] or json_results['item']['item']['location']:
                results = json_results['item']['image_url'][-1]
                """Note that [-1] here allows us to get the highest image resolution at all times.
                The amount of options for image resolution varies. The Library for consistently puts the
                highest resolution at the end of the list."""

                urls.append(results)
    except:
        error_links.append(results)

Now, let's write a script to get the URL from the "no_alt_json" data. This script will be more straightforward and scalable.

In [102]:
for collection_data in no_alt_json:
    results = collection_data['content']['results']
    for result in results[:20]:
        """Limit the data returned for faster processing."""

        """Filter by Washington, D.C. and images that can be freely distributed and used."""
        if 'No Known restrictions' in result['item']['rights_advisory'] or result['item']['rights_information']:
            if 'D.C.' or 'District of Columbia' in result['item']['title'] or result['item']['notes'][:]:
                image_url = result['image_url'][-1]
                """Note that [-1] here allows us to get the highest image resolution at all times.
                The amount of options for image resolution varies. The Library for consistently puts the
                highest resolution at the end of the list."""

                urls.append(image_url)

Now, let's verify that the "urls" variable has the information we want.

In [103]:
for x in urls[:5]:
    print(x)

https://tile.loc.gov/image-services/iiif/service:pnp:highsm:36200:36247/full/pct:50/0/default.jpg#h=2812&w=2450
https://tile.loc.gov/storage-services/service/pnp/ppmsca/34500/34513v.jpg#h=1024&w=628
https://tile.loc.gov/storage-services/service/pnp/pga/01600/01637v.jpg#h=693&w=1024
https://tile.loc.gov/image-services/iiif/service:pnp:highsm:12600:12695/full/pct:50/0/default.jpg#h=2145&w=1710
https://tile.loc.gov/image-services/iiif/service:pnp:highsm:36500:36524/full/pct:25/0/default.jpg#h=1448&w=2172


Before we continue, let's create a folder to store the images after we download them.

In [104]:
if not os.path.exists('image-collection-output'):
    os.mkdir('image-collection-output')

Now, let's download the images.

In [105]:
for index, image in enumerate(urls):
        try:
                print(image)
                image_to_detect = requests.get(image)
                image_filename = f"image-collection-output/image_{index + 1}.jpg"
                img_bytes_io = BytesIO(image_to_detect.content)
                converted_file = Image.open(img_bytes_io).convert('RGB').save(image_filename)
        except:
                print(image)

https://tile.loc.gov/image-services/iiif/service:pnp:highsm:36200:36247/full/pct:50/0/default.jpg#h=2812&w=2450
https://tile.loc.gov/storage-services/service/pnp/ppmsca/34500/34513v.jpg#h=1024&w=628
https://tile.loc.gov/storage-services/service/pnp/pga/01600/01637v.jpg#h=693&w=1024
https://tile.loc.gov/image-services/iiif/service:pnp:highsm:12600:12695/full/pct:50/0/default.jpg#h=2145&w=1710
https://tile.loc.gov/image-services/iiif/service:pnp:highsm:36500:36524/full/pct:25/0/default.jpg#h=1448&w=2172
https://tile.loc.gov/storage-services/service/pnp/highsm/55500/55524v.jpg#h=683&w=1024
https://tile.loc.gov/storage-services/service/pnp/ppmsca/85300/85335v.jpg#h=996&w=1024
https://tile.loc.gov/storage-services/service/pnp/ppmsca/85600/85616v.jpg#h=832&w=1024
https://tile.loc.gov/storage-services/service/pnp/highsm/56800/56807v.jpg#h=683&w=1024
https://tile.loc.gov/storage-services/service/pnp/ppmsca/13400/13482v.jpg#h=1024&w=699
https://tile.loc.gov/storage-services/service/pnp/ppmsc/00