This script is based on instructions given in [this lesson](https://github.com/HeardLibrary/digital-scholarship/blob/master/code/scrape/pylesson/lesson2-api.ipynb). 

## Import libraries and load API key from file

The API key should be the only item in a text file called `flickr_api_key.txt` located in the user's home directory. No trailing newline and don't include the "secret".

In [7]:
from pathlib import Path
import requests
import json
import csv
from time import sleep
import webbrowser

# define some canned functions we need to use

# write a list of dictionaries to a CSV file
def write_dicts_to_csv(table, filename, fieldnames):
    with open(filename, 'w', newline='', encoding='utf-8') as csv_file_object:
        writer = csv.DictWriter(csv_file_object, fieldnames=fieldnames)
        writer.writeheader()
        for row in table:
            writer.writerow(row)

home = str(Path.home()) #gets path to home directory; supposed to work for Win and Mac
key_filename = 'flickr-api-keys-tang-song.txt'
api_key_path = home + '/' + key_filename

try:
    with open(api_key_path, 'rt', encoding='utf-8') as file_object:
        api_key = file_object.read()
        # print(api_key) # delete this line once the script is working; don't want the key as part of the notebook
except:
    print(key_filename + ' file not found - is it in your home directory?')

## Make a test API call to the account

We need to know the user ID. Go to flickr.com, and search for vutheatre. The result is https://www.flickr.com/photos/123262983@N05 which tells us that the ID is 123262983@N05 . There are a lot of kinds of searches we can do. A list is [here](https://www.flickr.com/services/api/).  Let's try `flickr.people.getPhotos` (described [here](https://www.flickr.com/services/api/flickr.people.getPhotos.html)).  This method doesn't actually get the photos; it gets metadata about the photos for an account.

The main purpose of this query is to find out the number of photos that are available so that we can know how to set up the next part. The number of photos is in `['photos']['total']`, so we can extract that from the response data.

In [21]:
user_id = '14665661@N20' # vutheatre's ID
endpoint_url = 'https://www.flickr.com/services/rest'
method = 'flickr.groups.pools.getPhotos'
filename = 'miller-metadata.csv'

param_dict = {
    'method' : method,
#    'tags' : 'kangaroo',
#    'extras' : 'url_o',
    'per_page' : '1',  # default is 100, maximum is 500. Use paging to retrieve more than 500.
    'page' : '1',
    'group_id' : user_id,
    'oauth_consumer_key' : api_key,
    'nojsoncallback' : '1', # this parameter causes the API to return actual JSON instead of its weird default string
    'format' : 'json' # overrides the default XML serialization for the search results
    }

metadata_response = requests.get(endpoint_url, params = param_dict)

# print(metadata_response.url) # uncomment this if testing is needed, again don't reveal key in notebook
data = metadata_response.json()
print(json.dumps(data, indent=4))
print()

number_photos = int(data['photos']['total']) # need to convert string to number
print('Number of photos: ', number_photos)

{
    "photos": {
        "page": 1,
        "pages": 205,
        "perpage": 1,
        "total": 205,
        "photo": [
            {
                "id": "52938328347",
                "owner": "30365320@N04",
                "secret": "e3341fdd7a",
                "server": "65535",
                "farm": 66,
                "title": "jiashan \u5047\u5c71",
                "ispublic": 1,
                "isfriend": 0,
                "isfamily": 0,
                "ownername": "tgmill",
                "dateadded": "1685494158"
            }
        ]
    },
    "stat": "ok"
}

Number of photos:  205


## Test to see what kinds of useful metadata we can get

The instructions for the [method](https://www.flickr.com/services/api/flickr.people.getPhotos.html) says what kinds of "extras" you can request metadata about. Let's ask for everything that we care about and don't already know: 

`description,license,original_format,date_taken,original_format,geo,tags,machine_tags,media,url_t,url_o`

`url_t` is the URL for a thumbnail of the image and `url_o` is the URL to retrieve the original photo. The dimensions of these images will be given automatically when we request the URLs, so we don't need `o_dims`. There isn't any place to request the title, since it's automatically returned.

In [13]:
param_dict = {
    'method' : method,
    'extras' : 'description,license,original_format,date_taken,original_format,geo,tags,machine_tags,media,url_t,url_o',
    'per_page' : '1',  # default is 100, maximum is 500. Use paging to retrieve more than 500.
    'page' : '1',
    'group_id' : user_id,
    'oauth_consumer_key' : api_key,
    'nojsoncallback' : '1', # this parameter causes the API to return actual JSON instead of its weird default string
    'format' : 'json' # overrides the default XML serialization for the search results
    }

metadata_response = requests.get(endpoint_url, params = param_dict)
# print(metadata_response.url) # uncomment this if testing is needed, again don't reveal key in notebook

data = metadata_response.json()
print(json.dumps(data, indent=4))
print()

{
    "photos": {
        "page": 1,
        "pages": 205,
        "perpage": 1,
        "total": 205,
        "photo": [
            {
                "id": "52938328347",
                "owner": "30365320@N04",
                "secret": "e3341fdd7a",
                "server": "65535",
                "farm": 66,
                "title": "jiashan \u5047\u5c71",
                "ispublic": 1,
                "isfriend": 0,
                "isfamily": 0,
                "ownername": "tgmill",
                "dateadded": "1685494158",
                "license": "0",
                "description": {
                    "_content": "jiashan of yellow stone, &quot;Yungang,&quot;  Wangshiyuan (Master of the Nets Garden), Qing dynasty, Suzhou, Jiangsu Province, detail showing interior &quot;grotto&quot;  \u7db2\u5e2b\u5712\u96f2\u5d17\u5047\u5c71\uff0c \u5c40\u90e8\uff0c\u9ec3\u77f3\u4e0b\u6709\u6d1e\uff0c\u6c5f\u8607\u7701\u8607\u5dde\uff08photo: Jin Yinuo \u91d1\u4e00\u8afe, 2023\uff09"
 

## Create and test the function to extract the data we want



In [15]:
def extract_data(photo_number, data):
    dictionary = {} # create an empty dictionary

    # load the response data into a dictionary
    dictionary['id'] = data['photos']['photo'][photo_number]['id']
    dictionary['title'] = data['photos']['photo'][photo_number]['title']
    dictionary['license'] = data['photos']['photo'][photo_number]['license']
    dictionary['description'] = data['photos']['photo'][photo_number]['description']['_content']

    # convert the stupid date format to ISO 8601 dateTime; don't know the time zone - maybe add later?
    temp_time = data['photos']['photo'][photo_number]['datetaken']
    dictionary['date_taken'] = temp_time.replace(' ', 'T')

    dictionary['tags'] = data['photos']['photo'][photo_number]['tags']
    dictionary['machine_tags'] = data['photos']['photo'][photo_number]['machine_tags']
    dictionary['original_format'] = data['photos']['photo'][photo_number]['originalformat']
    dictionary['latitude'] = data['photos']['photo'][photo_number]['latitude']
    dictionary['longitude'] = data['photos']['photo'][photo_number]['longitude']
    dictionary['thumbnail_url'] = data['photos']['photo'][photo_number]['url_t']
    dictionary['original_url'] = data['photos']['photo'][photo_number]['url_o']
    dictionary['original_height'] = data['photos']['photo'][photo_number]['height_o']
    dictionary['original_width'] = data['photos']['photo'][photo_number]['width_o']
    
    return dictionary

# test the function with a single row
table = []

photo_number = 0
photo_dictionary = extract_data(photo_number, data)
table.append(photo_dictionary)

# write the data to a file
fieldnames = photo_dictionary.keys() # use the keys from the last dictionary for column headers; assume all are the same
write_dicts_to_csv(table, filename, fieldnames)

print('Done')

Done


## Create the loops to do the paging

Flickr limits the number of photos that can be requested to 500. Since we have more than that, we need to request the data 500 photos at a time.

In [24]:
per_page = 500   # use 500 for full download, use smaller number like 5 for testing
pages = number_photos // per_page   # the // operator returns the integer part of the division ("floor")
table = []

for page_number in range(0, pages + 1):  # need to add one to get the final partial page
#for page_number in range(0, 1):  # use this to do only one page for testing
    print('retrieving page ', page_number + 1)
    page_string = str(page_number + 1)
    param_dict = {
        'method' : method,
        'extras' : 'description,license,original_format,date_taken,original_format,geo,tags,machine_tags,media,url_t,url_o',
        'per_page' : str(per_page),  # default is 100, maximum is 500.
        'page' : page_string,
        'group_id' : user_id,
        'oauth_consumer_key' : api_key,
        'nojsoncallback' : '1', # this parameter causes the API to return actual JSON instead of its weird default string
        'format' : 'json' # overrides the default XML serialization for the search results
        }
    metadata_response = requests.get(endpoint_url, params = param_dict)
    data = metadata_response.json()
#    print(json.dumps(data, indent=4))  # uncomment this line for testing
    
    # data['photos']['photo'] is the number of photos for which data was returned
    for image_number in range(0, len(data['photos']['photo'])):
        photo_dictionary = extract_data(image_number, data)
        table.append(photo_dictionary)

    # write the data to a file
    # We could just do this for all the data at the end.
    # But if the search fails in the middle, we will at least get partial results
    fieldnames = photo_dictionary.keys() # use the keys from the last dictionary for column headers; assume all are the same
    write_dicts_to_csv(table, filename, fieldnames)

    sleep(1) # wait a second to avoid getting blocked for hitting the API to rapidly

print('Done')

retrieving page  1
Done


# Download the images to a local directory.

Before running this cell, add a column called image_name

In [25]:
# Open the CSV file with the saved list of images with empty cells as empty strings
import pandas as pd
import requests
import time

STORAGE_DIR = '/Users/baskausj/Downloads/miller/'

# Read the CSV file
images_df = pd.read_csv(filename, na_filter=False, dtype=str)

# For testing, just use the first three rows
#images_df = images_df.head(2)

# Step through each row and download the image using the URL from the original_url column
for index, row in images_df.iterrows():
    # If the image_name column is not empty, skip this row
    if row['image_name'] != '':
        continue

    # Get the original URL from the original_url column
    url = row['original_url']
    
    # Get the filename from the id column
    image_name = row['id'] + '.' + row['original_format']
    print(image_name)
    
    # Download the image
    r = requests.get(url)

    # Save the image as a bytes file object
    with open(STORAGE_DIR + image_name, 'wb') as f:
        f.write(r.content)

    # Add the image name to a column in the dataframe
    images_df.loc[index, 'image_name'] = image_name

    # Save the dataframe to the same CSV file after each image is downloaded in case the script crashes
    images_df.to_csv(filename, index=False)

    # Wait 1 second before downloading the next image
    time.sleep(1)

    

    



52938328347.jpg
52909252587.jpg
52908806535.jpg
52905463137.jpg
52763297178.jpg
52762775821.jpg
52762312780.jpg
52762312795.jpg
52762157619.jpg
52761361362.jpg
52761361382.jpg
52760165567.jpg
52749647023.jpg
52739995797.jpg
52740739089.jpg
52737347115.jpg
52736784674.jpg
52736527921.jpg
52736761524.jpg
52735974457.jpg
52729746371.jpg
52729641751.jpg
52726033636.jpg
52711796280.jpg
52705293679.jpg
52699855437.jpg
52698644285.jpg
52270792221.jpg
52247309391.jpg
52244240167.jpg
52245212591.jpg
51734052350.jpg
52243339731.jpg
52243231095.jpg
52225051557.jpg
52225990416.jpg
52220742225.jpg
52185457598.jpg
52184610640.jpg
52183425289.jpg
52182248038.jpg
52182252156.png
52182240896.jpg
52182613380.jpg
52182607685.png
52182118053.png
52180012568.jpg
52168682808.png
52166485361.png
52165321614.jpg
52164247456.jpg
52164482254.jpg
52163211072.jpg
52164721640.png
51312063806.jpg
51277669993.jpg
51273946031.jpg
51273155892.jpg
51273831066.jpg
51274177544.jpg
51272659777.jpg
51272443557.jpg
51273884