Created: 18 July 2019  
Updated: 31 July 2019

Bulk Downloading Images from iNaturalist
====================================

This notebook outlines the procedure for bulk downloading observation photos from iNaturalist using Python. This notebook has been tailored to GGI-Gardens, but this procedure can be used for any project with the right modifications.

Individual specimens are recorded in iNaturalist as `observation` records. Each observation record contains taxonomic information, collection event information, etc. Multiple images of the specimen can be attached to each observation as `photo` records. Photo records contain their own data: the time and date the photo was taken, licensing information, etc.

Observation data for collections can be exported from iNaturalist using the Export Observation tool. This notebook uses exported observation data to download the associated images and image metadata via the iNaturalist API.

iNaturalist API
---------------------

The [iNaturalist API](https://www.inaturalist.org/pages/api+reference) includes an endpoint that allows users to retrieve information about an observation with the observation's ID. Among other data, the API will return static URLs for all the photos attached to an observation. 

Here is an example of the API data returned for one observation:

https://www.inaturalist.org/observations/2646623.json

iNaturalist requests that users limit querying rates to no more than 60 queries/minute. This code adheres to this rate limit.

Download Observation Data
------------------------------------------

1. Log into iNaturalist


2. Once logged in, you should be able to use this link to get to the **Export Observations** page for the Global Genome Initiative Gardens project: https://www.inaturalist.org/observations/export?projects%5B%5D=global-genome-initiative-gardens (Alternatively, click the **Projects** tab, then **Global Genome Initiative Gardens**, then scroll down and click **Export Observations**.)


3. In Section 1 (Create a Query), find the **Filter** options and select the box for **w/ photos**. 


4. In Section 3 (Choose columns), ensure that the **id** box in the **Basic** section is selected.


5. Add additional filters and select/deselect additional columns based on the observation data you would like to export.


6. Scroll down to the bottom and click **Create export**. You can either wait a few minutes for the report to generate, or choose to receive a notification email.  


7. Once the report is generated, download it and extract the .csv file from the zip archive. Rename the extracted file `observations.csv`.

Retrieve image URLs and metadata
---------------------------------------------------

Import the requisite Python libraries.

In [1]:
import os, requests, time, urllib.request
import pandas as pd

Read `observations.csv` to get a list of all observation IDs in the project.

In [2]:
try:
    observations = pd.read_csv('observations.csv')
    print(observations.columns)
except FileNotFoundError:
    print("Could not find observations.csv.")

Index(['id', 'observed_on_string', 'observed_on', 'time_observed_at',
       'time_zone', 'out_of_range', 'user_id', 'user_login', 'created_at',
       'updated_at', 'quality_grade', 'license', 'url', 'image_url',
       'sound_url', 'tag_list', 'description', 'id_please',
       'num_identification_agreements', 'num_identification_disagreements',
       'captive_cultivated', 'oauth_application_id', 'place_guess', 'latitude',
       'longitude', 'positional_accuracy', 'private_place_guess',
       'private_latitude', 'private_longitude', 'private_positional_accuracy',
       'geoprivacy', 'taxon_geoprivacy', 'coordinates_obscured',
       'positioning_method', 'positioning_device', 'species_guess',
       'scientific_name', 'common_name', 'iconic_taxon_name', 'taxon_id'],
      dtype='object')


Extract a list of all observation `id`s.

In [3]:
obs_ids = list(observations['id'])
print("Number of observations: {}".format(len(obs_ids)))

Number of observations: 1032


For each observation, use the API to retrieve the associated image data. 

In [4]:
print("Retrieving photo data for {} observations".format(len(obs_ids)))
images = []
obs_counter = 0
for idno in obs_ids[:30]:
    url = "https://www.inaturalist.org/observations/{}.json".format(idno)
    r = requests.get(url).json()
    photos = r.get('observation_photos', [])
    for photo in photos:
        photo_data = {'observation_id': photo.get('observation_id', None)}
        photo_data.update(photo.get('photo', {}))
        photo_data['original_size_url'] = photo_data.get('large_url', '').replace('large', 'original')
        images.append(photo_data)
    time.sleep(1) # Rate limit 1 request per second
    
    obs_counter += 1
    if obs_counter % 10 == 0:
        print("{} observations processed, {} total photo data retrieved".format(obs_counter, len(images)))

Retrieving photo data for 1032 observations
10 observations processed, 32 total photo data retrieved
20 observations processed, 61 total photo data retrieved
30 observations processed, 98 total photo data retrieved


In [5]:
images[0]

{'observation_id': 2646623,
 'id': 2970895,
 'square_url': 'https://static.inaturalist.org/photos/2970895/square.jpg?1454686317',
 'thumb_url': 'https://static.inaturalist.org/photos/2970895/thumb.jpg?1454686317',
 'small_url': 'https://static.inaturalist.org/photos/2970895/small.jpg?1454686317',
 'medium_url': 'https://static.inaturalist.org/photos/2970895/medium.jpg?1454686317',
 'large_url': 'https://static.inaturalist.org/photos/2970895/large.jpg?1454686317',
 'created_at': '2016-02-05T15:31:41.383Z',
 'updated_at': '2016-02-05T15:32:02.023Z',
 'native_page_url': 'http://www.inaturalist.org/photos/2970895',
 'native_username': 'morgangostel',
 'license': 0,
 'subtype': None,
 'native_original_image_url': None,
 'license_code': 'C',
 'attribution': '(c) Morgan Gostel, all rights reserved',
 'license_name': 'Copyright',
 'license_url': 'http://en.wikipedia.org/wiki/Copyright',
 'type': 'LocalPhoto',
 'original_size_url': 'https://static.inaturalist.org/photos/2970895/original.jpg?145

Convert the retrieved image data to a Pandas DataFrame

In [6]:
keep_columns = [
    'observation_id', 'id', 'created_at', 'updated_at', 
    'native_page_url', 'native_username', 'license', 'subtype', 
    'native_original_image_url', 
    'license_code', 'attribution', 'license_name', 'license_url', 
    'type', 'original_size_url'
]

rename_columns = {
    "id": "photo_id", 
    "native_page_url": "inaturalist_page_url",
    "native_username": "inaturalist_username", 
    "native_original_image_url": "original_image_url",
    "original_size_url": "image_url"
}

images_df = pd.DataFrame(images)
images_df = images_df[keep_columns]
images_df = images_df.rename(columns=rename_columns)

Export image metadata as a tab-delimited file

In [7]:
images_df.to_csv('image_metadata.csv', index=False)

Download image files
--------------------------------

Make an `/images` directory to store photos, if one does not already exist.

In [8]:
try:
    os.mkdir('images')
    print("Created images directory.")
except FileExistsError:
    print('Images directory already exists.')

Created images directory.


For every photo retrieved, download the photo in its original size to the `/images` folder.

In [10]:
image_counter = 0
print("Retrieving {} images".format(len(images)))
for image in images:
    obs_id = image.get('observation_id', 'Unknown')
    photo_id = image.get('id', 'Unknown')
    image_name = "images/observation{}_image{}.jpg".format(obs_id, photo_id)
    image_url = image.get('original_size_url', None)
    if image_url:
        urllib.request.urlretrieve(image_url, image_name)
    
    image_counter += 1
    if image_counter % 10 == 0:
        print("Retrieved {} of {} images".format(image_counter, len(images)))
    time.sleep(1) # Rate limit 1 request per second
print("Completed. {} of {} images have been retrieved".format(len(images), len(images)))

Retrieving 98 images
Retrieved 10 of 98 images
Retrieved 20 of 98 images
Retrieved 30 of 98 images
Retrieved 40 of 98 images
Retrieved 50 of 98 images
Retrieved 60 of 98 images
Retrieved 70 of 98 images
Retrieved 80 of 98 images
Retrieved 90 of 98 images
Completed. 98 of 98 images have been retrieved


Match images to observation data and image metadata
-------------------------------------------------------------------------------

Downloaded images are named in the format `observation<OBSERVATION ID>_image<PHOTO ID>.jpg`.

To match images to occurrence data, use the occurrence ID number of the image (the first number) and the `id` column of `observations.csv`.

To match images to image metadata, use the photo ID number of the image (the second number) and the `photo_id` column of `image_metadata.csv`.

To match observation data to image metadata, use the `id` column of `observations.csv` and the `observation_id` column of `image_metadata.csv`.