<a href="https://colab.research.google.com/github/emcdona1/field_classification/blob/master/utilities/image_processing/Download_Image_Files_from_Pteridoportal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Setup -- download necessary files:
1. Go to [The Pteridophyte Collections Consortium](https://www.pteridoportal.org/portal/index.php).
1. In the top menu, navigate to *Search* > *Search Collections*.
1. Click *Select/Deselect All* to deselect everything, then choose your source (e.g. 'Field Museum of Natural History Pteridophyte Collection').
1. Click *Search >* in the upper right.
1. Fill in your search parameters and click *List Display*.
1. In the top right, click the *download button* (it looks like a down arrow into an open box).
1. In the pop-up window, choose the following settings:
    * **Structure:** Darwin Core
    * **Data Extensions**: Keep both boxes checked
    * **File Format**: Comma Delimited (CSV)
    * **Character Set**: ISO-8859-1 (western)  **note this setting is different than previously**
    * **Compression**: Check this box
1. Click 'Download Data.'



---



### How to download the images:
1.    Hit play on the cell of code below to get Google Colab set up.  (Hover over the [ ], and a play button will appear.  It will show a spinning circle around a stop sign, until the cell is complete. After complete, you can move onto each next step.)

In [None]:
import csv
import pandas as pd
import requests
import argparse
from zipfile import ZipFile
from google.colab import files
import io
import os
import shutil

2.    Once the file has downloaded to your computer, hit play on the next cell, and upload the ZIP file to Google Colab  (click *Choose Files*).

In [None]:
uploaded = files.upload()

---

2.    Hit play on the cell below, and type in the name of your dataset.  (Something useful!)

In [None]:
name = input('Type in the name of your dataset, then hit Enter: ')


---

3.    Click play on the cell below.  If there are a lot of images, this may take a while.
4.    Scroll to the bottom of the cell to view the progress.  When the program is finished, it will show you the name of the saved ZIP file.

In [None]:
def unpack_zip_file() -> (pd.DataFrame, pd.DataFrame):
    uploaded_filename = list(uploaded.keys())[0]
    with ZipFile(io.BytesIO(uploaded[uploaded_filename]), 'r') as zipped:
        occur_bytes = zipped.read('occurrences.csv')
        occur = pd.read_csv(io.BytesIO(occur_bytes), encoding='ISO-8859-1', 
                            usecols=['id','catalogNumber'])
        images_bytes = zipped.read('images.csv')
        images = pd.read_csv(io.BytesIO(images_bytes), encoding='ISO-8859-1', 
                             usecols=['coreid','identifier','goodQualityAccessURI', 'format'])
    return images, occur


def download_images_from_csv(image_rows, occ_df):
    download_location = 'images'
    if not os.path.exists(download_location):
        os.makedirs(download_location)
    #delete the duplicate rows
    # reindex and drop the rows that are NaN
    image_rows = image_rows.reset_index(drop=True)
    image_rows = image_rows.drop('format', axis=1)
    print('Filtered duplicates.')

    barcode_dict = occ_df.set_index('id').T.to_dict('list')
    num_not_found = 0
    not_found = [['Barcode', 'Core ID']]
    for i in range(len(image_rows)):
        image_url = image_rows.identifier[i]
        result = requests.get(image_url)
        coreid = image_rows.coreid[i]
        barcode = barcode_dict.get(coreid)[0]
        if (result.status_code != 200):
            # if the identifier link fails, try the goodQualityAccessURI link
            image_url = image_rows.goodQualityAccessURI[i]
            result = requests.get(image_url)
        if (result.status_code == 200):
            with open(os.path.join(download_location, 
                                   str(barcode) + '_' + str(coreid) + '.jpg'), 
                      'wb') as download_image:
                download_image.write(result.content)
        else:
            num_not_found = num_not_found + 1
            not_found.append([barcode, coreid])
        
        if i > 0 and i % 25 == 0:
            print('%i images downloaded.' % i)
    print('All images downloaded.')
    if (num_not_found > 0):
        print('Some images were not able to be downloaded:')
        for row in not_found:
            print(str(row[0]) + '\t\t' + str(row[1]))


images, occur = unpack_zip_file()
download_images_from_csv(images, occur)
name = name.lower().replace(' ', '_')
save_location = shutil.make_archive(name, 'zip', 'images')
print('Your zip file is saved as: %s' % save_location)

---

5.    In the left menu, click the folder icon (called *Files*).  You will see the ZIP file that was generated.  Right click this file, and click *Download* to save this file to your computer.
6.    Celebrate!