In [None]:
import requests
import pandas as pd
import os

from PIL import Image
from tqdm.notebook import tqdm

# Download Kew specimens from iDigBio

This notebook downloads all images associated with an [iDigBio](https://www.idigbio.org/portal/search) search.

To run it, you'll first have to generate the search on iDigBio and download the associated `multimedia` and `occurrence` files.

For example, for applications 2 and 3, we used Kew specimens for distinct and similar genera. We searched for these specimen images using, e.g.:

```
[x] Must have media
Genus: Dendrobium
Institution Code: K
Basis of Record: PreservedSpecimen
```

## Load metdata

We need both the `multimedia` and `occurrence` files from the search.

In [None]:
urls = pd.read_csv("../data/kew-dendrobium-multimedia.csv")
occ = pd.read_csv("../data/kew-dendrobium-occurrence.csv")

So that we can label each URL with the specimen barcode, for future reference.

In [None]:
d = urls.merge(occ, on="coreid").loc[:, ["ac:accessURI", "dwc:occurrenceID"]]

## Setup download folder

In [None]:
outpath = "../data/kew-images/dendrobium"

In [None]:
if not os.path.exists(outpath):
    os.mkdir(outpath)

## Download images

We'll loop over the barcodes and urls and give each image the barcode as the filename.

In [None]:
barcodes = d["dwc:occurrenceID"].str.split("/").str[-1].values
urls = d["ac:accessURI"].values

In [None]:
for barcode, url in tqdm(zip(barcodes, urls), total=len(urls)):
    img_path = os.path.join(outpath, f"{barcode}.jpg")
    if not os.path.exists(img_path):
        image_data = requests.get(url).content
        with open(img_path, "wb") as handler:
            handler.write(image_data)

Some of the images don't download properly, so we'll remove them so we don't have problems later.

In [None]:
img_files = [f for f in os.listdir(outpath) if f.endswith(".jpg")]

for f in tqdm(img_files):
    try:
        img = Image.open(os.path.join(outpath, f))
        img.verify()
    except(IOError, SyntaxError) as e:
        print(f)
        os.remove(os.path.join(outpath, f))