# Downloading Original Data

This notebook is provided for reproducibility and historical reference.

For Domino employees, when setting up this project on a new instance of Domino, it is recommended to download this set of images from the relevant Domino s3 bucket as described in `Download-Workshop-Data.ipynb` rather than downloading directly from the original source as in this notebook.

## Overview 

This information is pulled from http://lila.science/datasets/snapshot-serengeti

The data set that we are pulling images from contains approximately 2.65M sequences of camera trap images, totaling 7.1M images, from the Snapshot Safari network:

*Using the same camera trapping protocols at every site, Snapshot Safari members are collecting standardized data from many protected areas in Africa, which allows for cross-site comparisons to assess the efficacy of conservation and restoration programs.*

Labels are provided for 61 categories, primarily at the species level (for example, the most common labels are wildebeest, zebra, and Thomson’s gazelle). Approximately 76% of images are labeled as empty. You can find a full list of species and associated image counts at their website.

## Download Guidelines
This notebook follows roughly the guidelines provided at http://lila.science/image-access

All files and images will be downloaded into the default [Domino Dataset](https://docs.dominodatalab.com/en/latest/user_guide/0a8d11/datasets-overview/) location, assuming this notebook is run inside a Domino project.
Adjust the paths as needed if running this notebook elsewhere.

In [1]:
import os
import json
import pandas as pd

In [2]:
DATASET_ROOT_PATH = f"/domino/datasets/local/{os.environ['DOMINO_PROJECT_NAME']}"
ORIGINAL_DATA_PATH = os.path.join(DATASET_ROOT_PATH, 'original_data')
if not os.path.exists(ORIGINAL_DATA_PATH):
    os.mkdir(ORIGINAL_DATA_PATH)

In [3]:
!echo "Downloading to $ORIGINAL_DATA_PATH"

Downloading to /domino/datasets/local/Deep-Learning-Tutorial/original_data


## Download and process metadata
The metadata file provided at https://lilablobssc.blob.core.windows.net/snapshotserengeti-v-2-0/SnapshotSerengeti_S1-11_v2_1.json.zip contains the information needed to download a selection of files from the dataset.

### Download metadata file and inspect contents

In [4]:
!wget "https://lilablobssc.blob.core.windows.net/snapshotserengeti-v-2-0/SnapshotSerengetiS11.json.zip" -P $ORIGINAL_DATA_PATH

--2022-08-20 22:31:01--  https://lilablobssc.blob.core.windows.net/snapshotserengeti-v-2-0/SnapshotSerengetiS11.json.zip
Resolving lilablobssc.blob.core.windows.net (lilablobssc.blob.core.windows.net)... 52.239.159.84
Connecting to lilablobssc.blob.core.windows.net (lilablobssc.blob.core.windows.net)|52.239.159.84|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24463908 (23M) [application/x-zip-compressed]
Saving to: ‘/domino/datasets/local/Deep-Learning-Tutorial/original_data/SnapshotSerengetiS11.json.zip’


2022-08-20 22:31:03 (19.2 MB/s) - ‘/domino/datasets/local/Deep-Learning-Tutorial/original_data/SnapshotSerengetiS11.json.zip’ saved [24463908/24463908]



In [5]:
!unzip $ORIGINAL_DATA_PATH/"SnapshotSerengetiS11.json.zip" -d $ORIGINAL_DATA_PATH

Archive:  /domino/datasets/local/Deep-Learning-Tutorial/original_data/SnapshotSerengetiS11.json.zip
  inflating: /domino/datasets/local/Deep-Learning-Tutorial/original_data/SnapshotSerengetiS11.json  


In [6]:
# Note the unzipped file
!ls $ORIGINAL_DATA_PATH

images	SnapshotSerengetiS11.json  SnapshotSerengetiS11.json.zip


In [7]:
metadata = json.load(open(os.path.join(ORIGINAL_DATA_PATH, "SnapshotSerengetiS11.json")))
metadata.keys()

dict_keys(['info', 'categories', 'images', 'annotations'])

In [8]:
metadata['info']

{'version': '2.0',
 'description': 'Camera trap data from the Snapshot Serengeti program, season 11',
 'date_created': '2019',
 'contributor': 'University of Minnesota Lion Center'}

In [9]:
print(f"There are {len(metadata['images'])} images total. Sample image entry:")
metadata['images'][0]

There are 499401 images total. Sample image entry:


{'id': 'SER_S11/L13/L13_R2/SER_S11_L13_R2_IMAG4128',
 'file_name': 'SER_S11/L13/L13_R2/SER_S11_L13_R2_IMAG4128.JPG',
 'frame_num': 3,
 'seq_id': 'SER_S11#L13#2#1377',
 'width': 2560,
 'height': 1920,
 'corrupt': False,
 'location': 'L13',
 'seq_num_frames': 3,
 'datetime': '2015-09-29 11:23:01'}

In [10]:
all_images = pd.DataFrame.from_dict(metadata['images'])
all_images.head()

Unnamed: 0,id,file_name,frame_num,seq_id,width,height,corrupt,location,seq_num_frames,datetime
0,SER_S11/L13/L13_R2/SER_S11_L13_R2_IMAG4128,SER_S11/L13/L13_R2/SER_S11_L13_R2_IMAG4128.JPG,3,SER_S11#L13#2#1377,2560,1920,False,L13,3,2015-09-29 11:23:01
1,SER_S11/M09/M09_R1/SER_S11_M09_R1_IMAG3285,SER_S11/M09/M09_R1/SER_S11_M09_R1_IMAG3285.JPG,1,SER_S11#M09#1#1213,2560,1920,False,M09,3,2015-08-10 17:49:13
2,SER_S11/M11/M11_R1/SER_S11_M11_R1_IMAG3321,SER_S11/M11/M11_R1/SER_S11_M11_R1_IMAG3321.JPG,2,SER_S11#M11#1#1111,2560,1920,False,M11,3,2015-07-20 11:03:54
3,SER_S11/O07/O07_R1/SER_S11_O07_R1_IMAG08918,SER_S11/O07/O07_R1/SER_S11_O07_R1_IMAG08918.JPG,2,SER_S11#O07#1#3062,2048,1536,False,O07,3,2015-09-12 17:56:02
4,SER_S11/M11/M11_R1/SER_S11_M11_R1_IMAG2478,SER_S11/M11/M11_R1/SER_S11_M11_R1_IMAG2478.JPG,1,SER_S11#M11#1#829,2560,1920,False,M11,3,2015-07-19 14:05:09


In [11]:
print(f"There are {len(metadata['annotations'])} annotations total. Sample annotation entry:")
metadata['annotations'][0]

There are 506455 annotations total. Sample annotation entry:


{'sequence_level_annotation': True,
 'id': '53b021e9-955a-11e9-98aa-000d3a198845',
 'category_id': 0,
 'seq_id': 'SER_S11#B03#1#1',
 'season': 'SER_S11',
 'datetime': '2015-08-05 10:41:29',
 'subject_id': 18837479,
 'count': nan,
 'standing': 0.07,
 'resting': 0.0,
 'moving': 0.07,
 'interacting': 0.0,
 'young_present': 0.0,
 'image_id': 'SER_S11/B03/B03_R1/SER_S11_B03_R1_IMAG0001',
 'location': 'B03'}

In [12]:
all_annotations = pd.DataFrame.from_dict(metadata['annotations'])
all_annotations.head()

Unnamed: 0,sequence_level_annotation,id,category_id,seq_id,season,datetime,subject_id,count,standing,resting,moving,interacting,young_present,image_id,location
0,True,53b021e9-955a-11e9-98aa-000d3a198845,0,SER_S11#B03#1#1,SER_S11,2015-08-05 10:41:29,18837479,,0.07,0.0,0.07,0.0,0.0,SER_S11/B03/B03_R1/SER_S11_B03_R1_IMAG0001,B03
1,True,53b021ea-955a-11e9-a2b7-000d3a198845,0,SER_S11#B03#1#2,SER_S11,2015-08-05 10:41:40,18837481,,,,,,,SER_S11/B03/B03_R1/SER_S11_B03_R1_IMAG0002,B03
2,True,53b021eb-955a-11e9-8e4f-000d3a198845,0,SER_S11#B03#1#3,SER_S11,2015-08-05 10:41:45,18837482,,,,,,,SER_S11/B03/B03_R1/SER_S11_B03_R1_IMAG0003,B03
3,True,53b021ec-955a-11e9-a5bb-000d3a198845,1,SER_S11#B03#1#4,SER_S11,2015-08-05 10:42:11,18837483,,0.62,0.0,0.38,0.23,0.0,SER_S11/B03/B03_R1/SER_S11_B03_R1_IMAG0004,B03
4,True,53b021ed-955a-11e9-96a3-000d3a198845,1,SER_S11#B03#1#4,SER_S11,2015-08-05 10:42:11,18837483,,0.62,0.0,0.38,0.23,0.0,SER_S11/B03/B03_R1/SER_S11_B03_R1_IMAG0005,B03


In [13]:
print(f"There are {len(metadata['categories'])} categories total. Sample categories entry:")
metadata['categories'][0]

There are 61 categories total. Sample categories entry:


{'id': 0, 'name': 'empty'}

In [14]:
all_categories = pd.DataFrame.from_dict(metadata['categories'])
all_categories.set_index('id', inplace=True)
all_categories.head()

Unnamed: 0_level_0,name
id,Unnamed: 1_level_1
0,empty
1,human
2,gazellegrants
3,reedbuck
4,dikdik


### Pick out 4 categories of interest

We will pick out four of the categories to use in our sample

In [15]:
our_categories = {
    category: all_categories.index[all_categories['name'] == category].values[0]
    for category in ['gazellethomsons', 'giraffe', 'wildebeest', 'zebra']
}
our_categories

{'gazellethomsons': 7, 'giraffe': 13, 'wildebeest': 18, 'zebra': 5}

In [16]:
our_annotations = all_annotations[all_annotations['category_id'].isin(our_categories.values())]
print(f"There are {len(our_annotations)} annotations matching our 4 categories of interest.")

There are 112264 annotations matching our 4 categories of interest.


In [17]:
our_annotations.value_counts('category_id')

category_id
7     64706
18    26308
5     17533
13     3717
dtype: int64

### Take a balanced sample and save metadata files for our reduced list

Do most of the work in the first cell, making extensive use of free line breaks on chained operations within parens.
Note the random seed in the sampling step!

In [18]:
our_metadata = (
    our_annotations
    .groupby('category_id')
    .apply(pd.DataFrame.sample, n=1000, random_state=42)
    .reset_index(drop=True)
    [['category_id', 'image_id']]
    .merge(all_images, left_on = 'image_id', right_on = 'id')
    [['category_id', 'file_name', 'location', 'datetime']]
    .merge(
        (
            pd.DataFrame.from_dict(our_categories, orient='index', columns = ['category_id'])
            .reset_index()
            .rename({'index': 'category_name'}, axis=1)
        ),
        on='category_id'
    )
    .drop('category_id', axis=1)
    .rename({'file_name': 'file_path'}, axis=1)
)
our_metadata['file_name'] = our_metadata['file_path'].apply(os.path.basename)
our_metadata.head()

Unnamed: 0,file_path,location,datetime,category_name,file_name
0,SER_S11/D03/D03_R2/SER_S11_D03_R2_IMAG0047.JPG,D03,2015-10-16 13:27:27,zebra,SER_S11_D03_R2_IMAG0047.JPG
1,SER_S11/C02/C02_R1/SER_S11_C02_R1_IMAG0616.JPG,C02,2015-09-09 18:53:19,zebra,SER_S11_C02_R1_IMAG0616.JPG
2,SER_S11/C12/C12_R1/SER_S11_C12_R1_IMAG0538.JPG,C12,2015-10-05 17:59:57,zebra,SER_S11_C12_R1_IMAG0538.JPG
3,SER_S11/P11/P11_R2/SER_S11_P11_R2_IMAG1279.JPG,P11,2015-11-21 20:00:52,zebra,SER_S11_P11_R2_IMAG1279.JPG
4,SER_S11/E04/E04_R1/SER_S11_E04_R1_IMAG0516.JPG,E04,2015-09-07 08:53:21,zebra,SER_S11_E04_R1_IMAG0516.JPG


In [19]:
# Make sure this came out like we expect, 1000 of each
our_metadata.value_counts('category_name')

category_name
zebra              1000
wildebeest         1000
giraffe            1000
gazellethomsons    1000
dtype: int64

In [20]:
# We will only need the file path while downloading, so don't include it in the final metadata file
reduced_metadata_path = os.path.join(ORIGINAL_DATA_PATH, 'reduced_metadata.csv')
our_metadata.to_csv(
    reduced_metadata_path, 
    index=False
)

In [21]:
!head $ORIGINAL_DATA_PATH/'reduced_metadata.csv'

file_path,location,datetime,category_name,file_name
SER_S11/D03/D03_R2/SER_S11_D03_R2_IMAG0047.JPG,D03,2015-10-16 13:27:27,zebra,SER_S11_D03_R2_IMAG0047.JPG
SER_S11/C02/C02_R1/SER_S11_C02_R1_IMAG0616.JPG,C02,2015-09-09 18:53:19,zebra,SER_S11_C02_R1_IMAG0616.JPG
SER_S11/C12/C12_R1/SER_S11_C12_R1_IMAG0538.JPG,C12,2015-10-05 17:59:57,zebra,SER_S11_C12_R1_IMAG0538.JPG
SER_S11/P11/P11_R2/SER_S11_P11_R2_IMAG1279.JPG,P11,2015-11-21 20:00:52,zebra,SER_S11_P11_R2_IMAG1279.JPG
SER_S11/E04/E04_R1/SER_S11_E04_R1_IMAG0516.JPG,E04,2015-09-07 08:53:21,zebra,SER_S11_E04_R1_IMAG0516.JPG
SER_S11/H01/H01_R1/SER_S11_H01_R1_IMAG0172.JPG,H01,2015-08-08 14:23:33,zebra,SER_S11_H01_R1_IMAG0172.JPG
SER_S11/F04/F04_R1/SER_S11_F04_R1_IMAG0146.JPG,F04,2015-08-14 18:42:27,zebra,SER_S11_F04_R1_IMAG0146.JPG
SER_S11/E04/E04_R1/SER_S11_E04_R1_IMAG0386.JPG,E04,2015-08-29 13:24:07,zebra,SER_S11_E04_R1_IMAG0386.JPG
SER_S11/K03/K03_R1/SER_S11_K03_R1_IMAG1514.JPG,K03,2015-09-15 18:34:30,zebra,SER_S11_K03_R1_IMAG1514.JPG


## Download individual images

### Download azcopy

In [22]:
!wget "https://aka.ms/downloadazcopy-v10-linux" -P $ORIGINAL_DATA_PATH

--2022-08-20 22:31:57--  https://aka.ms/downloadazcopy-v10-linux
Resolving aka.ms (aka.ms)... 104.86.5.150
Connecting to aka.ms (aka.ms)|104.86.5.150|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://azcopyvnext.azureedge.net/release20220721/azcopy_linux_amd64_10.16.0.tar.gz [following]
--2022-08-20 22:31:58--  https://azcopyvnext.azureedge.net/release20220721/azcopy_linux_amd64_10.16.0.tar.gz
Resolving azcopyvnext.azureedge.net (azcopyvnext.azureedge.net)... 23.213.34.169, 23.213.34.191, 2600:1405:800::6864:a831, ...
Connecting to azcopyvnext.azureedge.net (azcopyvnext.azureedge.net)|23.213.34.169|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12590201 (12M) [application/gzip]
Saving to: ‘/domino/datasets/local/Deep-Learning-Tutorial/original_data/downloadazcopy-v10-linux’


2022-08-20 22:31:58 (109 MB/s) - ‘/domino/datasets/local/Deep-Learning-Tutorial/original_data/downloadazcopy-v10-linux’ saved [12590201/1

In [23]:
!cd $ORIGINAL_DATA_PATH; tar -xvf $ORIGINAL_DATA_PATH/downloadazcopy-v10-linux

azcopy_linux_amd64_10.16.0/
azcopy_linux_amd64_10.16.0/NOTICE.txt
azcopy_linux_amd64_10.16.0/azcopy


### Download the files
This is the meaty step!
Note that we specify the full path in the URL source, but we use only the file name on the destination path.
This avoids redundant nesting folder structure and keeps things a bit neater.

The naive download loop is slow, but simple, and checks for existing files so it can be restarted if interrupted.
If interrupted, even if kernel is restarted, the notebook can be resumed from this section.

In [1]:
# Repeat necessary stuff from earlier sections - be careful if any of these paths get edited
import os
import pandas as pd

DATASET_ROOT_PATH = f"/domino/datasets/local/{os.environ['DOMINO_PROJECT_NAME']}"
ORIGINAL_DATA_PATH = os.path.join(DATASET_ROOT_PATH, 'original_data')
reduced_metadata_path = os.path.join(ORIGINAL_DATA_PATH, 'reduced_metadata.csv')

In [2]:
AZCOPY_PATH = f"{ORIGINAL_DATA_PATH}/azcopy_linux_amd64_10.16.0/azcopy"

In [3]:
IMAGES_FOLDER = os.path.join(ORIGINAL_DATA_PATH, 'images')
if not os.path.exists(IMAGES_FOLDER):
    os.mkdir(IMAGES_FOLDER)

In [4]:
our_metadata_for_downloads = pd.read_csv(reduced_metadata_path)
N_IMAGES = len(our_metadata_for_downloads)
our_metadata_for_downloads.head()

Unnamed: 0,file_path,location,datetime,category_name,file_name
0,SER_S11/D03/D03_R2/SER_S11_D03_R2_IMAG0047.JPG,D03,2015-10-16 13:27:27,zebra,SER_S11_D03_R2_IMAG0047.JPG
1,SER_S11/C02/C02_R1/SER_S11_C02_R1_IMAG0616.JPG,C02,2015-09-09 18:53:19,zebra,SER_S11_C02_R1_IMAG0616.JPG
2,SER_S11/C12/C12_R1/SER_S11_C12_R1_IMAG0538.JPG,C12,2015-10-05 17:59:57,zebra,SER_S11_C12_R1_IMAG0538.JPG
3,SER_S11/P11/P11_R2/SER_S11_P11_R2_IMAG1279.JPG,P11,2015-11-21 20:00:52,zebra,SER_S11_P11_R2_IMAG1279.JPG
4,SER_S11/E04/E04_R1/SER_S11_E04_R1_IMAG0516.JPG,E04,2015-09-07 08:53:21,zebra,SER_S11_E04_R1_IMAG0516.JPG


In [5]:
def download_image(metadata_row):
    url = f"https://lilablobssc.blob.core.windows.net/snapshotserengeti-unzipped/{metadata_row['file_path']}"
    dest_path = os.path.join(IMAGES_FOLDER, metadata_row['file_name'])
    print(f"Downloading {metadata_row.name + 1}/{N_IMAGES} ({metadata_row['file_name']})\r", end='')
    if not os.path.isfile(dest_path):
        os.system(f"{AZCOPY_PATH} cp {url} {dest_path}")

In [6]:
our_metadata_for_downloads.apply(download_image, axis=1)
print('\nFINISHED') 

Downloading 4000/4000 (SER_S11_O07_R1_IMAG02515.JPG)
FINISHED
