## Overview 

This information is pulled from http://lila.science/datasets/snapshot-serengeti

The data set that we are pulling images from contains approximately 2.65M sequences of camera trap images, totaling 7.1M images, from the Snapshot Safari network:

*Using the same camera trapping protocols at every site, Snapshot Safari members are collecting standardized data from many protected areas in Africa, which allows for cross-site comparisons to assess the efficacy of conservation and restoration programs.*

Labels are provided for 61 categories, primarily at the species level (for example, the most common labels are wildebeest, zebra, and Thomson’s gazelle). Approximately 76% of images are labeled as empty. You can find a full list of species and associated image counts at their website.

### Metadata
This project includes the two metadata files in csv format. The code for how to use those files to download images is modified from their instructions, found here:
http://lila.science/image-access

These files are the `SnapshotSerengeti_v2_1_annotations.csv` and `SnapshotSerengeti_v2_1_images.csv` which we will explore below

In [1]:
import pandas as pd

In [2]:
#read in the doc with the names of all images in the set and format the image name
all_images = pd.read_csv('/mnt/SnapshotSerengeti_v2_1_images.csv', index_col=0)
all_images['image_name'] = all_images['image_path_rel'].str.split('/').str[-1]
all_images.head()

  mask |= (ar1 == a)


Unnamed: 0,capture_id,image_rank_in_capture,image_path_rel,image_name
0,SER_S1#B04#1#1,1,S1/B04/B04_R1/S1_B04_R1_PICT0001.JPG,S1_B04_R1_PICT0001.JPG
1,SER_S1#B04#1#2,1,S1/B04/B04_R1/S1_B04_R1_PICT0002.JPG,S1_B04_R1_PICT0002.JPG
2,SER_S1#B04#1#3,1,S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG,S1_B04_R1_PICT0003.JPG
3,SER_S1#B04#1#4,1,S1/B04/B04_R1/S1_B04_R1_PICT0004.JPG,S1_B04_R1_PICT0004.JPG
4,SER_S1#B04#1#5,1,S1/B04/B04_R1/S1_B04_R1_PICT0005.JPG,S1_B04_R1_PICT0005.JPG


In [3]:
#read in annotations file
annotations = pd.read_csv('/mnt/SnapshotSerengeti_v2_1_annotations.csv', index_col=0, low_memory=False)
annotations.head()

Unnamed: 0,capture_id,season,site,roll,capture,capture_date_local,capture_time_local,subject_id,question__species,question__count_max,...,question__count_min,question__standing,question__resting,question__moving,question__eating,question__interacting,question__young_present,p_users_identified_this_species,pielous_evenness_index,question__horns_visible
0,SER_S1#B04#1#1,S1,B04,1,1,2010-07-18,16:26:14,ASG0002kjh,human,2.0,...,1.0,0.62,0.06,0.0,0.0,0.5,0.0,1.0,0.0,
1,SER_S1#B04#1#2,S1,B04,1,2,2010-07-18,16:26:30,ASG0002kji,human,2.0,...,1.0,0.1,0.62,0.0,0.05,0.33,0.0,1.0,0.0,
2,SER_S1#B04#1#3,S1,B04,1,3,2010-07-20,06:14:06,ASG0002kjj,blank,,...,,,,,,,,1.0,0.0,
3,SER_S1#B04#1#4,S1,B04,1,4,2010-07-22,08:56:06,ASG0002kjk,blank,,...,,,,,,,,1.0,0.0,
4,SER_S1#B04#1#5,S1,B04,1,5,2010-07-24,01:16:28,ASG0002kjl,blank,,...,,,,,,,,1.0,0.0,


### Downloading Data

Downloading a whole data set from the Labeled Information Library of Alexandria (lila) without using the giant zipfiles from the browser you can use AzCopy and the Serengeti URL: https://lilablobssc.blob.core.windows.net/snapshotserengeti-unzipped?st=2020-01-01T00%3A00%3A00Z&se=2034-01-01T00%3A00%3A00Z&sp=rl&sv=2019-07-07&sr=c&sig=/DGPd%2B9WGFt6HgkemDFpo2n0M1htEXvTq9WoHlaH7L4%3D

To downloadthe entire data set to the folder c:\myfolder first download and install AzCopy and then run the following in a terminal:

`azcopy cp "https://lilablobssc.blob.core.windows.net/snapshotserengeti-unzipped?st=2020-01-01T00%3A00%3A00Z&se=2034-01-01T00%3A00%3A00Z&sp=rl&sv=2019-07-07&sr=c&sig=/DGPd%2B9WGFt6HgkemDFpo2n0M1htEXvTq9WoHlaH7L4%3D" "c:\myfolder" --recursive`

**Note that this is a very large dataset, make sure you have storage space if you decide to do this**


### Downloading images from a list of file names
If you want to download specific images, e.g., all the images for a particular species from a data set, this is supported too, but it requires a little code. 

We used AzCopy, note that this function is 'not-officially-supported' and could theoretically could cease to exist. See AzCopy: Listing specific files to transfer.

First, make a text file of image names that you would like to download. See the `1-Data-Prep` notebook for how we generated our list. Assuming that list is saved as `images_for_dataset.txt` and you would like to save them images in the folder `images` then you would run the following in a terminal:

`azcopy cp "https://lilablobssc.blob.core.windows.net/snapshotserengeti-unzipped?st=2020-01-01T00%3A00%3A00Z&se=2034-01-01T00%3A00%3A00Z&sp=rl&sv=2019-07-07&sr=c&sig=/DGPd%2B9WGFt6HgkemDFpo2n0M1htEXvTq9WoHlaH7L4%3D" "/images" --list-of-files images_for_dataset.txt`

We also used a bash script to flatten directories as follows, assuming these are contained in the `/images` folder:

```shopt -s dotglob
for d in images/*/
do
        find "$d" -type f -exec mv -i -t "$d" {} +
        find "$d" -mindepth 1 -type d -delete
done
```

### Downloading images for this tutorial

The `1-Data-Prep` notebook was used to explore and clean data, and those images were saved as `labels_reduced_classes.csv`



In [16]:
#import labels for cleaned images
small_data = pd.read_csv('/mnt/labels_reduced_classes.csv')
print('There are {} images in this dataset'.format(small_data.shape[0]))
small_data.head()

There are 4000 images in this dataset


Unnamed: 0,image_name,question__species
0,S10_P05_R2_IMAG1102.JPG,gazellethomsons
1,S9_M06_R4_IMAG3581.JPG,gazellethomsons
2,S2_O07_R3_IMAG2997.JPG,gazellethomsons
3,S1_R10_R2_PICT0255.JPG,gazellethomsons
4,S9_C11_R1_IMAG0489.JPG,gazellethomsons


In [17]:
small_data_names = small_data.merge(all_images, on = 'image_name', how='inner')
small_data_names.head()

Unnamed: 0,image_name,question__species,capture_id,image_rank_in_capture,image_path_rel
0,S10_P05_R2_IMAG1102.JPG,gazellethomsons,SER_S10#P05#2#420,1,S10/P05/P05_R2/S10_P05_R2_IMAG1102.JPG
1,S9_M06_R4_IMAG3581.JPG,gazellethomsons,SER_S9#M06#4#1282,2,S9/M06/M06_R4/S9_M06_R4_IMAG3581.JPG
2,S2_O07_R3_IMAG2997.JPG,gazellethomsons,SER_S2#O07#3#1022,2,S2/O07/O07_R3/S2_O07_R3_IMAG2997.JPG
3,S1_R10_R2_PICT0255.JPG,gazellethomsons,SER_S1#R10#2#86,2,S1/R10/R10_R2/S1_R10_R2_PICT0255.JPG
4,S9_C11_R1_IMAG0489.JPG,gazellethomsons,SER_S9#C11#1#190,3,S9/C11/C11_R1/S9_C11_R1_IMAG0489.JPG


In [14]:
#save text file with image IDs (note that the /mnt directory is specific to Domino)
with open('/mnt/reduced_images_for_dataset.txt', 'w') as f:
    f.write(small_data_names['image_path_rel'].str.cat(sep='\n'))

In [12]:
#if you are using this outside the tutorial you can run the AzCopy code above in a terminal or in this jupyter cell 
#to execute bash code add a ! before the command


In [18]:
#save text file with image IDs (note that the /mnt directory is specific to Domino)
with open('/mnt/image_names.txt', 'w') as f:
    f.write(small_data_names['image_names'].str.cat(sep='\n'))