## Overview

The images have been downloaded, and are based on the data cleaning process below. They are stored in the following path `/domino/datasets/local/serengeti-small-dataset/snapshotserengeti-unzipped`

### Metadata

To determine which images to download, we will use the metadata for the entire dataset, found here: http://lila.science/datasets/snapshot-serengeti

These files are the `SnapshotSerengeti_v2_1_annotations.csv` and `SnapshotSerengeti_v2_1_images.csv` 

The dataset does not contain any images with humans for privacy reasons, and we will remove blank images to focus only on animal identification. Also note that we cleaned corrupt images from a larger dataset, see the download-data folder for the code 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [3]:
#read in annotations file
annotations = pd.read_csv('/mnt/SnapshotSerengeti_v2_1_annotations.csv', index_col=0, low_memory=False)
print('There are {} rows in the dataset'.format(annotations.shape[0]))
annotations.head()

There are 2688077 rows in the dataset


Unnamed: 0,capture_id,season,site,roll,capture,capture_date_local,capture_time_local,subject_id,question__species,question__count_max,...,question__count_min,question__standing,question__resting,question__moving,question__eating,question__interacting,question__young_present,p_users_identified_this_species,pielous_evenness_index,question__horns_visible
0,SER_S1#B04#1#1,S1,B04,1,1,2010-07-18,16:26:14,ASG0002kjh,human,2.0,...,1.0,0.62,0.06,0.0,0.0,0.5,0.0,1.0,0.0,
1,SER_S1#B04#1#2,S1,B04,1,2,2010-07-18,16:26:30,ASG0002kji,human,2.0,...,1.0,0.1,0.62,0.0,0.05,0.33,0.0,1.0,0.0,
2,SER_S1#B04#1#3,S1,B04,1,3,2010-07-20,06:14:06,ASG0002kjj,blank,,...,,,,,,,,1.0,0.0,
3,SER_S1#B04#1#4,S1,B04,1,4,2010-07-22,08:56:06,ASG0002kjk,blank,,...,,,,,,,,1.0,0.0,
4,SER_S1#B04#1#5,S1,B04,1,5,2010-07-24,01:16:28,ASG0002kjl,blank,,...,,,,,,,,1.0,0.0,


In [3]:
#print column names
annotations.columns

Index(['capture_id', 'season', 'site', 'roll', 'capture', 'capture_date_local',
       'capture_time_local', 'subject_id', 'question__species',
       'question__count_max', 'question__count_median', 'question__count_min',
       'question__standing', 'question__resting', 'question__moving',
       'question__eating', 'question__interacting', 'question__young_present',
       'p_users_identified_this_species', 'pielous_evenness_index',
       'question__horns_visible'],
      dtype='object')

In [4]:
#remove all photos with the species labeled as human or blank
img_keep = annotations[~annotations['question__species'].isin(['human', 'blank'])]
print('There are currently {} rows in the dataset'.format(img_keep.shape[0]))

There are currently 672516 rows in the dataset


In [5]:
#print list of species
species = img_keep['question__species'].unique()
print(len(species))
np.sort(species)

74


array(['aardvark', 'aardwolf', 'baboon', 'bat', 'batEaredFox',
       'batearedfox', 'birdother', 'buffalo', 'bushbuck', 'caracal',
       'cattle', 'cheetah', 'civet', 'dikDik', 'dikdik', 'duiker',
       'eland', 'elephant', 'fire', 'gazelleGrants', 'gazelleThomsons',
       'gazellegrants', 'gazellethomsons', 'genet', 'giraffe',
       'guineaFowl', 'guineafowl', 'hare', 'hartebeest', 'hippopotamus',
       'honeyBadger', 'honeybadger', 'hyenaSpotted', 'hyenaStriped',
       'hyenabrown', 'hyenaspotted', 'hyenastriped', 'impala',
       'insectSpider', 'insectspider', 'jackal', 'koriBustard',
       'koribustard', 'kudu', 'leopard', 'lionFemale', 'lionMale',
       'lioncub', 'lionfemale', 'lionmale', 'mongoose', 'monkeyvervet',
       'ostrich', 'otherBird', 'pangolin', 'porcupine', 'reedbuck',
       'reptiles', 'rhinoceros', 'rodents', 'secretaryBird',
       'secretarybird', 'serval', 'steenbok', 'topi', 'vervetMonkey',
       'vulture', 'warthog', 'waterbuck', 'wildcat', 'wildd

In [6]:
#fix duplicates occuring because of capitalization
img_cleaned = img_keep.copy()
img_cleaned.loc[:,'question__species'] = img_cleaned.loc[:,'question__species'].str.lower()

In [7]:
#number of species
species_cleaned = img_cleaned['question__species'].unique()
print(len(species_cleaned))
np.sort(species_cleaned)

61


array(['aardvark', 'aardwolf', 'baboon', 'bat', 'batearedfox',
       'birdother', 'buffalo', 'bushbuck', 'caracal', 'cattle', 'cheetah',
       'civet', 'dikdik', 'duiker', 'eland', 'elephant', 'fire',
       'gazellegrants', 'gazellethomsons', 'genet', 'giraffe',
       'guineafowl', 'hare', 'hartebeest', 'hippopotamus', 'honeybadger',
       'hyenabrown', 'hyenaspotted', 'hyenastriped', 'impala',
       'insectspider', 'jackal', 'koribustard', 'kudu', 'leopard',
       'lioncub', 'lionfemale', 'lionmale', 'mongoose', 'monkeyvervet',
       'ostrich', 'otherbird', 'pangolin', 'porcupine', 'reedbuck',
       'reptiles', 'rhinoceros', 'rodents', 'secretarybird', 'serval',
       'steenbok', 'topi', 'vervetmonkey', 'vulture', 'warthog',
       'waterbuck', 'wildcat', 'wilddog', 'wildebeest', 'zebra',
       'zorilla'], dtype=object)

In [8]:
#remove NaNs and limit photos to those with only 1 animal and a high nubmber of users that correctly identified the animal
img_final = img_cleaned.copy()
img_final = img_final[(img_final['question__count_max'] == '1') & (img_final['p_users_identified_this_species'] > 0.9)]

print('There are currently {} rows in the dataset'.format(img_final.shape[0]))
img_final.head()

There are currently 124583 rows in the dataset


Unnamed: 0,capture_id,season,site,roll,capture,capture_date_local,capture_time_local,subject_id,question__species,question__count_max,...,question__count_min,question__standing,question__resting,question__moving,question__eating,question__interacting,question__young_present,p_users_identified_this_species,pielous_evenness_index,question__horns_visible
8,SER_S1#B04#1#9,S1,B04,1,9,2010-07-30,05:20:22,ASG0002kjp,zebra,1,...,1,0.94,0.0,0.06,0.06,0.0,0.0,1.0,0.0,
11,SER_S1#B04#1#12,S1,B04,1,12,2010-07-30,20:57:28,ASG0002kjs,zebra,1,...,1,0.23,0.0,0.0,0.77,0.0,0.0,1.0,0.0,
16,SER_S1#B04#1#17,S1,B04,1,17,2010-08-05,02:24:04,ASG0002kjx,zebra,1,...,1,1.0,0.08,0.0,0.0,0.0,0.0,1.0,0.0,
17,SER_S1#B04#1#18,S1,B04,1,18,2010-08-05,02:29:02,ASG0002kjy,zebra,1,...,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,
33,SER_S1#B05#1#4,S1,B05,1,4,2010-07-20,15:26:16,ASG0000004,gazellethomsons,1,...,1,0.21,0.0,0.0,1.0,0.0,0.0,0.93,0.35,


In [9]:
#breakdown of species counts
species_counts = pd.DataFrame(img_final['question__species'].value_counts())
print('Most common species:')
print(species_counts.head())

Most common species:
                 question__species
zebra                        30584
gazellethomsons              17575
wildebeest                   15578
giraffe                      10736
elephant                      9097


In [10]:
print('Least common species:')
print(species_counts.tail(15))

Least common species:
              question__species
mongoose                     53
vulture                      31
bushbuck                     29
caracal                      28
civet                        23
rhinoceros                   22
aardwolf                     17
honeybadger                  14
wildcat                       9
rodents                       8
hyenastriped                  8
zorilla                       6
monkeyvervet                  5
genet                         4
bat                           1


### Choose images for dataset

Next we'll combine the annotations with the image names in `SnapshotSerengeti_v2_1_images.csv`. 
We'll then select a max of 1000 photos from each class and eliminate classes of species with less than 500 photos.

In [11]:
#read in the doc with the names of all images in the set and format the image name from the image path
all_images = pd.read_csv('/mnt/SnapshotSerengeti_v2_1_images.csv', index_col=0)
all_images['image_name'] = all_images['image_path_rel'].str.split('/').str[-1]
all_images.head()

Unnamed: 0,capture_id,image_rank_in_capture,image_path_rel,image_name
0,SER_S1#B04#1#1,1,S1/B04/B04_R1/S1_B04_R1_PICT0001.JPG,S1_B04_R1_PICT0001.JPG
1,SER_S1#B04#1#2,1,S1/B04/B04_R1/S1_B04_R1_PICT0002.JPG,S1_B04_R1_PICT0002.JPG
2,SER_S1#B04#1#3,1,S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG,S1_B04_R1_PICT0003.JPG
3,SER_S1#B04#1#4,1,S1/B04/B04_R1/S1_B04_R1_PICT0004.JPG,S1_B04_R1_PICT0004.JPG
4,SER_S1#B04#1#5,1,S1/B04/B04_R1/S1_B04_R1_PICT0005.JPG,S1_B04_R1_PICT0005.JPG


In [12]:
#merge all images with the img_final df
image_ids = all_images.merge(img_final, on = 'capture_id', how = 'inner')
print('There are currently {} rows in the dataset'.format(image_ids.shape[0]))
image_ids.head()

There are currently 302101 rows in the dataset


Unnamed: 0,capture_id,image_rank_in_capture,image_path_rel,image_name,season,site,roll,capture,capture_date_local,capture_time_local,...,question__count_min,question__standing,question__resting,question__moving,question__eating,question__interacting,question__young_present,p_users_identified_this_species,pielous_evenness_index,question__horns_visible
0,SER_S1#B04#1#9,1,S1/B04/B04_R1/S1_B04_R1_PICT0009.JPG,S1_B04_R1_PICT0009.JPG,S1,B04,1,9,2010-07-30,05:20:22,...,1,0.94,0.0,0.06,0.06,0.0,0.0,1.0,0.0,
1,SER_S1#B04#1#12,1,S1/B04/B04_R1/S1_B04_R1_PICT0012.JPG,S1_B04_R1_PICT0012.JPG,S1,B04,1,12,2010-07-30,20:57:28,...,1,0.23,0.0,0.0,0.77,0.0,0.0,1.0,0.0,
2,SER_S1#B04#1#17,1,S1/B04/B04_R1/S1_B04_R1_PICT0017.JPG,S1_B04_R1_PICT0017.JPG,S1,B04,1,17,2010-08-05,02:24:04,...,1,1.0,0.08,0.0,0.0,0.0,0.0,1.0,0.0,
3,SER_S1#B04#1#18,1,S1/B04/B04_R1/S1_B04_R1_PICT0018.JPG,S1_B04_R1_PICT0018.JPG,S1,B04,1,18,2010-08-05,02:29:02,...,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,
4,SER_S1#B05#1#4,1,S1/B05/B05_R1/S1_B05_R1_PICT0006.JPG,S1_B05_R1_PICT0006.JPG,S1,B05,1,4,2010-07-20,15:26:16,...,1,0.21,0.0,0.0,1.0,0.0,0.0,0.93,0.35,


In [13]:
#number of species
species_combined = image_ids['question__species'].unique()
print(len(species_combined))
np.sort(species_combined)

52


array(['aardvark', 'aardwolf', 'baboon', 'bat', 'batearedfox',
       'birdother', 'buffalo', 'bushbuck', 'caracal', 'cheetah', 'civet',
       'dikdik', 'eland', 'elephant', 'gazellegrants', 'gazellethomsons',
       'genet', 'giraffe', 'guineafowl', 'hare', 'hartebeest',
       'hippopotamus', 'honeybadger', 'hyenaspotted', 'hyenastriped',
       'impala', 'insectspider', 'jackal', 'koribustard', 'leopard',
       'lionfemale', 'lionmale', 'mongoose', 'monkeyvervet', 'ostrich',
       'otherbird', 'porcupine', 'reedbuck', 'reptiles', 'rhinoceros',
       'rodents', 'secretarybird', 'serval', 'topi', 'vervetmonkey',
       'vulture', 'warthog', 'waterbuck', 'wildcat', 'wildebeest',
       'zebra', 'zorilla'], dtype=object)

In [14]:
#check for duplicates - note that this is because there are more than one image for a capture
duplicates = image_ids[image_ids.duplicated(subset='capture_id')]
print(len(duplicates))

177759


In [15]:
#pull a set number of photos from each class
num_samples = 1000
min_samples = 500

small_dfs = []
for animal in species_combined:
    small_df = image_ids.loc[image_ids['question__species'] == animal]
    samples = min(num_samples,small_df.shape[0])
    small_df = small_df[['image_name','question__species']].sample(samples, random_state=42)
    if samples >= min_samples:
        small_dfs.append(small_df)

image_df = pd.concat(small_dfs, ignore_index=True)
print('There are currently {} rows in the dataset'.format(image_df.shape[0]))
image_df.head()

There are currently 25929 rows in the dataset


Unnamed: 0,image_name,question__species
0,S8_F02_R3_IMAG0238.JPG,zebra
1,S10_P13_R3_IMAG1998.JPG,zebra
2,S9_L08_R2_IMAG0069.JPG,zebra
3,S5_G07_R1_IMAG2413.JPG,zebra
4,SER_S11_C12_R1_IMAG0526.JPG,zebra


In [16]:
species = image_df['question__species'].unique()
print(len(species))
np.sort(species)

28


array(['aardvark', 'baboon', 'buffalo', 'cheetah', 'dikdik', 'eland',
       'elephant', 'gazellegrants', 'gazellethomsons', 'giraffe',
       'guineafowl', 'hare', 'hartebeest', 'hippopotamus', 'hyenaspotted',
       'impala', 'jackal', 'koribustard', 'lionfemale', 'lionmale',
       'ostrich', 'otherbird', 'porcupine', 'secretarybird', 'topi',
       'warthog', 'wildebeest', 'zebra'], dtype=object)

In [17]:
with pd.option_context('display.max_rows', 999):
    print(image_df['question__species'].value_counts())

lionmale           1000
elephant           1000
gazellethomsons    1000
buffalo            1000
eland              1000
zebra              1000
giraffe            1000
hippopotamus       1000
otherbird          1000
gazellegrants      1000
hartebeest         1000
baboon             1000
cheetah            1000
impala             1000
hyenaspotted       1000
secretarybird      1000
lionfemale         1000
ostrich            1000
guineafowl         1000
wildebeest         1000
warthog            1000
dikdik              887
koribustard         871
hare                710
topi                697
jackal              650
aardvark            583
porcupine           531
Name: question__species, dtype: int64


### Check for Corrupt Images
Note that we checked for corrupt images in a much larger dataset (not shown in this tutorial) before bringing this over. That process can be found in the `check_corrupt_images.ipynb` notebook in the `download-data-supplement` folder. 

For the next section, we'll use the image_labels.csv from that process. However, the above steps can be used to pull in images on your local machine.

In [18]:
#save the labels to csv - not done here since we already have a labels file

#labels.to_csv('cleaning_example_image_labels.csv', index=False)

In [19]:
labels = pd.read_csv('image_labels.csv')
print('There are currently {} rows in the dataset'.format(labels.shape[0]))
labels.head()

There are currently 19565 rows in the dataset


Unnamed: 0,image_name,question__species
0,S9_H01_R1_IMAG2066.JPG,baboon
1,S7_M05_R2_IMAG1186.JPG,baboon
2,S10_E06_R1_IMAG0427.JPG,baboon
3,S2_E02_R2_PICT0091.JPG,baboon
4,SER_S11_H01_R1_IMAG0013.JPG,baboon


In [20]:
#we're going to limit the classes to 4 to save on training time and memory. Feel free to play around with upping these later
reduced_classes = labels[labels['question__species'].isin(['giraffe', 'zebra', 'wildebeest','gazellethomsons'])]
print('There are currently {} rows in the dataset'.format(reduced_classes.shape[0]))
reduced_classes.head()

There are currently 4000 rows in the dataset


Unnamed: 0,image_name,question__species
5415,S10_P05_R2_IMAG1102.JPG,gazellethomsons
5416,S9_M06_R4_IMAG3581.JPG,gazellethomsons
5417,S2_O07_R3_IMAG2997.JPG,gazellethomsons
5418,S1_R10_R2_PICT0255.JPG,gazellethomsons
5419,S9_C11_R1_IMAG0489.JPG,gazellethomsons


In [6]:
#this file is already saved
#reduced_classes.to_csv('labels_reduced_classes.csv', index=False)