# Wellington Data Analysis

_Author_: Darshan Mehta

The purpose of this notebook is to perform an investigatory analysis of the Wellington Dataset ([link](http://lila.science/datasets/wellingtoncameratraps)) CSV metadata file (present [here](https://lilablobssc.blob.core.windows.net/wellingtoncameratraps/wellington_camera_traps.csv.zip)) to identify, understand and possibly modify the information present in the file.

In [1]:
import os

import numpy as np
import pandas as pd

In [2]:
# Read the file
file_path = os.path.join(os.getcwd(), '..', 'data', 'wellington_camera_traps.csv')

raw_data = pd.read_csv(file_path)

In [3]:
# Print some demo top rows
raw_data.head()

Unnamed: 0,sequence,image_sequence,file,label,site,date,camera
0,2,image1,290716114012001a1116.jpg,BIRD,001a,7/29/2016 11:40,111
1,12,image1,100816090812001a1111.jpg,BIRD,001a,8/10/2016 9:08,111
2,17,image1,180516121622001a1602.jpg,BIRD,001a,5/18/2016 12:16,160
3,18,image1,260416120224001a1601.jpg,BIRD,001a,4/26/2016 12:02,160
4,20,image1,160516023810001a1606.jpg,CAT,001a,5/16/2016 2:38,160


The dataset webpage says that there are 270,450 images from 187 camera locations in the dataset. Each of the 90,150 sequences contain 3 images taken as part of the burst imagery. Each sequence is classified into 15 animal categories, empty, and unclassifiable. Approximately 17% of images are labeled as empty. Let us verify this dataset to make sure we are not missing any part of it.

In [4]:
print("There are", len(raw_data), "images in the dataset.")

There are 270450 images in the dataset.


In [9]:
print("There are", len(raw_data.camera.unique()), "unique camera locations in the dataset.")

There are 129 unique camera locations in the dataset.


Hmmm... $\large{🐟}$. I'm not sure how useful it is to have the right camera locations for our problem statement right now, but it is good that we have found and documented the issue so that we can trace back if we face a problem in the future.

In [15]:
print("There are", len(raw_data.label.unique()), "unique labels in the dataset.")

There are 17 unique labels in the dataset.


In [10]:
print("There are", len(raw_data.sequence.unique()), "unique sequences in the dataset.")

There are 90478 unique sequences in the dataset.


Hmmmm.. extra $\huge{🐟}$. This means that not every sequence has 3 images. Let's see what are we facing here. 

In [14]:
unique_image_counts = raw_data.groupby('sequence')["image_sequence"].transform("count").unique()
print("There various unique counts of images in sequences are:", unique_image_counts)

There various unique counts of images in sequences are: [3 2 1]


Interesting $\large{🤔}$. Let's dive deeper to see what kind of sequences these are that don't have 3 images.