In [1]:
import pathlib
import scipy.io as sio

from matplotlib import pyplot as plt

## General order of events / goals of this notebook
1. Download the Stanford Dogs dataset to a local gitignored directory
1. Understand the size and shape of the data (class frequencies, etc.)
1. Experimenting with resolving individual dog breeds to American Kennel Club breed groups
1. Testing out how to apply bounding boxes to the images to focus on the dogs that need to be classified
1. Implementing a benchmark RandomGuess model.

### 1. Bringing the data in locally

First, we'll create a directory to hold all the downloaded files

In [2]:
data_dir = pathlib.Path('data')
if not data_dir.exists():
    data_dir.mkdir(exist_ok = False)

Here's how we receive the **list** provided by the researchers that map images to train and test sets

In [4]:
!wget http://vision.stanford.edu/aditya86/ImageNetDogs/lists.tar -O data/temp.tar
!tar xofp data/temp.tar -C data
!rm data/temp.tar

--2020-06-17 13:38:01--  http://vision.stanford.edu/aditya86/ImageNetDogs/lists.tar
Resolving vision.stanford.edu (vision.stanford.edu)... 171.64.68.10
Connecting to vision.stanford.edu (vision.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 481280 (470K) [application/x-tar]
Saving to: ‘data/temp.tar’


2020-06-17 13:38:01 (607 KB/s) - ‘data/temp.tar’ saved [481280/481280]



In [8]:
files_downloaded = [x for x in data_dir.glob('*') if x.is_file()]
print(files_downloaded)

[PosixPath('data/file_list.mat'), PosixPath('data/test_list.mat'), PosixPath('data/train_list.mat')]


Noting that these `_list` files are of the `.mat` type, so we'll need to use scipy.io to read them as numpy arrays...

Here's a similar process for the **Annotations** files that contain bounding boxes...

In [9]:
!wget http://vision.stanford.edu/aditya86/ImageNetDogs/annotation.tar -O data/temp.tar
!tar xofp data/temp.tar -C data
!rm data/temp.tar

--2020-06-17 13:44:48--  http://vision.stanford.edu/aditya86/ImageNetDogs/annotation.tar
Resolving vision.stanford.edu (vision.stanford.edu)... 171.64.68.10
Connecting to vision.stanford.edu (vision.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21852160 (21M) [application/x-tar]
Saving to: ‘data/temp.tar’


2020-06-17 13:44:54 (3.45 MB/s) - ‘data/temp.tar’ saved [21852160/21852160]



This creates a subdirectory called `data/Annotation` that holds a bunch of really gnarly XML documents that contain the bounding boxes for each particular folder.

Finally, what I would imagine to be the longest download process, we can repeat much the same process for the actual images themselves.

In [10]:
!wget http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar -O data/temp.tar
!tar xofp data/temp.tar -C data
!rm data/temp.tar

--2020-06-17 13:49:23--  http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar
Resolving vision.stanford.edu (vision.stanford.edu)... 171.64.68.10
Connecting to vision.stanford.edu (vision.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 793579520 (757M) [application/x-tar]
Saving to: ‘data/temp.tar’


2020-06-17 13:51:14 (6.82 MB/s) - ‘data/temp.tar’ saved [793579520/793579520]



This last step took about two minutes to complete, which is good to keep in mind. It's worth noting that in order to play nicely with PyTorch's ImageFolder and DataLoader classes, I may eventually want to have the images reshuffled into `train/<class-label>` and `test/<class-label` folders. I'm not going to handle that right now, though.