# Project Summary

Tentatively, my project aims to build a web/mobile app which allows visual detection classification of dog breeds using deep neural networks. To this end, the datasets must consist of images of various dog breeds and corresponding breed labels. As per the requirements of this mini-project, three datasets are described below. 

In [1]:
# Utility function for downloading large files

import requests
import os
from tqdm.auto import tqdm
from pathlib import Path
import math

def download_file(url, filename=None, dirname=None):
    """Downloads a file at `url`
    url - URl of file to download
    filename[optional]- filename to save as
    dirname[optional] - directory in which to save filename
    """
    if filename is None:
        filename = url.split(os.sep)[-1]
    if dirname is not None:
        Path(dirname).mkdir(parents=True, exist_ok=True)
        filename = dirname + os.sep + filename
    
    if Path(filename).exists():
        print(f"{filename} already exists. Skipping")
        return
    # Get size of the file
    CHUNK_SIZE = 16384
    headers = requests.head(url).headers
    size = None
    if headers:
        size = headers.get('content-length', None)
        if size is not None:
            size = float(size)/CHUNK_SIZE
            size = math.ceil(size)
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(filename, 'wb') as f:
            for chunk in tqdm(r.iter_content(chunk_size=16384), total=size):
                f.write(chunk)

# Stanford Dogs Dataset
The [stanford dogs dataset consists](http://vision.stanford.edu/aditya86/ImageNetDogs/) of **20,580 images of 120 different dog breeds**. The dataset consists of both bounding boxes (for object detection) as well as dog breed labels. Based on discussions and leaderboards at paperswithcode, this dataset has become an important benchmark for dog breed classification.

In [2]:
# Run cell below to download Stanford Dogs data
STANFORD_DOGS_IMAGE_URL = 'http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar'
STANFORD_DOGS_ANNOTATIONS_URL = 'http://vision.stanford.edu/aditya86/ImageNetDogs/annotation.tar'
STANFORD_DOGS_SPLITS_URL = 'http://vision.stanford.edu/aditya86/ImageNetDogs/lists.tar'

In [3]:
download_file(STANFORD_DOGS_IMAGE_URL, dirname='stanford_dogs')

  0%|          | 0/48437 [00:00<?, ?it/s]

In [4]:
download_file(STANFORD_DOGS_ANNOTATIONS_URL, dirname='stanford_dogs')

  0%|          | 0/1334 [00:00<?, ?it/s]

In [5]:
download_file(STANFORD_DOGS_SPLITS_URL, dirname='stanford_dogs')

  0%|          | 0/30 [00:00<?, ?it/s]

# Tsinghua Dogs Dataset

[Tsinghua University Dogs Dataset](https://cg.cs.tsinghua.edu.cn/ThuDogs/) is another important benchmark dataset for dogbreed classification and detection. This dataset consists of **70428 images of 130 different dog breeds**. Each dog breed has anywhere from 200 to 7449 images represented in this dataset and the sample sizes are roughly representative of frequencies of dog breeds found in China. 

As is the case with Stanford Dogs dataset, this dataset also consists of class labels as well as bounding boxes. 

In [6]:
# Run cell below to download Tsinghua Dogs Dataset
TSINGHUA_DOGS_LOW_RES_IMAGES_URL = 'https://cloud.tsinghua.edu.cn/f/80013ef29c5f42728fc8/?dl=1'
TSINGHUA_DOGS_LOW_RES_ANNOTATIONS_URL = 'https://cg.cs.tsinghua.edu.cn/ThuDogs/low-annotations.zip'

In [11]:
download_file(TSINGHUA_DOGS_LOW_RES_IMAGES_URL, filename='low-resolution.zip', dirname='tsinghua_dogs')

low-resolution.zip already exists. Skipping


In [8]:
download_file(TSINGHUA_DOGS_LOW_RES_ANNOTATIONS_URL, dirname='tsinghua_dogs')

  0%|          | 0/2279 [00:00<?, ?it/s]

# Kaggle Dog Breeds Classification Dataset

This is yet another dataset for **20,000 images of 120 different breeds.**. However, a drawback of this dataset is that it only consists of labels of dog breeds and not bounding boxes. However, on the plus side, the data is pre-cropped so each image represents mostly only the dog. 

Since this dataset is associated with a Kaggle competition, downloading it programmatically requires a Kaggle account. Please follow instructions [here](https://www.kaggle.com/docs/api) on how to use the Kaggle API. 

In [11]:
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.6.12.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 2.2 MB/s eta 0:00:01
Collecting certifi>=2023.7.22
  Downloading certifi-2024.2.2-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 4.9 MB/s eta 0:00:01
Collecting python-slugify
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 15.3 MB/s eta 0:00:01
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.6.12-py3-none-any.whl size=102975 sha256=9be59da15dbe66facfa4be567468b0ed456a208fc781319554268a68f1fcd0a2
  Stored in directory: /Users/deman/Library/Caches/pip/wheels/6d/00/bd/a7b836e7e94f733cef5a0f274b7e991c045a6ab60cfd285fdd
Successfully built kaggle
Installing collected packages: text-unidecode, cer

In [9]:
# Ensure you've downloaded kaggle.json based on instructions above
# The Kaggle API client expects this file to be in ~/.kaggle,
# so move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/


# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json

In [10]:
#download the dataset for the dog-breed identification challenge https://www.kaggle.com/c/dog-breed-identification
!kaggle competitions download -c dog-breed-identification

Downloading dog-breed-identification.zip to /Users/deman/Dev/SpringboardCapstone/CapstoneDatasets
100%|███████████████████████████████████████▉| 690M/691M [00:10<00:00, 75.7MB/s]
100%|████████████████████████████████████████| 691M/691M [00:10<00:00, 66.3MB/s]
