# Caltech UCSD Birds 200 2011 (CUB-200-2011)

## Part 1 - Downloading and sorting the dataset

This notebook is for sorting the image and label data from Caltech UCSD ready for use with PyTorch data loading utilities.

This is the first stage of our roadmap for building and understanding a birds image classifier:

![RoadMapImage](../docs/birds_roadmap.png)

To download the data click on the following links:

   1. Images and annotations [Link](http://www.vision.caltech.edu/visipedia-data/CUB-200-2011/CUB_200_2011.tgz)
       
       *(Alternatively these can be downloaded from the releases page of this repository: [ecm200 CalTech Birds](https://github.com/ecm200/caltech_birds))*
       *Two files are available:
               
               CUB_200_2011_data_original.zip - which contains all the images and ancillary data listed below, with exec.
               
               CUB_200_2011_data_original.zip - this contains the sorted images in train and test folders, given by the train_test_split.txt*
               
    
   2. Segmentations (optional, not need for this work) [Link](http://www.vision.caltech.edu/visipedia-data/CUB-200-2011/segmentations.tgz)
    
Place the files into the root of the cloned caltech_birds repo file structure.

Unzip the dowloaded zip files into cloned repository structure that replicates the project structure:

    caltech_birds-|
        cub_tools-|
        data-|   ** SEPARATE DOWNLOAD FROM RELEASES **
            attributes-|
            images_orig-|
            images-|
            parts-|
            attributes.txt
            bounding_boxes.txt
            classes.txt
            image_class_labels.txt
            images.txt
            README
            train_test_split.txt
        example_notebooks-|  ** DIRECTORY CONTAINING THIS NOTEBOOK **
        models-|   ** SEPARATE DOWNLOAD FROM RELEASES **
            classification-|
                #modelname1#-|
                #modelname2#-|
        notebooks-|
        scripts-|
        
This section covers the contents of the directories with a brief description.

**example_notebooks** directory contains the set of walk through notebooks, of which this is the first, for following the full workflow of producing a bird classifier using deep neural networks, using a ResNet152 deep neural network architecture.
        
**notebooks** directory contain the Jupyter notebooks where the majority of the visualisation and high level code will be maintained.

**scripts** directory contains the computationally intensive functions which take longer to run, and are better executed in a python script, using some form of terminal persistance method to keep the computational session open whilst the program is running. In most examples in this project, it is the training of the Neural Networks where this will be achieved, that can typically take a few hours, to a day or so. I prefer to use [TMUX](https://github.com/tmux/tmux/wiki/Getting-Started), which is a Linux utility that maintains separate terminal sessions that can be attached and detached to any linux terminal session once opened. This way, you can execute long running python training scripts inside a TMUX session, detach it, close down the terminal session and let the process run. You can then start a new terminal session, and then attach the running TMUX session to view the progress of the executed script, including all its output to terminal history.

**cub_tools** directory contains all the utility functions that have been developed to process, visualise, train and evaluate CNN models, as well results post processing have been contained. It has been converted into a python package that can be installed in the local environment by running in the **cub_tools** directory, *pip install -e .*. The functions can then be accessed using the cub_tools module import as ***import cub_tools***.

**models** directory contains the results from the model training processes, and also any other outputs from the evaluation processes including model predictions, network feature maps etc. **All model outputs used by the example notebooks can be downloaded from the release folder of the Github repo.** The models zip should be placed in the root of the repo directory structure and unziped to create a models directory with the ResNet152 results contained within. Other models are also available including PNASNET, Inception V3 and V4, GoogLenet and ResNeXt variants.

## Dataset details


Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations. For detailed information about the dataset, please see the technical report linked below.

Number of categories: 200

Number of images: 11,788

Annotations per image: 15 Part Locations, 312 Binary Attributes, 1 Bounding Box

Some related datasets are Caltech-256, the Oxford Flower Dataset, and Animals with Attributes. More datasets are available at the Caltech Vision Dataset Archive.

## Notebook setup

### Modules and externals

In [1]:
import os
import pandas as pd
import shutil

### Runtime setup

**root_dir** is the path to the directory where the downloaded CUB-200-2011 data resides.

**orig_images_folder** is the folder name that all the images are stored in. By default this is "images", and in this examples it has been renamed to **images_orig**

**new_images_folder** is the folder name where the images will be sorted into train and test sets folder structures for PyTorch dataloading utility that creates a dataloading object.

The reason for sorting the input images into train and test directories, based on the supplied *train_test_split.txt* file, allows for the use of the PyTorch Torchvision data loading utility, [ImageFolder](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder).

This script will take the directory location of the images as downloaded from the src in **images_orig**, and sort them into train and test sets in the **images** folder. The sorting of the images into training and test sets will be governed by the datasets provided designation, found in the ancillary text file, *train_test_split.txt*. This file contains *image_ID* and a binary flag *is_test_image*, where 1 indicates it is a test image, 0 a training image. The file *images.txt* gives the path to the image for each *image_ID*.

In [2]:
root_dir = '../data'
orig_images_folder = 'images_orig'
new_images_folder = 'images'

In [3]:
data_dir = os.path.join(root_dir,orig_images_folder)
new_data_dir = os.path.join(root_dir,new_images_folder)

### Load the relevant data from text files

This section loads the image files paths with *image_ID* (images.txt), and *train_test_split.txt* designation into Pandas Dataframes for later use.

In [4]:
image_fnames = pd.read_csv(filepath_or_buffer=os.path.join(root_dir,'images.txt'), 
                          header=None, 
                          delimiter=' ', 
                          names=['Img ID', 'file path'])

image_fnames['is training image?'] = pd.read_csv(filepath_or_buffer=os.path.join(root_dir,'train_test_split.txt'), 
                                                 header=None, delimiter=' ', 
                                                 names=['Img ID','is training image?'])['is training image?']

Create the new directories for the sorted data.

In [5]:
os.makedirs(os.path.join(new_data_dir,'train'), exist_ok=True)
os.makedirs(os.path.join(new_data_dir,'test'), exist_ok=True)

### Sort images into train and test folders based on predefined split

Using the train_test_split.txt file, each image is copied either to the relevant folder in either the train or test folders.
The resulting file will have the following structure:

    images-|
        train-|
            #classname1#-|
                image-1.jpg
                image-2.jpg
            #classname2-|
                image-1.jpg
                image-2.jpg
            |
            |
            #classnameN-|
                image-1.jpg
                image-2.jpg
        test-|
            #classname1#-|
                image-1.jpg
                image-2.jpg
            #classname2-|
                image-1.jpg
                image-2.jpg
            |
            |
            #classnameN-|
                image-1.jpg
                image-2.jpg

In [6]:
for i_image, image_fname in enumerate(image_fnames['file path']):
    if image_fnames['is training image?'].iloc[i_image]:
        new_dir = os.path.join(new_data_dir,'train',image_fname.split('/')[0])
        os.makedirs(new_dir, exist_ok=True)
        shutil.copy(src=os.path.join(data_dir,image_fname), dst=os.path.join(new_dir, image_fname.split('/')[1]))
        print(i_image, ':: Image is in training set. [', bool(image_fnames['is training image?'].iloc[i_image]),']')
        print('Image:: ', image_fname)
        print('Destination:: ', new_dir)
    else:
        new_dir = os.path.join(new_data_dir,'test',image_fname.split('/')[0])
        os.makedirs(new_dir, exist_ok=True)
        shutil.copy(src=os.path.join(data_dir,image_fname), dst=os.path.join(new_dir, image_fname.split('/')[1]))
        print(i_image, ':: Image is in testing set. [', bool(image_fnames['is training image?'].iloc[i_image]),']')
        print('Source Image:: ', image_fname)
        print('Destination:: ', new_dir)

0 :: Image is in testing set. [ False ]
Source Image::  001.Black_footed_Albatross/Black_Footed_Albatross_0046_18.jpg
Destination::  ../data/images/test/001.Black_footed_Albatross
1 :: Image is in training set. [ True ]
Image::  001.Black_footed_Albatross/Black_Footed_Albatross_0009_34.jpg
Destination::  ../data/images/train/001.Black_footed_Albatross
2 :: Image is in testing set. [ False ]
Source Image::  001.Black_footed_Albatross/Black_Footed_Albatross_0002_55.jpg
Destination::  ../data/images/test/001.Black_footed_Albatross
3 :: Image is in training set. [ True ]
Image::  001.Black_footed_Albatross/Black_Footed_Albatross_0074_59.jpg
Destination::  ../data/images/train/001.Black_footed_Albatross
4 :: Image is in training set. [ True ]
Image::  001.Black_footed_Albatross/Black_Footed_Albatross_0014_89.jpg
Destination::  ../data/images/train/001.Black_footed_Albatross
5 :: Image is in testing set. [ False ]
Source Image::  001.Black_footed_Albatross/Black_Footed_Albatross_0085_92.jpg
