# Pre-processing and image augmentation for object detection model training and testing datasets
---
*Last Updated 1 April 2020*   
Test and train datasets (images and cropping dimensions) exported from [split_train_test.ipynb](https://github.com/aubricot/computer_vision_with_eol_images/tree/master/object_detection_for_image_cropping/split_train_test.ipynb) are pre-processed and transformed to formatting standards for use with YOLO via Darkflow and SSD and R-FCN object detection models implemented in Tensorflow. All train and test images are also downloaded to Google Drive for future use training and testing.

Before reformatting to object detection model standards, training data for each taxon (Coleoptera, Anura, Squamata and Carnivora) is augmented using the [imgaug library](https://github.com/aleju/imgaug). Image augmentation is used to increase training data sample size and diversity to reduce overfitting when training object detection models. Both images and cropping coordinates are augmented. Augmented and original training datasets are then combined before being transformed to object detection model formatting standards.

After exporting augmented box coordinates from this notebook, test displaying them using [coordinates_display_test.ipynb](https://github.com/aubricot/computer_vision_with_eol_images/tree/master/object_detection_for_image_cropping/coordinates_display_test.ipynb). If they are not as expected, modify data cleaning steps in the section **Remove out of bounds values from train crops and export results for use with object detection models** for train and test images below until the desired results are achieved. 

## Installs
---
Install required libraries directly to this Colab notebook.

In [0]:
# Install libraries for augmenting and displaying images
!pip install imgaug
!pip install pillow
!pip install scipy==1.1.0

In [0]:
# Mount google drive to import/export files
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

## Imports   
---

In [0]:
# Change to your training directory within Google Drive
%cd drive/My Drive/fall19_smithsonian_informatics/train

# For importing/exporting files, working with arrays, etc
import pathlib
import os
import imageio
import time
import csv
import numpy as np
import pandas as pd
from urllib.request import urlopen
from scipy.misc import imread

# For augmenting the images and bounding boxes
import imgaug as ia
import imgaug.augmenters as iaa
from imgaug.augmentables.bbs import BoundingBox, BoundingBoxesOnImage

# For drawing onto and plotting the images
import matplotlib.pyplot as plt
import cv2
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

### Train images - Run once for each taxon
---
Run all steps once for each taxon (Coleoptera, Anura, Squamata and Carnivora).
Must change names where you see '# TO-DO' (ie. find -> "carnivora" and replace with "coleoptera"

#### Augment & download train images to Google Drive  
  

In [0]:
urls = 'https://editors.eol.org/other_files/bundle_images/files/images_for_Lepidoptera_20K_breakdown_000001.txt'
df1 = pd.read_csv(urls, sep='\t')
df = df1[['eolMediaURL', 'dataObjectVersionID']]
print(df.head())

In [0]:
# Augment train images and bounding boxes
# Then download train images to Google Drive and write new df with updated filenames and paths
# Saved train images will be used with bounding box dimensions for future use with the object detection models

# Set-up augmentation parameters and write the header of output file crops_train_aug.tsv generated in the next step
from imgaug import augmenters as iaa
from scipy import misc
# Set number of seconds to timeout if image url taking too long to open
import socket
socket.setdefaulttimeout(10)

# Define image augmentation pipeline
# modified from https://github.com/aleju/imgaug
seq = iaa.Sequential([
    iaa.Crop(px=(1, 16), keep_size=False), # crop by 1-16px, resize resulting image to orig dims
    iaa.Affine(rotate=(-25, 25)), # rotate -25 to 25 degrees
    iaa.GaussianBlur(sigma=(0, 3.0)), # blur using gaussian kernel with sigma of 0-3
    iaa.AddToHueAndSaturation((-50, 50), per_channel=True)
])

# Optional: set seed to make augmentations reproducible across runs, otherwise will be random each time
ia.seed(1) 

# Loop to perform image augmentation for each image in crops
# First test on 5 images from crops
#for i, row in crops.head(5).iterrows():
# Next run on all rows
for i, row in df.iloc[100:110].iterrows():

  try:
    # Import image from url
    # Use imread instead of imageio.imread to load images from url and get consistent output type and shape
    url = df.at[i, "eolMediaURL"]
    with urlopen(url) as file:
      image = imread(file, mode='RGB')

    # Augment image using settings defined above in seq
    image_aug = seq.augment(image=image)
    
    # Define augmentation results needed in exported dataset
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/out/aug_ims/'
    path_aug = pathbase + str(df.dataObjectVersionID[i]) + '_aug' + '.jpg'
    filename_aug = str(df.dataObjectVersionID[i]) + '_aug' + '.jpg'
    
    # Export augmented images to Google Drive
    misc.imsave(path_aug, image_aug)
    
    # Draw augmented bounding box and image
    # Only use this for 20-30 images, otherwise comment out
    #imagewbox = cv2.rectangle(image_aug, (xmin_aug, ymin_aug), 
                      #(xmax_aug, ymax_aug), 
                      #(255, 0, 157), 3) # change box color and thickness
    _, ax = plt.subplots(figsize=(10, 10))
    ax.imshow(image_aug)
    plt.title('{}) Successfully augmented image from {}'.format(format(i+1, '.0f'), url))
        
    # Display message to track augmentation process by image
    print('{}) Successfully augmented image from {}'.format(format(i+1, '.0f'), url))
  
  except:
    print('{}) Error: check if web address for image from {} is valid'.format(format(i+1, '.0f'), url))