<a href="https://colab.research.google.com/github/aubricot/computer_vision_with_eol_images/blob/master/object_detection_for_image_cropping/multitaxa/multitaxa_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-processing and image augmentation for object detection model training and testing datasets
---
*Last Updated 22 April 2020*   
Test and train datasets (images and cropping dimensions) exported from [split_train_test.ipynb](https://github.com/aubricot/computer_vision_with_eol_images/tree/master/object_detection_for_image_cropping/split_train_test.ipynb) are pre-processed and transformed to formatting standards for use with YOLO via Darkflow and SSD and R-FCN object detection models implemented in Tensorflow. All train and test images are also downloaded to Google Drive for future use training and testing.

Before reformatting to object detection model standards, training data for each taxon (Coleoptera, Anura, Squamata and Carnivora) is augmented using the [imgaug library](https://github.com/aleju/imgaug). Image augmentation is used to increase training data sample size and diversity to reduce overfitting when training object detection models. Both images and cropping coordinates are augmented. Augmented and original datasets are combined to make the final training dataset pooled for all taxa before being transformed to object detection model formatting standards.

**Note: Train Image sections need to be run once for each taxon. Change taxon names where you see '# TO DO' (ie. find -> "squamata" and replace it with "coleoptera").**

After exporting augmented box coordinates from this notebook, test displaying them using [coordinates_display_test.ipynb](https://github.com/aubricot/computer_vision_with_eol_images/tree/master/object_detection_for_image_cropping/coordinates_display_test.ipynb). If they are not as expected, modify data cleaning steps in the section **Remove out of bounds values from train crops and export results for use with object detection models** for train and test images below until the desired results are achieved. 

## Installs
---
Install required libraries directly to this Colab notebook.

In [0]:
# Install libraries for augmenting and displaying images
!pip install imgaug
!pip install pillow
!pip install scipy==1.1.0

In [0]:
# Mount google drive to import/export files
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

## Imports   
---

In [0]:
# Change to your training directory within Google Drive
%cd drive/My Drive/fall19_smithsonian_informatics/train

# For importing/exporting files, working with arrays, etc
import pathlib
import os
import imageio
import time
import csv
import numpy as np
import pandas as pd
from urllib.request import urlopen
from scipy.misc import imread

# For augmenting the images and bounding boxes
import imgaug as ia
import imgaug.augmenters as iaa
from imgaug.augmentables.bbs import BoundingBox, BoundingBoxesOnImage

# For drawing onto and plotting the images
import matplotlib.pyplot as plt
import cv2
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

### Train images - Run once for each taxon
---
Run all steps once for each taxon (Coleoptera, Anura, Squamata and Carnivora).
Must change names where you see '# TO DO' (ie. find -> "carnivora" and replace with "coleoptera"

#### Augment & download train images to Google Drive  
  

In [0]:
# Set-up augmentation parameters and write the header of output file crops_train_aug.tsv generated in the next step

# Read in EOL images and user-generated cropping coordinate training data
# TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train.tsv
crops = pd.read_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train.tsv', sep='\t', header=0)
print(crops.head())

# For image augmentation
from imgaug import augmenters as iaa

# For saving images to Google Drive
from scipy import misc

# Set number of seconds to timeout if image url taking too long to open
import socket
socket.setdefaulttimeout(10)

# Define image augmentation pipeline
# modified from https://github.com/aleju/imgaug
seq = iaa.Sequential([
    iaa.Crop(px=(1, 16), keep_size=False), # crop by 1-16px, resize resulting image to orig dims
    iaa.Affine(rotate=(-25, 25)), # rotate -25 to 25 degrees
    iaa.GaussianBlur(sigma=(0, 3.0)), # blur using gaussian kernel with sigma of 0-3
    iaa.AddToHueAndSaturation((-50, 50), per_channel=True)
])

# Write header of crops_aug.tsv before looping through crops for remaining data
# TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train_aug.tsv
if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_aug.tsv', 'a') as out_file:
            tsv_writer = csv.writer(out_file, delimiter='\t')
            tsv_writer.writerow(["data_object_id",	"obj_url",	"height",	"width",	"xmin",
                                 "ymin",	"xmax",	"ymax",	"filename",	"path",	"class"])

In [0]:
# Augment train images and bounding boxes
# Then download train images to Google Drive and write new df with updated filenames and paths
# Saved train images will be used with bounding box dimensions for future use with the object detection models

# Optional: set seed to make augmentations reproducible across runs, otherwise will be random each time
ia.seed(1) 

# Loop to perform image augmentation for each image in crops
# First test on 5 images from crops
#for i, row in crops.head(5).iterrows():
# Next run on all rows
for i, row in crops.iterrows():
  try:
    # Import image from url
    # Use imread instead of imageio.imread to load images from url and get consistent output type and shape
    url = crops.at[i, "obj_url"]
    with urlopen(url) as file:
      image = imread(file, mode='RGB')

    # Import bounding box coordinates
    bb  = ia.BoundingBox(x1=crops.xmin[i].astype(int), y1=crops.ymin[i].astype(int), 
        x2=crops.xmax[i].astype(int), y2=crops.ymax[i].astype(int))        
    bb = BoundingBoxesOnImage([bb], shape=image.shape)
    
    # Augment image using settings defined above in seq
    image_aug, bb_aug = seq.augment(image=image, bounding_boxes=bb)
    
    # Define augmentation results needed in exported dataset
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/images/'
    path_aug = pathbase + str(crops.data_object_id[i]) + '_aug' + '.jpg'
    filename_aug = str(crops.data_object_id[i]) + '_aug' + '.jpg'
    obj_id = crops.data_object_id[i]
    height, width, depth = image_aug.shape
    xmin_aug = bb_aug.bounding_boxes[0].x1.astype(int)
    ymin_aug = bb_aug.bounding_boxes[0].y1.astype(int)
    xmax_aug = bb_aug.bounding_boxes[0].x2.astype(int)
    ymax_aug = bb_aug.bounding_boxes[0].y2.astype(int)
    # TO DO: Change to Anura, Coleoptera, Squamata, and Carnivora
    name = str("Anura")

    # Export augmented images to Google Drive
    misc.imsave(path_aug, image_aug)
    
    # Draw augmented bounding box and image
    # Only use this for 20-30 images, otherwise comment out
    #imagewbox = cv2.rectangle(image_aug, (xmin_aug, ymin_aug), 
                      #(xmax_aug, ymax_aug), 
                      #(255, 0, 157), 3) # change box color and thickness
    #_, ax = plt.subplots(figsize=(10, 10))
    #ax.imshow(imagewbox)
    #plt.title('{}) Successfully augmented image from {}'.format(format(i+1, '.0f'), url))
    
    # Export augmentation results to crops_aug.tsv
    # TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train_aug.tsv
    if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_aug.tsv', 'a') as out_file:
            tsv_writer = csv.writer(out_file, delimiter='\t')
            tsv_writer.writerow([crops.data_object_id[i], crops.obj_url[i], height, width,
                                 xmin_aug, ymin_aug, xmax_aug, ymax_aug, filename_aug, path_aug, name])
    
    # Display message to track augmentation process by image
    print('{}) Successfully augmented image from {}'.format(format(i+1, '.0f'), url))
  
  except:
    print('{}) Error: check if web address for image from {} is valid'.format(format(i+1, '.0f'), url))

#### Make full training dataset by combining augmented and un-augmented bounding boxes and images   

In [0]:
# Download original (un-augmented) training images to Google Drive 

# For saving images to Google Drive
from scipy import misc

# Set number of seconds to timeout if image url taking too long to open
import socket
socket.setdefaulttimeout(10)

for i, row in crops.iterrows():
  try:
    # Import image from url
    # Use imread instead of imageio.imread to load images from url and get consistent output type and shape
    url = crops.at[i, "obj_url"]
    with urlopen(url) as file:
      image = imread(file, mode='RGB')

    # Define paths and filenames for augmented and unaugmented images
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/images/'
    path = pathbase + str(crops.data_object_id[i]) + '.jpg'
    filename = str(crops.data_object_id[i]) + '.jpg'
     
    # Export augmented images to Google Drive
    misc.imsave(path, image)
  
    # Display message to track augmentation process by image
    print('{}) Successfully downloaded original (un-augmented) image from {} to Google Drive'.format(format(i+1, '.0f'), url))
  
  except:
    print('{}) Error: check if web address for image from {} is valid'.format(format(i+1, '.0f'), url))

In [0]:
# Create new df with original (un-augmented) bounding boxes and images that is formatted the same as the augmented data

# Write header of crops_notaug.tsv before looping through crops for other data
# TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train_notaug.tsv
if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_notaug.tsv', 'a') as out_file:
            tsv_writer = csv.writer(out_file, delimiter='\t')
            tsv_writer.writerow(["data_object_id",	"obj_url",	"height",	"width",	"xmin",
                                 "ymin",	"xmax",	"ymax",	"filename",	"path",	"class"])

# Loop through crops to get images and bounding boxes
for i, row in crops.iterrows():
  try:
    # Import images from crops
    # Use imread instead of imageio.imread to load images from url and get consistent output type and shape
    url = crops.at[i, "obj_url"]
    with urlopen(url) as file:
      image = imread(file, mode='RGB')
    height, width, depth = image.shape
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/images/'
    path = pathbase + str(crops.data_object_id[i]) + '.jpg'
    filename = str(crops.data_object_id[i]) + '.jpg'
    # TO DO: Change to Anura, Coleoptera, Squamata, and Carnivora
    name = str("Anura")
    
    # Write results to crops_notaug.tsv
    # TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train_notaug.tsv
    if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_notaug.tsv', 'a') as out_file:
            tsv_writer = csv.writer(out_file, delimiter='\t')
            tsv_writer.writerow([crops.data_object_id[i], crops.obj_url[i], height, width, 
                                 crops.xmin[i], crops.ymin[i], crops.xmax[i], crops.ymax[i], filename, path, name])
    
    # Display message to track augmentation process by image
    print('{}) Successfully loaded image from {}'.format(format(i+1, '.0f'), url))
  
  except:
    print('{}) Error: check if web address for image from {} is valid'.format(format(i+1, '.0f'), url))

In [0]:
# Combine augmented and un-augmented datasets to make one full training dataset

# File names to be combined
# TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train_aug.tsv and _crops_train_notaug.tsv
all_filenames = ["/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_aug.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_notaug.tsv"]

# Combine all files in the list
combined = pd.concat([pd.read_csv(f, sep='\t') for f in all_filenames])

# Export to tsv
# TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train_aug_all.tsv
combined.to_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_aug_all.tsv", index=False, sep='\t')
print(combined.head())

#### Remove out of bounds values from train crops and export results for use with object detection models

In [0]:
# Read in crops_all.tsv from above
# TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train_aug_all.tsv
allcrops = pd.read_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_aug_all.tsv", sep='\t')
print(allcrops.head())

# Set negative values to 0
allcrops.xmin[allcrops.xmin < 0] = 0
allcrops.ymin[allcrops.ymin < 0] = 0

# Remove out of bounds cropping dimensions
for i, row in allcrops.iterrows():
    # When crop height > image height, set crop height equal to image height:
    if allcrops.ymax[i] > allcrops.height[i]:
            allcrops.ymin[i] = 0
            allcrops.ymax[i] = allcrops.height[i]

for i, row in allcrops.iterrows(): 
    # When crop width > image width, set crop width equal to image width:
    if allcrops.xmax[i] > allcrops.width[i]:
        allcrops.xmin[i] = 0
        allcrops.xmax[i] = allcrops.width[i]

# Write results to tsv for records with all info
# TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train_aug_all_transf.tsv
allcrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_aug_all_transf.tsv', sep='\t', index=False)

# Write relevant results to csv formatted for training and annotations needed by tensorflow and yolo
df1 = allcrops.iloc[:, 4:8]
df2 = allcrops[['filename', 'width', 'height', 'class']]
traincrops = pd.concat([df2, df1], axis=1)
traincrops.insert(0, 'folder', 'images')
print(traincrops.head())
# TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train_aug_fin.tsv
traincrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_aug_fin.csv', sep=',', index=False)

# Write relevant results to tsv formatted for training and annotations needed by yolo
traincrops = allcrops[['filename', 'path', 'width', 'height', 'xmin', 'ymin', 'xmax', 'ymax', 'class']]
traincrops.rename(columns={'class': 'name'}, inplace=True)
traincrops.insert(0, 'folder', 'images')
# Remove leading '/' from filepaths (format needed for xmls)
traincrops['path'] = traincrops['path'].str.lstrip('/')
print(traincrops.head())
# TO DO: Change to anura, coleoptera, squamata, and carnivora _crops_train_aug_fin_foreli.tsv
traincrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_aug_fin_foreli.tsv', sep='\t', index=False)

#### Make multitaxa training dataset by combining train files for all taxa
---

In [0]:
# Combine augmented and un-augmented datasets to make pooled multitaxa training dataset
# Files to be combined
all_filenames = ["/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_train.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_train.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_train.tsv"]
# Combine all files in the list
combined = pd.concat([pd.read_csv(f, sep='\t') for f in all_filenames])
# Write results to tsv
combined.to_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/multitaxa_crops_train.tsv", index=False, sep='\t')
print(combined.head())

# Combine final, transformed datasets to make pooled multitaxa training dataset for Tensorflow models
# Files to be combined
all_filenames = ["/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_train_aug_fin.csv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_train_aug_fin.csv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_aug_fin.csv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_train_aug_fin.csv"]
# Combine all files in the list
combined = pd.concat([pd.read_csv(f, sep=',') for f in all_filenames])
# Write results to tsv
combined.to_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/multitaxa_crops_train_aug_all_fin.csv", index=False, sep=',')
print(combined.head())

# Combine final, transformed datasets to make pooled multitaxa training dataset for YOLO
# Files to be combined
all_filenames = ["/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_train_aug_fin_foreli.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_train_aug_fin_foreli.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_train_aug_fin_foreli.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_train_aug_fin_foreli.tsv"]
# Combine all files in the list
combined = pd.concat([pd.read_csv(f, sep='\t') for f in all_filenames])
# Write results to tsv
combined.to_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/multitaxa_crops_train_aug_all_fin_foreli.tsv", index=False, sep='\t')
print(combined.head())

### Test Images
---
Run blocks to prepare testing datasets for each taxon separately and to make one pooled taxa dataset.


#### Squamata (lizards, snakes)
---
Download test images to Google Drive and write new dataframe with image filenames and paths used to prepare image annotation files

In [0]:
# Saved test images will be used with bounding box dimensions for future use with the object detection models

from scipy import misc
# Set number of seconds to timeout if image url taking too long to open
import socket
socket.setdefaulttimeout(10)

# Read in EOL images and user-generated cropping coordinate testing data
crops_test = pd.read_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_test.tsv', sep='\t', header=0)
crops_test.head()

# Write header of crops_test_transf.tsv before looping through crops for other data
#if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        #with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_test_notaug.tsv', 'a') as out_file:
            #tsv_writer = csv.writer(out_file, delimiter='\t')
            #tsv_writer.writerow(["data_object_id",	"obj_url",	"height",	"width",	"xmin",
                                 #"ymin",	"xmax",	"ymax",	"filename",	"path",	"class"])

# Loop through crop testing data
for i, row in crops_test.iterrows():
  try:
    # Import image from url
    # Use imread instead of imageio.imread to load images from url and get consistent output type and shape
    url = crops_test.at[i, "obj_url"]
    with urlopen(url) as file:
      image = imread(file, mode='RGB')

    # Define variables needed in exported dataset
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/test_images_squamata/'
    path = pathbase + str(crops_test.data_object_id[i]) + '.jpg'
    filename = str(crops_test.data_object_id[i]) + '.jpg'
    obj_id = crops_test.data_object_id[i]
    height, width, depth = image.shape
    name = str("Squamata")
    
    # Export image to Google Drive test_images_taxon/ for testing by each taxon
    #misc.imsave(path, image)
    
    # Make path for test images file set to test_images folder, can change test_images_squamata to test_images if testing with only squamata
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/test_images/'
    path = pathbase + str(crops_test.data_object_id[i]) + '.jpg'
    
    # Export image to Google Drive test_images/ for testing with multitaxa images (all taxa pooled)
    misc.imsave(path, image)

    # Export to crops_test.tsv
    #if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        #with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_test_notaug.tsv', 'a') as out_file:
            #tsv_writer = csv.writer(out_file, delimiter='\t')
            #tsv_writer.writerow([crops_test.data_object_id[i], crops_test.obj_url[i], height, width, 
                                 #crops_test.xmin[i], crops_test.ymin[i], crops_test.xmax[i], crops_test.ymax[i], filename, path, name])
    
    # Display message to track download process by image
    print('{}) Successfully downloaded image from {}'.format(format(i+1, '.0f'), url))
  
  except:
    print('{}) Error: check if web address for image from {} is valid'.format(format(i+1, '.0f'), url))

In [0]:
# Read in crops_test_notaug.tsv from above
crops = pd.read_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_test_notaug.tsv", sep='\t')
print(crops.head())

# Remove out of bounds (OOB) cropping dimensions
# Set negative values (OOB -) equal to 0
crops.xmin[crops.xmin < 0] = 0
crops.ymin[crops.ymin < 0] = 0
# Set positive out of bounds values (OOB +) equal to image dimensions
for i, row in crops.iterrows():
    # When crop height > image height, set crop height equal to image height:
    if crops.ymax[i] > crops.height[i]:
        crops.ymax[i] = crops.height[i]
    # When crop width > image width, set crop width equal to image width:
    if crops.xmax[i] > crops.width[i]:
        crops.xmax[i] = crops.width[i]            

# Write results to tsv for records with all info
crops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_test_notaug_transf.tsv', sep='\t', index=False)

# Write relevant results to csv formatted for training and annotations needed by tensorflow
df1 = crops.iloc[:, 4:8]
df2 = crops[['filename', 'width', 'height', 'class']]
testcrops = pd.concat([df2, df1], axis=1)
testcrops.insert(0, 'folder', 'test_images')
print(testcrops.head())
testcrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_test_notaug_fin.csv', sep=',', index=False)

# Write relevant results to csv formatted for training and annotations needed by yolo
testcrops = crops[['filename', 'path', 'width', 'height', 'xmin', 'ymin', 'xmax', 'ymax', 'class']]
testcrops.rename(columns={'class': 'name'}, inplace=True)
# Remove leading '/' from filepaths (format needed for xmls)
testcrops['path'] = testcrops['path'].str.lstrip('/')
testcrops.insert(0, 'folder', 'test_images')
print(testcrops.head())

# Write results to tsv
testcrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_test_notaug_fin_foreli.tsv', sep='\t', index=False)

#### Coleoptera (beetles)
---
Download test images to Google Drive and write new dataframe with image filenames and paths used to prepare image annotation files

In [0]:
# Saved test images will be used with bounding box dimensions for future use with the object detection models

from scipy import misc
# Set number of seconds to timeout if image url taking too long to open
import socket
socket.setdefaulttimeout(10)

# Read in EOL images and user-generated cropping coordinate testing data
crops_test = pd.read_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_test.tsv', sep='\t', header=0)
crops_test.head()

# Write header of crops_test_transf.tsv before looping through crops for other data
if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_test_notaug.tsv', 'a') as out_file:
            tsv_writer = csv.writer(out_file, delimiter='\t')
            tsv_writer.writerow(["data_object_id",	"obj_url",	"height",	"width",	"xmin",
                                 "ymin",	"xmax",	"ymax",	"filename",	"path",	"class"])

# Loop through crop testing data
for i, row in crops_test.iterrows():
  try:
    # Import image from url
    # Use imread instead of imageio.imread to load images from url and get consistent output type and shape
    url = crops_test.at[i, "obj_url"]
    with urlopen(url) as file:
      image = imread(file, mode='RGB')

    # Define variables needed in exported dataset
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/test_images_coleoptera/'
    path = pathbase + str(crops_test.data_object_id[i]) + '.jpg'
    filename = str(crops_test.data_object_id[i]) + '.jpg'
    obj_id = crops_test.data_object_id[i]
    height, width, depth = image.shape
    name = str("Coleoptera")

    # Export image to Google Drive test_images_taxon/ for testing by each taxon
    misc.imsave(path, image)
    
    # Make path for test images file set to test_images folder, can change test_images_coleoptera to test_images if testing with only coleopterans
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/test_images/'
    path = pathbase + str(crops_test.data_object_id[i]) + '.jpg'
    
    # Export image to Google Drive test_images/ for testing with multitaxa images (all taxa pooled)
    misc.imsave(path, image)

    # Export to crops_test.tsv
    if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_test_notaug.tsv', 'a') as out_file:
            tsv_writer = csv.writer(out_file, delimiter='\t')
            tsv_writer.writerow([crops_test.data_object_id[i], crops_test.obj_url[i], height, width, 
                                 crops_test.xmin[i], crops_test.ymin[i], crops_test.xmax[i], crops_test.ymax[i], filename, path, name])
    
    # Display message to track download process by image
    print('{}) Successfully downloaded image from {}'.format(format(i+1, '.0f'), url))
  
  except:
    print('{}) Error: check if web address for image from {} is valid'.format(format(i+1, '.0f'), url))

In [0]:
# Read in crops_test_notaug.tsv from above
crops = pd.read_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_test_notaug.tsv", sep='\t')
print(crops.head())

# Remove out of bounds (OOB) cropping dimensions
# Set negative values (OOB -) equal to 0
crops.xmin[crops.xmin < 0] = 0
crops.ymin[crops.ymin < 0] = 0
# Set positive out of bounds values (OOB +) equal to image dimensions
for i, row in crops.iterrows():
    # When crop height > image height, set crop height equal to image height:
    if crops.ymax[i] > crops.height[i]:
        crops.ymax[i] = crops.height[i]
    # When crop width > image width, set crop width equal to image width:
    if crops.xmax[i] > crops.width[i]:
        crops.xmax[i] = crops.width[i]            

# Write results to tsv for records with all info
crops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_test_notaug_transf.tsv', sep='\t', index=False)

# Write relevant results to csv formatted for training and annotations needed by tensorflow
df1 = crops.iloc[:, 4:8]
df2 = crops[['filename', 'width', 'height', 'class']]
testcrops = pd.concat([df2, df1], axis=1)
testcrops.insert(0, 'folder', 'test_images')
print(testcrops.head())
testcrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_test_notaug_fin.csv', sep=',', index=False)

# Write relevant results to csv formatted for training and annotations needed by yolo
testcrops = crops[['filename', 'path', 'width', 'height', 'xmin', 'ymin', 'xmax', 'ymax', 'class']]
testcrops.rename(columns={'class': 'name'}, inplace=True)
# Remove leading / from filepaths (format needed for xmls)
testcrops['path'] = testcrops['path'].str.lstrip('/')
testcrops.insert(0, 'folder', 'test_images')
print(testcrops.head())

# Write results to tsv
testcrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_test_notaug_fin_foreli.tsv', sep='\t', index=False)

#### Anura (frogs)
---
Download test images to Google Drive and write new dataframe with image filenames and paths used to prepare image annotation files

In [0]:
# Saved test images will be used with bounding box dimensions for future use with the object detection models

from scipy import misc
# Set number of seconds to timeout if image url taking too long to open
import socket
socket.setdefaulttimeout(10)

# Read in EOL images and user-generated cropping coordinate testing data
crops_test = pd.read_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_test.tsv', sep='\t', header=0)
crops_test.head()

# Write header of crops_test_transf.tsv before looping through crops for other data
if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_test_notaug.tsv', 'a') as out_file:
            tsv_writer = csv.writer(out_file, delimiter='\t')
            tsv_writer.writerow(["data_object_id",	"obj_url",	"height",	"width",	"xmin",
                                 "ymin",	"xmax",	"ymax",	"filename",	"path",	"class"])

# Loop through crop testing data
for i, row in crops_test.iterrows():
  try:
    # Import image from url
    # Use imread instead of imageio.imread to load images from url and get consistent output type and shape
    url = crops_test.at[i, "obj_url"]
    with urlopen(url) as file:
      image = imread(file, mode='RGB')

    # Define variables needed in exported dataset
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/test_images_anura/'
    path = pathbase + str(crops_test.data_object_id[i]) + '.jpg'
    filename = str(crops_test.data_object_id[i]) + '.jpg'
    obj_id = crops_test.data_object_id[i]
    height, width, depth = image.shape
    name = str("Anura")

    # Export image to Google Drive test_images_taxon/ for testing by each taxon 
    misc.imsave(path, image)
    
    # Make path for test images file set to test_images folder, can change test_images_anura to test_images if testing with only anurans
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/test_images/'
    path = pathbase + str(crops_test.data_object_id[i]) + '.jpg'
    
    # Export image to Google Drive test_images/ for testing with multitaxa images (all taxa pooled)
    misc.imsave(path, image)

    # Export to crops_test.tsv
    if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_test_notaug.tsv', 'a') as out_file:
            tsv_writer = csv.writer(out_file, delimiter='\t')
            tsv_writer.writerow([crops_test.data_object_id[i], crops_test.obj_url[i], height, width, 
                                 crops_test.xmin[i], crops_test.ymin[i], crops_test.xmax[i], crops_test.ymax[i], filename, path, name])
    
    # Display message to track download process by image
    print('{}) Successfully downloaded image from {}'.format(format(i+1, '.0f'), url))
  
  except:
    print('{}) Error: check if web address for image from {} is valid'.format(format(i+1, '.0f'), url))

In [0]:
# Read in crops_test_notaug.tsv from above
crops = pd.read_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_test_notaug.tsv", sep='\t')
print(crops.head())

# Remove out of bounds (OOB) cropping dimensions
# Set negative values (OOB -) equal to 0
crops.xmin[crops.xmin < 0] = 0
crops.ymin[crops.ymin < 0] = 0
# Set positive out of bounds values (OOB +) equal to image dimensions
for i, row in crops.iterrows():
    # When crop height > image height, set crop height equal to image height:
    if crops.ymax[i] > crops.height[i]:
        crops.ymax[i] = crops.height[i]
    # When crop width > image width, set crop width equal to image width:
    if crops.xmax[i] > crops.width[i]:
        crops.xmax[i] = crops.width[i]            

# Write results to tsv for records with all info
crops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_test_notaug_transf.tsv', sep='\t', index=False)

# Write relevant results to csv formatted for training and annotations needed by tensorflow
df1 = crops.iloc[:, 4:8]
df2 = crops[['filename', 'width', 'height', 'class']]
testcrops = pd.concat([df2, df1], axis=1)
testcrops.insert(0, 'folder', 'test_images')
print(testcrops.head())
testcrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_test_notaug_fin.csv', sep=',', index=False)

# Write relevant results to csv formatted for training and annotations needed by yolo
testcrops = crops[['filename', 'path', 'width', 'height', 'xmin', 'ymin', 'xmax', 'ymax', 'class']]
testcrops.rename(columns={'class': 'name'}, inplace=True)
# Remove leading / from filepaths (format needed for xmls)
testcrops['path'] = testcrops['path'].str.lstrip('/')
testcrops.insert(0, 'folder', 'test_images')
print(testcrops.head())

# Write results to tsv
testcrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_test_notaug_fin_foreli.tsv', sep='\t', index=False)

#### Carnivora (carnivores)
---
Download test images to Google Drive and write new dataframe with image filenames and paths used to prepare image annotation files

In [0]:
# Saved test images will be used with bounding box dimensions for future use with the object detection models

from scipy import misc
# Set number of seconds to timeout if image url taking too long to open
import socket
socket.setdefaulttimeout(10)

# Read in EOL images and user-generated cropping coordinate testing data
crops_test = pd.read_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_test.tsv', sep='\t', header=0)
crops_test.head()

# Write header of crops_test_transf.tsv before looping through crops for other data
if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_test_notaug.tsv', 'a') as out_file:
            tsv_writer = csv.writer(out_file, delimiter='\t')
            tsv_writer.writerow(["data_object_id",	"obj_url",	"height",	"width",	"xmin",
                                 "ymin",	"xmax",	"ymax",	"filename",	"path",	"class"])

# Loop through crop testing data
for i, row in crops_test.iterrows():
  try:
    # Import image from url
    # Use imread instead of imageio.imread to load images from url and get consistent output type and shape
    url = crops_test.at[i, "obj_url"]
    with urlopen(url) as file:
      image = imread(file, mode='RGB')

    # Define variables needed in exported dataset
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/test_images_carnivora/'
    path = pathbase + str(crops_test.data_object_id[i]) + '.jpg'
    filename = str(crops_test.data_object_id[i]) + '.jpg'
    obj_id = crops_test.data_object_id[i]
    height, width, depth = image.shape
    name = str("Carnivora")
    
    # Export image to Google Drive test_images_taxon/ for testing by each taxon
    misc.imsave(path, image)

    # Make path for test images file set to test_images folder, can change test_images_carnivora to test_images if testing with only carnivores
    pathbase = '/content/drive/My Drive/fall19_smithsonian_informatics/train/test_images/'
    path = pathbase + str(crops_test.data_object_id[i]) + '.jpg'

    # Export image to Google Drive test_images/ for testing with multitaxa images (all taxa pooled)
    misc.imsave(path, image)

    # Export to crops_test.tsv
    if os.path.exists('/content/drive/My Drive/fall19_smithsonian_informatics/train'):
        with open('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_test_notaug.tsv', 'a') as out_file:
            tsv_writer = csv.writer(out_file, delimiter='\t')
            tsv_writer.writerow([crops_test.data_object_id[i], crops_test.obj_url[i], height, width, 
                                 crops_test.xmin[i], crops_test.ymin[i], crops_test.xmax[i], crops_test.ymax[i], filename, path, name])
    
    # Display message to track download process by image
    print('{}) Successfully downloaded image from {}'.format(format(i+1, '.0f'), url))
  
  except:
    print('{}) Error: check if web address for image from {} is valid'.format(format(i+1, '.0f'), url))

In [0]:
# Read in crops_test_notaug.tsv from above
crops = pd.read_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_test_notaug.tsv", sep='\t')
print(crops.head())

# Remove out of bounds (OOB) cropping dimensions
# Set negative values (OOB -) equal to 0
crops.xmin[crops.xmin < 0] = 0
crops.ymin[crops.ymin < 0] = 0
# Set positive out of bounds values (OOB +) equal to image dimensions
for i, row in crops.iterrows():
    # When crop height > image height, set crop height equal to image height:
    if crops.ymax[i] > crops.height[i]:
        crops.ymax[i] = crops.height[i]
    # When crop width > image width, set crop width equal to image width:
    if crops.xmax[i] > crops.width[i]:
        crops.xmax[i] = crops.width[i]            

# Write results to tsv for records with all info
crops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_test_notaug_transf.tsv', sep='\t', index=False)

# Write relevant results to csv formatted for training and annotations needed by tensorflow
df1 = crops.iloc[:, 4:8]
df2 = crops[['filename', 'width', 'height', 'class']]
testcrops = pd.concat([df2, df1], axis=1)
testcrops.insert(0, 'folder', 'test_images')
print(testcrops.head())
testcrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_test_notaug_fin.csv', sep=',', index=False)

# Write relevant results to csv formatted for training and annotations needed by yolo
testcrops = crops[['filename', 'path', 'width', 'height', 'xmin', 'ymin', 'xmax', 'ymax', 'class']]
testcrops.rename(columns={'class': 'name'}, inplace=True)
# Remove leading / from filepaths (format needed for xmls)
testcrops['path'] = testcrops['path'].str.lstrip('/')
testcrops.insert(0, 'folder', 'test_images')
print(testcrops.head())

# Write results to tsv
testcrops.to_csv('/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_test_notaug_fin_foreli.tsv', sep='\t', index=False)

### Multitaxa (all groups pooled)
---   
Make pooled multitaxa test image datasets by combining test datasets for all taxa

In [0]:
# Make original multitaxa test image dataset
# Files to be combined
all_filenames = ["/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_test.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_test.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_test.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_test.tsv"]
# Combine all files in the list
combined = pd.concat([pd.read_csv(f, sep='\t') for f in all_filenames])
# Export to tsv
combined.to_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/multitaxa_crops_test.tsv", index=False, sep='\t')
print(combined.head())

# Make final multitaxa test test image dataset (for making R-FCN and SSD annotation data) by combining test datasets for all taxa
# Files to be combined
all_filenames = ["/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_test_notaug_fin.csv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_test_notaug_fin.csv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_test_notaug_fin.csv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_test_notaug_fin.csv"]
# Combine all files in the list
combined = pd.concat([pd.read_csv(f, sep=',') for f in all_filenames])
# Export to tsv
combined.to_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/multitaxa_crops_test_notaug_fin.csv.csv", index=False, sep=',')
print(combined.head())

# Make final multitaxa test dataset (for use by Eli to make YOLO annotation xmls) by combining test datasets for all taxa
# Files to be combined
all_filenames = ["/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/squamata_crops_test_notaug_fin_foreli.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/coleoptera_crops_test_notaug_fin_foreli.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/anura_crops_test_notaug_fin_foreli.tsv",
                 "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/carnivora_crops_test_notaug_fin_foreli.tsv"]
# Combine all files in the list
combined = pd.concat([pd.read_csv(f, sep='\t') for f in all_filenames])

# Export to tsv
combined.to_csv( "/content/drive/My Drive/fall19_smithsonian_informatics/train/preprocessing/multitaxa_crops_test_notaug_fin_foreli.tsv", index=False, sep='\t')
print(combined.head())