<a href="https://colab.research.google.com/github/aubricot/computer_vision_with_eol_images/blob/master/object_detection_for_image_cropping/chiroptera/chiroptera_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-processing and image augmentation for object detection model training and testing datasets
---
*Last Updated 21 Aug 2025*   

An [EOL user generated cropping dataset](https://editors.eol.org/other_files/EOL_v2_files/image_crops_withEOL_pk.txt.zip) is pre-processed and transformed to formatting standards for use with YOLO via Darkflow and SSD and Faster-RCNN object detection models implemented in Tensorflow. All train and test images are also downloaded to Google Drive for use training and testing.

Before reformatting to object detection model standards, training data is augmented using the [imgaug library](https://github.com/aleju/imgaug). Image augmentation is used to increase training data sample size and diversity to reduce overfitting when training object detection models. Both images and cropping coordinates are augmented. Augmented and original training datasets are then combined before being transformed to object detection model formatting standards.

Notes:   
* Run code blocks by pressing play button in brackets on left
* Before you you start: change the runtime to "GPU" with "High RAM"
* Change parameters using form fields on right (find details at corresponding lines of code by searching '#@param')

## Installs & Imports
---

In [None]:
#@title Choose where to save results & set up directory structure
import os

# Use dropdown menu on right
save = "in Colab runtime (files deleted after each session)" #@param ["in my Google Drive", "in Colab runtime (files deleted after each session)"]
print("Saving results ", save)

# Mount google drive to export image cropping coordinate file(s)
if 'Google Drive' in save:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

# Enter taxon of interest in form field
taxon = "Chiroptera" #@param ["Chiroptera"] {allow-input: true}

# Type in the path to your working directory in form field to right
basewd = "/content/drive/MyDrive/train/tf2" #@param ["/content/drive/MyDrive/train/tf2"] {allow-input: true}
basewd = basewd + '/' + taxon

# Folder where preprocessing outputs will be saved
folder = "pre-processing" # @param ["pre-processing","inspect_resul","results"] {"allow-input":true}
cwd = basewd + '/' + folder

# Folder where train images will be saved
train_folder = "images" #@param ["images"] {allow-input: true}
train_wd = cwd + '/' + train_folder

# Folder where test images will be saved
test_folder = "test_images" #@param ["test_images"] {allow-input: true}
test_wd = cwd + '/' + test_folder

# Download helper_funcs folder
!pip3 -q install --upgrade gdown
!gdown 1xmkrYEJKLJvei9q4zulKfqsGTgDvfvpR
!tar -xzvf helper_funcs.tar.gz -C .

# Install requirements.txt
!pip3 -q install -r requirements.txt

# Set up directory structure
from setup import setup_dirs

# Set up directory structure
setup_dirs(cwd, train_wd, test_wd)
print("\nWorking directory set to: \n", cwd)
print("\nTraining images directory set to: \n", train_wd)
print("\nTesting images directory set to: \n", test_wd)

In [None]:
#@title Install libraries

# For augmenting and displaying images
!pip install imaug
!pip install pillow
import imgaug as ia
import imgaug.augmenters as iaa
from imgaug.augmentables.bbs import BoundingBox, BoundingBoxesOnImage

# For importing/exporting files, working with arrays, etc
import pathlib
import os
import imageio
import time
import csv
import numpy as np
import pandas as pd
from urllib.request import urlopen
from imageio import imread
from PIL import Image

# Set number of seconds to timeout if image url taking too long to open
import socket
socket.setdefaulttimeout(10)

# For drawing onto and plotting the images
import matplotlib.pyplot as plt
import cv2
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

# So URL's don't get truncated & show all cols in display
pd.set_option('display.max_colwidth',1000)
pd.set_option('display.max_columns', None)

# Define functions
from wrangle_data import *

## Build train and test datasets from EOL user-generated cropping data
---
Full cropping dataset is available [here](https://editors.eol.org/other_files/EOL_v2_files/image_crops_withEOL_pk.txt.zip).

In [None]:
#@title Filter EOL cropping coordinates for taxon of interest and reformat to Pascal VOC Annotation Style

# Download EOL user generated cropping file to temporary runtime location
print("Downloading EOL user-generated cropping dataset...\n")
!wget --user-agent="Mozilla" https://editors.eol.org/other_files/EOL_v2_files/image_crops_withEOL_pk.txt.zip

# Unzip cropping file to your working directory
!unzip /content/image_crops_withEOL_pk.txt.zip -d $basewd

# Change to your training directory within Google Drive
%cd $basewd
!mv image_crops_withEOL_pk.txt $cwd
%cd $cwd

# Read in user-generated image cropping file
fpath = cwd + '/image_crops_withEOL_pk.txt'
df = read_datafile(fpath, disp_head=False)

# Reformat cropping dimensions
reformatted = reformat_crops(df, disp_head=True)

# Filter by taxon of interest (Chiroptera)
filter = taxon # defined in first code block
filtered = filter_by_taxon(reformatted, taxon, disp_head=False)

# Export Chiroptera crops as tsv
outfpath = filter + '_crops.tsv'
filtered.to_csv(outfpath, sep='\t', index=False)
print("\nCropping data filtered by taxon {} being saved to: \n{}\n".format(filter, outfpath))

# Split into train (80%) and test (20%) datasets
train, test = split_train_test(filtered, outfpath, 0.8, disp_head=False)
print("\nCropping data split into 80% train - 20% test\n")

## Pre-process train dataset
---

In [None]:
#@title Set up cropping file export parameters
%cd $cwd

# Folder where train images will be saved (defined in first code block)
folder = train_folder

# Write header of crops_aug.tsv before looping through crops for remaining data
outfpath = cwd + '/' + filter + '_crops_train_aug.tsv'
print("\nAugmented cropping data being saved to: \n{}\n".format(outfpath))
if not os.path.isfile(outfpath): # Prevents writing duplicate header rows
    with open(outfpath, 'a') as out_file:
        tsv_writer = csv.writer(out_file, delimiter='\t')
        tsv_writer.writerow(["data_object_id",	"obj_url",	"im_height",	"im_width",	"xmin",
                                "ymin",	"xmax",	"ymax",	"filename",	"path",	"class"])

In [None]:
#@title Augment training images & save them to Google Drive

# Read in EOL user generated cropping data
fpath = filter + "_crops_train.tsv"
crops = read_datafile(fpath, disp_head=False)

# Test pipeline with a smaller subset than 5k images?
run = "test with tiny subset" #@param ["test with tiny subset", "for all images"]

# Display detection results on images
display_results = True #@param {type:"boolean"}

# Download images, augment them, and save to Google Drive
print("Downloading and augmenting training images")
start, stop = set_start_stop(run, crops)
for i, row in enumerate(crops.iloc[start:stop].iterrows()):
    try:
        # Load image from url
        url = crops["obj_url"][i]
        image = imread(url, mode='RGB')

        # Augment the image and bounding box
        image_aug, fpath_aug = augment_image_w_bboxes(image, crops, i, filter, folder, cwd, display_results)

        # Save image to Google Drive
        imageio.imwrite(fpath_aug, image_aug)

        # Save unaugmented image to Google Drive
        fpath = fpath_aug.replace("_aug", "")
        imageio.imwrite(fpath, image)

        # Display message to track augmentation process by image
        print('\033[92m {}) Successfully downloaded & augmented image from {}\033[0m'.format(format(i+1, '.0f'), url))

    except:
        print('\033[91m {}) Error: check if web address for image from {} is valid\033[0m'.format(format(i+1, '.0f'), url))

# Remove out of bounds values
outfpath = cwd + '/' + filter + '_crops_train_aug.tsv'
aug_crops = read_datafile(outfpath, disp_head=False)
crops_oobrem = remove_oob(aug_crops)

# Save results for use training object detectors
outfpath = os.path.splitext(outfpath)[0] + '_oob_rem_fin.csv'
crops_oobrem.to_csv(outfpath, sep=',', index=False)
print("\nFinal cropping results for train data (augmented, square, centered, with out of bounds removed) being saved to: \n{}\n".format(outfpath))

## Pre-process test dataset
---


In [None]:
#@title Set up cropping file export parameters
%cd $cwd

# Folder where test images will be saved (defined in first code block)
folder = test_folder

# Write header of crops_test_notaug.tsv before looping through crops for other data
fpath = cwd + "/" + filter + "_crops_test.tsv"
outfpath = os.path.splitext(fpath)[0] + '_notaug.tsv'
print("\nCropping data being saved to: \n{}\n".format(outfpath))
if not os.path.isfile(outfpath): # Prevents writing duplicate header rows
    with open(outfpath, 'a') as out_file:
        tsv_writer = csv.writer(out_file, delimiter='\t')
        tsv_writer.writerow(["data_object_id",	"obj_url",	"im_height",	"im_width",	"xmin",
                              "ymin",	"xmax",	"ymax",	"filename",	"path",	"class"])

In [None]:
#@title Save test images to Google Drive

# Read in EOL user generated cropping data
fpath = filter + "_crops_test.tsv"
crops = read_datafile(fpath, disp_head=False)

# Test pipeline with a smaller subset than 5k images?
run = "test with tiny subset" #@param ["test with tiny subset", "for all images"]

# Display detection results on images
display_results = False #@param {type:"boolean"}

# Loop through crop test data
print("Downloading testing images")
start, stop = set_start_stop(run, crops)
for i, row in crops.iloc[start:stop].iterrows():
    try:
        # Load image from url
        # Use imread instead of imageio.imread to load images from url and get consistent output type and shape
        url = crops["obj_url"][i]
        image = imread(url, mode='RGB')

        # Define variables needed in exported dataset
        fpath = get_image_info(image, crops, i, cwd, folder, filter)

        # Save image to Google Drive
        imageio.imwrite(fpath, image)

        # Display message to track download process by image
        print('\033[92m {}) Successfully downloaded image from {}\033[0m'.format(format(i+1, '.0f'), url))

    except:
        print('\033[91m {}) Error: check if web address for image from {} is valid\033[0m'.format(format(i+1, '.0f'), url))

# Remove out of bounds values
outfpath = os.path.splitext(fpath)[0] + '_notaug.tsv'
crops = read_datafile(outfpath, disp_head=False)
crops_oobrem = remove_oob(crops)

# Save results for use training object detectors
outfpath = os.path.splitext(outfpath)[0] + '_oob_rem_fin.csv'
crops_oobrem.to_csv(outfpath, sep=',', index=False)
print("\nFinal cropping results for test data (square, centered, with out of bounds removed) being saved to: \n{}\n".format(outfpath))

## Inspect pre-preprocessed crops on images
---
If needed, adjust "iaa.Sequential" augmentation parameters and/or "remove_oob" transformations above and re-visualize until desired results are acheived.

In [None]:
#@title Specify which dataset to visualize (train or test)
%cd $cwd
import cv2

# Read in cropping file for displaying results
dataset = "train" #@param ["train", "test"] {allow-input: true}
pathbase = filter + '_crops_'
if dataset == "test":
    dataset = dataset + "_notaug"
    im_path = "test_images"
else:
    dataset = dataset + "_aug"
    im_path = "images"
outfpath = pathbase + dataset + '_oob_rem_fin.csv'
df = read_datafile(outfpath, sep=',', disp_head=True)
print("\nLoading cropping data from file for {} data: \n{}".format(dataset, df.head()))

In [None]:
#@title Choose starting index for crops to display

# Adjust line to right to see up to 50 images displayed at a time
start = 0 #@param {type:"slider", min:0, max:5000, step:50}
stop = start+50

# Loop through images
for i, row in df.iloc[start:stop].iterrows():
    # Read in image
    fn = df['filename'][i]
    fpath = im_path + '/' + fn
    img = imread(fpath, mode='RGB')

    # Draw bounding box on image
    image_wbox, box = draw_box_on_image(df, img, i)

    # Plot cropping box on image
    _, ax = plt.subplots(figsize=(10, 10))
    ax.imshow(image_wbox)

    # Display image URL and coordinates above image
    plt.title('{} \n xmin: {}, ymin: {}, xmax: {}, ymax: {}'.format(url, box[0], box[1], box[2], box[3]))