<a href="https://colab.research.google.com/github/aubricot/computer_vision_with_eol_images/blob/master/classification_for_image_tagging/image_type/image_type_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-process Image Type Classifier Training Images
---
*Last Updated 22 Oct 2020*   
1) Download images from map, phylogeny, illustration, and herbarium sheet image bundles to Google Drive.   
2) Manually exclude images that are not representative class examples.    
3) Build null image class from EOL images for negative control class.   
4) Standardize number of images per class.  

**Notes**
* Change filepaths or information using the form fields to the right of code blocks (also noted in code with 'TO DO')

### Connect to Google Drive
---

In [None]:
# Mount google drive to import/export files
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

### 1) Download images to Google Drive from EOL Image bundle
---
Run this step 5x (once per image bundle). For each iteration, use the dropdown menu to the right to select the image bundle to download images from.

In [None]:
import os
import pandas as pd

# Set paths to where your training/testing images will be stored in form field on right
classif_type = "image_type"
# TO DO: Choose images by class to download each iteration
imclass = "illus" #@param ["map", "herb", "phylo", "illus"]
impath = "/content/drive/'My Drive'/summer20/classification/" + classif_type + "/images/" + imclass + "/"
print("Path to images:")
%cd $impath

# TO DO: Choose filename of EOL breakdown_download image bundle for image class above
bundle = "https://editors.eol.org/other_files/bundle_images/classifier/Zoological_illustrations_download.txt" #@param ["https://editors.eol.org/other_files/bundle_images/classifier/maps.txt", "https://editors.eol.org/other_files/bundle_images/classifier/Phylogeny_images.txt", "https://editors.eol.org/other_files/bundle_images/classifier/herbarium_sheets_download.txt", "https://editors.eol.org/other_files/bundle_images/classifier/Zoological_illustrations_download.txt", "https://editors.eol.org/other_files/bundle_images/classifier/Botanical_illustrations_download.txt"]
# Download images to Google Drive
print("Images should already be downloaded. Un-comment out lines 15/16 to download images to Google Drive")
# Note: added user-agent tag bc got 403 errors preventing bots from downloading imgs
!wget --user-agent="Mozilla" -nc -i $bundle
print("Images successfully downloaded to Google Drive")

# Confirm expected number of images downloaded to Google Drive
# Numbers may be slightly different due to dead hyperlinks
print("Expected number of images from bundle:\n{}".format(len(pd.read_table(bundle))))
print("Actual number of images downloaded to Google Drive: ")
!ls $impath | wc -l

### 2) Go to Google Drive and visually inspect images in each folder
---   
Delete images based on chosen exclusion criteria to get consistent classes with representative images.

### 3) Build "null" image class from EOL images
---   
Having a negative control will help train the classifier on what images do not belong in any of the above classes

#### Take images from flower/fruit classifier training data to have botanical images

In [None]:
# Copy null.zip from flower_fruit directory to image_type directory
!cp /content/drive/'My Drive'/summer20/classification/flower_fruit/backup_img_befevenclassnum/null.zip /content/drive/'My Drive'/summer20/classification/image_type/images/null/

# Unzip images
%cd /content/drive/My Drive/summer20/classification/image_type/images/null
print("Unzipping botanical images")
!unzip null.zip
# Zipped folders have preserved directory structure for some reason
# Hacky workaround to move images to null folder
!mv content/drive/'My Drive'/summer20/classification/flower_fruit/backup_img_befevenclassnum/null/* .

# Check how many images were moved
print("Number of botanical images moved to null folder:")
!ls . | wc -l

# Delete not needed folders
!rm -r content
!rm -r .ipynb_checkpoints

# Delete all but 1000 images
!find . -type f -print0 | sort -zR | tail -zn +1501 | xargs -0 rm
print("Number of botanical images remaining:")
!ls . | wc -l

#### Take images from object detection image bundles to have zoological images
Take 150 images from Aves, Chiroptera, Lepidoptera, Coleoptera, Squamata, Anura, Mammalia bundles

In [None]:
import pandas as pd

# All available zoological image bundles
bundles = ['https://editors.eol.org/other_files/bundle_images/files/images_for_Aves_20K_breakdown_download_000001.txt',
           'https://editors.eol.org/other_files/bundle_images/files/images_for_Chiroptera_20K_breakdown_download_000001.txt',
           'https://editors.eol.org/other_files/bundle_images/files/images_for_Lepidoptera_20K_breakdown_download_000001.txt',
           'https://editors.eol.org/other_files/bundle_images/files/images_for_Squamata_20K_breakdown_download_000001.txt',
           'https://editors.eol.org/other_files/bundle_images/files/images_for_Coleoptera_20K_breakdown_download_000001.txt',
           'https://editors.eol.org/other_files/bundle_images/files/images_for_Anura_20K_breakdown_download_000001.txt',
           'https://editors.eol.org/other_files/bundle_images/files/images_for_Carnivora_20K_breakdown_download_000001.txt']
print(bundles)

urls = []
for bundle in bundles:
  df = pd.read_csv(bundle, names=["url"])
  urls.append(df[:1000].sample(300))
imgs = pd.concat(urls, ignore_index=True)
imgs.to_csv('zool/imgs.txt', header=None, index=None, sep=' ', mode='a')
print(imgs)

# Download images to Google Drive
%cd zool
!wget -nc -i imgs.txt
print("Images successfully downloaded to Google Drive")

# Confirm expected number of images downloaded to Google Drive
# Numbers may be slightly different due to dead hyperlinks
print("Expected number of images from bundle:\n{}".format(len(pd.read_table('imgs.txt'))))
print("Actual number of images downloaded to Google Drive: ")
!ls . | wc -l

# Move zoological images to null folder
%cd ../ #cd to "null/"
!mv zool/* .
!rm -r zool

#### Go to Google Drive and visually inspect images in each folder
Manually delete all images that are cartoons/non-photographic, then return here

### 4) Split Illustrations into greyscale and color
Poor accuracy for greyscale illustrations class on initial training with color and greyscale illustrations lumped together

In [None]:
# Run inference
import PIL
from PIL import Image
import time

PIL.Image.MAX_IMAGE_PIXELS = 196103143

# Cd to images/
#%cd ../

# TO DO: Choose which image class to inspect results for in true_imclass to right
# TO DO: Choose start and end image numbers to inspect (inspect up to 50 images at a time)
path_to_orig_imgs = "illus"
path_to_move_imgs = "illus_g"
names = os.listdir(path_to_orig_imgs)
orig_fpath = [os.path.join(path_to_orig_imgs, name) for name in names]
moved_fpath = [os.path.join(path_to_move_imgs, name) for name in names]

# Loops through first 5 image urls from the text file
start = 0 #@param {type:"number"}
end =   4000#@param {type:"number"}
for im_num, im_path in enumerate(orig_fpath, start=1):
#for im_num, im_path in enumerate(orig_fpath[start:end], start=1):
    # Load in image
    img = Image.open(im_path)
    if img.getbands() == ('L',):
      new_im_path = moved_fpath[im_num]
      print('{}) Greyscale image found at {}, moving to illus_g'.format(im_num, im_path))
      !mv "$im_path" "$new_im_path"
    else:
      print('{}) Color image found at {}, skipping'.format(im_num, im_path))

In [None]:
print("Number of color images:")
!ls illus | wc -l
print("Number of greyscale images:")
!ls illus_g | wc -l

### 5) Standardize number of images per class

In [None]:
# Inspect the number of images in each folder
print("Number of map images:")
maps = !ls /content/drive/'My Drive'/summer20/classification/image_type/images/maps | wc -l
print(maps)
print("Number of herbarium sheet images:")
herb = !ls /content/drive/'My Drive'/summer20/classification/image_type/images/herb | wc -l
print(herb)
print("Number of phylogeny images:")
phylo = !ls /content/drive/'My Drive'/summer20/classification/image_type/images/phylo | wc -l
print(phylo)
print("Number of illustration images:")
illus = !ls /content/drive/'My Drive'/summer20/classification/image_type/images/illus | wc -l
print(illus)
print("Number of null images:")
null = !ls /content/drive/'My Drive'/summer20/classification/image_type/images/null | wc -l
print(null)

# Check which folder has the smallest number of images
folders = [maps, herb, phylo, illus, null]
fnames = ["maps", "herb", "phylo", "illus", "null"]
num_imgs = [int(x.list[0]) for x in folders]
min_imgs = (min(num_imgs))
idx = num_imgs.index(min(num_imgs))
keepfolder = fnames[idx]
print("The minimum number of images is {} in the folder {}".format(min_imgs, fnames[idx]))

#### Augment phylogenies because not enough images
Phylogeny has half the images of other folders. Use image augmentation to increase the number and diversity of phylogeny images, then make reamining image classes even.

In [None]:
# Install libraries for augmenting, displaying, and saving images
!pip install imgaug
!pip install pillow
!pip install scipy==1.1.0

%cd /content/drive/My Drive/summer20/classification/image_type/images

# For importing/exporting files, working with arrays, etc
import pathlib
import os
import imageio
import time
import csv
import numpy as np
import pandas as pd
from urllib.request import urlopen
from scipy.misc import imread
from scipy import misc

# For augmenting the images and bounding boxes
import imgaug as ia
import imgaug.augmenters as iaa
from imgaug.augmentables.bbs import BoundingBox, BoundingBoxesOnImage

# For drawing onto and plotting the images
import matplotlib.pyplot as plt
import cv2
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

# Define image augmentation pipeline
# modified from https://github.com/aleju/imgaug
seq = iaa.Sequential([
    iaa.Crop(px=(1, 16), keep_size=False), # crop by 1-16px, resize resulting image to orig dims
    iaa.Affine(rotate=(-25, 25)), # rotate -25 to 25 degrees
    iaa.GaussianBlur(sigma=(0, 3.0)), # blur using gaussian kernel with sigma of 0-3
    iaa.AddToHueAndSaturation((-50, 50), per_channel=True)
])

# Optional: set seed to make augmentations reproducible across runs, otherwise will be random each time
ia.seed(1) 

In [None]:
# Loop to perform image augmentation for each image in crops
# First test on 5 images from crops
#for i, row in crops.head(5).iterrows():
# Next run on all rows
for i, fn in enumerate(os.listdir("phylo"), start=1):
    # Read in image
    impath = "phylo/" + fn
    img = imread(impath, mode='RGB')
    # Display image
    #_, ax = plt.subplots(figsize=(10, 10))
    #plt.title("Original")
    #ax.imshow(img)
    
    # Augment image using settings defined above in seq
    img_aug = seq.augment(image=img)
    
    # Define augmentation results needed in exported dataset
    fn_aug = os.path.splitext(impath)[0] + '_aug.jpg'

    # Export augmented images to Google Drive
    misc.imsave(fn_aug, img_aug)
    
    # Draw augmented image
    #_, ax = plt.subplots(figsize=(10, 10))
    #ax.imshow(img_aug)
    #plt.title('{}) Successfully augmented image from {}'.format(i, fn))
    
    # Display message to track augmentation process by image
    print('{}) Successfully augmented image from {}'.format(i, fn))

#### Delete excess images from classes so that folders have roughly the same number of images

In [None]:
# CD to images/
#%cd ../

# Randomly delete all but 3000 images from illustration and phylogeny folders
#!find "illus_g" -type f -print0 | sort -zR | tail -zn +3001 | xargs -0 rm
#!find "null" -type f -print0 | sort -zR | tail -zn +3001 | xargs -0 rm