<a href="https://colab.research.google.com/github/aubricot/computer_vision_with_eol_images/blob/master/classification_for_image_tagging/flower_fruit/flower_fruit_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-process Flower/Fruit Classifier Training Images
---
*Last Updated 25 Sep 2020*   
1) Download images from EOL Angiosperm "max 30 images per family" image bundle to Google Drive.   
2) Manually sort images into sub-folders: flower, fruit, null.   
3) Inspect taxonomic distribution within folders and make number of images per class even.  

**Notes**
* Change filepaths or information using the form fields to the right of code blocks (also noted in code with 'TO DO')

### Connect to Google Drive
---

In [None]:
# Mount google drive to import/export files
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

### 1) Download images to Google Drive from EOL Image bundle
---

In [None]:
import os
import pandas as pd

# TO DO: Change path to where your training/testing images will be stored in form field on right
# Select flower/fruit classifier from dropdown
class_type = "flower_fruit" #@param ["flower_fruit"]
impath = "/content/drive/'My Drive'/summer20/classification/" + class_type + "/images"
print("Path to images:")
%cd $impath

# TO DO: Change to filename of EOL breakdown_download image bundle
bundle = "https://editors.eol.org/other_files/bundle_images/files/images_for_Angiosperms_max30imgPerFam_breakdown_download_000001.txt" #@param {type:"string"}
# Download images to Google Drive
#!wget -nc -i $bundle
print("Images should already be downloaded. Un-comment out line 15 to download images to Google Drive")

# Confirm expected number of images downloaded to Google Drive
# Numbers may be slightly different due to dead hyperlinks
print("Expected number of images from bundle:\n{}".format(len(pd.read_table(bundle))))
print("Actual number of images downloaded to Google Drive: ")
!ls $impath | wc -l

### 2) Go to Google Drive and manually sort images into flower, fruit, null folders
---   
Tip: First make folders using the commands below. Use numbered prefixes before folder names so they stay at top of file viewer in Google Drive (ex: 01_flower/, 02_fruit/, 03_null/), making it easier to drag and drop images into folders as you manually sort.

In [None]:
# Make folders to sort images into
%cd $impath

# Images containing clearly visible flowers only (no fruits)
!mkdir 01_flower
# Images containing clearly visible fruits only (no flowers)
!mkdir 02_fruit
# Images without any reproductive structures
!mkdir 03_null
# Images that don't clearly fit into 01-03 for exclusion and possible use in training future models (ex: maps, illustrations, text, microscopy, etc.)
!mkdir 04_other

print("Next, go to Google Drive and manually sort images into their respective folders. After you're finished sorting, continue with steps below.")

### 3) Resume here after manually sorting images
---   

#### Inspect image content within folders

In [None]:
# Inspect the number of images in each folder

print("Number of flower images:")
flow = !ls /content/drive/'My Drive'/summer20/classification/flower_fruit/images/01_flower | wc -l
print(flow)
print("Number of fruit images:")
fru = !ls /content/drive/'My Drive'/summer20/classification/flower_fruit/images/02_fruit | wc -l
print(fru)
print("Number of null images:")
null = !ls /content/drive/'My Drive'/summer20/classification/flower_fruit/images/03_null | wc -l
print(null)
print("Number of other/excluded images:")
#other = !ls /content/drive/'My Drive'/summer20/classification/flower_fruit/other_sorted_images/04_other | wc -l
#print(other)

# Check which folder has the smallest number of images
folders = [flow, fru, null, other]
fnames = ["01_flower", "02_fruit", "03_null", "04_other"]
num_imgs = [int(x.list[0]) for x in folders]
min_imgs = (min(num_imgs))
idx = num_imgs.index(min(num_imgs))
keepfolder = fnames[idx]
print("The minimum number of images is {} in the folder {}".format(min_imgs, fnames[idx]))

In [None]:
# Inspect the families present within each folder
import pandas as pd
import os

# Make lists of all flower, fruit, null images
## Flower
fpath = "01_flower"
files = []
for fname in os.listdir(fpath): 
      files.append(fname)
## Make flower training images dataframe
flowers = pd.DataFrame({'imname':files})
flowers["imclass"] = "flower"

## Fruit
fpath = "02_fruit"
files = []
for fname in os.listdir(fpath): 
      files.append(fname)
## Make fruit training images dataframe
fruits = pd.DataFrame({'imname':files})
fruits["imclass"] = "fruit"

## Null
fpath = "03_null"
files = []
for fname in os.listdir(fpath): 
      files.append(fname)
## Make null training images dataframe
nulls = pd.DataFrame({'imname':files})
nulls["imclass"] = "null"

# Merge flower, fruit, null training images to train_df
train_df = flowers.append([fruits, nulls])
print("Merged training datasets:")
print(train_df)

# Get ancestry info for training images from EOL breakdown image bundle
## TO DO: Change to filename of EOL breakdown image bundle
bundle = "https://editors.eol.org/other_files/bundle_images/files/images_for_Angiosperms_max30imgPerFam_breakdown_000001.txt" #@param {type:"string"}
bundle = pd.read_table(bundle)
## Get filenames from tail end after slash of eolMediaURLs
f = lambda x: x['eolMediaURL'].rsplit('/', 1)[-1]
bundle['imname'] = bundle.apply(f, axis=1)

## Map train_df to EOL bundle using image names as an index
train_df.set_index('imname', inplace=True, drop=True)
bundle.set_index('imname', inplace=True, drop=True)
df = train_df.merge(bundle, left_index=True, right_index=True)
print("Training images with ancestry info:")
df.to_csv("sorted_train_data_bef_even_classes.tsv", sep="\t")
print(df.head())

# Get number of images per family in training image classes
## Split ancestry column
family = df.copy()
family.ancestry = family.ancestry.str.split("|")
family = family.explode('ancestry')
## Get all family names (ending in 'aceae')
family = family[family.ancestry.str.contains('aceae', case=False, na=False)]
## Count family occurences in training image classes
### Fruit
fruit = family[family.imclass=='fruit']
fruit_counts = fruit.ancestry.value_counts()
fruit_counts.columns = ['family', 'no_occurrences']
print("Fruit family counts:")
print(fruit_counts.head(10))

### Flower
flower = family[family.imclass=='flower']
flower_counts = flower.ancestry.value_counts()
flower_counts.columns = ['family', 'no_occurrences']
print("Flower family counts:")
print(flower_counts.head(10))

### Null
null = family[family.imclass=='null']
null_counts = null.ancestry.value_counts()
null_counts.columns = ['family', 'no_occurrences']
print("Null family counts:")
print(null_counts.head(10))

#fruit_counts.rbind(flower_counts, null_counts)
print(fruit_counts)
#df.to_csv("sorted_train_data_counts_bef_even_classes.tsv", sep="\t")

#### Make number of images per class even

In [None]:
print("The minimum number of images is {} in the folder {}".format(min_imgs, fnames[idx]))
print("In the next steps, all but {} images need to be deleted from the folders {} and {}".format(min_imgs, fnames[1], fnames[2]))

In [None]:
# Check that images are already archived
if not os.path.exists("/content/drive/My Drive/summer20/classification/flower_fruit/backup_img_befevenclassnum/flower.zip"):
  print("Complete image datasets need to be backed up and zipped. Un-comment out lines 8-9 and 12-13. Then proceed to next step.")
else:
  print("Complete image datsets have already been backed up and zipped. Proceed to next step.")

# Make copy of all files within 01_flower and 03_null folders
#!cp -r /content/drive/'My Drive'/summer20/classification/flower_fruit/images/01_flower/. /content/drive/'My Drive'/summer20/classification/flower_fruit/backup_img_befevenclassnum/flower
#!cp -r /content/drive/'My Drive'/summer20/classification/flower_fruit/images/03_null/. /content/drive/'My Drive'/summer20/classification/flower_fruit/backup_img_befevenclassnum/null

# Zip copied folders
#!zip -r "/content/drive/My Drive/summer20/classification/flower_fruit/backup_img_befevenclassnum/flower.zip" "/content/drive/My Drive/summer20/classification/flower_fruit/backup_img_befevenclassnum/flower"
#!zip -r "/content/drive/My Drive/summer20/classification/flower_fruit/backup_img_befevenclassnum/null.zip" "/content/drive/My Drive/summer20/classification/flower_fruit/backup_img_befevenclassnum/null"

In [None]:
# Randomly delete all but 843 images from 01_flower and 03_null folders (Number of Fruit images = 843)

#!find "/content/drive/My Drive/summer20/classification/flower_fruit/images/01_flower" -type f -print0 | sort -zR | tail -zn +844 | xargs -0 rm
#!find "/content/drive/My Drive/summer20/classification/flower_fruit/images/03_null" -type f -print0 | sort -zR | tail -zn +844 | xargs -0 rm

In [None]:
# Move 04_other folder out of images because it contains images excluded from the training dataset

#!mkdir "/content/drive/My Drive/summer20/classification/flower_fruit/other_sorted_images"
#!mv "/content/drive/My Drive/summer20/classification/flower_fruit/images/04_other" "/content/drive/My Drive/summer20/classification/flower_fruit/other_sorted_images/04_other"