<a href="https://colab.research.google.com/github/aubricot/computer_vision_with_eol_images/blob/master/classification_for_image_tagging/rating/classify_images.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run images through image rating classification pipeline
--- 
Classify images as "bad" or "good" quality.  
*Last Updated 27 October 2021* 

Use trained image classification model (Run 18, MobileNet SSD v2) to add tags for image rating (bad, good) to EOL images for predictions above the chosen confidence threshold. (Confidence value and model selected in [inspect_train_results.ipynb](https://colab.research.google.com/github/aubricot/computer_vision_with_eol_images/blob/master/classification_for_image_tagging/rating/inspect_train_results.ipynb)).

In post-processing, keep only "bad" image quality predications (model accuracy was high for this class) when confidence > 1.5. "Good" image quality predications are discarded (model accuracy was low for this class). Then, display tagging results on images to verify behavior is as expected.

***Models were trained in Python 2 and TF 1 in December 2020: MobileNet SSD v2 (Run 18, trained on 'good' and 'bad' classes) was trained for 12 hours to 10 epochs with Batch Size=16, Lr=0.001, Dropout=0.2.***

Notes:     
* Change parameters using form fields on right (/where you see 'TO DO' in code)
* To test the notebook, all code can be run without connecting to your Google Drive. Results will be saved to the Colab runtime and cleared at the end of your session.
* We observed controversy among users assigning ratings to "good" images, and consensus for assigning ratings to "bad" images (Users were more conflicted on what they like than what they don't like). Model behavior matched this observation.*

## Installs & Imports
---

In [None]:
# (Optional): Mount google drive to import/export files
# Note: Only run this cell if want to save results
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
# For downloading and displaying images
!pip install pillow
!pip install scipy==1.1.0
from PIL import Image
import cv2
import scipy
from scipy import misc
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# For working with data
import numpy as np
import pandas as pd
import os
from os import path
import csv
import itertools
from scipy.linalg import norm
from scipy import sum, average
# So URL's don't get truncated in display
pd.set_option('display.max_colwidth',1000)
pd.options.display.max_columns = None

# For measuring inference time
import time

# For image classification
import tensorflow as tf
print('\nTensorflow Version: %s' % tf.__version__)

# Set number of seconds to timeout if image url taking too long to open
import socket
socket.setdefaulttimeout(10)

## Generate tags for images
---
Run EOL 20k image bundles through pre-trained image classification models and save results in 4 batches (A-D). 

### Prepare classification functions and settings

In [None]:
# Set working directory
# TO DO: Type in the path to your working directory in form field to right
wd = "/content/drive/MyDrive/train/" #@param {type:"string"}

# Make folder for image tags within base wd
cwd = wd + 'results/'
if not os.path.exists(cwd):
    os.makedirs(cwd)

# Define functions

# To read in EOL formatted data files
def read_datafile(fpath, sep="\t", header=0, disp_head=True, lineterminator='\n', encoding='latin1'):
    """
    Defaults to tab-separated data files with header in row 0
    """
    try:
        df = pd.read_csv(fpath, sep=sep, header=header, lineterminator=lineterminator, encoding=encoding)
        if disp_head:
          print("Data header: \n", df.head())
    except FileNotFoundError as e:
        raise Exception("File not found: Enter the path to your file in form field and re-run").with_traceback(e.__traceback__)
    
    return df

# Define start and stop indices in EOL bundle for running inference   
def set_start_stop():
    # To test with a tiny subset, use 5 random bundle images
    if test_with_tiny_subset:
        start=np.random.choice(a=1000, size=1)[0]
        stop=start+5
    # To run inference on 4 batches of 5k images each
    elif "_a." in outfpath: # batch a is from 0-5000
        start=0
        stop=5000
    elif "_b." in outfpath: # batch b is from 5000-1000
        start=5000
        stop=10000
    elif "_c." in outfpath: # batch c is from 10000-15000
        start=10000
        stop=15000
    elif "_d." in outfpath: # batch d is from 15000-20000
        start=15000
        stop=20000
    print("Running inference on images")

    return start, stop

# Load in image from URL
# Modified from https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/saved_model.ipynb#scrollTo=JhVecdzJTsKE
def image_from_url(url, fn):
    file = tf.keras.utils.get_file(fn, url) # Filename doesn't matter
    disp_img = tf.keras.preprocessing.image.load_img(file)
    image = tf.keras.preprocessing.image.load_img(file, target_size=[pixels, pixels])
    image = tf.keras.preprocessing.image.img_to_array(image)
    image = tf.keras.applications.mobilenet_v2.preprocess_input(
        image[tf.newaxis,...])

    return image, disp_img

# Get info about trained classification model
def get_model_info(use_EOL_model):
    # Use EOL pre-trained model
    if use_EOL_model:
        # Model metadata
        module_selection =('mobilenet_v2_1.0_224', 224)
        dataset_labels = ['bad', 'good']
        TRAIN_SESS_NUM = '18'
        saved_models_dir = 'saved_models/'
        # If running for the first time, download model
        if not os.path.exists('saved_models/'):
            # Make folder for trained model
            os.makedirs(saved_models_dir)
            %cd $saved_models_dir
            os.makedirs(TRAIN_SESS_NUM)
            # Download saved model files for Run 18 - MobileNet SSD v2
            !gdown --id 1L-WqfuoQtPgqJzU8tDKjgsZC98M-68w9 # 18.zip 404 Mb
            !unzip 18.zip -d .
            !mv -v content/drive/MyDrive/summer20/classification/rating/saved_models/18/* 18
            !rm -r content
            !rm -r 18.zip
            %cd ../
            print("Successfully downloaded pre-trained EOL model to ", (saved_models_dir + '/' + TRAIN_SESS_NUM))
    
    # Use your own trained model
    elif not use_EOL_model:
        # TO DO: Change values to match your trained model
        module_selection = ("inception_v3", 299) #@param ["(\"mobilenet_v2_1.0_224\", 224)", "(\"inception_v3\", 299)"] {type:"raw", allow-input: true}
        dataset_labels = ['bad', 'good'] #@param
        saved_models_dir = "train/saved_models/" #@param {type:"string"}
        TRAIN_SESS_NUM = "18" #@param

    return module_selection, dataset_labels, saved_models_dir, TRAIN_SESS_NUM

# Load saved model from directory
def load_saved_model(saved_models_dir, TRAIN_SESS_NUM, module_selection):
    # Load trained model from path
    saved_model_path = saved_models_dir + TRAIN_SESS_NUM
    model = tf.keras.models.load_model(saved_model_path)
    # Get name and image size for model type
    handle_base, pixels = module_selection

    return model, pixels, handle_base

# Get info from predictions to display on images
def get_predict_info(predictions, url, i, stop, start):
    # Get info from predictions
    label_num = np.argmax(predictions[0], axis=-1)
    conf = predictions[0][label_num]
    im_class = dataset_labels[label_num]
    # Display progress message after each image
    print("Completed for {}, {} of {} files".format(url, i+1, format(stop-start, '.0f')))
    
    return label_num, conf, im_class

# Set filename for saving classification results
def set_outpath(tags_file):
    outpath = wd + 'results/' + tags_file + '.tsv'
    print("Saving results to: \n", outpath)

    return outpath

# Export results
def export_results(df, url, det_imclass, conf):
    # Define variables for export
    if 'ancestry' in df.columns:
        ancestry = df['ancestry'][i]
    else:
        ancestry = "NA"
    identifier = df['identifier'][i]
    dataObjectVersionID = df['dataObjectVersionID'][i] 
    # Write row with results for each image
    results = [url, identifier, dataObjectVersionID, ancestry, det_imclass, conf]
    with open(outpath, 'a') as out_file:
        tsv_writer = csv.writer(out_file, delimiter='\t')
        tsv_writer.writerow(results)

In [None]:
# Set current working directory
%cd $cwd

# Read in EOL image bundle dataframe
# TO DO: Choose image bundle address using form field to right
bundle = "https://editors.eol.org/other_files/bundle_images/files/images_for_Squamata_20K_breakdown_000001.txt" #@param ["https://editors.eol.org/other_files/bundle_images/files/images_for_Squamata_20K_breakdown_000001.txt", "https://editors.eol.org/other_files/bundle_images/files/images_for_Coleoptera_20K_breakdown_000001.txt", "https://editors.eol.org/other_files/bundle_images/files/images_for_Anura_20K_breakdown_000001.txt", "https://editors.eol.org/other_files/bundle_images/files/images_for_Carnivora_20K_breakdown_000001.txt"] {allow-input: true}
df = read_datafile(bundle, sep='\t', header=0, disp_head=False)

# Use EOL pre-trained model for object detection?
# TO DO: Check use_EOL_model if "Yes"
use_EOL_model = True #@param {type: "boolean"}

# Load saved model
module_selection, dataset_labels, saved_models_dir, TRAIN_SESS_NUM = get_model_info(use_EOL_model)
model, pixels, handle_base = load_saved_model(saved_models_dir, TRAIN_SESS_NUM, module_selection)

# Set filepath for output tagging file
# TO DO: Change file name for each bundle/run
tags_file = "rating_tags_tf2_d" #@param ["rating_tags_tf2_a", "rating_tags_tf2_b", "rating_tags_tf2_c", "rating_tags_tf2_d"] {allow-input: true}
outpath = set_outpath(tags_file)

# Write header row of tagging file
if not os.path.isfile(outpath): 
    with open(outpath, 'a') as out_file:
              tsv_writer = csv.writer(out_file, delimiter='\t')
              tsv_writer.writerow(["eolMediaURL", "identifier", 
                                   "dataObjectVersionID", "ancestry", \
                                   "tag_rating", "confidence"])

### Run images through model for image rating classification 

In [None]:
# Run inference

# Test with tiny subset (5 images)?
# TO DO: If yes, check test_with_tiny_subset box
test_with_tiny_subset = True #@param {type: "boolean"}

# Run EOL bundle images through classifier to add rating tags
start, stop = set_start_stop()
for i, row in enumerate(df.iloc[start:stop].iterrows()):
    try:
        # Read in image from url
        url = df['eolMediaURL'][i]
        fn = str(i) + '.jpg'
        img, disp_img = image_from_url(url, fn)

        # Image classification
        start_time = time.time() # Record inference time
        predictions = model.predict(img, batch_size=1)
        label_num, conf, det_imclass = get_predict_info(predictions, url, i, stop, start)
        end_time = time.time()
        print("Inference time: {} sec".format(format(end_time-start_time, '.2f')))

        # Export tagging results to tsv
        export_results(df, url, det_imclass, conf)

    except:
        print('Check if URL from {} is valid'.format(url))

## Post-process classification results
---
MobileNet SSD v2 confidence threshold (>1.5) for all 'bad' predictions was chosen in inspect_train_results.ipynb to minimize false detections and maximize dataset coverage. All 'good' predictions and any 'bad' predictions below the confidence threshold are discarded.

In [None]:
# TO DO: Input and adjust classification confidence thresholds
conf_thresh = 1.5 #@param 

# Combine 4 batches of classification tags
fpath =  os.path.splitext(tags_file)[0] # Get name of one tag file
base = cwd + fpath.rsplit('_',1)[0] + '_' # Remove lettered suffix to get basename
exts = ['a.tsv', 'b.tsv', 'c.tsv', 'd.tsv']
all_filenames = [base + e for e in exts] # List all tag filenames
df = pd.concat([pd.read_csv(f, sep='\t', header=0, na_filter = False) for f in all_filenames], ignore_index=True)
df[['confidence']] = df[['confidence']].apply(pd.to_numeric)

# Summarize combined results
print("Model predictions for Training Attempt {}, {}:".format(TRAIN_SESS_NUM, handle_base))
print("No. Images: {}\n{}".format(len(df), df[['eolMediaURL', 'tag_rating', 'confidence']].head()))

# Discard all predictions for 'good' or below confidence threshold
# (Final tag to keep -> predictions for 'bad' with confidence > 1.5) 
idx_tokeep = df.index[(df.tag_rating == 'bad') & (df.confidence > conf_thresh)]
idx_todiscard = df.index.difference(idx_tokeep)
df.loc[idx_todiscard, 'tag_rating'] = 'NA'

# Write results to tsv
print("\nNew concatenated dataframe with all 4 batches: \n", df[['eolMediaURL', 'tag_rating', 'confidence']].head())
outpath = base + 'finaltags.tsv'
df.to_csv(outpath, sep='\t', index=False)

## Display classification results on images
---

In [None]:
# Set number of seconds to timeout if image url taking too long to open
import socket
socket.setdefaulttimeout(10)

# TO DO: Adjust start index and display 50 image with tags
start = 0 #@param {type:"slider", min:0, max:5000, step:50}
stop = start+50

# Loop through EOL image bundle to classify images and generate tags
for i, row in df.iloc[start:stop].iterrows():
    try:
        # Read in image from url
        url = df['eolMediaURL'][i]
        fn = str(i) + '.jpg'
        img, disp_img = image_from_url(url, fn)
    
        # Get quality rating tag
        tag = df['tag_rating'][i]
    
        # Display progress message after each image is loaded
        print('Successfully loaded {} of {} images'.format(i+1, (stop-start)))

        # Show classification results for images
        # Only use to view predictions on <50 images at a time
        _, ax = plt.subplots(figsize=(10, 10))
        ax.imshow(disp_img)
        plt.axis('off')
        plt.title("{}) Image quality rating: {} ".format(i+1, tag))

    except:
        print('Check if URL from {} is valid'.format(url))