# Identifying Edible Plants
## w251 Final Project
## Scott Xu, Divya Babu, Aaron Olson

The Intent of this final project is to develop an image recognition program that can accurately identify edible and/or poisonous plants in the wild. This endeavor has been attempted by several apps and other programs - however all of these realize an edge architecture that relies on a remote server connection in order to upload the file and run through the model. 

This paper explores the difference performance options in order to arrive at the best performing model. We then work to reduce the model size in order to fit on an edge device for real time diagnosis. 

In order to get a baseline model for image recognition, we used a transfer learning technique where the model weights and architecture of ResNet50 was applied. ResNet50 was chosen for its performance as well as its size. Training on the volume of images for the duration that ResNet50 was done would not be reasonable - therefore we have used this baseline model to improve the baseline prediction. On top of this we explore different model architectures in order to define which architecture performs the best.

In order to get the best performing model we needed to remember to balance model performance with edge device performance. In the case of poisonous plants the consequences of a bad prediction can be high - however the utility of an app that takes 60 min to make a prediction is impacted. Therefore at the end of this notebook we examine the relationship with building the model on a virtual machine (for training) vs inference on the edge device (time to predict vs accuracy). 

This paper can be broken down into the following sections: 
- Exploratory Data Analysis & Understanding of the Training Dataset
- Image Augmentation - impacts of different augmentation techniques and best performing augmentation
- Discussion of Base Model choice
- Model Architecture
- Image Classification on VM - Model Peformance
- Binary Classification on VM - Model Performance
- Model Transfer to Edge Device and discussion of performance vs inference time & resource constraints
- Conclusion

We begin by examining the training dataset:

### Training Dataset

Our traning datasets were downloaded from Kaggle [https://www.kaggle.com/gverzea/edible-wild-plants, https://www.kaggle.com/nitron/poisonous-plants-images]. The datasets are comprised of:
- Total of 6962 pictures:
    - 6552 of these pictures are of edible wild plants
    - 410 pictures are of poisonous plants. 
- There are:
    - 62 categories of edible plants
    - 8 categories of poisonous plants
    
From this data it is evident that our dataset is skewed (more edible pictures and categories than poisonous). Additionally, the dataset does not comprise all wild plants - and is only a rather small subset of wild plants. We will treat this dataset as our primary for training purposes. We do have two other datasets that can be utlized for further training on a larger array of edible plant types. The average number of images per category is 99 - however we can see in the plot below that the mean count of image per category is closer to 50 with some categories having a large number of images.  

<table><tr>
    <td> <img src="CountbyCateogry.jpg" alt="Drawing" style="width: 750px;"/> </td>
</tr></table>

Having 50 images per class does indicate that there is hopefully some variability in terms of image orientation, quality, etc. Having this variety will have a beneficial impact on the training process. We can further improve the variety of images by utilizing image augmentation. 

In addition to increasing image variety, augmentation also helps to increase the trianing dataset size. Because we have a limited dataset (considering we will be training on Dense, CNN or ReLu layers) we can utilize augmentation and increase both the variety and total number of images in order to improve our trianing process. Furthermore, In the latter portion of our analysis, we attempt to make a prediction of poisonous or edible (rather than plant category). Because we have a biased training set for the binary label problem - we will increase image count of the poisonous images by utilizing image augmentation - explained in the next section. 

Depicted below are a few sample images taken from the datasets. 

<table><tr>
    <td> <img src="How-To-Grow-Rhubarb.jpg" alt="Drawing" style="width: 250px;"/> </td>
    <td> <img src="Chicory20.jpg" alt="Drawing" style="width: 350px;"/> </td>
    <td> <img src="asparagus8.jpg" alt="Drawing" style="width: 350px;"/> </td>
</tr></table>

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['figure.figsize'] = [10, 5]
DIR = ['C:\\Users\\AOlson\\Documents\\UC Berkeley MIDS\\w251_Scaling\\Final Project\\Data\\EdibleWildPlants\\datasets\\dataset',
       'C:\\Users\\AOlson\\Documents\\UC Berkeley MIDS\\w251_Scaling\\Final Project\\Data\\PoisonousPlants']

category = []
image_length = []
for d in DIR:
    for name in os.listdir(d):
        category.append(name)
        image_length.append(len(os.listdir(d + "\\" + name)))
df = pd.DataFrame(list(zip(category, image_length)), columns =['Category', 'Count']) 
df.plot.bar(x='Category', y='Count', rot=90, title = 'Count of Image by Category')

### ResNet Model Choice

With the problem of image classification we understood that utilizing transfer learning would speed the training process. We chose to utilize the ResNet50 model due to its: 
- Accuracy proven on the ImageNet dataset
- Overall size (50 ReLu layers)

The ResNet50 dataset was trained on the ImageNet dataset which is a large volume dataset with classes associated with common everyday items (broccoli, horse, etc). Reviewing the categories the model has already been trained on, it currently doesn't have any that match the classes in our training dataset - however it has been trained on many plant and food related items. 

ResNet is a model built upon the Residual layer structure. It has been noted in literature where deeper networks tend to have lower accuracy compared to it's shallower counterpart. This is essentially because it can be hard for a dense layer to learn the y = x relationship when the training has become saturated. A Residual Layer has an impliminatation similar to  y = F(x) + x where the function F(x) can reduce to 0 and bypass the degradation problem. 

<table><tr>
    <td> <img src="ResidualLayer.jpg" alt="Drawing" style="width: 500px;"/> </td>
</tr></table>

Picture above is one layer of a residual layer which shows how the input can bypass the learned weights and become the output. This type of layer has shown high accuracy in image classification problems. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Leraning for Image Recognition. Cornell University.


In addition to a well trained model we also needed to control the size and the inference time of the model. The intent of this paper is to train in a data center (where volume of data and computing power can be large) in order to bring the model to an edge device for inference. We therefore need to ensure the model is sufficiently small (such that the memory requirements won't exhaust the hardware of an edge device), fit on a limited storage device, and provide image inference at a reasonable time. Because of this constraint we chose to use ResNet50 as our baselayer vs ResNet 100, 1000 for example. 

In [None]:
# Load ResNet50 baseline model
from keras.applications.resnet50 import ResNet50, preprocess_input

HEIGHT = 300
WIDTH = 300

base_model = ResNet50(weights='imagenet', 
                      include_top=False, 
                      input_shape=(HEIGHT, WIDTH, 3))

HEIGHT = 300
WIDTH = 300
BATCH_SIZE = 32

train_datagen =  ImageDataGenerator(
      preprocessing_function=preprocess_input,
      rotation_range=90,
      horizontal_flip=True,
      vertical_flip=True,
      height_shift_range=0.5,
      width_shift_range=0.5
    )
generator = train_datagen.flow_from_directory('../binary_duplicate',target_size=(300,300),save_to_dir='../binary_duplicate/1',class_mode='binary',save_prefix='augment_',save_format='jpg',batch_size=1)
i = 0
for inputs,outputs in generator:
    i = i + 1
    if i > 6142:
        break

### Image Augmentation

With the baseline model established, we understand that there will likely be a difference in the cleanliness of the images taken for the training dataset, vs the images taken in the field when a user wants to succesfully identify a plant. We therefore utilize image augmentation, to achieve a couple of tasks: (1) affect the image quality, orientation, etc in order to make the model more versatile (2) create more training images in order to train the model. 

Below we explore different effects of image augmentation and show below the effects of model performance: 

In [None]:
# Image augmentation

from keras.preprocessing.image import ImageDataGenerator
import os
from glob import glob

TRAIN_DIR = "train_edible"
result = [y for x in os.walk(TRAIN_DIR) for y in glob(os.path.join(x[0], '*.jpg'))]
classes = list(set([y.split("/")[-2] for y in result]))
HEIGHT = 300
WIDTH = 300
BATCH_SIZE = 32

train_datagen =  ImageDataGenerator(
      preprocessing_function=preprocess_input,
      rotation_range=90,
      horizontal_flip=True,
      vertical_flip=True,
      height_shift_range=0.5,
      width_shift_range=0.5
    )

train_generator = train_datagen.flow_from_directory(TRAIN_DIR, 
                                                    target_size=(HEIGHT, WIDTH), 
                                                    batch_size=BATCH_SIZE)

### Model Architecture - Predicting Plant Class

Building on the baseline model, we have explored adding different layers (size, type, etc) in order to produce the 'best' performing model. See above and later discussions as to how we define 'best' model. For this paper we explored both accurately predicting the class of the plant in an image - which would help users understand more information about the specific plant they have taken the picture of, as well as predicting whether a plant is poisonous or not without the context of exact plant type. 

We first explored keeping all the base model layers static (no change to the model weights) and only training on the additional layers - we explore the following layer architecture: 
- Dense layers (different nodes and depths)
- CNN with pooling layers given the image classification problem
- Residual layer based on our chosen base model architecture

In order to narrow model choice we chose to train for 100 epochs on 15,000 images with a batch size of 64. Our specific trials are captured and analyzed below. Below we show the code that was used during model training (both for image classification as well as binary classification (poison yes/no)). Below the model architecture we detail performance of varying model architectures.

In [None]:
# Define model architecture on top of base model
from keras.layers import Dense, Activation, Flatten, Dropout
from keras.models import Sequential, Model

def build_finetune_model(base_model, dropout, fc_layers, num_classes):
# Prevent the base model (ResNet50) from training
    for layer in base_model.layers:
        layer.trainable = False
# Optional Residual layers
#     conv1 = Conv2D(64, (3,3), padding = 'same', activation = 'relu', kernel_initializer='he_normal')(x)
#     conv2 = Conv2D(2048,(3,3), padding = 'same', activation = 'linear', kernel_initializer='he_normal')(conv1)
#     layer_out = add([conv2, x])
#     x = Activation('relu')(layer_out)
#     conv1 = Conv2D(64, (3,3), padding = 'same', activation = 'relu', kernel_initializer='he_normal')(x)
#     conv2 = Conv2D(2048,(3,3), padding = 'same', activation = 'linear', kernel_initializer='he_normal')(conv1)
#     layer_out = add([conv2, x])
#     x = Activation('relu')(layer_out)
#     conv1 = Conv2D(64, (3,3), padding = 'same', activation = 'relu', kernel_initializer='he_normal')(x)
#     conv2 = Conv2D(2048,(3,3), padding = 'same', activation = 'linear', kernel_initializer='he_normal')(conv1)
#     layer_out = add([conv2, x])
#     x = Activation('relu')(layer_out)
#     conv1 = Conv2D(64, (3,3), padding = 'same', activation = 'relu', kernel_initializer='he_normal')(x)
#     conv2 = Conv2D(2048,(3,3), padding = 'same', activation = 'linear', kernel_initializer='he_normal')(conv1)
#     layer_out = add([conv2, x])
#     x = Activation('relu')(layer_out)
#     conv1 = Conv2D(64, (3,3), padding = 'same', activation = 'relu', kernel_initializer='he_normal')(x)
#     conv2 = Conv2D(2048,(3,3), padding = 'same', activation = 'linear', kernel_initializer='he_normal')(conv1)
#     layer_out = add([conv2, x])
#     x = Activation('relu')(layer_out)

    x = base_model.output
    x = Flatten()(x)
# Cycle through the FC_LAYERS list (defined below) and add dense layers on top of the base ResNet model
    for fc in fc_layers:
        # Can look here if adding different types of layers has an effect
        # Also explore differences in changing activation function
        # Can also iterate on droupout amount
        x = Dense(fc, activation='relu')(x) 
        x = Dropout(dropout)(x)

    # New softmax layer
    predictions = Dense(num_classes, activation='softmax')(x) 
    
    finetune_model = Model(inputs=base_model.input, outputs=predictions)

    return finetune_model

# Can change the model architecture here
FC_LAYERS = [128,64,32,16]
dropout = 0.3

finetune_model = build_finetune_model(base_model, 
                                      dropout=dropout, 
                                      fc_layers=FC_LAYERS, 
                                      num_classes=len(classes))

In [None]:
# Define metrics (in addition to accuracy) in order to track
from keras import backend as K

def recall(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

def precision(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision

def f1(y_true, y_pred):
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    return 2*((prec*rec)/(prec+rec+K.epsilon()))

In [None]:
# Train model and save best performing model as defined by highest accuracy
from keras.optimizers import SGD, Adam
from keras.callbacks import ModelCheckpoint
import tensorflow as tf
import datetime
import numpy   
    
# For the baseline model will use 100 epochs and 15000 images to test model performance
# Will then use 'optimized' model parameters to train for longer time and explore
# Size vs Accuracy for edge compute purposes
NUM_EPOCHS = 100
BATCH_SIZE = 64
num_train_images = 15000

adam = Adam(lr=0.00001)
# Can look into whether 
finetune_model.compile(adam, loss='categorical_crossentropy', metrics=['accuracy', recall, precision, f1])
# Checkpoin is overwritten at each epoch - can look at line below where datetime is used to create time based file names
filepath="checkpoint/" + "edible_default" + "_model_weights.h5"
checkpoint = ModelCheckpoint(filepath, monitor=["acc"], verbose=1, mode='max')
# log_dir = "C:\\Users\\AOlson\\Documents\\UC Berkeley MIDS\\w251_Scaling\\Final Project\\Data\\log_dir\\" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
# tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
callbacks_list = [checkpoint]

history = finetune_model.fit_generator(train_generator, epochs=NUM_EPOCHS, workers=8, 
                                       steps_per_epoch=num_train_images // BATCH_SIZE, 
                                       shuffle=True, callbacks=callbacks_list)
# Save model parameters
numpy.savetxt('edible_loss_history.txt', numpy.array(history.history['loss']), delimiter = ',')
numpy.savetxt('edible_acc_history.txt', numpy.array(history.history['acc']), delimiter = ',')

### Image Classification Model Analysis

### Binary Poison Detection Model Analysis

While having the ability to reveal plant identification via image classification provides a richer context of the plant background (name, species, description, etc) - we have seen that the model cannot predict with a high accuracy (>90%) whether or not a plant is poisonous. From a users perspective, this lack of confidence likely isn't worthwhile to utilize the model as the consequences of a wrong prediction are severe. 

In addition to the image classification problem described above we also explored whether or not features contained in the images of poisonous and edible plants could be extracted in order to predict whether or not a plant was poisonous (without regard to the specific species or genome of the plant). We similarly started with ResNet50 as this provides a good starting point for model architecture as well as the image classification problem. 

Similar to image classification - we attempted different model architectures in order to obtain the best performing model. In the first iteration of model performance we examined:
- Training on an imbalanced training set (remember we have >6k images of edible plants and only 410 of poisonous plants)
- Using image augmentation on the poisonous plants dataset in order to balance the training dataset
- Different model depths (baseline model contains layers with [128,64,32,16] nodes). Expanded to six different layers containing up to 1024 nodes. All layers were Dense in the original round of training
- Changing dropout rate - generally performed worse (perhaps needed longer training time) - performance not shown below
- Different activation functions (specifically tanh) 
- Training the entire model (including the ResNet50 base model)
- Different layer architecture (Dense, CNN, Pooling, Residual)

Depicted below is the performance of these different training episodes:
<table><tr>
    <td> <img src="Binary_Accuracy.jpg" alt="Drawing" style="width: 500px;"/> </td>
</tr></table>

In the legend, equal infers the training set is balanced, tanh refers to a different activation function (default was relu), trainentiremodel refers to un-freezing the base ResNet50 model for training. As can be seen, while certain model architecture changes did impact performance in the earlier training epochs - the best performing model by far was the model that trained all layers (including ResNet50) as well as having additional dense layers above. Presumably this showcases the improved accuracy that the residual layers have in terms of the image classification/detection problem over and above a series of dense layers. 

Surprisingly training the entire model did not have a large effect on training time. For 100 epochs training on only the dense layers we averaged 24sec per epoch training on a P100 GPU with a batch size of 64. When expanding and training on the entire model our average epoch time was 32sec. 

In terms of other model performance - there wasn't very much of a performance boost by equalizing the dataset. The blue and green lines show approximately similar performance across the training epochs. This is likely due to the image augmentation that's already taking place during training in order to increase our total image size to 15,000. Increasing the depth of the model didn't show significant model performance improvement. This may be due to a small change in overall model architecture - moving from a layer configuration of [128,64,32,16] to [512,256,128,64,32,16]. This also may be due to the aforementioned degradation problem with deeper networks. 

In order to explore different layer architectures - we also varied the types of layers as depicted below. Dense, CNN and Residual layers were all tested. During the early training periods we noticed that these different types of architectures likely required longer training periods to infer the appropriate weights. We therefore expanded our analysis to 500 epochs. 

***INSERT DISCUSSION REGARDING DIFFERENT LAYER ARCHITECTURES

As previously mentioned, accuracy isn't the only metric in which we need to track for the binary detection problem. While a accuracy of 100% would be phenomenal, it isn't likely. However we also need to be cognizant that a model that can predict 100% of the time correctly when there is a poisonous plant is very important. This is because the consequences of predicting a plant is poisonous when it isn't (false positive) is much less severe than predicting a plant is edible when it's poisonous (false negative). We therefore also track recall, precision and F1 scores. Of these scores recall is likely the most important since it tracks true positives vs the number of poisonous plants in the dataset. 

***INSERT DEPICTION OF RECALL/PRECISION/F1 FOR MODEL ARCHITECTURE

### Transferring Model Trained in VM to TX2 (Edge Device)

We chose to train the model in a cloud environment in order utilize large datasets and train on complex models. However our goal is to transfer the trained model down to an edge device whereby inference can be achieved on pictures of plants. This differs from many former models as inference remains in the cloud - if a data connection cannot be made to the cloud then inference cannot be achieved. 

We transferred down to the TX2 by copying the model architecture and weights and directly using the pre-trained model to infer image classification as well as binary classification. We also explored the use of tensorrt in order to optimize inference time on the edge device. The TX2 is a fairly powerful edge device - however we noticed that certain model architectures required memory or hardware that exceeded the specifications of the TX2. We therefore had to select models that could run on a lower powered device. 

***INSERT CODE FOR MODEL TRANSFER & RUN ON TX2

***INSERT DISCUSSION OF TX2 PERFORMANCE