## David Liu
## BAN 676

## QUESTION 2:

#### Build a CNN Classification Model to classify species from the Orchid Flowers dataset (Use train and validation data for respective purpose, no testing needed)

#### (1) Model 2: Use Transfer learning or fine tuning

In [86]:
import numpy as np
import pandas as pd
import os
import tensorflow as tf
from keras_preprocessing.image import ImageDataGenerator
from tensorflow import keras
import matplotlib.pyplot as plt
from PIL import Image 

print(tf.__version__)

2.3.0


## Preliminary Steps
Checking to see how many images are in the entire orchid data directory

In [87]:
orchid_directory = "Orchid Flowers Dataset-v1.1/Orchid_Images"

file_names = os.listdir(orchid_directory)
print(len(file_names), "images")

7160 images


Putting together a dataframe that shows the image file names and their associated labels. This dataframe will be used later to build an image dataset with associated labels.

In [88]:
img_label = pd.read_csv("Orchid Flowers Dataset-v1.1/Species_Classifier/ClassLabels.txt", header=None)
img_label.columns = ["Images","Labels"]
img_label["Labels"] = img_label["Labels"].astype(str)
img_label

Unnamed: 0,Images,Labels
0,1.jpg,1
1,2.jpg,1
2,3.jpg,1
3,4.jpg,1
4,5.jpg,1
...,...,...
7151,7152.jpeg,63
7152,7153.jpeg,63
7153,7154.jpeg,63
7154,7155.jpg,63


Identifying all image files' formats in the image folder. The tensorflow functions, flow_from_dataframe() and ImageDataGenerator(), which will be used later to create an image dataset, only accepts jpeg, png, bmp, and gif image formats.

In [89]:
file_names_df = pd.DataFrame(file_names)
file_names_df[0] = file_names_df[0].str.extract("[0-9]+.(.*)")
file_names_df_dummies = pd.get_dummies(file_names_df[0])
file_names_df_dummies.sum()

jfif       4
jpeg      34
jpg     7122
dtype: int64

Shown above, there are 4 jfif images, which not an acceptable image format. These images will be converted to jpeg.

In [90]:
deviant_files = []

for each in file_names:
    if each.endswith(".jfif") == True:
        deviant_files.append(each)

print(deviant_files)
  

['7067.jfif', '7068.jfif', '7069.jfif', '7071.jfif']


The 4 jfif files - '7067.jfif', '7068.jfif', '7069.jfif', and '7071.jfif' - are converted to jpg, and resaved into the image folder.

In [91]:
for each in deviant_files:
    im = Image.open(orchid_directory + "/" + each)
    im.save(orchid_directory + "/" + each.split(".")[0] + ".jpg")

The file names with the .jfif format extension in the img_label dataframe will be converted to the .jpg format extension. The strings under the "Images" column in the img_label dataframe will be used as directory names that points to an image file in the image folder. Therefore, the strings must match the image filenames in the image folder directory.  

In [92]:
for index in range(len(img_label)):
    if img_label.iloc[index,0] in deviant_files:
        img_label.iloc[index,0] = img_label.iloc[index,0].split(".")[0] + ".jpg"

Identifying all image formats in the orchid image folder again.

In [57]:
formats = img_label["Images"].str.extract("[0-9]+.(.*)")

formats_dummies = pd.get_dummies(formats[0])
formats_dummies.sum()


jpeg      34
jpg     7122
dtype: int64

Few lines from the img_label dataframe

In [58]:
img_label.head(20)

Unnamed: 0,Images,Labels
0,1.jpg,1
1,2.jpg,1
2,3.jpg,1
3,4.jpg,1
4,5.jpg,1
5,6.jpg,1
6,7.jpg,1
7,8.jpg,1
8,9.jpg,1
9,10.jpg,1


## Split Training/Validation Data

Extracting the data again, and splitting them into training and validation data. Extracted images are resized to 500 x 500 in order to accommodate the majority of image sizes in the data directory. The number of classes shown in the outputs below are inferred by the function. 

In [98]:
image_data=ImageDataGenerator(rescale=1./255.,validation_split=0.3)

In [99]:
orchid_img_train = image_data.flow_from_dataframe(
    dataframe=img_label,
    directory = "Orchid Flowers Dataset-v1.1/Orchid_Images/",
    subset = "training",
    x_col = "Images",
    y_col = "Labels",
    batch_size = 32,
    seed = 3,
    shuffle = False,
    class_mode = "categorical",
    target_size = (300,300),
)

Found 5010 validated image filenames belonging to 156 classes.


In [100]:
orchid_img_val = image_data.flow_from_dataframe(
    dataframe=img_label,
    directory = "Orchid Flowers Dataset-v1.1/Orchid_Images/",
    subset = "validation",
    x_col = "Images",
    y_col = "Labels",
    batch_size = 32,
    seed = 3,
    shuffle = False,
    class_mode = "categorical",
    target_size = (300,300),
)

Found 2146 validated image filenames belonging to 156 classes.


## Building the Transfer Learning Model

This transfer learning model will be built with Resnet50.

In [105]:
orchid_input = tf.keras.Input(shape=(300,300,3))

t_model = tf.keras.applications.VGG16(include_top = False, input_shape=(300,300,3))

for layer in t_model.layers:
    layer.trainable = False

flat_layer = tf.keras.layers.Flatten()(t_model.layers[-1].output)
dense_layer = tf.keras.layers.Dense(512, activation="relu")(flat_layer)
output_layer = tf.keras.layers.Dense(156, activation="softmax")(dense_layer)

t_model = tf.keras.Model(inputs=t_model.inputs, outputs=output_layer)

In [106]:
t_model.summary()

Model: "functional_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_17 (InputLayer)        [(None, 300, 300, 3)]     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 300, 300, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 300, 300, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 150, 150, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 150, 150, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 150, 150, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 75, 75, 128)     

In [107]:
label = t_model.predict(orchid_img_val)

KeyboardInterrupt: 

In [None]:
label = tf.keras.applications.vgg16.decode_predictions(t_model)
label

In [104]:
t_orchid_img_val = tf.keras.preprocessing.image_dataset_from_directory(
  directory = "Orchid Flowers Dataset-v1.1/",
  validation_split=0.3,
  subset="validation",
  seed=2,
  image_size=(300, 300),
  batch_size=32)

Found 7156 files belonging to 8 classes.
Using 2146 files for validation.
