<a href="https://colab.research.google.com/github/emma-s137/cs344/blob/master/project/seedling_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data pre-processing modeled from https://www.kaggle.com/vbookshelf/a-simple-keras-solution

In [0]:
# Run this cell and select the kaggle.json file downloaded
# from the Kaggle account settings page.
from google.colab import files
files.upload()

In [2]:
# Let's make sure the kaggle.json file is present.
!ls -lha kaggle.json

-rw-r--r-- 1 root root 67 May 20 20:19 kaggle.json


In [0]:
# Next, install the Kaggle API client.
!pip install -q kaggle

In [0]:
# The Kaggle API client expects this file to be in ~/.kaggle,
# so move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json

In [5]:
# Copy the stackoverflow data set locally.
!kaggle datasets download -d vbookshelf/v2-plant-seedlings-dataset
!ls -l 

Downloading v2-plant-seedlings-dataset.zip to /content
100% 3.18G/3.19G [00:37<00:00, 72.8MB/s]
100% 3.19G/3.19G [00:37<00:00, 90.5MB/s]
total 3347024
-rw-r--r-- 1 root root         67 May 20 20:19 kaggle.json
drwxr-xr-x 1 root root       4096 May 13 16:29 sample_data
-rw-r--r-- 1 root root 3427338216 May 20 20:20 v2-plant-seedlings-dataset.zip


In [0]:
!unzip -q v2-plant-seedlings-dataset.zip 

In [0]:
import pandas as pd
import numpy as np

import tensorflow

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import categorical_crossentropy
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint

import os
import cv2

import imageio
import skimage
import skimage.io
import skimage.transform

from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import itertools
import shutil
import matplotlib.pyplot as plt
%matplotlib inline

In [0]:
# Number of samples we will have in each class.
SAMPLE_SIZE = 250

# The images will all be resized to this size.
IMAGE_SIZE = 224

In [9]:
os.listdir('nonsegmentedv2')

['Maize',
 'Black-grass',
 'Scentless Mayweed',
 'Sugar beet',
 'Fat Hen',
 'Loose Silky-bent',
 'ShepherdтАЩs Purse',
 'Cleavers',
 'Common wheat',
 'Small-flowered Cranesbill',
 'Common Chickweed',
 'Charlock']

In [0]:
# Create a new directory to store all available images
all_images_dir = 'all_images_dir'
os.mkdir(all_images_dir)


In [0]:
# This code copies all images from their seperate folders into the same 
# folder called all_images_dir.


folder_list = os.listdir('nonsegmentedv2')

for folder in folder_list:
    
    # create a path to the folder
    path = 'nonsegmentedv2/' + str(folder)

    # create a list of all files in the folder
    file_list = os.listdir(path)

    # move the 0 images to all_images_dir
    for fname in file_list:

        # source path to image
        src = os.path.join(path, fname)
        
        # Change the file name because many images have the same file name.
        # Add the folder name to the existing file name.
        new_fname = str(folder) + '_' + fname
        
        # destination path to image
        dst = os.path.join(all_images_dir, new_fname)
        # copy the image from the source to the destination
        shutil.copyfile(src, dst)

In [12]:
# Check how many images are in all_images_dir.
# Should be 5539.

len(os.listdir('all_images_dir'))


5539

In [13]:
# Get a list of all images in the all_images_dir folder.
image_list = os.listdir('all_images_dir')

# Create the dataframe.
df_data = pd.DataFrame(image_list, columns=['image_id'])

df_data.head()

Unnamed: 0,image_id
0,Common Chickweed_359.png
1,Sugar beet_177.png
2,Fat Hen_304.png
3,Small-flowered Cranesbill_57.png
4,Common wheat_128.png


In [14]:
# Each file name has this format:
# Loose Silky-bent_377.png

# This function will extract the class name from the file name of each image.
def extract_target(x):
    # split into a list
    a = x.split('_')
    # the target is the first index in the list
    target = a[0]
    
    return target


# create a new column called 'target'
df_data['target'] = df_data['image_id'].apply(extract_target)

df_data.head()

Unnamed: 0,image_id,target
0,Common Chickweed_359.png,Common Chickweed
1,Sugar beet_177.png,Sugar beet
2,Fat Hen_304.png,Fat Hen
3,Small-flowered Cranesbill_57.png,Small-flowered Cranesbill
4,Common wheat_128.png,Common wheat


In [15]:
df_data.shape

(5539, 2)

In [16]:
# What is the class distribution?

df_data['target'].value_counts()

Loose Silky-bent             762
Common Chickweed             713
Scentless Mayweed            607
Small-flowered Cranesbill    576
Fat Hen                      538
Sugar beet                   463
Charlock                     452
Cleavers                     335
Black-grass                  309
ShepherdтАЩs Purse           274
Maize                        257
Common wheat                 253
Name: target, dtype: int64

In [0]:
# Get a list of classes
target_list = os.listdir('nonsegmentedv2')

for target in target_list:

    # Filter out a target and take a random sample
    df = df_data[df_data['target'] == target].sample(SAMPLE_SIZE, random_state=101)
    
    # if it's the first item in the list
    if target == target_list[0]:
        df_sample = df
    else:
        # Concat the dataframes
        df_sample = pd.concat([df_sample, df], axis=0).reset_index(drop=True)

In [18]:
# Display the balanced classes.

df_sample['target'].value_counts()

Sugar beet                   250
Fat Hen                      250
Small-flowered Cranesbill    250
Cleavers                     250
Loose Silky-bent             250
Maize                        250
Scentless Mayweed            250
Black-grass                  250
Common Chickweed             250
Common wheat                 250
Charlock                     250
ShepherdтАЩs Purse           250
Name: target, dtype: int64

In [19]:
# train_test_split

# stratify=y creates a balanced validation set.
y = df_sample['target']

df_train, df_val = train_test_split(df_sample, test_size=0.10, random_state=101, stratify=y)

print(df_train.shape)
print(df_val.shape)

(2700, 2)
(300, 2)


In [20]:
# Train set class distribution

df_train['target'].value_counts()


Sugar beet                   225
Fat Hen                      225
Small-flowered Cranesbill    225
Cleavers                     225
Loose Silky-bent             225
Maize                        225
Scentless Mayweed            225
Black-grass                  225
Common Chickweed             225
Common wheat                 225
Charlock                     225
ShepherdтАЩs Purse           225
Name: target, dtype: int64

In [21]:
# Val set class distribution

df_val['target'].value_counts()

Black-grass                  25
Loose Silky-bent             25
Charlock                     25
Common wheat                 25
Fat Hen                      25
Cleavers                     25
Maize                        25
Scentless Mayweed            25
ShepherdтАЩs Purse           25
Sugar beet                   25
Small-flowered Cranesbill    25
Common Chickweed             25
Name: target, dtype: int64

In [22]:
folder_list = os.listdir('nonsegmentedv2')

folder_list

['Maize',
 'Black-grass',
 'Scentless Mayweed',
 'Sugar beet',
 'Fat Hen',
 'Loose Silky-bent',
 'ShepherdтАЩs Purse',
 'Cleavers',
 'Common wheat',
 'Small-flowered Cranesbill',
 'Common Chickweed',
 'Charlock']

In [0]:
# Create a new directory
base_dir = 'base_dir'
os.mkdir(base_dir)


#[CREATE FOLDERS INSIDE THE BASE DIRECTORY]

# now we create 2 folders inside 'base_dir':

# create a path to 'base_dir' to which we will join the names of the new folders
# train_dir
train_dir = os.path.join(base_dir, 'train_dir')
os.mkdir(train_dir)

# val_dir
val_dir = os.path.join(base_dir, 'val_dir')
os.mkdir(val_dir)


# [CREATE FOLDERS INSIDE THE TRAIN AND VALIDATION FOLDERS]

# create new folders inside train_dir

for folder in folder_list:
    
    folder = os.path.join(train_dir, str(folder))
    os.mkdir(folder)


# create new folders inside val_dir

for folder in folder_list:
    
    folder = os.path.join(val_dir, str(folder))
    os.mkdir(folder)

In [24]:
# check that the folders have been created

os.listdir('base_dir/train_dir')


['Maize',
 'Black-grass',
 'Scentless Mayweed',
 'Sugar beet',
 'Fat Hen',
 'Loose Silky-bent',
 'ShepherdтАЩs Purse',
 'Cleavers',
 'Common wheat',
 'Small-flowered Cranesbill',
 'Common Chickweed',
 'Charlock']

In [0]:
# Set the id as the index in df_data
df_data.set_index('image_id', inplace=True)

In [26]:
df_data.head()

Unnamed: 0_level_0,target
image_id,Unnamed: 1_level_1
Common Chickweed_359.png,Common Chickweed
Sugar beet_177.png,Sugar beet
Fat Hen_304.png,Fat Hen
Small-flowered Cranesbill_57.png,Small-flowered Cranesbill
Common wheat_128.png,Common wheat


In [0]:
# Get a list of train and val images
train_list = list(df_train['image_id'])
val_list = list(df_val['image_id'])

# Transfer the train images

for image in train_list:
    
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image
    # get the label for a certain image
    folder = df_data.loc[image,'target']
    
    
    # source path to image
    src = os.path.join(all_images_dir, fname)
    # destination path to image
    dst = os.path.join(train_dir, folder, fname)
    
    # resize the image and save it at the new location
    image = cv2.imread(src)
    image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
    # save the image at the destination
    cv2.imwrite(dst, image)
        
    

# Transfer the val images

for image in val_list:
    
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image
    # get the label for a certain image
    folder = df_data.loc[image,'target']
    

    # source path to image
    src = os.path.join(all_images_dir, fname)
    # destination path to image
    dst = os.path.join(val_dir, folder, fname)

    # resize the image and save it at the new location
    image = cv2.imread(src)
    image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
    # save the image at the destination
    cv2.imwrite(dst, image)


In [28]:
folder_list = os.listdir('base_dir/train_dir')

total_images = 0

# loop through each folder
for folder in folder_list:
    # set the path to a folder
    path = 'base_dir/train_dir/' + str(folder)
    # get a list of images in that folder
    images_list = os.listdir(path)
    # get the length of the list
    num_images = len(images_list)
    
    total_images = total_images + num_images
    # print the result
    print(str(folder) + ':' + ' ' + str(num_images))
    
print('\n')
# print the total number of images available
print('Total Images: ', total_images)

Maize: 225
Black-grass: 225
Scentless Mayweed: 225
Sugar beet: 225
Fat Hen: 225
Loose Silky-bent: 225
ShepherdтАЩs Purse: 225
Cleavers: 225
Common wheat: 225
Small-flowered Cranesbill: 225
Common Chickweed: 225
Charlock: 225


Total Images:  2700


In [29]:
# get a list of image folders
folder_list = os.listdir('base_dir/val_dir')

total_images = 0

# loop through each folder
for folder in folder_list:
    # set the path to a folder
    path = 'base_dir/val_dir/' + str(folder)
    # get a list of images in that folder
    images_list = os.listdir(path)
    # get the length of the list
    num_images = len(images_list)
    
    total_images = total_images + num_images
    # print the result
    print(str(folder) + ':' + ' ' + str(num_images))
    
print('\n')
# print the total number of images available
print('Total Images: ', total_images)

Maize: 25
Black-grass: 25
Scentless Mayweed: 25
Sugar beet: 25
Fat Hen: 25
Loose Silky-bent: 25
ShepherdтАЩs Purse: 25
Cleavers: 25
Common wheat: 25
Small-flowered Cranesbill: 25
Common Chickweed: 25
Charlock: 25


Total Images:  300


In [0]:
train_path = 'base_dir/train_dir'
valid_path = 'base_dir/val_dir'


num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 10
val_batch_size = 10


train_steps = np.ceil(num_train_samples / train_batch_size)
val_steps = np.ceil(num_val_samples / val_batch_size)

In [31]:
datagen = ImageDataGenerator(rescale=1.0/255)

train_gen = datagen.flow_from_directory(train_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=train_batch_size,
                                        class_mode='categorical')

val_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=val_batch_size,
                                        class_mode='categorical')

# Note: shuffle=False causes the test dataset to not be shuffled
test_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=1,
                                        class_mode='categorical',
                                        shuffle=True)

Found 2700 images belonging to 12 classes.
Found 300 images belonging to 12 classes.
Found 300 images belonging to 12 classes.


Basic neural network Model similar to the one devoleped for homework04.

In [32]:

kernel_size = (3,3)
pool_size= (2,2)

model_basic = Sequential()
model_basic.add(Conv2D(32, kernel_size, activation = 'relu', 
                 input_shape = (IMAGE_SIZE, IMAGE_SIZE, 3)))

model_basic.add(Conv2D(32, (3, 3), activation='relu'))
model_basic.add(MaxPooling2D(pool_size=(2, 2)))
model_basic.add(Dropout(0.25))

model_basic.add(Conv2D(128, kernel_size, activation ='relu'))
model_basic.add(MaxPooling2D(pool_size = pool_size))
model_basic.add(Dropout(0.25))

model_basic.add(Flatten())
model_basic.add(Dense(64, activation = "relu"))
model_basic.add(Dense(12, activation = "softmax"))

model_basic.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 222, 222, 32)      896       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 220, 220, 32)      9248      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 110, 110, 32)      0         
_________________________________________________________________
dropout (Dropout)            (None, 110, 110, 32)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 108, 108, 128)     36992     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 54, 54, 128)       0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 54, 54, 128)       0

In [33]:
model_basic.compile(loss='categorical_crossentropy',
             optimizer='adagrad',
             metrics=['accuracy'])

model_basic.fit(train_gen,
         epochs=10,
         validation_data=val_gen)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f1e4cc9c080>

Model uses Alex Net architecture
Found at https://www.mydatahack.com/building-alexnet-with-keras/

In [34]:
# (1) Importing dependency
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Flatten,\
 Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
import numpy as np
np.random.seed(1000)


# (3) Create a sequential model
model = Sequential()

# 1st Convolutional Layer
model.add(Conv2D(filters=96, input_shape=(224,224,3), kernel_size=(11,11),\
 strides=(4,4), padding='valid'))
model.add(Activation('relu'))
# Pooling 
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
# Batch Normalisation before passing it to the next layer
model.add(BatchNormalization())

# 2nd Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
# Batch Normalisation
model.add(BatchNormalization())

# 3rd Convolutional Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Batch Normalisation
model.add(BatchNormalization())

# 4th Convolutional Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Batch Normalisation
model.add(BatchNormalization())

# 5th Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
# Batch Normalisation
model.add(BatchNormalization())

# Passing it to a dense layer
model.add(Flatten())
# 1st Dense Layer
model.add(Dense(4096, input_shape=(224*224*3,)))
model.add(Activation('relu'))
# Add Dropout to prevent overfitting
model.add(Dropout(0.4))
# Batch Normalisation
model.add(BatchNormalization())

# 2nd Dense Layer
model.add(Dense(4096))
model.add(Activation('relu'))
# Add Dropout
model.add(Dropout(0.4))
# Batch Normalisation
model.add(BatchNormalization())

# 3rd Dense Layer
model.add(Dense(1000))
model.add(Activation('relu'))
# Add Dropout
model.add(Dropout(0.4))
# Batch Normalisation
model.add(BatchNormalization())

# Output Layer
model.add(Dense(12, activation = "softmax"))

model.summary()

# (4) Compile 
model.compile(loss='categorical_crossentropy', optimizer='adagrad',\
 metrics=['accuracy'])



Using TensorFlow backend.


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 54, 54, 96)        34944     
_________________________________________________________________
activation_1 (Activation)    (None, 54, 54, 96)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 27, 27, 96)        0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 27, 27, 96)        384       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 17, 17, 256)       2973952   
_________________________________________________________________
activation_2 (Activation)    (None, 17, 17, 256)       0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 8, 8, 256)        

In [36]:
# (5) Train
model.fit(train_gen, epochs=10, verbose=1, shuffle=True, validation_data=val_gen )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7f1e44231e80>

Overall, it looks as though the more simple basic architecture has a better validation accuracy of 0.58, as compared with 0.53. However, we can see at 10 epochs the model has slowed down severely in its progress, and the additional epochs would lead to more overfitting as opposed to additional accuracy.

In contrast, the Alex Net architecture seems to be just getting started. We observe virtually no overfitting as the validation accuracy is better then the accuracy for the trainning set. In addition, the model was only trained for 10 epochs for time purposes, but the model seemed to still be advancing in the right direction when it was stopped. One article, I read discussing the architecture said that the model normally trained for 5 to 6 days. Although, the limited training seems not to bad considering AlexNet is designed to classify with up to 1,000 different categories and much larger data sets.

In [0]:
!rm -r test_images/
!mkdir test_images
!mkdir test_images/Acorn
!mkdir test_images/Fat\ Hen
!mkdir test_images/Maize

!cp base_dir/val_dir/Charlock/Charlock_32.png test_images/Acorn/Acorn_32.png
!cp base_dir/val_dir/Fat\ Hen/'Fat Hen_238.png' test_images/Fat\ Hen/'Fat Hen_238.png'
!cp base_dir/val_dir/Maize/Maize_255.png test_images/Maize/Maize_255.png


In [155]:
# This is how to check what index keras has internally assigned to each class. 
test_gen.class_indices

{'Black-grass': 0,
 'Charlock': 1,
 'Cleavers': 2,
 'Common Chickweed': 3,
 'Common wheat': 4,
 'Fat Hen': 5,
 'Loose Silky-bent': 6,
 'Maize': 7,
 'Scentless Mayweed': 8,
 'ShepherdтАЩs Purse': 9,
 'Small-flowered Cranesbill': 10,
 'Sugar beet': 11}

From this we can see that the inputs do fit with the same internal indices
when the model predicts what the picture is. The basic model correctly identifies Charlock, Maize and 'Fat Hen'.

In [169]:
imgPath = 'test_images/'
!ls test_images/

img = datagen.flow_from_directory(imgPath,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        class_mode='categorical'
                                  )

model_basic.predict_classes(img)

 Acorn	'Fat Hen'   Maize
Found 3 images belonging to 3 classes.


array([1, 7, 5])