This is a deep learning approach to classify images taken from cassava leaves in 5 categories: one of 4 types of diseases or the healhy group.

The data set is available in kaggle.com and includes 21397 images of cassava corps taken by farmers or other individuals that are labeled by experts. Further details on the dataset and the issue are availabl on kaggle.com.

*** Important: the code is tailored to be run on colab. To run it on the computer, the data needs to be downloaded on the system and necessary training and test subdirectories be made and refered to on the local system.
The codes for that are commented out in this file.

In [1]:
# First import the necessary libraries and modules

import os
import pandas as pd
import numpy as np
import zipfile
import random
import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import shutil
from shutil import copyfile
import math

In [14]:
# The next few code cells are to download the dataset fro kaggle onto colab.

# 1 - Uninstall and reinstall kaggle
!pip uninstall -y kaggle
!pip install kaggle

Unnamed: 0,image_id,label
0,1000015157.jpg,0
1,1000201771.jpg,3
2,100042118.jpg,1
3,1000723321.jpg,1
4,1000812911.jpg,3
...,...,...
21392,999068805.jpg,3
21393,999329392.jpg,3
21394,999474432.jpg,1
21395,999616605.jpg,4


In [None]:
# 2- To use the Kaggle API we need to set environment variables

# Sign up in kaggle if you do not already have an account;
# sign in to Kaggle;
# click on account;
# click the 'Create New API Token' button;
# a .json file will be downloaded on your system;
# open the file using any text editor;
# you'll find your username and password there, copy and paste them in the two follwing codes and run it.

os.environ["KAGGLE_USERNAME"] = 'PUT YOUR USERNAME HERE'
os.environ["KAGGLE_KEY"] = 'PUT YOUR PASSWORD HERE'


In [None]:
# 3- Now download the data from the kaggle

raw_data_dir = "input/raw"

!kaggle competitions download -c cassava-leaf-disease-classification -p {raw_data_dir}

In [None]:
# 4- Just to check out that the zip file is downloaded in the colab
!ls {raw_data_dir}

In [None]:
# Well, the mission almost accomplished and we got the dataset in our colab. Now,
# to unzip the zipped downloaded file we'll use Fuze-zip. First we need to install it. 

!apt-get install -y fuse-zip

In [None]:
# Then, we need to extract the data from the zipped file using fuze-zip

input_dir = "/tmp/kaggle-data"
!mkdir {input_dir}
!fuse-zip input/raw/cassava-leaf-disease-classification.zip {input_dir}

In [None]:
# Let's see everything looks normal and the dubdirectories are in place
!ls {input_dir}

In [None]:
# The unzipped dataset, as is available on kaggle, is now in the temp subdirectory of colab.
# It will remain there for 90 min if we do nothing hereinafter, and for 12 hours if we use it.
# Let's set a variable for the base path of our data.

base_path = '/tmp/kaggle-data/'

os.listdir(base_path)

In [None]:
##  It's always a good idea to breifly examine the data to see if all labels are valied or see ...
#  We'll see that all images are labeled with their file names and numbers as classes. Class 4 represents healthy leaves.

# Note: My examination showed that there are some mislabeled images, like apparantly infected leaves that are 
# labeled healthy (class 4), or instances of wrongly labled sick leaves, though I am not a corp or agriculture expert.

image_labels= pd.read_csv(base_path + 'train.csv')
image_labels

In [2]:
# # This line is only needed on my computer not on the colab

# path = '/Users/apple/Documents/Fereidoun/Projects/kaggle/cassava-leaf-disease-classification/'

In [66]:
# We divide the classes on the training and test sets in seperate subfolders to use flowfromdirectory method

os.mkdir(base_path + 'subset-train') 
os.mkdir(base_path + 'subset-test') 

for label in range(5):
    os.mkdir(base_path + 'subset-train/label-' + str(label)) # We'll put images of each label in on of these subfolders 
    os.mkdir(base_path + 'subset-test/label-' + str(label)) # We'll put images of each label in on of these subfolders 


In [None]:
# After shuffeling the list of images, we devide the data to training and test sets by putting them in seperate subfolders.
# First we make the subdirectory for the training set.

subgroup = image_labels_shuffled[:training_size] # pick images for training
files = subgroup.image_id
for f in files:
    shutil.copy(base_path + 'train_images/' + f, base_path + 'subset-train/')

# rearrange training images using their labels in seperate labeled (named) subdirectories
for label in range(5):
    mask = subgroup['label'] == label 
    category = subgroup.loc[mask]
    files = category.image_id
    dest_folder = base_path + 'subset-train/label-' + str(label) +'/'
    for f in files:
        file_path = base_path + 'subset-train/' + f
        if os.path.getsize(file_path) > 0:
            shutil.move(file_path, dest_folder)
        else:
            print(filename + " is zero length, so ignoring.")

In [67]:
# Then, we put the rest of images in another subdirectory as the test

subgroup = image_labels_shuffled[training_size:] # pick the rest of images for testing
files = subgroup.image_id
for f in files:
    shutil.copy(base_path + 'train_images/' + f, base_path + 'subset-test/')

# rearrange test images using their labels in seperate labeled (named) subdirectories
for label in range(5):
    mask = subgroup['label'] == label 
    category = subgroup.loc[mask]
    files = category.image_id
    dest_folder = base_path + 'subset-test/label-' + str(label) +'/'
    for f in files:
        file_path = base_path + 'subset-test/' + f
        if os.path.getsize(file_path) > 0:
            shutil.move(file_path, dest_folder)
        else:
            print(filename + " is zero length, so ignoring.")

In [None]:
#  To check the number of images of different classes (labels) in the training and test sets

print('No. of images of different labels in the training set.\n')
for i in range(5):
  print(f'Label_{i}: ', len(os.listdir(base_path + 'subset-train/label-' + str(i))))

print('\nNo. of images of different labels in the test set.\n')
for i in range(5):  
  print(f'Label{i}: ', len(os.listdir(base_path + 'subset-test/label-' + str(i))))

In [3]:
# Deep neural network is going to be the backbone of our algorithm. We'll use Keras and Tensorflow to this end.
# Convolving, pooling, and dropping out a small ratio of adjacent trained nerons are all the supplimentary tecniques
# employed to boost the performance of the method.

# First designe the model.

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(16, (3, 3), activation='relu', input_shape=(150, 150, 3)),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    # The next line is a trick to improve the test accuracy even more by removing
    # one of every two neighboring neurons right after being traied. 0.5 is too aggressive though.
    tf.keras.layers.Dropout(0.2),   
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')
])

# And then, set up the parameters of the model.

model.compile(optimizer=RMSprop(lr=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])

In [4]:
# Now, we also add augmentation to the algorithm to make up for the position and angle of the regions of interests
# (infected areas on leaves) in the images. We also rescale all the images to 0-1 range (from 0-255 range).

size_of_batch = 50                                        # we will set 'bath size' in the training stage according to this
epoch_steps = math.ceil(training_size/size_of_batch)           # the steps per batch can be calculated based on the batch size
val_steps = math.ceil(test_size/size_of_batch)

train_dir = os.path.join(base_path, 'subset-train/')
validation_dir = os.path.join(base_path, 'subset-test/')

train_datagen = ImageDataGenerator(rescale=1.0/255,
                                  #  rotation_range = 40,
                                   width_shift_range = 0.3,
                                   height_shift_range = 0.3,
                                   shear_range = 0.3,
                                   zoom_range = 0.7,
                                  #  horizontal_flip = True,
                                   fill_mode = 'nearest')

train_generator = train_datagen.flow_from_directory(train_dir,
                                                    batch_size=size_of_batch,
                                                    class_mode='categorical',
                                                    target_size=(150, 150))

validation_datagen = ImageDataGenerator(rescale=1.0/255)
validation_generator = validation_datagen.flow_from_directory(validation_dir,
                                                              batch_size=size_of_batch,
                                                              class_mode='categorical',
                                                              target_size=(150, 150))

Found 3000 images belonging to 5 classes.
Found 1000 images belonging to 5 classes.


In [None]:
# We save the performance indicators of every epoch in a variable.
# Note: every epoch with the current configuration and setting will take about 570 seconds.

history = model.fit(train_generator, 
                    epochs = 20,  
                    steps_per_epoch = epoch_steps,
                    validation_data = validation_generator, 
                    validation_steps = val_steps,
                    verbose=2)

Epoch 1/10
100/100 - 2745s - loss: 2.2863 - accuracy: 0.5390 - val_loss: 1.3107 - val_accuracy: 0.6050
Epoch 2/10
100/100 - 863s - loss: 1.1326 - accuracy: 0.6200 - val_loss: 1.1291 - val_accuracy: 0.6090
Epoch 3/10


In [None]:
# Here the progress of model's performance indicators versus the epoch number can be evaluated.

%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.image  as mpimg

#-----------------------------------------------------------
# Retrieve a list of indicators on training and test data
# sets for each epoch
#-----------------------------------------------------------
acc=history.history['accuracy']
val_acc=history.history['val_accuracy']
loss=history.history['loss']
val_loss=history.history['val_loss']

epochs=range(1, len(acc)+1) # Get number of epochs

#------------------------------------------------
# Plot training and validation accuracies versus epoch
#------------------------------------------------
plt.plot(epochs, acc, 'bo', "Training Accuracy")
plt.plot(epochs, val_acc, 'b', "Validation Accuracy")
plt.title('Training and validation accuracies')
plt.xlabel('Epoch number')
plt.ylim
plt.figure()

#------------------------------------------------
# Plot training and validation loss per epoch
#------------------------------------------------
plt.plot(epochs, loss, 'bo', "Training Loss")
plt.plot(epochs, val_loss, 'b', "Validation Loss")
plt.title('Training and validation loss')
plt.xlabel('Epoch number')
plt.legend()

plt.show


In [None]:
# *** Use this code only if for any reason you needed to remove a directory made as above from the 
# downloaded zip file from kaggle

# shutil.rmtree(base_path + 'subset-test')