# Image Preprocessing

**Tutorials**

- https://circuitdigest.com/tutorial/image-segmentation-using-opencv
- https://realpython.com/python-opencv-color-spaces/
- https://machinelearningmastery.com/how-to-manually-scale-image-pixel-data-for-deep-learning/

## Exploration

In [1]:
import cv2
import numpy as np

In [2]:
filename = "../data/raw/raw/color/Apple___healthy/0055dd26-23a7-4415-ac61-e0b44ebfaf80___RS_HL 5672.JPG"

In [3]:
image = cv2.imread(filename)

In [4]:
def view_image(image, title="Image"):
    cv2.imshow(title, image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

In [5]:
view_image(image)

In [6]:
image_grayscale = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
view_image(image_grayscale, "Grayscale")

In [7]:
help(cv2.Canny)

Help on built-in function Canny:

Canny(...)
    Canny(image, threshold1, threshold2[, edges[, apertureSize[, L2gradient]]]) -> edges
    .   @brief Finds edges in an image using the Canny algorithm @cite Canny86 .
    .   
    .   The function finds edges in the input image and marks them in the output map edges using the
    .   Canny algorithm. The smallest value between threshold1 and threshold2 is used for edge linking. The
    .   largest value is used to find initial segments of strong edges. See
    .   <http://en.wikipedia.org/wiki/Canny_edge_detector>
    .   
    .   @param image 8-bit input image.
    .   @param edges output edge map; single channels 8-bit image, which has the same size as image .
    .   @param threshold1 first threshold for the hysteresis procedure.
    .   @param threshold2 second threshold for the hysteresis procedure.
    .   @param apertureSize aperture size for the Sobel operator.
    .   @param L2gradient a flag, indicating whether a more accurate \

In [8]:
image_edged = cv2.Canny(image_grayscale, 200, 500)
view_image(image_edged, "Edges")

In [9]:
image_edged_copy = image_edged.copy()

In [10]:
help(cv2.findContours)

Help on built-in function findContours:

findContours(...)
    findContours(image, mode, method[, contours[, hierarchy[, offset]]]) -> contours, hierarchy
    .   @brief Finds contours in a binary image.
    .   
    .   The function retrieves contours from the binary image using the algorithm @cite Suzuki85 . The contours
    .   are a useful tool for shape analysis and object detection and recognition. See squares.cpp in the
    .   OpenCV sample directory.
    .   @note Since opencv 3.2 source image is not modified by this function.
    .   
    .   @param image Source, an 8-bit single-channel image. Non-zero pixels are treated as 1's. Zero
    .   pixels remain 0's, so the image is treated as binary . You can use #compare, #inRange, #threshold ,
    .   #adaptiveThreshold, #Canny, and others to create a binary image out of a grayscale or color one.
    .   If mode equals to #RETR_CCOMP or #RETR_FLOODFILL, the input can also be a 32-bit integer image of labels (CV_32SC1).
    .   @p

In [11]:
contours, hierarchy = cv2.findContours(image_edged_copy, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

In [12]:
contours

[array([[[130, 185]],
 
        [[130, 186]],
 
        [[131, 187]],
 
        [[131, 188]],
 
        [[132, 189]],
 
        [[131, 190]],
 
        [[130, 191]],
 
        [[129, 191]],
 
        [[130, 192]],
 
        [[131, 192]],
 
        [[132, 192]],
 
        [[133, 192]],
 
        [[134, 192]],
 
        [[135, 192]],
 
        [[136, 192]],
 
        [[137, 192]],
 
        [[136, 192]],
 
        [[135, 192]],
 
        [[134, 192]],
 
        [[133, 192]],
 
        [[132, 192]],
 
        [[131, 192]],
 
        [[130, 191]],
 
        [[131, 190]],
 
        [[132, 190]],
 
        [[132, 189]],
 
        [[133, 188]],
 
        [[134, 188]],
 
        [[135, 189]],
 
        [[136, 189]],
 
        [[137, 189]],
 
        [[138, 188]],
 
        [[139, 188]],
 
        [[140, 188]],
 
        [[139, 188]],
 
        [[138, 188]],
 
        [[137, 189]],
 
        [[136, 189]],
 
        [[135, 189]],
 
        [[134, 188]],
 
        [[133, 188]],
 
        [[132, 1

In [13]:
len(contours)

88

In [14]:
view_image(image_edged_copy)

In [15]:
help(cv2.drawContours)

Help on built-in function drawContours:

drawContours(...)
    drawContours(image, contours, contourIdx, color[, thickness[, lineType[, hierarchy[, maxLevel[, offset]]]]]) -> image
    .   @brief Draws contours outlines or filled contours.
    .   
    .   The function draws contour outlines in the image if \f$\texttt{thickness} \ge 0\f$ or fills the area
    .   bounded by the contours if \f$\texttt{thickness}<0\f$ . The example below shows how to retrieve
    .   connected components from the binary image and label them: :
    .   @include snippets/imgproc_drawContours.cpp
    .   
    .   @param image Destination image.
    .   @param contours All the input contours. Each contour is stored as a point vector.
    .   @param contourIdx Parameter indicating a contour to draw. If it is negative, all the contours are drawn.
    .   @param color Color of the contours.
    .   @param thickness Thickness of lines the contours are drawn with. If it is negative (for example,
    .   thickness

In [16]:
cv2.drawContours(image,contours, -1, (0,255,0), 3)

array([[[138, 138, 156],
        [114, 114, 132],
        [153, 153, 171],
        ...,
        [119, 118, 138],
        [138, 137, 157],
        [134, 133, 153]],

       [[134, 134, 152],
        [108, 108, 126],
        [113, 113, 131],
        ...,
        [113, 112, 132],
        [143, 142, 162],
        [122, 121, 141]],

       [[ 90,  90, 108],
        [119, 119, 137],
        [113, 113, 131],
        ...,
        [ 84,  83, 103],
        [104, 103, 123],
        [109, 108, 128]],

       ...,

       [[192, 195, 200],
        [197, 200, 205],
        [199, 202, 207],
        ...,
        [190, 191, 201],
        [190, 191, 201],
        [190, 191, 201]],

       [[193, 196, 201],
        [199, 202, 207],
        [201, 204, 209],
        ...,
        [190, 191, 201],
        [190, 191, 201],
        [190, 191, 201]],

       [[195, 198, 203],
        [201, 204, 209],
        [201, 204, 209],
        ...,
        [190, 191, 201],
        [190, 191, 201],
        [189, 190, 200]]

In [17]:
view_image(image)

In [18]:
# convex hulls

image_copy = image.copy()
image_gray = cv2.cvtColor(image_copy, cv2.COLOR_BGR2GRAY)
ret, thresh = cv2.threshold(image_gray, 127, 255, cv2.THRESH_BINARY_INV)
contours, hierarchy = cv2.findContours(thresh.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)

In [20]:
for c in contours:
    x, y, w, h = cv2.boundingRect(c)
    cv2.rectangle(image, (x, y), (x + w, y + h), (0,0,255), 2)
    #view_image(image)
    accuracy = 0.03 * cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, accuracy, True)
    cv2.drawContours(image, [approx], 0, (0,255,0), 2)

In [21]:
view_image(image)

## Preprocessing Images


Based on [this](https://machinelearningmastery.com/how-to-manually-scale-image-pixel-data-for-deep-learning/).

In [96]:
import glob
import os
import random
import numpy as np
import pandas as pd

from PIL import Image


def find_image_files(data_dir, file_ext="JPG"):
    return glob.glob(os.path.join(data_dir, f"*.{file_ext}"))


def random_filename(filenames):
    idx = random.randint(0, len(filenames))
    return filenames[idx]


def preprocess_image(filename, normalize=True, standardize=True):
    image = load_image(filename)
    pixel_values = convert_to_pixel_values(image, normalize, standardize)
    pixel_values.reshape(-1, image.size[0], image.size[1], 1)
    return pixel_values


def load_image(filename):
    """ Load image and optionaly split into different channels """
    image = Image.open(filename)
    print(f"LOADED {filename}\n{image.format} {image.mode} {image.size}")
    return image


def convert_to_pixel_values(image, normalize, standardize):
    """ Convert image to array of pixel values """
    pixels = np.asarray(image)
    if normalize:
        pixels = normalize_pixel_values(pixels)
    if standardize:
        pixels = standardize_pixel_values(pixels)
    return pixels


def normalize_pixel_values(pixels):
    """ Normalize pixel values to be in the range [0, 1] """
    print('Data Type: %s' % pixels.dtype)
    print('BEFORE NORMALIZATION Min: %.3f, Max: %.3f' % (pixels.min(), pixels.max()))
    pixels = pixels.astype('float32')
    pixels /= 255.0
    print('AFTER NORMALIZATION Min: %.3f, Max: %.3f' % (pixels.min(), pixels.max()))
    return pixels


def standardize_pixel_values(pixels):
    """ Globally standardize pixel values to positive """
    mean, std = pixels.mean(), pixels.std()
    print('BEFORE STANDARDIZATION Mean: %.3f, Standard Deviation: %.3f' % (mean, std))
    pixels = (pixels - mean) / std
    pixels = np.clip(pixels, -1.0, 1.0)
    pixels = (pixels + 1.0) / 2.0
    mean, std = pixels.mean(), pixels.std()
    print('AFTER STANDARDIZATION Mean: %.3f, Standard Deviation: %.3f' % (mean, std))
    print('AFTER STANDARDIZATION Min: %.3f, Max: %.3f' % (pixels.min(), pixels.max()))
    return pixels

In [75]:
image_files = find_image_files("../data/raw/raw/color/Apple___healthy/", file_ext="JPG")
filename = random_filename(image_files)

In [76]:
image = load_image(filename)
pixel_values = convert_to_pixel_values(image, normalize=True, standardize=True)

LOADED ../data/raw/raw/color/Apple___healthy/647ab085-3f32-4ff1-9972-bef63cb4085e___RS_HL 7595.JPG
JPEG RGB (256, 256)
Data Type: uint8
BEFORE NORMALIZATION Min: 0.000, Max: 255.000
AFTER NORMALIZATION Min: 0.000, Max: 1.000
BEFORE STANDARDIZATION Mean: 0.412, Standard Deviation: 0.200
AFTER STANDARDIZATION Mean: 0.492, Standard Deviation: 0.368
AFTER STANDARDIZATION Min: 0.000, Max: 1.000


In [77]:
preprocessed_image = preprocess_image(filename, normalize=True, standardize=True)

LOADED ../data/raw/raw/color/Apple___healthy/647ab085-3f32-4ff1-9972-bef63cb4085e___RS_HL 7595.JPG
JPEG RGB (256, 256)
Data Type: uint8
BEFORE NORMALIZATION Min: 0.000, Max: 255.000
AFTER NORMALIZATION Min: 0.000, Max: 1.000
BEFORE STANDARDIZATION Mean: 0.412, Standard Deviation: 0.200
AFTER STANDARDIZATION Mean: 0.492, Standard Deviation: 0.368
AFTER STANDARDIZATION Min: 0.000, Max: 1.000


In [79]:
image_files

['../data/raw/raw/color/Apple___healthy/603b96cc-3237-4ddf-97b9-11ddbf4840f0___RS_HL 7598.JPG',
 '../data/raw/raw/color/Apple___healthy/7c388477-2953-4b3d-8f22-7f9eed5fff5c___RS_HL 7527.JPG',
 '../data/raw/raw/color/Apple___healthy/24adc938-da71-4d2a-9c7e-f2dc3a73c908___RS_HL 7307.JPG',
 '../data/raw/raw/color/Apple___healthy/ea45b7c8-f1c2-42c0-bfec-96444c58ddd0___RS_HL 6004.JPG',
 '../data/raw/raw/color/Apple___healthy/0adc1c5b-8958-47c0-a152-f28078c214f1___RS_HL 7825.JPG',
 '../data/raw/raw/color/Apple___healthy/ce9a3738-55ea-4974-8539-922ed4f864b8___RS_HL 6175.JPG',
 '../data/raw/raw/color/Apple___healthy/f76ff409-9723-4b89-83f9-1ff33edaffbc___RS_HL 7331.JPG',
 '../data/raw/raw/color/Apple___healthy/a0670cfe-00e2-4c7a-a9f7-29408debdc3a___RS_HL 6061.JPG',
 '../data/raw/raw/color/Apple___healthy/b5ff5168-38d9-4a8b-b5bb-94c310e77b53___RS_HL 6083.JPG',
 '../data/raw/raw/color/Apple___healthy/96717630-6d48-4660-b472-268f5f1304d9___RS_HL 7978.JPG',
 '../data/raw/raw/color/Apple___healthy/

In [91]:
input_data = np.asarray(list(map(preprocess_image, image_files)))

LOADED ../data/raw/raw/color/Apple___healthy/603b96cc-3237-4ddf-97b9-11ddbf4840f0___RS_HL 7598.JPG
JPEG RGB (256, 256)
Data Type: uint8
BEFORE NORMALIZATION Min: 0.000, Max: 251.000
AFTER NORMALIZATION Min: 0.000, Max: 0.984
BEFORE STANDARDIZATION Mean: 0.458, Standard Deviation: 0.181
AFTER STANDARDIZATION Mean: 0.520, Standard Deviation: 0.365
AFTER STANDARDIZATION Min: 0.000, Max: 1.000
LOADED ../data/raw/raw/color/Apple___healthy/7c388477-2953-4b3d-8f22-7f9eed5fff5c___RS_HL 7527.JPG
JPEG RGB (256, 256)
Data Type: uint8
BEFORE NORMALIZATION Min: 0.000, Max: 255.000
AFTER NORMALIZATION Min: 0.000, Max: 1.000
BEFORE STANDARDIZATION Mean: 0.438, Standard Deviation: 0.238
AFTER STANDARDIZATION Mean: 0.489, Standard Deviation: 0.393
AFTER STANDARDIZATION Min: 0.000, Max: 1.000
LOADED ../data/raw/raw/color/Apple___healthy/24adc938-da71-4d2a-9c7e-f2dc3a73c908___RS_HL 7307.JPG
JPEG RGB (256, 256)
Data Type: uint8
BEFORE NORMALIZATION Min: 0.000, Max: 255.000
AFTER NORMALIZATION Min: 0.000, 

In [93]:
input_data.shape

(1645, 256, 256, 3)

## Labels

In [99]:
filenames = list(map(os.path.basename, image_files))

In [106]:
apple_df = pd.DataFrame({"Filename": filenames, "Class": ["healthy" for _ in range(len(filenames))]})

In [107]:
apple_df

Unnamed: 0,Filename,Class
0,603b96cc-3237-4ddf-97b9-11ddbf4840f0___RS_HL 7...,healthy
1,7c388477-2953-4b3d-8f22-7f9eed5fff5c___RS_HL 7...,healthy
2,24adc938-da71-4d2a-9c7e-f2dc3a73c908___RS_HL 7...,healthy
3,ea45b7c8-f1c2-42c0-bfec-96444c58ddd0___RS_HL 6...,healthy
4,0adc1c5b-8958-47c0-a152-f28078c214f1___RS_HL 7...,healthy
...,...,...
1640,00fca0da-2db3-481b-b98a-9b67bb7b105c___RS_HL 7...,healthy
1641,ada1699c-0a91-44ea-bc2b-476147394421___RS_HL 5...,healthy
1642,5a0aa782-76a8-4de9-89e9-47cb35d222e3___RS_HL 6...,healthy
1643,139ff5a7-102b-4434-b540-eb6930e6d968___RS_HL 7...,healthy


In [115]:
BASE_DIR = "../data/raw/raw/color"

In [135]:
DATA_DIRS = [
    "Apple___Apple_scab",
    "Apple___Black_rot",
    "Apple___Cedar_apple_rust",
    "Apple___healthy",
]

In [142]:
data = {
    "Filename": [],
    "Label": [],
    "Species": [],
}


for data_dir in DATA_DIRS:
    folder_path = os.path.join(BASE_DIR, data_dir)
    
    print(folder_path)
    image_files = list(map(os.path.basename, find_image_files(folder_path)))
    species, label = data_dir.split("___")
    labels = [label for _ in range(len(image_files))]
    species = [species for _ in range(len(image_files))]
    
    data["Filename"].extend(image_files)
    data["Label"].extend(labels)
    data["Species"].extend(species)


df = pd.DataFrame(data)

../data/raw/raw/color/Apple___Apple_scab
../data/raw/raw/color/Apple___Black_rot
../data/raw/raw/color/Apple___Cedar_apple_rust
../data/raw/raw/color/Apple___healthy


In [143]:
df

Unnamed: 0,Filename,Label,Species
0,95c2cc04-5581-48ca-8001-9644bb488b12___FREC_Sc...,Apple_scab,Apple
1,8a212787-77fb-40dc-a028-92d5ff2d6c7c___FREC_Sc...,Apple_scab,Apple
2,b10dceaa-a764-46b4-bd61-bc9fb13579ca___FREC_Sc...,Apple_scab,Apple
3,d842b34b-d2ab-48fc-9e78-ef586308c359___FREC_Sc...,Apple_scab,Apple
4,c270319e-61d4-4376-a3e8-b159ff712dd4___FREC_Sc...,Apple_scab,Apple
...,...,...,...
3166,00fca0da-2db3-481b-b98a-9b67bb7b105c___RS_HL 7...,healthy,Apple
3167,ada1699c-0a91-44ea-bc2b-476147394421___RS_HL 5...,healthy,Apple
3168,5a0aa782-76a8-4de9-89e9-47cb35d222e3___RS_HL 6...,healthy,Apple
3169,139ff5a7-102b-4434-b540-eb6930e6d968___RS_HL 7...,healthy,Apple


In [138]:
df.to_csv(os.path.join("../data/raw/labels.csv"), index=False)

In [139]:
pd.read_csv("../data/raw/labels.csv")

Unnamed: 0,Filename,Label
0,95c2cc04-5581-48ca-8001-9644bb488b12___FREC_Sc...,Apple_scab
1,8a212787-77fb-40dc-a028-92d5ff2d6c7c___FREC_Sc...,Apple_scab
2,b10dceaa-a764-46b4-bd61-bc9fb13579ca___FREC_Sc...,Apple_scab
3,d842b34b-d2ab-48fc-9e78-ef586308c359___FREC_Sc...,Apple_scab
4,c270319e-61d4-4376-a3e8-b159ff712dd4___FREC_Sc...,Apple_scab
...,...,...
3166,00fca0da-2db3-481b-b98a-9b67bb7b105c___RS_HL 7...,healthy
3167,ada1699c-0a91-44ea-bc2b-476147394421___RS_HL 5...,healthy
3168,5a0aa782-76a8-4de9-89e9-47cb35d222e3___RS_HL 6...,healthy
3169,139ff5a7-102b-4434-b540-eb6930e6d968___RS_HL 7...,healthy


In [141]:
df["Filename"]

0       95c2cc04-5581-48ca-8001-9644bb488b12___FREC_Sc...
1       8a212787-77fb-40dc-a028-92d5ff2d6c7c___FREC_Sc...
2       b10dceaa-a764-46b4-bd61-bc9fb13579ca___FREC_Sc...
3       d842b34b-d2ab-48fc-9e78-ef586308c359___FREC_Sc...
4       c270319e-61d4-4376-a3e8-b159ff712dd4___FREC_Sc...
                              ...                        
3166    00fca0da-2db3-481b-b98a-9b67bb7b105c___RS_HL 7...
3167    ada1699c-0a91-44ea-bc2b-476147394421___RS_HL 5...
3168    5a0aa782-76a8-4de9-89e9-47cb35d222e3___RS_HL 6...
3169    139ff5a7-102b-4434-b540-eb6930e6d968___RS_HL 7...
3170    5ee4cb1c-d933-4e89-b364-9b740b48e077___RS_HL 6...
Name: Filename, Length: 3171, dtype: object