# Preform analysis on images

This ia a helper notebook and uses cats/dogs as the example.  In real usage you would copy this notebook to your subdirectory and then change the dataset access as needed.  You could then customize the reporting based on the type of problem and images.  When you look in some of my folders you will see my copy that I've used to analyze images - like for example, Dog Breeds.

Sample pandas examples:

https://github.com/rasbt/pattern_classification/blob/master/data_viz/matplotlib_viz_gallery.ipynb
https://github.com/rasbt/pattern_classification/blob/master/data_viz/matplotlib_viz_gallery.ipynb

### Processing for using Google Drive and normal includes

The notebook uses TensorFlow 2.x.  (Eager execution is enabled by default and we use the newer versions of tf.Data.)

I use Notebooks running on Colab and on my local workstation, so I need to separate some logic to make it easier to run in both locations.

I was going to delete and just make Colab version, but that is not "real world."  You usually have multiple environments and I'm showing you how I accommodate different environments, you might need something different...



In [0]:
#"""
# Google Collab specific stuff....
from google.colab import drive
drive.mount('/content/drive')

import os
!ls "/content/drive/My Drive"

USING_COLLAB = True
# Force to use 2.x version of Tensorflow
%tensorflow_version 2.x
#"""

In [0]:
# Setup sys.path to find MachineLearning lib directory

# Check if "USING_COLLAB" is defined, if yes, then we are using Colab, otherwise set to False
try: USING_COLLAB
except NameError: USING_COLLAB = False

%load_ext autoreload
%autoreload 2

# set path env var
import sys
if "MachineLearning" in sys.path[0]: 
    pass
else:
    print(sys.path)
    if USING_COLLAB:
        sys.path.insert(0, '/content/drive/My Drive/GitHub/MachineLearning/lib')  ###### CHANGE FOR SPECIFIC ENVIRONMENT
    else:
        sys.path.insert(0, '/Users/john/Documents/GitHub/MachineLearning/lib')  ###### CHANGE FOR SPECIFIC ENVIRONMENT
    
    print(sys.path)

In [0]:
# Normal includes...

from __future__ import absolute_import, division, print_function, unicode_literals

import os, sys, random, warnings, time, copy, csv
import numpy as np 

import IPython.display as display
from PIL import Image
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import tensorflow as tf
print(tf.__version__)

# This allows the runtime to decide how best to optimize CPU/GPU usage
AUTOTUNE = tf.data.experimental.AUTOTUNE

from TrainingUtils import *

#warnings.filterwarnings("ignore", category=DeprecationWarning)
#warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", "(Possibly )?corrupt EXIF data", UserWarning)

## General Setup

- Create a dictionary wrapped by a class for global values.  This is how I manage global vars in my notebooks.



In [0]:
# Set root directory path to data
if USING_COLLAB:
    ROOT_PATH = "/content/drive/My Drive/GitHub/MachineLearning/9-LibTest/Data"  ###### CHANGE FOR SPECIFIC ENVIRONMENT
else:
    ROOT_PATH = "/Users/john/Documents/GitHub/MachineLearning/9-LibTest/Data"  ###### CHANGE FOR SPECIFIC ENVIRONMENT
        
# Establish global dictionary
parms = GlobalParms(ROOT_PATH=ROOT_PATH,
                    TRAIN_DIR="CatDogLabeledVerySmall", 
                    NUM_CLASSES=2,
                    IMAGE_ROWS=256,
                    IMAGE_COLS=256,
                    IMAGE_CHANNELS=3,
                    BATCH_SIZE=1,  # must be one if you want to see different image sizes
                    IMAGE_EXT=".jpg")

parms.print_contents()

In [0]:
# Simple helper method to display batches of images with labels....        
def show_batch(image_batch, label_batch, number_to_show=25, r=5, c=5, print_shape=False):
    show_number = min(number_to_show, parms.BATCH_SIZE)

    if show_number < 8: #if small number, then change row, col and figure size
        if parms.IMAGE_COLS > 64 or parms.IMAGE_ROWS > 64:
            plt.figure(figsize=(25,25)) 
        else:
            plt.figure(figsize=(10,10))  
        r = 4
        c = 2 
    else:
        plt.figure(figsize=(10,10))  

    for n in range(show_number):
        if print_shape:
            print("Image shape: {}  Max: {}  Min: {}".format(image_batch[n].shape, np.max(image_batch[n]), np.min(image_batch[n])))

        ax = plt.subplot(r,c,n+1)
        plt.imshow(tf.keras.preprocessing.image.array_to_img(image_batch[n]))
        plt.title(parms.CLASS_NAMES[np.argmax(label_batch[n])])
        plt.axis('off')

### Create dataset and normal mappings

Pipeline Flow:

create dataset -> map "process_path" -> repeat forever -> batch

The mappings open and read an image.  These next cells should be changed based on your specific needs.


In [0]:
# Create path list and class list using cat/dog images
# Change for your own dataset 
images_list, sub_directories = load_file_names_labeled_subdir_Util(parms.TRAIN_PATH, 
                                                                   parms.IMAGE_EXT)

images_list_len = len(images_list)
print("Number of images: ", images_list_len)

# Set the class names.
parms.set_class_names(sub_directories)
print("Classes: {}  Labels: {}  {}".format(parms.NUM_CLASSES, len(parms.CLASS_NAMES), parms.CLASS_NAMES) )


In [0]:
# Using the path, show the images that will be used
for image_path in images_list[:2]:
    print(image_path)
    display.display(Image.open(str(image_path)))

In [0]:
# Return a label based on the path of the image
def get_label(file_path: tf.Tensor) -> tf.Tensor:

    # convert the path to a list of path components
    parts = tf.strings.split(file_path, os.path.sep)
    # The second to last is the class-directory
    return parts[-2] == parms.CLASS_NAMES

# Decode the image, convert to float, normalize by 255 and resize
def decode_img(image: tf.Tensor) -> tf.Tensor:

    # convert the compressed string to a 3D uint8 tensor
    image = tf.image.decode_jpeg(image, channels=parms.IMAGE_CHANNELS)
    # Use `convert_image_dtype` to convert to floats in the [0,1] range.
    image = tf.image.convert_image_dtype(image, parms.IMAGE_DTYPE)

    # uncomment to resize the image to the desired size.
    #image = tf.image.resize(image, [parms.IMAGE_ROWS, parms.IMAGE_COLS])
    return image

# method mapped to load, resize and aply any augmentations
def process_path(file_path: tf.Tensor) -> tf.Tensor:
  
    label = get_label(file_path)
    # load the raw data from the file as a string
    image = tf.io.read_file(file_path)
    image = decode_img(image)

    return image, label

### Create dataset from list of images and apply mappings

In [0]:
# Create Dataset from list of images
full_dataset = tf.data.Dataset.from_tensor_slices(np.array(images_list))

# Verify image paths were loaded and save one path for later in "some_image"
for f in full_dataset.take(2):
    some_image = f.numpy().decode("utf-8")
    print(f.numpy())
    
print("Some Image: ", some_image)

In [0]:
# map training images to processing, includes any augmentation
full_dataset = full_dataset.map(process_path, num_parallel_calls=AUTOTUNE)

# Verify the mapping worked
for image, label in full_dataset.take(1):
    print("Image shape: {}  Max: {}  Min: {}".format(image.numpy().shape, np.max(image.numpy()), np.min(image.numpy())))
    print("Label: ", label.numpy())

# Repeat forever
full_dataset = full_dataset.repeat()

# set the batch size
full_dataset = full_dataset.batch(parms.BATCH_SIZE)


In [0]:
#create simple iterator
ds_iter = iter(full_dataset)


In [0]:
# Show the images, execute this cell multiple times to see the images
# Execute at least 4 times if random is applied

image_batch, label_batch = next(ds_iter)
show_batch(image_batch.numpy(), label_batch.numpy())

### Collect image information

This will loop over each image and collect information to be used to create a Pandas dataframe.  The dataframe will then be used to report information.  You can also save the dataframe for future analysis.

This is where you can also customize what information is collected.

The size of the image is not changed, but you can change so every image is exactly like how it will be used for training.  I've found that looking at the raw image information is more helpful than looking at images that have been resized.


In [0]:
# Collect various information about an image
def dataset_analysis(ds_iter, steps, test=False):
    if test == True:
        steps = 4

    image_info = []

    for i in tqdm(range(int(steps))):
        image_batch, label_batch = next(ds_iter)
        #show_batch(image_batch.numpy(), label_batch.numpy())

        for j in range(parms.BATCH_SIZE):
            image = image_batch[j].numpy()
            label = label_batch[j].numpy()
            label = np.argmax(label)
            r = image.shape[0]
            c = image.shape[1]
            d = 0
            mean0=0
            mean1=0
            mean2=0
            if parms.IMAGE_CHANNELS == 3:
                d = image.shape[2]
                mean0 = np.mean(image[:,:,0])
                mean1 = np.mean(image[:,:,1])
                mean2 = np.mean(image[:,:,2])
            image_info.append([label, r, c, d, np.mean(image), np.std(image), mean0, mean1, mean2])

            if test:
                print(image_info[-1])
                
    return image_info

In [0]:
# Build image_info list
ds_iter = iter(full_dataset)

steps = np.ceil(images_list_len // parms.BATCH_SIZE)

image_info = dataset_analysis(ds_iter, steps=steps, test=False)

In [0]:
# Build pandas dataframe
image_info_df = pd.DataFrame(image_info, columns =['label', 'row','col', 'dim', 'mean', 'std', "chmean0", "chmean1", "chmean2"])
print(image_info_df.describe())
image_info_df.head()

In [0]:
#https://jamesrledoux.com/code/group-by-aggregate-pandas
image_info_df.groupby('label').agg({'mean': ['count', 'mean', 'min', 'max'], 'std': ['mean', 'min', 'max'], 'row': ['mean', 'min', 'max'],'col': ['mean', 'min', 'max'], 'chmean0':['mean'],'chmean1':['mean'],'chmean2':['mean'] })


In [0]:
image_info_df.agg({'mean': ['mean', 'min', 'max'], 'std': ['mean', 'min', 'max'], 'row': ['mean', 'min', 'max'],'col': ['mean', 'min', 'max'] })


In [0]:
image_mean = image_info_df["mean"]
print("Mean: ", np.mean(image_mean), "  STD: ", np.std(image_mean))

In [0]:
image_info_df["label"].value_counts().plot.bar()

In [0]:
image_info_df["label"].value_counts().plot.pie()

In [0]:
image_info_df.hist(column='mean')

In [0]:
image_info_df.plot.scatter(x='row', y='col', color='Blue', label='Row Col')

In [0]:
# Save results
result_path = os.path.join(parms.ROOT_PATH, "image-info.pkl")
image_info_df.to_pickle(result_path)  

In [0]:
# open and read saved file
image_info_df = pd.read_pickle(result_path)
image_info_df.head()

### Final Thoughts.....

This is a pattern notebook that should be copied and modified as needed for the specific training.

Some of the things to pay close attention to are:

- **class balance** is a big one!!  If unbalanced and not addressed will greatly impact training.

- **image size distribution** overall and with respect to the classes.  This can influence your approach to sizing and augmentation.

- **mean and std** have been helpful to understand when the images were more dark or light.  These values can also be used for feature-wise, zero center and stdnorm processing.
