# Build your custom image classifier using FastAI and OPEN-CE

*This notebook was copied and modified for this lab from this source https://github.com/fastai/course-v3*



Welcome to TechU Deep Learning Lab! 

For those of you who are using a Jupyter Notebook for the first time, you can learn about this useful tool in a tutorial we prepared specially for you; click `File`->`Open` now and click `jupyter_notebook_tutorial.ipynb`. [./jupyter_notebook_tutorial.ipynb](./jupyter_notebook_tutorial.ipynb)

In this lesson we will **build our first image classifier from scratch**, and see if we can achieve world-class results. Let's dive in!

Note, the most recent version of this lab is now using a build from OpenCE .  We have PyTorch 1.6 running with Fastai-version 2.  For more information about how to build the most recent deep learning frameworks for Power check out this repo.

https://github.com/open-ce/open-ce

In [None]:
%matplotlib inline
import os

#%reload_ext autoreload
#%autoreload 2
# For Using GPU's
# os.environ["CUDA_VISIBLE_DEVICES"]="1"

## About our Class Environment

We import all the necessary packages. We are going to work with the [fastai V2 library](http://www.fast.ai/) which sits on top of [Pytorch 1.6](https://pytorch.org/). The fastai library provides many useful functions that enable us to quickly and easily build neural networks and train our models.

For this class, we are using Open-CE running in the **IBM Garage for Systems Cloud**, or CECC for short.  To get your own environment, simply browse to this website and request a Power8 or Power9 environment.

https://www.ibm.com/it-infrastructure/services/cecc-portal/web/Catalog

* Note : To get FastAI up and running we have a special setup script in our github repo ... 

https://github.ibm.com/vanstee/aicoc-ai-immersion/blob/master/FastAI/setup_fastai.sh

## Import FastAI Libraries

In [None]:
from fastai.vision import *
from fastai.metrics import error_rate
from pathlib import Path
import os
import sys

# utility print function
def nprint(mystring) :
    print("**{}** : {}".format(sys._getframe(1).f_code.co_name,mystring))


## Project Configuration

This data structure below will hold all the settings for our class.  Its handy to have a simple dictionary contain all your project settings to keep organized.  We have include an override capability to these default settings so that you can customize your project. 


In [None]:
def getconfig(cfg_in={},base_dir="."):
    cfg = {}
    cfg["bs"] = 16
    cfg["base_dir"]  = base_dir
    cfg["image_dir"] = base_dir + "/class_images"
    cfg["classes"] = ["cars","busses","trucks"]
    cfg["num_images"] = {"train":200,"valid":0,"test":0}  # only use train for class. FastAI will autosplit
    cfg["d_partitions"] = list(cfg["num_images"].keys())                 
    cfg["jpeginfo"] =base_dir + "/jpeginfo"
   
    # overwrite configs if passed
    for (k,v) in cfg_in.items() :
        nprint("Overriding Config {}:{} with {}".format(k,cfg[k],v))
        cfg[k] = v
    return cfg

# bs = 16   # uncomment this line if you run out of memory even after clicking Kernel->Restart

## Build a Dataset on the Fly ..

Here we are going to build our own dataset !!  Think of 3 categories you would like to classify images.  In this example, we will use 
* people playing sports
* people holding money 
* people holding cups
* people playing with animals
* people on bikes

We will use an open source tool called **googliser** [https://github.com/teracow/googliser](https://github.com/teracow/googliser) to download our images from google images.

In [None]:
#################################################################################################
# @@ Students : Customize this cell with your custom classes for image classification
################################################################################################

# Overrides for lab

mycfg = {
     ## CLASS Enter your search terms below 
    "classes"     : ["people playing sports",
                     "people holding money",
                     "people holding cups",
                     "people playing with animals",
                     "people on bikes"], 
    "d_partitions": ["train"],
    "num_images"  : {"train":300,"valid":0,"test":0}  # only use train for class. FastAI will autosplit
  
}

# Setup Class Configuration
# base_dir=!pwd
# base_dir=base_dir[0]
base_dir = "/home/cecuser/5050/aicoc-ai-immersion/FastAI"
print("Base project directory : {}".format(base_dir))
cfg=getconfig(mycfg, base_dir)


## Make Some Directories to hold our data 

FastAI is very flexible and help you label your image data in all sorts of ways.  The most stand  

You might have a bunch of images and a csv file with the labels.  You might have your images organized by folder with the folder name being the labels.  FastAI provid

To see all the supported methods check out this link..
https://docs.fast.ai/vision.data.html#ImageDataBunch.from_folder

For our class, we are going to organize our data by folder into something like this.. 
```
base/class_images/train
   people_holding_cups    people_on_bikes          people_playing_with_animals  
   people_holding_money   people_playing_sports           
```

In each folder we will have a bunch of \*.jpg files

### Build Directory Hierarchy

In [None]:

# Helpers to make directories
def class_folder_name(base,d_part,cls) :
    return base+"/"+d_part+"/"+ cls.replace(" ","_")

def makeDirIfNotExist(directory) :
    if not os.path.exists(directory):  
        nprint("Making directory {}".format(directory))
        os.makedirs(directory) 
    else :
        nprint("Directory {} already exists .. ".format(directory))

# Build directory hierarchy
#   [train|valid|test ]
#    -----------------> [class1 | class2 | class...]
#for d_part in cfg["d_partitions"] :
#    for cls in cfg["classes"] :
#        directory=class_folder_name(cfg['image_dir'],d_part,cls)
#        makeDirIfNotExist(directory)


## Install Googliser here.  
This handy utility will download our images from google images.  We will clone the repo from git.

In [None]:
# install googliser
def install_googliser():
    googliser_directory = cfg['base_dir']+"/googliser"
    if not os.path.exists(googliser_directory):  
        nprint("Installing Googliser here : {} ".format(googliser_directory))
        os.chdir(cfg['base_dir'])
        !git clone https://github.com/teracow/googliser
    else :
        nprint("Googliser already installed here : {} ".format(googliser_directory))

    googliser = cfg['base_dir']+"/googliser/googliser.sh"

    return googliser 
googliser = install_googliser()
!ls {googliser}

## How many images should we grab ?


Luckily not too many.. it always depends on the project, but we are going to use a pre-training deep learning network when we perform training.  Since that network was already trained on over 1 million images, we don't need to supply too many for our task.  This is why transfer learning is so powerful!

In [None]:
cfg['num_images']

## Dask to the rescue!
Here we use python dask to download many images in parallel

In [None]:
# The code below will download files to train folder only to avoid duplicate downloads.  
# We then move a few files over.  This can be done manually or programatically.  For our example
# we will let FastAI do the work for us!
def runcmd(cmd):
    !{cmd}
    return 1;

def download_images(cfg, force_download=False):
    # Download if directory does not exist or Force download = True
    results=[]
    
    if(not(os.path.exists(cfg['image_dir']))) :
       nprint("Image directory {} does not exist.  Downloading images..".format(cfg['image_dir']))
       force_download = True
    else :
        nprint("image dir exists : {}.  Not downloading again.  \nuse force_download=True to overwrite".format(cfg['image_dir']))
    
    if(force_download) :
        utility_dir = cfg['base_dir']
        for d_p in cfg["d_partitions"] : # train only for now ..
            for cls in cfg["classes"] :
                directory=class_folder_name(cfg['image_dir'],d_p,cls)
                makeDirIfNotExist(directory)

                current_dir =class_folder_name(cfg['image_dir'],d_p,cls)
                #os.chdir(current_dir)
                os.chdir(utility_dir)
                command = googliser + \
                          " --o {}".format(current_dir) +\
                          " --phrase \"{}\"".format(cls) + \
                          " --parallel 50 --upper-size 1000000 --lower-size 2000 " + \
                          " -n {}".format(cfg['num_images'][d_p]) + \
                          " --format jpg --timeout 15 --safesearch-off "
                nprint(command)
                results.append(delayed(runcmd)(command))
        results=sum(results)
        
        print(results.compute())
        nprint("Downloads complete!")
       
download_images(cfg,force_download=True)

# This will take ~5 mins depending on how many classes / images we are grabbing... 

## Clean up downloaded images that are not proper jpeg format ..

In [None]:
# clean with jpeginfo! 
def clean_up_bad_images(cfg):
    os.chdir(cfg['image_dir'])
    nprint("Search for Error files in {}......".format(cfg['image_dir']))
    # handle both jpg //jpeg files that are malformed
    for ext in ["jpg","jpeg"] :
        command = "find . -name \"*.{}\"".format(ext) + \
          " | xargs -i {}".format(cfg["jpeginfo"]) + \
          " -c {} | grep ERROR"
        nprint("Running command : {}".format(command))
        !{command}
        nprint("Removing any error files listed above")
        command = command + ' | cut -d " " -f1 | xargs -i rm {} '
        nprint("Running command : {}".format(command))
        !{command}
        nprint("Done")
    # get rid of png /webp
    for ext in ["png", "webp"]:
        command = "find . -name \"*.{}\"".format(ext) + \
            " | xargs -i rm -f {}"
        nprint("Running command : {}".format(command))
        !{command}
        
clean_up_bad_images(cfg)

# find . -name "*.jpg," | xargs -i ./jpeginfo -c {}
# find . -name "*.jpg" | xargs -i ./jpeginfo -c {} | grep ERROR | cut -d " " -f1 | xargs -i rm {}

In [None]:
#path = untar_data(URLs.PETS); path
path = Path(cfg['image_dir'])

In [None]:
path.ls()
#cfg['image_dir']

## Data Processing and Understanding

The first thing we do when we approach a problem is to take a look at the data. We _always_ need to understand very well what the problem is and what the data looks like before we can figure out how to solve it. Taking a look at the data means understanding how the data directories are structured, what the labels are and what some sample images look like.

The main difference between the handling of image classification datasets is the way labels are stored. In this particular dataset, **labels are stored in the name of the folder containing the file**. We will need to extract them to be able to classify the images into the correct categories. Fortunately, the fastai library has a handy function made exactly for this, `ImageDataLoaders.from_folder` gets the labels from the folder name.

## Create Datasets / Dataloaders
Fastai has a number of methods to create a 

Datasets and Dataloaders (notice the 's' on the class names) are FastAI abstractions to help us load and process the image data.  In general these classes support all sorts of datatype like image / tabular/ and text.  

The main goal of Datasets / Dataloaders is to help hold all the data (train / validataion / test) in a single data structure and also allow us to easily transform and create batches for training.  

The creation of datasets / dataloaders is based on using something called the datablock API.  This is a set of tools to help you build out instances of the datasets/dataloaders classes.

For a full treatment of dataloaders / datasets/ datablock API topic see the documenation here -> https://docs.fast.ai/tutorial.datablock

This notebook was built referencing the documentation link above.

In [None]:
# FastAI makes use of the python Pathlib library.  here are a couple examples of how this works.

# get_image_files is a FastAI utility to grab all images from a provided base path


In [None]:
import fastai.data.all as fda
import fastai.vision.all as fva

fnames = fda.get_image_files(path)

# Some Path Experiments
# This will be useful for our label func ..
# https://docs.python.org/3/library/pathlib.html
fname = fnames[0]
print(type(fname))
print(fname)
print("Name : {}".format(fname.name))
print("Parent : {}".format(fname.parent))
print("Suffix : {}".format(fname.suffix))

In [None]:
## Method 1 use DataBlock API
# References
# https://docs.fast.ai/tutorial.datablock
# https://docs.fast.ai/data.transforms#RandomSplitter

In [None]:
# Here we create our own label function to grab the name of the folder that holds the label.
def label_func(fname):
    # this returns the name of the folder the file is in.  This is the label!
    return str(fname.parent).split('/')[-1]

# Example
print("Full file name  : {}".format(fname))
print("Label extracted : {}".format(label_func(fname)))

In [None]:
# aug+transforms 
# https://docs.fast.ai/vision.augment#aug_transforms
dblockv1 = fda.DataBlock(blocks    = (fva.ImageBlock, fda.CategoryBlock), # Tell datablock api we have images / categories
                   get_items = fda.get_image_files, # this recurses the datablock directory that is provided
                   get_y     = label_func, # applies the label
                   splitter  = fda.RandomSplitter(), # splitting func
                   item_tfms = fva.Resize(400),
                   batch_tfms=[*fva.aug_transforms(size=224, min_scale=0.85), fva.Normalize.from_stats(*fva.imagenet_stats)])


dsets_v1 = dblockv1.datasets(path)
print("Example X,y : {}".format(dsets_v1.train[0]))
print("labels in this dataset : {}".format(dsets_v1.vocab))

In [None]:
# Display a sample of images ...
bs=64
dls = dblockv1.dataloaders(path,bs=bs)
dls.show_batch()

In [None]:
dblockv1.summary(path)

In [None]:
# Alternative Method using FastAI built-in ImageDataLoaders [use above for class]
# bs = 64
# 
# # https://docs.fast.ai/vision.data#ImageDataLoaders.from_folder
# dlsv2 = ImageDataLoaders.from_folder(
#     path, 
#     item_tfms=Resize(460), 
#     bs=bs,
#     valid_pct=0.20,
#     batch_tfms=[*aug_transforms(max_lighting=0.5, size=224, min_scale=0.65), Normalize.from_stats(*imagenet_stats)])
# dlsv2.show_batch(max_n=9, figsize=(9,9))

### Compare Traing set vs Validation set Label Distribution

In [None]:
import collections
import numpy as np
import pandas as pd

def get_label_distribution(dls) :
    # *dls.xxx_ds returns tuples split into parts..zip reassembles into x/y vectors ...
    x,y = zip(*dls.train_ds)
    xv,yv = zip(*dls.valid_ds)
    
    # this creates our labels list.  basically transform fastai.tensor object to a simple list of ints 
    y_labels = list(map(lambda a : a.item() ,y))
    yv_labels = list(map(lambda a : a.item() ,yv))
    
    # Create a dataframe of categorical counts
    df=pd.DataFrame([
        pd.Series(y_labels).value_counts(),
        pd.Series(yv_labels).value_counts()
       ]).T
    # Add percentages..
    df.columns = ["train","valid"]
    df["train_pct"] = df["train"]/df["train"].sum()
    df["valid_pct"] = df["valid"]/df["valid"].sum()
    df["labels"] = pd.Series(dls.vocab)
    display(df)
    # If you did want to plot, look at this
    # https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/categorical_variables.html

type(dls.train_ds)
type(dls.valid_ds)

print("Num Images in Training Set : {}".format(len(dls.train_ds)))
print("Num Images in Validation Set : {}".format(len(dls.valid_ds)))

get_label_distribution(dls)


Analysis .. 
Make sure prior to the run you are ok with the class balance in the test vs validation.  Class imbalance between these 2 sets are important for good results.  I like to look for 2 things

1. Each class has a similar number of images ..
2. The percentage of images in training for a certain class = percentage of images for the validation .. 


## Training using Transfer Learning : resnet34

Now we will start training our model. We will use a [convolutional neural network](http://cs231n.github.io/convolutional-networks/) backbone and a fully connected head with a single hidden layer as a classifier. Don't know what these things mean? 

Not to worry, check out the FastAI course videos for a deep dive [https://course.fast.ai/](https://course.fast.ai/). For the moment you need to know that we are building a model which will take images as input and will output the predicted probability for each of the categories (in this case, it will have 5 outputs).


In [None]:
#create learner
learn = fva.cnn_learner(dls, fva.resnet34, metrics=error_rate).to_fp16()

In [None]:
# Lets see a summary of the model
learn.model

In [None]:
# Lets Train a little to see where we stand. 
# Fit_one_cycle uses cyclical learning rates
learn.fit_one_cycle(4)

In [None]:
# Lets run lr_find to find the optimal learning rates
learn.lr_find()

In [None]:
# Run 5 epochs (can do more if still getting better train / val)
# FILL IN THE RANGE BASE ON YOUR LRFIND RESULT
learn.fit_one_cycle(5, max_lr=slice(8e-4,2e-2))

In [None]:
# Save out our current results ..
learn.save('stage-1')

## Results

Let's see what results we have got. 

We will first see which were the categories that the model most confused with one another. We will try to see if what the model predicted was reasonable or not. In this case the mistakes look reasonable (none of the mistakes seems obviously naive). This is an indicator that our classifier is working correctly. 

Furthermore, when we plot the confusion matrix, if we can see that the distribution is heavily skewed: the model makes the same mistakes over and over again but it rarely confuses other categories. This suggests that it just finds it difficult to distinguish some specific categories between each other; this is normal behaviour.

In [None]:
interp = fva.ClassificationInterpretation.from_learner(learn)

losses,idxs = interp.top_losses()

len(dls.valid_ds)==len(losses)==len(idxs)

In [None]:
fda.doc(interp.plot_top_losses)

In [None]:
interp.plot_top_losses(9, figsize=(19,11))

In [None]:
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)
interp.confusion_matrix()

In [None]:
interp.most_confused(min_val=2)

In [None]:
interp.print_classification_report()

## Unfreezing, fine-tuning, and learning rates

Since our model is working as we expect it to, we will *unfreeze* our model and train some more.

In [None]:
learn.freeze_to(-2)

In [None]:
learn.summary()

In [None]:
#learn.fit_one_cycle(8)
learn.fit_one_cycle(5, max_lr=slice(8e-4,2e-2))

In [None]:
learn.save('stage-2')
#learn.load('stage-1');
#learn.unfreeze(-2)
#learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-3))

In [None]:
interp = fva.ClassificationInterpretation.from_learner(learn)

losses,idxs = interp.top_losses()
len(dls.valid_ds)==len(losses)==len(idxs)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

In [None]:
interp.print_classification_report()

That's a pretty accurate model!

##  Optional Assignment Training: xresnet50

Now we will train in the same way as before but with one caveat: instead of using resnet34 as our backbone we will use resnet50 (resnet34 is a 34 layer residual network while resnet50 has 50 layers. The details in the [resnet paper](https://arxiv.org/pdf/1512.03385.pdf) and this post https://towardsdatascience.com/xresnet-from-scratch-in-pytorch-e64e309af722.

Basically, xresnet50 usually performs better because it is a deeper network with more parameters. Let's see if we can achieve a higher performance here. Recommendations 
* let's us use larger images too, 
* reduce the batch size a bit since otherwise this larger network will require more GPU memory

Can you redo the training above with a different pre-trained model ???  See this page for some ideas

https://docs.fast.ai/vision.models.xresnet

copy the above code below, or maybe make a copy of your notebook and try it out...

In [None]:
# example !
learn = fva.cnn_learner(dls, fva.xresnet50, metrics=error_rate).to_fp16()

# Advanced Example :  Integration with Maximo Visual Insights

So you want to train on some data that maybe you labelled in another tool ?  No problem.  Here we show how to read in data exported from Maximo Visual Insights and classify using FastAI.

For our IBM Maximo Visual Insights Example, we can just run a nice utility to reformat the output.  For convenience, we added a **Bananas Dataset** to our repo to play with ..

In [None]:
os.chdir(cfg["base_dir"])
!ls *.zip

## Unzip our bananas data and convert to images in sub-directories

When you export data from IVI, you get a single zip file.  When you unzip this file, you get a singular directory with a metadata file called prop.json.  We have a utility that will read that prop.json and create a new directory with sorted images... lets see how this works

In [None]:
# unzip the Bananas.zip to /tmp/bananas_ivi directory
!sudo yum install -y unzip
!unzip -o -d /tmp Bananas.zip 

In [None]:
# Now download our handy Maximo dataset conversion script !
# Maximo Visual Insights exports data in a unique format with image files
# and a json file containing label metadata.  Our script reads this in
# and converts that into a nice directory structure with the folder
# containing the class label
!git clone https://github.com/dustinvanstee/powerai-vision-utils.git

In [None]:
#!conda install -y scikit-learn # dependency to run our utility ..
!python powerai-vision-utils/reorganize_exported_dataset.py --directory_in /tmp/Bananas --directory_out /tmp/bananas_fastai/train

In [None]:
!ls /tmp/bananas_fastai/train

In [None]:
# New project config!
mycfg = {
    "image_dir" : "/tmp/bananas_fastai",
    "classes":["black","green","overripe","yellow","ripe"],  ## <<- CLASS Enter your search terms here 
    "d_partitions":["train"],
}
ivicfg=getconfig(mycfg)


In [None]:
# New Image Path ..
path = Path(ivicfg["image_dir"])
path

In [None]:
!ls /tmp/bananas_fastai/train

## Define Databunch

In [None]:
# Define some transforms and setup data bunch
#tfms = get_transforms(do_flip=True)
#data = ImageDataBunch.from_folder(path,train="train", ds_tfms=tfms, size=100, valid_pct=0.20)
#data

#Alternative Method using FastAI built-in ImageDataLoaders [use above for class]
bs = 64
# 
# # https://docs.fast.ai/vision.data#ImageDataLoaders.from_folder
dls = fva.ImageDataLoaders.from_folder(
    path, 
    item_tfms=fva.Resize(460), 
    bs=bs,
    valid_pct=0.33,
    batch_tfms=[*fva.aug_transforms(max_lighting=0.5, size=224, min_scale=0.65), fva.Normalize.from_stats(*fva.imagenet_stats)])
dls.show_batch(max_n=9, figsize=(9,9))

In [None]:
get_label_distribution(dls)

## Training

In [None]:
# Setup a CNN learner
learn = fva.cnn_learner(dls, fva.resnet50, metrics=fva.accuracy)
learn.fit_one_cycle(5)

In [None]:
# Learning rate search
learn.lr_find()

In [None]:
# Run fit one cycle for 5 epochs
learn.fit_one_cycle(5,slice(1e-4,1e-2))

In [None]:
learn.fit_one_cycle(5,slice(1e-4,1e-2))

In [None]:
learn.freeze_to(-3)

In [None]:
learn.fit_one_cycle(5,slice(1e-4,1e-2))

In [None]:
learn.fit_one_cycle(1,slice(1e-4,1e-2))

In [None]:
# Lets see how well we did ... 
interp = fva.ClassificationInterpretation.from_learner(learn)
losses,idxs = interp.top_losses()
len(dls.valid_ds)==len(losses)==len(idxs)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

In [None]:
interp.print_classification_report()

# Inference

In [None]:
# Download some sample images here .... 
command = googliser + \
                      " --o /tmp/banana_inference "+\
                      " --phrase \"ripe black yellow green bananas\" " + \
                      " --parallel 50 --upper-size 500000 --lower-size 2000 " + \
                      " -n 10 " + \
                      " --format jpg --timeout 15 --safesearch-off "
print(command)
!{command}

In [None]:
# get rid of malformed images manually .

command = cfg["jpeginfo"] + " -c /tmp/banana_inference"
!{command}

#!ls /tmp/banana_inference
#!rm /tmp/banana_inference/image\(0020\).jpg

In [None]:
# Inference on your list of images .. 
import glob
import PIL
from IPython.display import Image 

files = glob.glob("/tmp/banana_inference/image*.jpg")

def single_inference(img_path) :
    
    # img = open_image(img_path)
    #img = PIL.Image.open(img_path)
    prediction=learn.predict(img_path)
    print(prediction)
    #img_cls = data.classes[prediction[1].numpy()]
    #img.show(title=img_cls)
    pil_img = Image(filename=img_path)
    display(pil_img)

for my_img in files :
    single_inference(Path(my_img))

Congratulations, you can now build a pretty darn good classifier in less than one hour!

Credits 
* FastAI Team
* Bob Chesebrough / Dustin VanStee / Clarisse Taffe-Hedglin - AICoC Data science team