Kaggle work book for personal learning following along with fast.ai's book, chapter 2,
found here: https://github.com/fastai/fastbook/blob/master/02_production.ipynb

In [None]:
#NB: Kaggle requires phone verification to use the internet or a GPU. If you haven't done that yet, the cell below will fail
#    This code is only here to check that your internet is enabled. It doesn't do anything else.
#    Here's a help thread on getting your phone number verified: https://www.kaggle.com/product-feedback/135367


import socket,warnings
try:
    socket.setdefaulttimeout(1)
    socket.socket(socket.AF_INET, socket.SOCK_STREAM).connect(('1.1.1.1', 53))
except socket.error as ex: raise Exception("STOP: No internet. If on Kaggle: Click '>|' in top right and set 'Internet' switch to on. Else maybe check that the file is trusted. ")

In [None]:
# It's a good idea to ensure you're running the latest version of any libraries you 
# need. Ifwe're on Kaggle we can do this for the virtual machine environment Kaggle has
# for us, since might not be up to date. `!pip install -Uqq <libraries>` upgrades to the
# latest version of <libraries>, q for quiet
# NB: You can safely ignore any warnings or errors pip spits out about running as root
# or incompatibilities
import os
on_kaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

if on_kaggle:
    # Update fastai if we're on Kaggle as it might be out of date
    !pip install -Uqq fastai
# duckduckgo_search: convinience library for getting images using Duck Duck Go search
#  Github: https://github.com/deedy5/duckduckgo_search
!pip install -Uqq duckduckgo_search

In [None]:
from duckduckgo_search import ddg_images
# fastai.vision.all must be imported before fastcore.all on Intel Macs. Kills kernels
# otherwise, don't know why (as of Aug. 2022, might have a fix later)
from fastai.vision.all import *
from fastcore.all import *
from fastdownload import download_url

In [None]:
import requests
import re
import json
# Can't import fastbook on my Mac, so copying the utils function from fastbook to here
# https://github.com/fastai/fastbook/blob/b7f756b49d4eb0d3ce96c0c29be98f4f293cde9f/utils.py#L45
def search_images_ddg(key,max_n=200):
     """Search for 'key' with DuckDuckGo and return a unique urls of 'max_n' images
        (Adopted from https://github.com/deepanprabhu/duckduckgo-images-api)
     """
     url        = 'https://duckduckgo.com/'
     params     = {'q':key}
     res        = requests.post(url,data=params)
     searchObj  = re.search(r'vqd=([\d-]+)\&',res.text)
     if not searchObj: print('Token Parsing Failed !'); return
     requestUrl = url + 'i.js'
     headers    = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0'}
     params     = (('l','us-en'),('o','json'),('q',key),('vqd',searchObj.group(1)),('f',',,,'),('p','1'),('v7exp','a'))
     urls       = []
     while True:
         try:
             res  = requests.get(requestUrl,headers=headers,params=params)
             data = json.loads(res.text)
             for obj in data['results']:
                 urls.append(obj['image'])
                 max_n = max_n - 1
                 if max_n < 1: return L(set(urls))     # dedupe
             if 'next' not in data: return L(set(urls))
             requestUrl = url + data['next']
         except:
             pass


Inspect at search_images_ddg

In [None]:
search_images_ddg

Download one bear image.

In [None]:
grizzly_bear_urls = search_images_ddg('grizzly bear', max_n=1)
len(grizzly_bear_urls)

In [None]:
destination = 'data/grizzly.jpg'

download_url(grizzly_bear_urls[0], destination, show_progress=False)

Inspect the image.

In [None]:
image = Image.open(destination)
image.to_thumb(128,128)

Use fastai's `download_images` to download images for our search terms. Put each category
in a folder:

In [None]:
bear_types = 'grizzly','black','teddy'

from pathlib import Path
path = Path('data/bears')

In [None]:
from time import sleep
if not path.exists():
    path.mkdir()
    max_n = 10
    for o in bear_types:
        destination = (path/o)
        destination.mkdir(exist_ok=True)
        print(f'Searching for {max_n} {o} bear images')
        results = search_images_ddg(f'{o} bear', max_n=max_n)
        print(f'Downloading {max_n} {o} bear images')
        download_images(destination, urls=results)
        print(f'Done downloading {max_n} {o} bear images')
        print('Sleeping 10 seconds')
        sleep(10)  # Pause to avoid over-loading server or rate limits
        print('Done sleeping')

Check folders have image files.

In [None]:
fns = get_image_files(path)
fns

Delete images we can't open.

In [None]:
failed = verify_images(fns)
failed

In [None]:
# In Jupyter Notebooks the value returned by the last expression is printed out as a
# string. If we don't care to see that output, we can supress it with a semi-colon at
# the end of the line
failed.map(Path.unlink);

As an example of how to use the DataBlock. DataLoaders need DataBlocks.
The DataBlock has a .dataloaders attribute you can call to get DataLoaders.
DataLoaders object has typically two, but can have more, DataLoader objects. One for
training and one for validation.

In [None]:
bears = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    get_items=get_image_files, 
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    get_y=parent_label,
    item_tfms=Resize(128, ResizeMethod.Squish))

# Some Options for Resize:
# Resize(128) This crops
# Resize(128, ResizeMethod.Squish)
# Resize(128, ResizeMethod.Pad, pad_mode='zeros')

bear_dataloaders = bears.dataloaders(path)

Inspect some images.

In [None]:
# Note: we're grabbing images from the validation set, seen here by calling
# bear_dataloaders.valid
bear_dataloaders.valid.show_batch(max_n=4, nrows=1)

Another good option for item transforms is RandomResizedCrop

In [None]:
bears = bears.new(item_tfms=RandomResizedCrop(128, min_scale=0.3))
bear_dataloaders = bears.dataloaders(path)
# Note: we're grabbing images from the training set, seen here by calling
# bear_dataloaders.train
bear_dataloaders.train.show_batch(max_n=4, nrows=1, unique=True)

When `unique=True`, the same image is repeated with different calls to `RandomResizedCrop`. aka data augmentation. These augmentations happen to the images
when the images are read from disk. It changes the image, passes it in for training
and then forgets about it. The images don't get written back out or saved anywhere with the transform that was performed on it.

For natual images (real life images of things), fast.ai provides a standard set of
augmentations that have been found to work pretty well. aug_transforms(). When all
the images are the same size you can apply transforms to a batch all at the same
time using the GPU. Saving time over one by one and/or running on the CPU.

Augmentations doubled to show more clearly upon inspection what's going on for the
lesson.

In [None]:
bears = bears.new(item_tfms=Resize(128), batch_tfms=aug_transforms(mult=2))
bear_dataloaders = bears.dataloaders(path)
bear_dataloaders.train.show_batch(max_n=8, nrows=2, unique=True)

Suggestion from book: Train your model first, then clean your data. Instead of
clean, then train.

Actual training time. Using RandomResizedCrop on each item and aug_transforms() on
the batch.

In [None]:
bears = bears.new(
    item_tfms=RandomResizedCrop(224, min_scale=0.5),
    batch_tfms=aug_transforms(mult=10)
    )
dls = bears.dataloaders(path)

Create a learner and train it. In this case, we will just use fine-tune since it's a
pre-trained model.

In [None]:
learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)

To see how well the model is doing, we can look at a confusion matrix.

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

Useful to see where errors are occuring. One way is to show things with the higest
loss from the models predictions.

plot_top_losses() will have labels that show:
Prediction/Actual/Loss/Probability
- Probability is the confidence level from 0 to 1 that the model has assigned to its
prediction. They're percentages. So 0.2 is 20% confidence. 0.98 is 98% confidence.

In [None]:
interp.plot_top_losses(5, nrows=1)

Tying back to the suggestion above, training first before cleaning, you can then use
the model to help find data issues you need to clean and deal with and do it
quickly.

fast.ai has a handy GUI for helping to clean (Image) data called
`ImageClassifierCleaner`.

In [None]:
cleaner = ImageClassifierCleaner(learn)
cleaner
# Images will be ordered by loss. So ones showing up first at the front of the list
# are more likely to be the ones with issues.

The cleaner doesn't actually delete or change labels directly, it just returns the
indexes of things to change. So use the following code to change and delete things.

In [None]:
for idx in cleaner.delete(): cleaner.fns[idx].unlink()
for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)

Run again on cleaned data. Maybe train again as well.

Export model for inference.

In [None]:
# Export model
model_path = Path('models')
model_path.mkdir(exist_ok=True, parents=True)
learn.export(model_path/'resnet18.pkl')