# Install dependencies and import libraries

In [2]:
!pip install -Uqq fastbook fastai duckduckgo_search

I quite like how you can see which libraries are currently installed. Here is a list of all libraries currently available that begin with 'f'.

In [5]:
!pip list | grep "^fast"

fastai                       2.7.13
fastbook                     0.0.29
fastcore                     1.5.29
fastdownload                 0.0.7
fastjsonschema               2.16.2
fastprogress                 1.0.3


In [6]:
from fastbook import *
from duckduckgo_search import DDGS
from fastcore.all import *
from itertools import islice
from IPython.display import HTML
from shutil import move

# Function definitions

In [46]:
# Search for images and return a list of URLs.
def search_images(term, max_images = 250):
    print(f"Searching for {max_images} '{term}' images...")
    keywords = term
    ddgs_images = ddgs.images(keywords, max_results=max_images)
    #limited_images = list(islice(ddgs_images, max_images))
    return L(ddgs_images).itemgot('image')

In [47]:
# Download images for the specifed search term.
#
# 1. If the search term contains spaces then hyphenate for the folder name.
# 2. Skip if images already exist.
#
def download_image_urls(search_term, max_images, resize=True):
    search = search_images(search_term, max_images)
    dest_search_term = search_term.replace(' ', '-')
    dest = Path(path)/dest_search_term
    if dest.exists() and any(dest.iterdir()):
        num_images_before = len([1 for _ in dest.iterdir()])
        print(f"{num_images_before} images already downloaded in {dest}, skipping download.")
    else:
        dest.mkdir(exist_ok=True, parents=True)
        download_images(dest, urls=search)
        num_images_downloaded = len([1 for _ in dest.iterdir()])
        if resize:
            print(f"Resizing images...")
            resize_images(dest, max_size=400, dest=dest)
        print(f"{num_images_downloaded} images downloaded and resized in {dest}.")

In [48]:
# Verify downloaded files.
#
# 1. Move failed images to '/deleted' folder for inspection.
# 2. Output information on number of files checked, and failed images.
#
def verify_downloaded_images():
    image_files = get_image_files(path)
    num_images_checked = len(image_files)
    print(f"Number of images to be checked: {num_images_checked}")

    failed = verify_images(image_files)

    # Create a 'deleted' folder if it doesn't exist
    deleted_folder = Path(path).parent / 'deleted'
    deleted_folder.mkdir(exist_ok=True)

    for failed_image in failed:
        new_location = deleted_folder / failed_image.name
        print(f"Moving failed image: {failed_image} to {new_location}")
        move(failed_image, new_location)
        
    print(f"Number of failed images moved to 'deleted' folder: {len(failed)}")
    return failed

In [49]:
# Display the images used in the training set, and validation set.
def displayRandomSplitter(v_pct=0.2, sd=42):
    # Get a list of files (or items) in your dataset
    items = get_image_files(path)
    
    # Initialize the RandomSplitter
    splitter = RandomSplitter(valid_pct=v_pct, seed=sd)
    
    # Apply the splitter to your dataset
    train_indices, valid_indices = splitter(items)
    
    # Function to create a DataFrame from file indices
    def create_dataframe(file_indices):
        data = [(items[i].name, items[i].parent.name) for i in file_indices]
        df = pd.DataFrame(data, columns=['Filename', 'Parent Folder'])
        #sorted = df.sort_values(by='Parent Folder')
        return df
    
    # Create DataFrames for training and validation sets
    train_files = create_dataframe(train_indices)
    valid_files = create_dataframe(valid_indices)
    
    # Convert DataFrames to HTML
    train_html = train_files.to_html(index=False)
    valid_html = valid_files.to_html(index=False)
    
    # Display the HTML tables side by side
    html_content = f"""
    <div style="float: left; padding-right: 20px;">
        <h3 style="text-align:center">Training Set Files</h3>
        {train_html}
    </div>
    <div style="float: left;padding-left:50px;">
        <h3 style="text-align:center">Validation Set Files</h3>
        {valid_html}
    </div>
    """
    display(HTML(html_content))

# Notes

Notes from Lesson 2 of the course.

## Data augmentation

Important to understand that `RandomResizedCrop` will only generate **one** cropped (and resized) image for each image in the training set, for each epoch. This does not include the original unaltered image at all apparently.

My understanding is that if you have `RandomResizedCrop` enabled, for each epoch it will take each image in the training set and randomly crop and resize it. And at no time is the original unaltered image used in any epochs.

I found that quite interesting as I initially thought that `RandomResizedCrop` randomly generated a bunch of images based on the original image for each epoch, so this clarification is really useful.

At some point in the future I wonder what effect on a model, using the original unaltered image for the first epoch would have.

## Using PKL files on Hugging Face

I think there are some general [security concerns](https://forums.fast.ai/t/lesson-2-official-topic/96033/710) when using `*.pkl` files. And Hugging Face will usually display a warning when you try to use them.

[Safetensors](https://huggingface.co/docs/safetensors/index) seem to be the preferred alternative to `*.pkl` files thses days.

It will be interesting to see if fastai adds support for the Safetensors format in the future.

## Exporting code from a notebook

Just over halfway through Lesson 2 in the video lecture Jeremy discusses how to [author code](https://youtu.be/F4tvM4Vb3A0?t=2905) in a notebook and export it for use in other applications, such as a Gradio application hosted on Hugging Face.


## Using a PKL model in JavaScript application

Jeremy discusses how to use a trained model exported as a `*.pkl` file in a JavaScript application by hosting the model on a server (e.g. Hugging Face) and using the api to make predictions on the model and return results. This is OK for testing but for production you would probably want to host your model on a dedicated server platform such as Vercel or one of the many others available.


# Homework
My own classifier model.

Needed to do a bit of manual data cleaning. There were quite a few cartoon type images that I removed, as well as anything that didn't look like it belonged in the data set.

In Lesson 2 Jeremy recommends training the model before doing any data cleaning, but in this notebook I cleaned the data manually before doing any training. 

# Conclusion
I went through the lesson one video of the course and followed along with the bird classifier, and modified a bit of the code to further my knowledge and experience of Python. Overall I was quite happy with my understanding of most of the topics presented.

I think my biggest takeaway is a desire to understand more about the `fine_tune()` method, and related methods such as `fit_one_cycle()` at a fundamental level. Plus all the `DataBlock` and `DataLoaders` methods/classes.