## Image classification from scratch
<b>Author</b>: <a href="https://twitter.com/fchollet">fchollet</a> <br/>
<b>Date created</b>: 2020/04/27 <br/>
<b>Last modified</b>: 2023/11/09 <br/>
<b>Description</b>: Training an image classifier from scratch on the Kaggle Cats vs Dogs dataset.

#### Introduction

This example shows how to do image classification from scratch, starting from JPEG image files on disk, without leveraging pre-trained weights or a pre-made Keras Application model. We demonstrate the workflow on the Kaggle Cats vs Dogs binary classification dataset.

We use the <font color="red">image_dataset_from_directory</font> utility to generate the datasets, and we use Keras image preprocessing layers for image standardization and data augmentation.

<hr/>

#### Setup

In [1]:
import os
import numpy as np
import keras
from keras import layers
from tensorflow import data as tf_data
import matplotlib.pyplot as plt

2024-05-31 04:31:24.477943: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<hr/>

### Load the data: the Cats vs Dogs dataset
#### Raw data download
First, let's download the 786M ZIP archive of the raw data:

In [5]:
!mkdir -p downloads
!curl -O --output-dir downloads https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_5340.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  786M  100  786M    0     0  8817k      0  0:01:31  0:01:31 --:--:--  9.8Mk      0  0:01:26  0:00:45  0:00:41 2853k


In [8]:
!cd downloads && unzip -q kagglecatsanddogs_5340.zip
!cd downloads && ls

 CDLA-Permissive-2.0.pdf   kagglecatsanddogs_5340.zip
 PetImages		  'readme[1].txt'


Now we have a <font color="red">PetImages</font> folder which contain two subfolders, <font color="red">Cat</font> and <font color="red">Dog</font>. Each subfolder contains image files for each category.

In [9]:
!ls downloads/PetImages

Cat  Dog


### Filter out corrupted images

When working with lots of real-world image data, corrupted images are a common occurence. Let's filter out badly-encoded images that do not feature the string "JFIF" in their header.

In [10]:
num_skipped = 0
for folder_name in ("Cat", "Dog"):
    folder_path = os.path.join("downloads/PetImages", folder_name)
    for fname in os.listdir(folder_path):
        fpath = os.path.join(folder_path, fname)
        try:
            fobj = open(fpath, "rb")
            is_jfif = b"JFIF" in fobj.peek(10)
        finally:
            fobj.close()

        if not is_jfif:
            num_skipped += 1
            # Delete corrupted image
            os.remove(fpath)

print(f"Deleted {num_skipped} images.")

Deleted 1590 images.


<hr/>

### Generate a <font color="red">Dataset</font>

In [None]:
image_size = (180, 180)
batch_size = 128

train_ds, val_ds = keras.utils.image_dataset_from_directory(
    "downPetImages",
    validation_split=0.2,
    subset="both",
    seed=1337,
    image_size=image_size,
    batch_size=batch_size,
)