
Borrowed from Terry's AI-4-Media unit, week 3: [create-classification-dataset.ipynb](https://git.arts.ac.uk/tbroad/AI-4-Media-23-24/blob/main/Week-3-CNNs-and-image-classification/create-classification-dataset.ipynb)


## Make a classification dataset  

In this notebook we are going to look at how to quickly make our own image classification datasets, which we can then use for training an image classifier.

To do this we are going to use a python library called [`gallery-dl`](https://github.com/mikf/gallery-dl). This allows us to download entire image galleries from sites such as [pinterest](https://www.pinterest.co.uk/), [tumblr](https://www.tumblr.com/), and [bbc](https://www.bbc.co.uk/). 

In this walkthrough we will look at downloading from pinterest, downloading boards that have been pre-curated by other users on the platform. 

First let's do some installation:

In [None]:
# this only need to be done once, 
# it will install gallery-dl to your coding3 environment

!pip install -U gallery-dl

Then let's do some import

In [2]:
from PIL import Image
import os, os.path
import shutil

from torchvision.transforms.v2.functional import resize

### Step 1: Make a folder for your dataset

Lets make a folder for you dataset. We will call it `my_dataset`. And put it in the folder `./data`.

In [3]:
my_dataset_path = './data/my_dataset'

try:
    os.mkdir(my_dataset_path)
except FileExistsError:
    print(f'{my_dataset_path} already existed, no worries, keep going')

### Step 2: Navigate to pinterest and download a board

Go to https://www.pinterest.co.uk/ and search for a category of image or thing of your interest. It is totally up to you what classes you have in your dataset!

When searching for a category you will need to click the filter on the side after a search query is made and then select the option for **Boards**. See the image below:

<img src="./src/graphics/pinterest-filter-example.png" width="800"></img>

Select a board and then look at the URL of the pinterest board, it should be something like `https://www.pinterest.co.uk/user_id/board_id/`, use the URL as an argument for the command below:

For example, `!gallery-dl https://www.pinterest.co.uk/bbuechi/chair-design/ --filter "extension in ('jpg', 'png', 'gif')"`  

You can early stop it if there're too many images in the board, but typically you'll need at least a few hundreds of images for one class

In [None]:
# make sure to replace the URL to the board you found!

!gallery-dl https://www.pinterest.co.uk/______/______/ --filter "extension in ('jpg', 'png', 'gif')"

### Step 3: Resize downloaded images and save them into the respective class folder 

After downloading, you will see a folder called `gallery-dl` which contains the subfolder `gallery-dl/pinterest/user-id/board name`. Inside you'll find the images you have just downloaded (you will need to change **user-id** and **board-id** to whatever you have downloaded).

We're going to put these images into a subfolder in your dataset folder, the name of this subfolders is going to be the name of this class (e.g. put all images of cat into a folder named 'cat')

Some images may be too large in its resolution, we can scale them down at this stage to speed up the training.

The following code will check images we downloaded, resize them to a reasonable size, and save them into the dataset folder we created.

Make sure to **edit the following code with the correct path to your image folder**:

In [12]:
# change this to the correct path to the folder of images you downloaded
source_folder = f"./gallery-dl/pinterest/______/______"

# change this to a name of this category, it is up to you
# this is going to be the name of the subfolder
class_name = '______' 

# we want to crop the images to this resolution
target_resolution = 128

In [None]:
target_folder = os.path.join(my_dataset_path, class_name)

# create a class folder if it doesnt exist
try:
    os.mkdir(target_folder)
    print(f'{target_folder} created')
except FileExistsError:
    print(f'{target_folder} already existed')
     
print(f'source folder: {source_folder}')
print(f'target folder: {target_folder}')

count = len(os.listdir(target_folder))
print(f'{count} files in the target_folder')
valid_file_type = [".jpg", ".gif", ".png"]

for i, f in enumerate(os.listdir(source_folder)):
    
    # check the file extension
    ext = os.path.splitext(f)[1]
    if ext.lower() not in valid_file_type:
        continue
    
    # load and resize an image from the source folder
    img = Image.open(os.path.join(source_folder,f)).convert('RGB')
    img = resize(img, target_resolution)
    
    # save the resized image into target folder
    img.save(f'{target_folder}/{count:05}.jpg')
    
    if count % 100 == 0:
        print(f'processing the first {count+100} images...')
        
    count += 1  
    
print(f'saved {count} images into {target_folder}')

#### Step 4: Repeat

Repeat steps 2 & 3 until you have **at least two categoies**. You may need to download more than 1 board for each category to get enough number of images. A common rule of thumb is that you need at least a few hundreds of images per category to train an effective classifier. 

**Don't forget to look at the data before training!** Make sure that the dataset you have collected contains the thing that you actually want to be there. It is very easy for junk or data samples that aren't actually of the class you want to be in there. You may have to do some manual data cleaning to remove unwanted samples if you are serious about training an effective image classifier. 

After completion you can go ahead and delete the folder `gallery-dl`.
