# Preprocessing

This notebook pre-processes the data from the aircraft and bird datasets. Image manipulation is done via the `pillow` Python module.

## Aircraft

We are training our model to determine the family of an aircraft. There are 70 unique families in this dataset. The family of each aircraft image is recorded in the `images_family_test.txt`, `images_family_train.txt`, and `images_family_val.txt` flies in the dataset.

The aircraft images in this dataset contain a copyright at the bottom of the image. This copyright is approximately 20 pixels in height. It is removed from each image.

Each aircraft image is resized to be 224x224 pixels, regardless of its original dimensions and aspect ratio.

## Birds

We are training our model to determine the genus of a bird. The species of the birds is recorded in the `birds.csv` file in the dataset. The genus is extracted from the species.

In order to match the number of classes in the aircraft dataset, only the 70 genus with the largest number of associated images in the dataset are included. Then, to match the sample size of aircraft, a sample of 10,000 birds is taken from the only-70-genus dataset. A constant random state is used to get the same results every run (and with the random state used below, there are still 70 genus included in the final dataset).

Each bird image is resized to be 224x224 pixels, regardless of its original dimensions and aspect ratio. Despite the dataset claiming all images are already 224x224 pixels, there are a few images that are larger or smaller.

## Output

All pre-processed data ends up in the `preprocessed-data` directory. Aircraft images will be saved to a `preprocessed-data/images/<aircraft-family>` directory. Bird images will be saved to a `preprocessed-data/images/<bird-genus>` directory.

A "master" CSV file is created at `preprocessed-data/data.csv`. It contains the following columns:

| Column                     | Description                                                                         |
|----------------------------|-------------------------------------------------------------------------------------|
| aircraft_family            | The family of the aircraft. No data if the row represents a bird image.             |
| bird_genus                 | The genus of the bird. No data if the row represents an aircraft image.             |
| filepath                   | The relative filepath to the associated image.                                      |
| type                       | `0` if an aircraft image, `1` if a bird image.                                      |

There will be 20,000 rows (10,000 each of aircraft and birds). There are 70 unique aircraft families and 70 unique bird genus.

## Implementation

Define preprocessing code:

In [3]:
import pandas as pd
import os
from PIL import Image
from tqdm import tqdm
import kaggle

if not "data" in os.listdir():
    kaggle.api.dataset_download_files(dataset="seryouxblaster764/fgvc-aircraft", path="data/aircraft", unzip=True, quiet=False)
    kaggle.api.dataset_download_files(dataset="gpiosenka/100-bird-species", path="data/birds", unzip=True, quiet=False)



AIRCRAFT_TYPE = 0
BIRD_TYPE = 1


def collect_aircraft_image_data() -> pd.DataFrame:
    base_path = 'data/aircraft/fgvc-aircraft-2013b/fgvc-aircraft-2013b/data'
    suffixes = ['test', 'train', 'val']

    family_count = {}
    data = []
    for suffix in suffixes:
        with open(f'{base_path}/images_family_{suffix}.txt') as file:
            for line in file:
                name, family = line.strip().split(sep=' ', maxsplit=1)
                family = family.replace("/", "").replace("\\", "") # remove file system path separators

                if family in family_count:
                    family_count[family] += 1
                else:
                    family_count[family] = 1

                if family_count[family] <= 100:
                    data.append((f'{base_path}/images/{name}.jpg', family))

    df = pd.DataFrame(data=data, columns=['filepath', 'aircraft_family'])
    df['type'] = AIRCRAFT_TYPE
    return df


def collect_bird_image_data() -> pd.DataFrame:
    base_path = 'data/birds'

    df = pd.read_csv(f'{base_path}/birds.csv')
    df = df.drop(columns=['labels', 'data set', 'class id'])
    df = df.rename(columns={'scientific name': 'bird_species', 'filepaths': 'filepath'})

    df.bird_species = df.bird_species.apply(lambda name: name.title())
    df.filepath = df.filepath.apply(lambda path: f'{base_path}/{path}')
    df['type'] = BIRD_TYPE

    # Use only the top 70 genus by number of images
    largest_species = df.groupby('bird_species').agg(total=('filepath', 'count')).nlargest(70, ['total']).reset_index().bird_species
    df = df[df.bird_species.isin(largest_species)]

    species_count = {}
    rows = []
    for _, row in df.iterrows():
        if row.bird_species in species_count:
            species_count[row.bird_species] += 1
        else:
            species_count[row.bird_species] = 1

        if species_count[row.bird_species] <= 100:
            rows.append(row)
    return pd.DataFrame(data=rows, columns=df.columns)


def remove_copyright_from_image(image: Image) -> Image:
    width, height = image.size
    return image.crop((0, 0, width, height - 20))


def resize_image(image: Image) -> Image:
    if (image.size == (224, 224)):
        return image
    return image.resize((224, 224))


def preprocess(df: pd.DataFrame, image_dir: str) -> pd.DataFrame:
    if not os.path.exists(image_dir):
        os.makedirs(image_dir)

    new_filepaths = []
    for index, row in tqdm(df.iterrows(), total=df.shape[0], desc='Preprocessing'):
        if row.type == AIRCRAFT_TYPE:
            dir = f'{image_dir}/{row.aircraft_family}'
        else:
            dir = f'{image_dir}/{row.bird_species}'

        if not os.path.exists(dir):
            os.mkdir(dir)

        path = f'{dir}/{index:05d}.jpg'
        new_filepaths.append(path)

        if not os.path.exists(path): # skip if already processed
            with Image.open(row.filepath) as image:
                if row.type == AIRCRAFT_TYPE: # copyright only present in aircraft images
                    image = remove_copyright_from_image(image)
                image = resize_image(image)
                image.save(path)
    
    df.filepath = new_filepaths
    return df

Run preprocessing code:

In [4]:
output_dir = 'preprocessed-data'
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

# print('Collecting aircraft data...')
# df_aircraft = collect_aircraft_image_data()
print("collecting bird data...")
df_birds = collect_bird_image_data()

# df_all = pd.concat((df_aircraft, df_birds)).reset_index(drop=True)
df_all = df_birds
df_all = preprocess(df_all, f'{output_dir}/images')
df_all.to_csv(f'{output_dir}/data.csv')
display(df_all)

collecting bird data...


Preprocessing: 100%|██████████| 7000/7000 [00:08<00:00, 826.27it/s]


Unnamed: 0,filepath,bird_species,type
1103,preprocessed-data/images/Tockus Fasciatus/0110...,Tockus Fasciatus,1
1104,preprocessed-data/images/Tockus Fasciatus/0110...,Tockus Fasciatus,1
1105,preprocessed-data/images/Tockus Fasciatus/0110...,Tockus Fasciatus,1
1106,preprocessed-data/images/Tockus Fasciatus/0110...,Tockus Fasciatus,1
1107,preprocessed-data/images/Tockus Fasciatus/0110...,Tockus Fasciatus,1
...,...,...,...
81988,preprocessed-data/images/Chamaea Fasciata/8198...,Chamaea Fasciata,1
81989,preprocessed-data/images/Chamaea Fasciata/8198...,Chamaea Fasciata,1
81990,preprocessed-data/images/Chamaea Fasciata/8199...,Chamaea Fasciata,1
81991,preprocessed-data/images/Chamaea Fasciata/8199...,Chamaea Fasciata,1
