# Creating the Dataset

The dataset used in this project, called **'Garbage Image Dataset'**, is available [here](https://www.kaggle.com/datasets/farzadnekouei/trash-type-image-dataset/). There are **2.527 images** distributed among six distinct classes:
* Cardboard: **403 images**;
* Glass: **501 images**;
* Metal: **410 images**;
* Paper: **594 images**;
* Plastic: **482 images**;
* Trash: **137 images**.

Before to start the exploration of the dataset, this notebook aims to divide the images in subsets to train, validate and test. The technique used is cross-validation of the type **hold-out** (70% train / 20% validation /10% test).

Furthermore, in order to build the training and validation partitions, the frameworks **TensorFlow** and **Keras** will be used in the next notebook to prepare these images correctly.

## Import libraries

In [128]:
import os
import kagglehub
import pandas as pd
import shutil

from sklearn.model_selection import train_test_split

## Define constants

In [129]:
KAGGLE_DATASET_TAG = 'farzadnekouei/trash-type-image-dataset'

CURRENT_PATH = os.getcwd()
RAW_DATASET_PATH = os.path.join(CURRENT_PATH, 'datasets', 'RAW_trash_type_dataset')
DATASET_PATH = os.path.join(CURRENT_PATH, 'datasets', 'trash_type_dataset')

VALIDATION_SPLIT = 0.2
TEST_SPLIT = 0.1

## Download the remote dataset to local directory

In [130]:
assert not os.path.isdir(RAW_DATASET_PATH) and not os.path.isdir(DATASET_PATH), "Dataset already exists."

# Downloading the dataset 
cache_path = kagglehub.dataset_download(KAGGLE_DATASET_TAG, force_download=True)
raw_dataset_path = os.path.join(cache_path, os.listdir(cache_path)[0])
print(f'Original dataset path: {raw_dataset_path}')

# Creating a path to move the dataset to the current path
os.makedirs(RAW_DATASET_PATH)

# Move each folder of the dataset to the new path
for folder in os.listdir(raw_dataset_path):
    folder_path = os.path.join(raw_dataset_path, folder)
    target_folder_path = os.path.join(RAW_DATASET_PATH, folder)
    os.rename(folder_path, target_folder_path)

print(f"Imported dataset: {RAW_DATASET_PATH}")

Downloading from https://www.kaggle.com/api/v1/datasets/download/farzadnekouei/trash-type-image-dataset?dataset_version_number=1...


100%|██████████| 40.9M/40.9M [00:02<00:00, 20.4MB/s]

Extracting model files...





Original dataset path: C:\Users\Gabriel\.cache\kagglehub\datasets\farzadnekouei\trash-type-image-dataset\versions\1\TrashType_Image_Dataset
Imported dataset: c:\Users\Gabriel\OneDrive\Documentos\projetos\gabriel\fine-tuning-cnns-example\datasets\RAW_trash_type_dataset


## Building the Train/Validation/Test partitions

In [131]:
def read_directory(path):
    files = os.listdir(path)
    files_and_paths = [(file, os.path.join(path, file)) for file in files]
    return files_and_paths

Listing all the images

In [132]:
dataset = {
    'class_name': [],
    'class_path': [],
    'image_name': [],
    'image_path': []
}

for class_name, class_path in read_directory(RAW_DATASET_PATH):
    for image, image_path in read_directory(class_path):
        dataset['class_name'].append(class_name)
        dataset['class_path'].append(class_path)
        dataset['image_name'].append(image)
        dataset['image_path'].append(image_path)

dataset = pd.DataFrame.from_dict(dataset)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2527 entries, 0 to 2526
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   class_name  2527 non-null   object
 1   class_path  2527 non-null   object
 2   image_name  2527 non-null   object
 3   image_path  2527 non-null   object
dtypes: object(4)
memory usage: 79.1+ KB


How many classes the dataset have?

In [133]:
classes = dataset['class_name'].unique()
print(f"{len(classes)} classes: {classes}")

6 classes: ['cardboard' 'glass' 'metal' 'paper' 'plastic' 'trash']


Splitting the dataset

In [134]:
total = len(dataset)
train_validation_dataset, test_dataset = train_test_split(dataset, 
                                                          stratify=dataset['class_name'], 
                                                          test_size=int(TEST_SPLIT*total),
                                                          shuffle=True)
train_dataset, validation_dataset = train_test_split(train_validation_dataset, 
                                                     stratify=train_validation_dataset['class_name'],
                                                     test_size=int(VALIDATION_SPLIT*total),
                                                     shuffle=True)

In [135]:
train_size = len(train_dataset)
val_size = len(validation_dataset)
test_size = len(test_dataset)

print(f"Total size =      {total} images")
print(f"Train size =      {train_size} images [{(train_size/total)*100:.0f}%]")
print(f"Validation size = {val_size} images [{(val_size/total)*100:.0f}%]")
print(f"Test size =       {test_size} images [{(test_size/total)*100:.0f}%]")

Total size =      2527 images
Train size =      1770 images [70%]
Validation size = 505 images [20%]
Test size =       252 images [10%]


Copying images

In [136]:
def copy_image(source_path: str, target_path: str):
    os.rename(source_path, target_path)

def copying_dataset_to_partition_dir(dataset: pd.DataFrame, 
                                      title: str,
                                      output_path: str):
    output_path = os.path.join(output_path, title)
    assert not os.path.isdir(output_path), f'"{output_path}" directory already exists.'

    # Creating the dataset partition folder
    os.makedirs(output_path)
    
    for _, value in dataset.iterrows():
        image_path = value['image_path']
        class_name = value['class_name']
        image = value['image_name']

        # Creating the class folder it not exists
        image_folder_path = os.path.join(output_path, class_name)
        if not os.path.isdir(image_folder_path):
            os.makedirs(image_folder_path)

        # Copying the image
        image_output_path = os.path.join(image_folder_path, image)
        copy_image(image_path, image_output_path)

In [137]:
copying_dataset_to_partition_dir(dataset=train_dataset, 
                                  title="train", 
                                  output_path=DATASET_PATH)
copying_dataset_to_partition_dir(dataset=validation_dataset, 
                                  title="validation", 
                                  output_path=DATASET_PATH)
copying_dataset_to_partition_dir(dataset=test_dataset, 
                                  title="test", 
                                  output_path=DATASET_PATH)

Finally, remove the original folder

In [138]:
shutil.rmtree(RAW_DATASET_PATH)