# DataSetDumper Demo
The `DataSetDumper` is used to serialize a DataSet automatically in folders that correspond to the class label.
This is used to be able to create a list of filenames that are the starting point of the workflow with pipelines.

# Prerequisites

In [1]:
import numpy as np
from pathlib import Path
from collections import Counter

import torch
from torch.utils.data import DataLoader

import torchvision
from torchvision.datasets import CIFAR10
from torchvision.transforms import Compose, ToTensor


from transights.utils import DataSetDumper
from transights.utils import FolderScanner as fs



random_state = 23

In [2]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("Running on device:", DEVICE.upper())

Running on device: CUDA


In [3]:
ROOT_PATH = Path.home() / "Downloads" / "data"

DATA_PATH = ROOT_PATH / "CIFAR10"

DATA_PATH_TEST = Path(DATA_PATH, "test")
DATA_PATH_TEST

WindowsPath('C:/Users/bernh/Downloads/data/CIFAR10/test')

In [5]:
import ssl
# this prevents the following error when trying to download the dataset:
# SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1007)
ssl._create_default_https_context = ssl._create_unverified_context

## Create CIFAR10 dataset organized in subfolders indicating class
Notice, that `torchvision` only provides us with a DataSet class, but not with image files organized in folders that correspond to their class label.
With `DataSetDumper`, this is taken care of automatically.
With a simple check if the result directory already exists, we can avoid to unnecessarily create the image files multiple times.

In [6]:
transform = Compose(
    [
        ToTensor(),
    ]
)

In [7]:
dataset = CIFAR10(root=DATA_PATH, train=False, transform=transform, download=True)

if not DATA_PATH_TEST.exists():
    DataSetDumper(dataset, DATA_PATH_TEST).dump()

Files already downloaded and verified


### Take a look at the results

In [10]:
files = fs.get_files(DATA_PATH_TEST, extensions='.png', recursive=True)
len(files)

10000

In [11]:
files[0]

WindowsPath('C:/Users/bernh/Downloads/data/CIFAR10/test/0/10.png')