# Modern Problem Requires Modern Solution

If you're working on classification problem with your own dataset, or dataset that is available in their native format (jpg, bmp, etc) and use PyTorch as your main weapon, you'll most likely feel that the **DatasetFolder** or **ImageFolder** is not good enough. So does vanilla **torch.utils.data.Dataset**. This library attempts to bridge that gap to effectively Extract, Transform, and Load your data by extending **torch.utils.data.Dataset**.

---------------------------------------------------------------------------------------------------------

In the first step to train a classifier is to prepare the dataset. By default, PyTorch provides the abstraction of this process through **DatasetFolder** or **ImageFolder**. However they requires us to arrange our folder in this way
##### root/class_x/xxx.ext
##### root/class_x/xxy.ext
##### root/class_x/xxz.ext
##### root/class_y/123.ext
##### root/class_y/nsdf3.ext
##### root/class_y/asd932_.ext

Most of the time, we don't have that, especially if we're using our own dataset (might be from scraping or a gift from someone). So we have to do something.

The first, naive way is to move folders manually or using some scripts, could be python scripts, bash scripts or anything else you're comfortable with.\n\nIt's kind of cool until we    
    1. Have terabytes of dataset    
    2. Want to partition them into train, validation, and or test    
    3. Want to explore/debug our dataset  
    4. Want others to reproduce our results    
    
The second option is to make your own custom Dataset, subclassed from **torch.utils.data.Dataset**. This approach is much simpler and cleaner than the naive way, but still, we would have problem number 2 and 3    
    2. Want to partition them into train, validation, and or test   
    3. Want to explore/debug our dataset  
    4. Want others to reproduce our results
    
This problem lead the design of ETL, a library that does Extract, Transform and Load on the fly. In a nutshell, Extract will read all the images from a parent directory, then partition those images into train, validation, and testing, and store them into csv files with the column of image path and encoded label. After that TransformAndLoad will ingest those samples efficiently to your classifier.

By the way, the reason I use csv rather than txt is because pandas can parse CSV flawlessly. If we're using txt it's a little bit more complicated but maybe I'll add those feature in the future"

# Pt.1 Extract

In [1]:
from pathlib import Path
from torchvision import transforms
from etl.etl import Extract, TransformAndLoad
import pandas as pd

In [2]:
parent_directory = Path.cwd() / 'etl' / 'data' 
print(parent_directory)

/Users/jedi/Repo/GitHub/etl/etl/data


In [3]:
combined_dataset = Extract(parent_directory = parent_directory, 
              extension = 'jpg', 
              labels = ['attack', 'real'], 
              training_size = 0.8,
              random_state = 69,
              verbose = True,
            )

In [4]:
help(combined_dataset)

Help on Extract in module etl.etl object:

class Extract(etl.base.dataset.BaseDataset)
 |  Extract(parent_directory: str, extension: str, labels: List[str], training_size: float, random_state: int, verbose: bool) -> None
 |  
 |  Method resolution order:
 |      Extract
 |      etl.base.dataset.BaseDataset
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, parent_directory: str, extension: str, labels: List[str], training_size: float, random_state: int, verbose: bool) -> None
 |      Class for creating csv files of train, validation, and test
 |      
 |      Parameters
 |      ----------
 |      parent_directory
 |              The parent_directory folder path. It is highly recommended to use Pathlib
 |      extension
 |              The extension we want to include in our search from the parent_directory directory
 |      labels
 |      
 |      Returns
 |      -------
 |      None
 |  
 |  extract(self, file_prefix: str, save_path: str)
 |      Create csv

From above we know that Extract inherits show_files method from BaseDataset. Let's print the first 5 files

In [5]:
combined_dataset.show_files(5)

[PosixPath('/Users/jedi/Repo/GitHub/etl/etl/data/ori/attack/hand/112/4187.jpg'),
 PosixPath('/Users/jedi/Repo/GitHub/etl/etl/data/ori/attack/hand/112/10099.jpg'),
 PosixPath('/Users/jedi/Repo/GitHub/etl/etl/data/ori/attack/hand/112/3159.jpg'),
 PosixPath('/Users/jedi/Repo/GitHub/etl/etl/data/ori/attack/hand/112/6021.jpg'),
 PosixPath('/Users/jedi/Repo/GitHub/etl/etl/data/ori/attack/hand/112/1028.jpg')]

Now we know that everything went perfect, it's time to save them into our desired path

In [6]:
combined_dataset.extract(file_prefix="combined", save_path="etl/data/combined")

Finished creating whole dataset array
Finished splitting dataset into train, validation, and test
Finished writing combined_train.csv into /Users/jedi/Repo/GitHub/etl/etl/data/combined
Finished writing combined_validation.csv into /Users/jedi/Repo/GitHub/etl/etl/data/combined
Finished writing combined_test.csv into /Users/jedi/Repo/GitHub/etl/etl/data/combined


# Pt. 2 Transform and Load

In [7]:
combined_dataset = Path.cwd() / 'etl' / 'data' / 'combined'
train_dataset_path = combined_dataset / 'combined_trains.csv'

In [8]:
data_transform = transforms.Compose([
transforms.ToPILImage(),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(
                    mean=[0.485, 0.456, 0.406],
                    std=[0.229, 0.224, 0.225])
])

FileNotFoundError. Maybe we slipped down a little bit

In [9]:
train_dataset = TransformAndLoad(parent_directory=parent_directory, 
                                extension="jpg", 
                                csv_file=train_dataset_path, 
                                transform=data_transform)

/Users/jedi/Repo/GitHub/etl/etl/data/combined/combined_trains.csv does not exist


In [10]:
train_dataset_path = combined_dataset / 'combined_train.csv'

train_dataset = TransformAndLoad(parent_directory=parent_directory, 
                                extension="jpg", 
                                csv_file=train_dataset_path, 
                                transform=data_transform)

Finally!!!

Now let's see what can we do with train_dataset

In [11]:
help(train_dataset)

Help on TransformAndLoad in module etl.etl object:

class TransformAndLoad(torch.utils.data.dataset.Dataset)
 |  TransformAndLoad(parent_directory: str, extension: str, csv_file: str, transform: Callable = None) -> None
 |  
 |  An abstract class representing a Dataset.
 |  
 |  All other datasets should subclass it. All subclasses should override
 |  ``__len__``, that provides the size of the dataset, and ``__getitem__``,
 |  supporting integer indexing in range from 0 to len(self) exclusive.
 |  
 |  Method resolution order:
 |      TransformAndLoad
 |      torch.utils.data.dataset.Dataset
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, idx: int) -> Tuple[numpy.ndarray, numpy.ndarray]
 |      Return the X and y of a specific instance based on the index
 |      
 |      Parameters
 |      ----------
 |      idx
 |              The index of the instance 
 |      
 |      Returns
 |      -------
 |      Tuple of X and y of a specific instance
 |  
 |  _

So by using getitem method we can inspect our desired instance X and y

In [12]:
train_dataset.__getitem__(0)

(tensor([[[-0.5082, -0.5082, -0.5082,  ...,  0.9132,  0.9132,  0.8961],
          [-0.5082, -0.5082, -0.4911,  ...,  0.9132,  0.9132,  0.8961],
          [-0.5424, -0.5253, -0.5082,  ...,  0.9132,  0.9132,  0.8961],
          ...,
          [ 0.0056, -0.0116, -0.0458,  ...,  0.4337,  0.4851,  0.4679],
          [-0.0116, -0.0287, -0.0629,  ...,  0.4166,  0.4679,  0.4508],
          [-0.0458, -0.0629, -0.1143,  ...,  0.3994,  0.4508,  0.4337]],
 
         [[-0.3375, -0.3375, -0.3375,  ...,  1.1155,  1.1155,  1.0980],
          [-0.3375, -0.3375, -0.3200,  ...,  1.1155,  1.1155,  1.0980],
          [-0.3725, -0.3550, -0.3375,  ...,  1.1155,  1.1155,  1.0980],
          ...,
          [ 0.1352,  0.1176,  0.0826,  ...,  0.6954,  0.7479,  0.7304],
          [ 0.1176,  0.1001,  0.0651,  ...,  0.6779,  0.7304,  0.7129],
          [ 0.0826,  0.0651,  0.0126,  ...,  0.6604,  0.7129,  0.6954]],
 
         [[ 0.1302,  0.1302,  0.1302,  ...,  1.5942,  1.5942,  1.5768],
          [ 0.1302,  0.1302,

Since we have the csv file, we could easily inspect our training dataset

In [13]:
import pandas as pd

In [15]:
train_df = pd.read_csv(train_dataset_path, header=None)

In [16]:
train_df

Unnamed: 0,0,1
0,ori/attack/hand/115/4030.jpg,0
1,ori/attack/fixed/112/4030.jpg,0
2,style/attack/hand/112/9111.jpg,0
3,style/attack/fixed/112/4144.jpg,0
4,style/attack/hand/112/2045.jpg,0
5,ori/attack/fixed/112/2131.jpg,0
6,ori/attack/hand/112/9070.jpg,0
7,ori/real/115/2079.jpg,1
8,ori/attack/hand/115/3039.jpg,0
9,style/attack/hand/115/9070.jpg,0


This is very handy. For instance we want to make sure that all that contains "attack" must have consistent label

In [17]:
train_df[train_df[0].str.contains('attack')][0] == train_df[train_df[1] == 0][0]

0      True
1      True
2      True
3      True
4      True
5      True
6      True
8      True
9      True
10     True
11     True
14     True
15     True
16     True
19     True
20     True
21     True
23     True
25     True
26     True
27     True
28     True
29     True
30     True
31     True
32     True
33     True
35     True
36     True
37     True
       ... 
324    True
325    True
326    True
328    True
332    True
333    True
334    True
335    True
336    True
339    True
341    True
342    True
345    True
346    True
348    True
349    True
352    True
353    True
355    True
356    True
357    True
358    True
359    True
361    True
363    True
364    True
365    True
366    True
367    True
368    True
Name: 0, Length: 248, dtype: bool

In [18]:
len(train_df[train_df[0].str.contains('attack')][0]) == len(train_df[train_df[1] == 0][0])

True

Now we have confirmed that file that contains "attack" have consistent label. 

In [20]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

With this, we have successfully create a train DataLoader. While this may looks like a long time, in practice we're saving quite a lot of time because we're not moving any files whatsoever. We also have more consistent and reproducible dataset. On top of that, debugging dataset is much easier than naive method.