### Pytorch Dataset and Dataloaders:

- We ideally want our dataset preparation code to be decoupled from our model training code for better readability and modularity. 

- PyTorch provides two data primitives which helps us to do this with ease:
 
    - torch.utils.data.DataLoader 
    - torch.utils.data.Dataset

What is Dataset?

- <b>Dataset</b> stores the samples and their corresponding labels (optionally)

What is DataLoader?

- <b>DataLoader</b> wraps an iterable around the Dataset to enable easy access to the samples.

In [53]:
import torch
from torch.utils.data import DataLoader , Dataset , TensorDataset

In [54]:
x = torch.arange(12,dtype=torch.float16)

In [55]:
x

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.],
       dtype=torch.float16)

DataLoader returns in iterator , which we can use to iterate through the indivisual examples in the dataset.

In [65]:
data_loader = DataLoader(x)

In [66]:
for item in data_loader:
    print(item)

tensor([0.], dtype=torch.float16)
tensor([1.], dtype=torch.float16)
tensor([2.], dtype=torch.float16)
tensor([3.], dtype=torch.float16)
tensor([4.], dtype=torch.float16)
tensor([5.], dtype=torch.float16)
tensor([6.], dtype=torch.float16)
tensor([7.], dtype=torch.float16)
tensor([8.], dtype=torch.float16)
tensor([9.], dtype=torch.float16)
tensor([10.], dtype=torch.float16)
tensor([11.], dtype=torch.float16)


### Creating batches of data

In [58]:
data_loader = DataLoader(x,batch_size=4,shuffle=True)

In [59]:
for i , batch in enumerate(data_loader):
    print(f'Batch : {i}', batch)

Batch : 0 tensor([6., 2., 1., 7.], dtype=torch.float16)
Batch : 1 tensor([ 9.,  5.,  8., 11.], dtype=torch.float16)
Batch : 2 tensor([10.,  0.,  3.,  4.], dtype=torch.float16)


### creating batches of data with input(X) and target(y) 

In [60]:
from sklearn.datasets import make_classification
X,y = make_classification(n_samples=10)

In [67]:
X.shape , y.shape

((10, 20), (10,))

In [69]:
dataset = TensorDataset( torch.Tensor(X), torch.Tensor(y)) ## combine X and Y into a dataset

In [70]:
data_loader = DataLoader(dataset, batch_size= 5)

In [71]:
for i ,batch in enumerate(data_loader):
    print(f'Batch: {i} \n X: {batch[0]} , \n y: {batch[1]}')

Batch: 0 
 X: tensor([[-0.4712,  1.0072, -0.1915, -0.4781,  0.3216, -0.3398, -0.3595, -1.1020,
         -1.3714, -0.2125, -0.1812, -0.0144,  1.1197,  2.1346,  0.9589, -0.2670,
          0.3318,  0.5929,  0.4351,  0.3562],
        [ 0.0534,  1.3810, -1.5928,  1.1625, -0.9294, -0.0450,  0.0813, -1.3332,
         -0.2898,  0.7793,  0.0826, -0.1925, -0.0047, -1.4658, -1.5462,  0.1317,
          0.2949,  0.0969, -0.5668, -0.1134],
        [ 0.9616,  2.2366,  0.9620,  1.4451, -1.4860, -0.3851,  1.7846,  2.8822,
         -0.5535, -1.1474,  0.0163,  0.7611,  2.6891,  0.3897,  1.6439, -2.6112,
         -0.5665, -1.4221,  3.3794, -1.5652],
        [ 0.6341,  0.9544,  0.0150,  2.6025,  0.4379, -0.5392,  0.1105, -0.9531,
         -1.8435, -0.5067,  0.5749,  0.0611,  1.5391,  0.9683,  0.7006,  0.2990,
          0.8367,  0.9405,  0.8352, -0.2220],
        [ 0.7500, -0.4814, -0.4454,  0.2543, -1.3657, -0.8532, -1.2005, -1.0224,
         -0.2191, -0.2908,  1.5124, -0.1360,  0.0861,  0.1592, -0.6450, -

### Custom Dataset class
The reason you may want to use Custom Dataset class:
 - There are some special handling before you can get the data sample. 
 - Data should be read from database or disk and you only want to keep a few samples in memory rather than prefetch everything. 
 - Do augmentation that is common in image tasks.

In PyTorch DataLoader expects its first argument can work with len() and with array index.

In [47]:
class CustomDataset(Dataset):
    def __init__(self, X, y):
        # convert into PyTorch tensors
        self.X = torch.tensor(X, dtype=torch.float16)
        self.y = torch.tensor(y, dtype=torch.float16)
 
    def __len__(self):
        # this should return the size of the dataset
        return len(self.X)
 
    def __getitem__(self, idx):
        # this should return one sample from the dataset and return X and y in pairs
        features = self.X[idx]
        target = self.y[idx]
        return features, target

In [48]:
# set up DataLoader using custom Dataset class
dataset = CustomDataset(X, y)
data_loader = DataLoader(dataset, shuffle=True, batch_size=5)

In [49]:
for i ,batch in enumerate(data_loader):
    print(f'Batch: {i} \n X: {batch[0]} , \n y: {batch[1]}')

Batch: 0 
 X: tensor([[-1.5010, -2.0703,  0.3979, -2.5938, -0.5840,  0.4141,  0.3630,  0.2273,
         -0.1897,  1.5000,  1.4629,  1.1270, -0.7510, -0.4087, -1.9150, -0.2410,
          0.6543,  0.6611,  0.1081,  0.6118],
        [-0.4355, -0.2539, -0.1873,  1.1973,  1.3535, -0.0685, -0.0800, -2.4082,
         -1.9092,  0.5386,  0.3604, -0.1027,  0.2064,  1.0098,  1.1152,  0.3284,
         -0.5991, -1.1025, -0.5576,  0.1531],
        [ 0.8950,  1.2412,  0.9756,  0.9302, -0.7598, -0.9756,  0.2666,  0.6006,
          0.7017, -0.7534, -0.9595, -0.6294, -0.5112, -1.2471,  0.3879,  1.4053,
          1.1123, -0.8330, -1.0215, -0.3982],
        [ 0.3157,  0.5942,  1.0068,  0.2452,  0.0422,  0.0092, -0.6616,  0.5718,
         -0.0480,  1.3467, -1.3418, -0.3171, -0.7725,  2.3457,  0.7930,  1.0645,
          0.2957, -0.9629,  1.1211, -0.5234],
        [-0.3806,  0.8228, -0.4451,  1.8955,  1.2881, -0.3262, -0.7075,  2.2871,
          0.5083,  0.5308,  0.2773, -0.3442, -0.5596, -0.1313,  0.6572, -