<h1>Dataset and DataLoader classes</h1> 



<h3 span style='color:yellow'>Calculating the gradient of samples of data can be time-consuming and cumbersome especially if the data set is relatively large.</h3>

<h3 span style='color:yellow'>To manage the large dataset, a useful practice is to divide the whole dataset into batches of samples.</h3>

<h3 span style='color:yellow'>Accordingly, the gradient is calculated over each epoch, and within each epoch, there is another loop that iterates over each batch of data samples.</h3>

<h3 span style='color:yellow'><span style='color:lightgreen'>Note:</span> One Epoch (epoch=1) means one forward and backward pass across all training samples.</h3>

<h3 span style='color:yellow'><span style='color:lightgreen'>Note:</span> The batch size is the number of training samples in one forward and backward pass.</h3>

<h3 span style='color:yellow'><span style='color:lightgreen'>Note:</span> The number of iterations is the count of forward and backward passes, where each pass uses the batch size as a number of samples.</h3>

<ul>
  <li style="color:lightgreen;"><span style="font-size:18px;">e.g. if we have 200 samples, and the batch size is 20, then 200/20 = 10 iterations are needed to complete one epoch.</span></li>
</ul>

<h3 span style='color:yellow'>The PyTorch datasets and dataloader classes are introduced to automate the process of batch calculation and iterations.</h3>




In [7]:
# Traditional method for reading a dataset
import numpy as np
 
# Columns name: Date,Open,High,Low,Close,Volume,Adj"
data=np.loadtxt("google_stock_example.csv", delimiter=',', skiprows=1, usecols=(1, 2,4)) # we select the numerical columns, otherwise we will have a problem with the data format

#Let's assume we have 5 epochs to optimize the model parameters based on our data
for epoch in range(5):
    x,y,z=data.T

# The above lines of code lead to a gradual approach to model optimization. Therefore, we need to divide the entire dataset into smaller batches

total_batches=20
for epoch in range(5):
    for i in range(total_batches):
        #x_batch,y_batch,z_batch=
        ...
#The nested loop above is just for illustration purposes

In [2]:
# Dataset and DataLoader classes
import torch,torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math
import pandas as pd


  warn(


In [13]:
# Let's generate synthetic data for classification purposes
np.random.seed(1234)
n_samples=150
labels=np.array([0]*50+[1]*50+[2]*50)
np.random.shuffle(labels)

feature1=np.random.rand(n_samples)
feature2=np.random.rand(n_samples)
feature3=np.random.rand(n_samples)

data=pd.DataFrame({"label":labels, "feature1":feature1,"feature2":feature2,"feature3":feature3})
data = data[['label', 'feature1', 'feature2', 'feature3']]
data.to_csv("synthetic_classification_data.csv",index=False)

In [15]:
# In the cell above, we generated data for classification. Our goal is to load this data using NumPy,
# taking into account that the first column contains the class labels and the header row needs to be skipped

class myDataset(Dataset):
    def __init__(self):
        Xy=np.loadtxt("synthetic_classification_data.csv", delimiter=',', skiprows=1, dtype=np.float64)
        self.X=Xy[:,1:]
        self.y=Xy[:,[0]] # we put [0] to make it as a a 2d array of [n_samples, 1]
        # convert to torch tensor
        self.X=torch.from_numpy(self.X)
        self.y=torch.from_numpy(self.y)
        self.n_samples=Xy.shape[0]        
        
    def __len__(self): # This retursn the length of the dataset samples
        return self.n_samples
    
    def __getitem__(self, index):  # This allows indexing
        return self.X[index],self.y[index]    # This returns a tuple of tensors for featues and labels



In [17]:
dataset=myDataset()
first_sample=dataset[0]
# We can unpack the tuple of tensor as follows
features,labels=first_sample
print(f'Features: {features} & label: {labels}')

Features: tensor([0.3257, 0.6295, 0.2986], dtype=torch.float64) & label: tensor([1.], dtype=torch.float64)


In [44]:
# DataLoader
dataloader=DataLoader(dataset=dataset,batch_size=4, shuffle=True,num_workers=2) # The num_workers parameter makes the data loader much faster because it enables multiprocessing

# To see how dataloader works, we convert it to an iterobject
dataiter=iter(dataloader)
data=next(dataiter)
# We are able to unpack the features and labels for the first batch. If you want to see the next batch, repeat the above line of code
features,labels=data
print(f'Features: {features} & label: {labels}') # We observe that there are 4 samples and 4 labels because we set the batch size to 4

Features: tensor([[0.5093, 0.3631, 0.1693],
        [0.1715, 0.7793, 0.3455],
        [0.8024, 0.2242, 0.1454],
        [0.1262, 0.8854, 0.6690]], dtype=torch.float64) & label: tensor([[0.],
        [2.],
        [2.],
        [0.]], dtype=torch.float64)


In [46]:
# We can print the next batch, and so on.
data=next(dataiter)
# We are able to unpack the features and labels for the first batch. If you want to see the next batch, repeat the above line of code
features,labels=data
print(f'Features: {features} & label: {labels}') # We observe that there are 4 samples and 4 labels because we set the batch size to 4

Features: tensor([[0.5451, 0.5367, 0.5748],
        [0.9745, 0.2859, 0.5671],
        [0.0420, 0.1787, 0.1749],
        [0.6124, 0.5163, 0.3927]], dtype=torch.float64) & label: tensor([[1.],
        [0.],
        [0.],
        [0.]], dtype=torch.float64)


In [47]:
# Perform a dummy training loop
NUM_EPOCH=5
TOTAL_SAMPLES=len(dataset)
NUM_ITERATION=math.ceil(TOTAL_SAMPLES/4) # math.ceil returns the samllest int that's greater or equal to the num, i.e. ceil 4.2=5
print(f'Total sammples: {TOTAL_SAMPLES} & Number otf iteration: {NUM_ITERATION}') 

Total sammples: 150 & Number otf iteration: 38


In [48]:
for epoch in range(NUM_EPOCH):
    for i , (input,labels) in enumerate(dataloader):
        #suppose we've already implemented the forward propagation, loss calculation, and weight update steps.
        if (i+1) % 2 ==0:
            print(f'epoch: {epoch+1}/{NUM_EPOCH}, step: {i+1}/{NUM_ITERATION}, input: {input.shape}')            

epoch: 1/5, step: 2/38, input: torch.Size([4, 3])
epoch: 1/5, step: 4/38, input: torch.Size([4, 3])
epoch: 1/5, step: 6/38, input: torch.Size([4, 3])
epoch: 1/5, step: 8/38, input: torch.Size([4, 3])
epoch: 1/5, step: 10/38, input: torch.Size([4, 3])
epoch: 1/5, step: 12/38, input: torch.Size([4, 3])
epoch: 1/5, step: 14/38, input: torch.Size([4, 3])
epoch: 1/5, step: 16/38, input: torch.Size([4, 3])
epoch: 1/5, step: 18/38, input: torch.Size([4, 3])
epoch: 1/5, step: 20/38, input: torch.Size([4, 3])
epoch: 1/5, step: 22/38, input: torch.Size([4, 3])
epoch: 1/5, step: 24/38, input: torch.Size([4, 3])
epoch: 1/5, step: 26/38, input: torch.Size([4, 3])
epoch: 1/5, step: 28/38, input: torch.Size([4, 3])
epoch: 1/5, step: 30/38, input: torch.Size([4, 3])
epoch: 1/5, step: 32/38, input: torch.Size([4, 3])
epoch: 1/5, step: 34/38, input: torch.Size([4, 3])
epoch: 1/5, step: 36/38, input: torch.Size([4, 3])
epoch: 1/5, step: 38/38, input: torch.Size([2, 3])
epoch: 2/5, step: 2/38, input: torc

epoch: 4/5, step: 2/38, input: torch.Size([4, 3])
epoch: 4/5, step: 4/38, input: torch.Size([4, 3])
epoch: 4/5, step: 6/38, input: torch.Size([4, 3])
epoch: 4/5, step: 8/38, input: torch.Size([4, 3])
epoch: 4/5, step: 10/38, input: torch.Size([4, 3])
epoch: 4/5, step: 12/38, input: torch.Size([4, 3])
epoch: 4/5, step: 14/38, input: torch.Size([4, 3])
epoch: 4/5, step: 16/38, input: torch.Size([4, 3])
epoch: 4/5, step: 18/38, input: torch.Size([4, 3])
epoch: 4/5, step: 20/38, input: torch.Size([4, 3])
epoch: 4/5, step: 22/38, input: torch.Size([4, 3])
epoch: 4/5, step: 24/38, input: torch.Size([4, 3])
epoch: 4/5, step: 26/38, input: torch.Size([4, 3])
epoch: 4/5, step: 28/38, input: torch.Size([4, 3])
epoch: 4/5, step: 30/38, input: torch.Size([4, 3])
epoch: 4/5, step: 32/38, input: torch.Size([4, 3])
epoch: 4/5, step: 34/38, input: torch.Size([4, 3])
epoch: 4/5, step: 36/38, input: torch.Size([4, 3])
epoch: 4/5, step: 38/38, input: torch.Size([2, 3])
epoch: 5/5, step: 2/38, input: torc

<h3><span style='color:yellow'>PyTorch provides several sample datasets. However, in most instances, we possess our own dataset and aim to integrate it with the PyTorch model.</span></h3>


In [15]:
# The following example demonstrates how to load a specific image dataset along with its associated annotations
from skimage import io
import os
data_path='./datastes/image data/cats_dogs/data'
labels_path='/home/mohanad/learn/Pytorch/8- Dataset and DataLoader/datastes/image data/cats_dogs/annotations/annotations.csv'


class CatDog(Dataset):
    def __init__(self, annotations, root_dir, transformations=None):
        self.annotations=pd.read_csv(annotations)
        self.root_dir= root_dir
        self.transformations=transformations
        

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        imgs_path=os.path.join(self.root_dir,self.annotations.iloc[index,0]) # We use iloc to access the dataframe rows. We employ the index to switch between rows, and using '0' in iloc ensures we only view the first column.
        images=io.imread(imgs_path)
        labels=torch.tensor(int(self.annotations.iloc[index,1])) 
        
        # This is an optinal step we will highlight in the next tutorials
        if self.transformations:
            images=self.transformations(images)
    
        return(images,labels)
    

In [18]:
import torchvision.transforms as transforms
dataset=CatDog(annotations=labels_path,root_dir=data_path, transformations=transforms.ToTensor())

first_sample=dataset[0]

dalaloader= DataLoader(dataset=dataset,batch_size=4,shuffle=True,num_workers=2)
iter_=iter(dalaloader)
iimgaes,labels=next(iter_)

print(f'Features: {iimgaes} & label: {labels}')

Features: tensor([[[[0.9333, 0.8980, 0.8784,  ..., 0.8471, 0.8510, 0.8588],
          [0.9333, 0.9059, 0.8784,  ..., 0.8471, 0.8510, 0.8627],
          [0.9373, 0.9137, 0.8706,  ..., 0.8392, 0.8510, 0.8588],
          ...,
          [0.4627, 0.4784, 0.4980,  ..., 0.6980, 0.6941, 0.6784],
          [0.4510, 0.4667, 0.4863,  ..., 0.6941, 0.6784, 0.6588],
          [0.4431, 0.4588, 0.4784,  ..., 0.6902, 0.6627, 0.6353]],

         [[0.9255, 0.9059, 0.8980,  ..., 0.8510, 0.8588, 0.8667],
          [0.9255, 0.9137, 0.8980,  ..., 0.8510, 0.8588, 0.8706],
          [0.9255, 0.9098, 0.8902,  ..., 0.8431, 0.8588, 0.8667],
          ...,
          [0.4353, 0.4471, 0.4549,  ..., 0.6980, 0.6941, 0.6784],
          [0.4275, 0.4353, 0.4431,  ..., 0.7059, 0.6902, 0.6706],
          [0.4196, 0.4275, 0.4353,  ..., 0.7020, 0.6745, 0.6471]],

         [[0.7647, 0.7216, 0.6784,  ..., 0.6784, 0.7020, 0.7098],
          [0.7647, 0.7294, 0.6784,  ..., 0.6784, 0.7020, 0.7137],
          [0.7569, 0.7216, 0.678

In [37]:
# Remeber that Pytorch contains a set of datasets that can be loaded directly
from torchvision.datasets import FashionMNIST, cifar
from torchvision import transforms
transform=transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5,),(0.5))]) #( (mean), (std))

# Download and load the training data
#trainset = FashionMNIST('~/learn', download=True, train=True, transform=transform)
#train_loader=DataLoader(trainset,batch_size=64,shuffle=True)

# Download and load the testing data
#testset = FashionMNIST('~/learn', download=True, train=False, transform=transform)
#train_loader=DataLoader(testset,batch_size=64,shuffle=True)