<h1>Dataset and DataLoader classes</h1> 



<h3 span style='color:yellow'>Calculating the gradient of samples of data can be time-consuming and cumbersome especially if the data set is relatively large.</h3>

<h3 span style='color:yellow'>To manage the large dataset, a useful practice is to divide the whole dataset into batches of samples.</h3>

<h3 span style='color:yellow'>Accordingly, the gradient is calculated over each epoch, and within each epoch, there is another loop that iterates over each batch of data samples.</h3>

<h3 span style='color:yellow'><span style='color:lightgreen'>Note:</span> One Epoch (epoch=1) means one forward and backward pass across all training samples.</h3>

<h3 span style='color:yellow'><span style='color:lightgreen'>Note:</span> The batch size is the number of training samples in one forward and backward pass.</h3>

<h3 span style='color:yellow'><span style='color:lightgreen'>Note:</span> The number of iterations is the count of forward and backward passes, where each pass uses the batch size as a number of samples.</h3>

<ul>
  <li style="color:lightgreen;"><span style="font-size:18px;">e.g. if we have 200 samples, and the batch size is 20, then 200/20 = 10 iterations are needed to complete one epoch.</span></li>
</ul>

<h3 span style='color:yellow'>The PyTorch datasets and dataloader classes are introduced to automate the process of batch calculation and iterations.</h3>




In [63]:
# Traditional method for reading a dataset
import numpy as np
 
# Columns name: Date,Open,High,Low,Close,Volume,Adj"
data=np.loadtxt("google_stock_example.csv", delimiter=',', skiprows=1, usecols=(1, 2,4)) # we select the numerical columns, otherwise we will have a problem with the data format

#Let's assume we have 5 epochs to optimize the model parameters based on our data
for epoch in range(5):
    x,y,z=data.T

# The above lines of code lead to a gradual approach to model optimization. Therefore, we need to divide the entire dataset into smaller batches

total_batches=20
for epoch in range(5):
    for i in range(total_batches):
        #x_batch,y_batch,z_batch=
        ...
#The nested loop above is just for illustration purposes

In [64]:
# Dataset and DataLoader classes
import torch,torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math
import pandas as pd


In [65]:
# Let's generate synthetic data for classification purposes
np.random.seed(1234)
n_samples=150
labels=np.array([0]*50+[1]*50+[2]*50)
np.random.shuffle(labels)

feature1=np.random.rand(n_samples)
feature2=np.random.rand(n_samples)
feature3=np.random.rand(n_samples)

data=pd.DataFrame({"label":labels, "feature1":feature1,"feature2":feature2,"feature3":feature3})
data = data[['label', 'feature1', 'feature2', 'feature3']]
data.to_csv("synthetic_classification_data.csv",index=False)

In [66]:
# In the cell above, we generated data for classification. Our goal is to load this data using NumPy,
# taking into account that the first column contains the class labels and the header row needs to be skipped

class myDataset(Dataset):
    def __init__(self):
        Xy=np.loadtxt("synthetic_classification_data.csv", delimiter=',', skiprows=1, dtype=np.float64)
        self.X=Xy[:,1:]
        self.y=Xy[:,[0]] # we put [0] to make it as a a 2d array of [n_samples, 1]
        # convert to torch tensor
        self.X=torch.from_numpy(self.X)
        self.y=torch.from_numpy(self.y)
        self.n_samples=Xy.shape[0]        
        
    def __len__(self): # This retursn the length of the dataset samples
        return self.n_samples
    
    def __getitem__(self, index):  # This allows indexing
        return self.X[index],self.y[index]    # This returns a tuple of tensors for featues and labels



In [67]:
dataset=myDataset()
first_sample=dataset[0]
# We can unpack the tuple of tensor as follows
features,labels=first_sample
print(f'Features: {features} & label: {labels}')

Features: tensor([0.3257, 0.6295, 0.2986], dtype=torch.float64) & label: tensor([1.], dtype=torch.float64)


In [68]:
# DataLoader
dataloader=DataLoader(dataset=dataset,batch_size=4, shuffle=True,num_workers=2) # The num_workers parameter makes the data loader much faster because it enables multiprocessing

# To see how dataloader works, we convert it to an iterobject
dataiter=iter(dataloader)
data=next(dataiter)
# We are able to unpack the features and labels for the first batch. If you want to see the next batch, repeat the above line of code
features,labels=data
print(f'Features: {features} & label: {labels}') # We observe that there are 4 samples and 4 labels because we set the batch size to 4

Features: tensor([[0.3063, 0.7150, 0.1311],
        [0.6228, 0.0520, 0.1323],
        [0.8900, 0.1576, 0.9788],
        [0.7167, 0.6060, 0.8343]], dtype=torch.float64) & label: tensor([[2.],
        [2.],
        [1.],
        [1.]], dtype=torch.float64)


In [69]:
# We can print the next batch, and so on.
data=next(dataiter)
# We are able to unpack the features and labels for the first batch. If you want to see the next batch, repeat the above line of code
features,labels=data
print(f'Features: {features} & label: {labels}') # We observe that there are 4 samples and 4 labels because we set the batch size to 4

Features: tensor([[0.2018, 0.7140, 0.3228],
        [0.1262, 0.8854, 0.6690],
        [0.6508, 0.2667, 0.8288],
        [0.0227, 0.2260, 0.8162]], dtype=torch.float64) & label: tensor([[0.],
        [0.],
        [2.],
        [2.]], dtype=torch.float64)


In [70]:
# Perform a dummy training loop
NUM_EPOCH=5
TOTAL_SAMPLES=len(dataset)
NUM_ITERATION=math.ceil(TOTAL_SAMPLES/4) # math.ceil returns the samllest int that's greater or equal to the num, i.e. ceil 4.2=5
print(f'Total sammples: {TOTAL_SAMPLES} & Number otf iteration: {NUM_ITERATION}') 

Total sammples: 150 & Number otf iteration: 38


In [71]:
for epoch in range(NUM_EPOCH):
    for i , (input,labels) in enumerate(dataloader):
        #suppose we've already implemented the forward propagation, loss calculation, and weight update steps.
        if (i+1) % 2 ==0:
            print(f'epoch: {epoch+1}/{NUM_EPOCH}, step: {i+1}/{NUM_ITERATION}, input: {input.shape}')            

epoch: 1/5, step: 2/38, input: torch.Size([4, 3])
epoch: 1/5, step: 4/38, input: torch.Size([4, 3])
epoch: 1/5, step: 6/38, input: torch.Size([4, 3])
epoch: 1/5, step: 8/38, input: torch.Size([4, 3])
epoch: 1/5, step: 10/38, input: torch.Size([4, 3])
epoch: 1/5, step: 12/38, input: torch.Size([4, 3])
epoch: 1/5, step: 14/38, input: torch.Size([4, 3])
epoch: 1/5, step: 16/38, input: torch.Size([4, 3])
epoch: 1/5, step: 18/38, input: torch.Size([4, 3])
epoch: 1/5, step: 20/38, input: torch.Size([4, 3])
epoch: 1/5, step: 22/38, input: torch.Size([4, 3])
epoch: 1/5, step: 24/38, input: torch.Size([4, 3])
epoch: 1/5, step: 26/38, input: torch.Size([4, 3])
epoch: 1/5, step: 28/38, input: torch.Size([4, 3])
epoch: 1/5, step: 30/38, input: torch.Size([4, 3])
epoch: 1/5, step: 32/38, input: torch.Size([4, 3])
epoch: 1/5, step: 34/38, input: torch.Size([4, 3])
epoch: 1/5, step: 36/38, input: torch.Size([4, 3])
epoch: 1/5, step: 38/38, input: torch.Size([2, 3])
epoch: 2/5, step: 2/38, input: torc

epoch: 5/5, step: 2/38, input: torch.Size([4, 3])
epoch: 5/5, step: 4/38, input: torch.Size([4, 3])
epoch: 5/5, step: 6/38, input: torch.Size([4, 3])
epoch: 5/5, step: 8/38, input: torch.Size([4, 3])
epoch: 5/5, step: 10/38, input: torch.Size([4, 3])
epoch: 5/5, step: 12/38, input: torch.Size([4, 3])
epoch: 5/5, step: 14/38, input: torch.Size([4, 3])
epoch: 5/5, step: 16/38, input: torch.Size([4, 3])
epoch: 5/5, step: 18/38, input: torch.Size([4, 3])
epoch: 5/5, step: 20/38, input: torch.Size([4, 3])
epoch: 5/5, step: 22/38, input: torch.Size([4, 3])
epoch: 5/5, step: 24/38, input: torch.Size([4, 3])
epoch: 5/5, step: 26/38, input: torch.Size([4, 3])
epoch: 5/5, step: 28/38, input: torch.Size([4, 3])
epoch: 5/5, step: 30/38, input: torch.Size([4, 3])
epoch: 5/5, step: 32/38, input: torch.Size([4, 3])
epoch: 5/5, step: 34/38, input: torch.Size([4, 3])
epoch: 5/5, step: 36/38, input: torch.Size([4, 3])
epoch: 5/5, step: 38/38, input: torch.Size([2, 3])


<h3><span style='color:yellow'>PyTorch provides several sample datasets. However, in most practical cases, we possess our own dataset and aim to integrate it with the PyTorch model.</span></h3>


In [72]:
# The following example demonstrates how to load a specific image dataset along with its associated annotations
from skimage import io
import os
data_path='./datastes/image data/cats_dogs/data'
labels_path='/home/mohanad/learn/Pytorch/8- Dataset and DataLoader/datastes/image data/cats_dogs/annotations/annotations.csv'


class CatDog(Dataset):
    def __init__(self, annotations, root_dir, transformations=None):
        self.annotations=pd.read_csv(annotations)
        self.root_dir= root_dir
        self.transformations=transformations
        

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        imgs_path=os.path.join(self.root_dir,self.annotations.iloc[index,0]) # We use iloc to access the dataframe rows. We employ the index to switch between rows, and using '0' in iloc ensures we only view the first column.
        images=io.imread(imgs_path)
        labels=torch.tensor(int(self.annotations.iloc[index,1])) 
        
        # This is an optinal step we will highlight in the next tutorials
        if self.transformations:
            images=self.transformations(images)
    
        return(images,labels)
    

<h3><span style='color:yellow'>What if we have only the data without annotation?</span></h3>
<h3><span style='color:yellow'>We introduce a preprocessing step to check the directory of images, extract labels based on mapping, and construct the dataframe to include the image names and labels.</span></h3>
<h3><span style='color:yellow'>Later, we build our dataset object class to facilitate further tasks.</span></h3>


In [73]:
import os

'Printin the images names'
data_dir='./datastes/image data/cats_dogs/data/'
images_name=os.listdir(data_dir)
images_name


['cat.6.jpg',
 'cat.4.jpg',
 'dog.0.jpg',
 'cat.2.jpg',
 'cat.3.jpg',
 'cat.1.jpg',
 'dog.1.jpg',
 'cat.7.jpg',
 'cat.0.jpg',
 'cat.5.jpg']

In [74]:
# Label the images
class_labels=[1 if 'dog' in images_name[i] else 0 for i in range(len(images_name))]
class_labels

[0, 0, 1, 0, 0, 0, 1, 0, 0, 0]

In [75]:
# Dataframe construction: it is used to store the images names and their associated labels
label_name_df=pd.DataFrame({"images_name":images_name,"class_labels":class_labels})
label_name_df.iloc[0][0]

'cat.6.jpg'

In [76]:
# Verify whether the images can be accurately read along with their corresponding labels and paths
from skimage import io
for i in range(len(label_name_df)):
    image_path=os.path.join(data_dir,label_name_df.iloc[i,0])
    image=io.imread(image_path)
    print(f'Image shape: {image.shape} & class label: {label_name_df.iloc[i,1]} & Path: {image_path}')


Image shape: (224, 224, 3) & class label: 0 & Path: ./datastes/image data/cats_dogs/data/cat.6.jpg
Image shape: (224, 224, 3) & class label: 0 & Path: ./datastes/image data/cats_dogs/data/cat.4.jpg
Image shape: (224, 224, 3) & class label: 1 & Path: ./datastes/image data/cats_dogs/data/dog.0.jpg
Image shape: (224, 224, 3) & class label: 0 & Path: ./datastes/image data/cats_dogs/data/cat.2.jpg
Image shape: (224, 224, 3) & class label: 0 & Path: ./datastes/image data/cats_dogs/data/cat.3.jpg
Image shape: (224, 224, 3) & class label: 0 & Path: ./datastes/image data/cats_dogs/data/cat.1.jpg
Image shape: (224, 224, 3) & class label: 1 & Path: ./datastes/image data/cats_dogs/data/dog.1.jpg
Image shape: (224, 224, 3) & class label: 0 & Path: ./datastes/image data/cats_dogs/data/cat.7.jpg
Image shape: (224, 224, 3) & class label: 0 & Path: ./datastes/image data/cats_dogs/data/cat.0.jpg
Image shape: (224, 224, 3) & class label: 0 & Path: ./datastes/image data/cats_dogs/data/cat.5.jpg


In [88]:
class CatDogv1(Dataset):
    def __init__(self,label_name_df,root_dir,transformations=None):
        self.label_name_df=label_name_df
        self.root_dir=root_dir
        self.transformations=transformations
    
    def __len__(self):
        return len(label_name_df) 
    
    def __getitem__(self, index):
        image_path=os.path.join(self.root_dir,self.label_name_df.iloc[index,0])
        images=io.imread(image_path)
        labels=torch.tensor(int(self.label_name_df.iloc[index,1]))
    
        if self.transformations:
            image=self.transformations(images)
            return(image,labels)        

In [89]:
dataset_v1=CatDogv1(root_dir=data_dir,label_name_df=label_name_df, transformations=torchvision.transforms.ToTensor())
first_sample=dataset[0]
first_sample

(tensor([[[0.7922, 0.7961, 0.8118,  ..., 0.9529, 0.9569, 0.9490],
          [0.7922, 0.7961, 0.8118,  ..., 0.9569, 0.9608, 0.9529],
          [0.7961, 0.8000, 0.8118,  ..., 0.9608, 0.9647, 0.9608],
          ...,
          [0.6039, 0.6078, 0.6196,  ..., 0.0118, 0.0118, 0.0118],
          [0.6039, 0.6039, 0.6078,  ..., 0.0078, 0.0078, 0.0078],
          [0.5961, 0.5961, 0.6000,  ..., 0.0078, 0.0078, 0.0078]],
 
         [[0.6471, 0.6510, 0.6588,  ..., 0.8039, 0.7922, 0.7843],
          [0.6471, 0.6510, 0.6588,  ..., 0.8078, 0.7961, 0.7882],
          [0.6431, 0.6471, 0.6588,  ..., 0.8118, 0.8000, 0.7961],
          ...,
          [0.4824, 0.4863, 0.4902,  ..., 0.0118, 0.0118, 0.0118],
          [0.4824, 0.4824, 0.4863,  ..., 0.0078, 0.0078, 0.0078],
          [0.4745, 0.4745, 0.4784,  ..., 0.0078, 0.0078, 0.0078]],
 
         [[0.3412, 0.3451, 0.3569,  ..., 0.4784, 0.4706, 0.4627],
          [0.3412, 0.3451, 0.3569,  ..., 0.4824, 0.4745, 0.4667],
          [0.3412, 0.3451, 0.3569,  ...,

In [100]:
import torchvision.transforms as transforms
dataset=CatDog(annotations=labels_path,root_dir=data_path, transformations=transforms.ToTensor())

# Splitting data into training and testing sets

total_size = len(dataset)
train_percent = 0.8
train_size = int(train_percent * total_size)
test_size = total_size - train_size


# Create training and testing subsets
training_set, testing_set=torch.utils.data.random_split(dataset,[train_size,test_size])

print(f"Length of training_set: {len(training_set)}")
print('')
sample=training_set[0]
print(sample)
print('')

# Accessing an attribute of the original dataset
attribute = training_set.dataset.annotations
print(attribute)

# Printing indices of the training set
indices = training_set.indices
print(indices)



Length of training_set: 8

(tensor([[[0.5451, 0.5647, 0.5922,  ..., 0.6039, 0.6784, 0.6667],
         [0.5412, 0.5608, 0.5882,  ..., 0.6039, 0.6549, 0.6745],
         [0.5490, 0.5686, 0.5922,  ..., 0.6353, 0.6471, 0.6157],
         ...,
         [0.2118, 0.2275, 0.1137,  ..., 0.0314, 0.0392, 0.0510],
         [0.4863, 0.2078, 0.1412,  ..., 0.0235, 0.0314, 0.0431],
         [0.6078, 0.2588, 0.1451,  ..., 0.0196, 0.0275, 0.0353]],

        [[0.5059, 0.5255, 0.5529,  ..., 0.5647, 0.6392, 0.6314],
         [0.5020, 0.5216, 0.5490,  ..., 0.5647, 0.6196, 0.6392],
         [0.5059, 0.5255, 0.5490,  ..., 0.5961, 0.6118, 0.5804],
         ...,
         [0.1804, 0.1922, 0.0784,  ..., 0.0510, 0.0588, 0.0706],
         [0.4431, 0.1725, 0.1059,  ..., 0.0588, 0.0510, 0.0627],
         [0.5647, 0.2235, 0.1059,  ..., 0.0549, 0.0471, 0.0549]],

        [[0.4706, 0.4902, 0.5176,  ..., 0.5569, 0.6314, 0.6118],
         [0.4667, 0.4863, 0.5137,  ..., 0.5569, 0.6000, 0.6196],
         [0.4824, 0.5020, 0.52

In [80]:
# Remeber that Pytorch contains a set of datasets that can be loaded directly
from torchvision.datasets import FashionMNIST, cifar
from torchvision import transforms
transform=transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5,),(0.5))]) #( (mean), (std))

# Download and load the training data
#trainset = FashionMNIST('~/learn', download=True, train=True, transform=transform)
#train_loader=DataLoader(trainset,batch_size=64,shuffle=True)

# Download and load the testing data
#testset = FashionMNIST('~/learn', download=True, train=False, transform=transform)
#train_loader=DataLoader(testset,batch_size=64,shuffle=True)