<a href="https://colab.research.google.com/github/afnanAlgognadi/Introduction-to-Deep-Learning/blob/main/MNIST_DataLoader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Handling in PyTorch

One of the most important parts of any machine learning pipeline is data handling, i.e. providing an iterface between data on the hard drive and the DL algorithm. 


We typically can't load all the data in the RAM at once as the data in DL application is prohibitively large. And we don't need all the data a single time, anyway, for algorithms like Batch/Sotchastic gradient decent.

PyTorch provides dataloader to effictively load the data and provide a pipeline from data to the DL algorithm. 

Below we will look at an example of a dataloader written in PyTorch.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
import imageio as Image

%matplotlib inline

In [None]:
import torch
from torchvision import transforms
from torchvision.datasets import MNIST


# MNIST
def mnist(batch_sz):
    num_classes = 10
    transform_train = transforms.Compose([
                        transforms.RandomCrop(28, padding=4),
                        transforms.ToTensor(),
                    ])
    transform_test = transforms.Compose([
                        transforms.ToTensor(),
                    ])

    # Training dataset
    train_data = MNIST(root='./datasets', train=True, download=True, transform=transform_train)
    train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_sz, shuffle=True,pin_memory=True)

    # Test dataset
    test_data = MNIST(root='./datasets', train=False, download=True, transform=transform_test)
    test_loader = torch.utils.data.DataLoader(test_data,
                                              batch_size=batch_sz, shuffle=False, pin_memory=True)

    return train_loader, test_loader, num_classes




In [None]:
train_loader, test_loader,_=mnist(10)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./datasets/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ./datasets/MNIST/raw/train-images-idx3-ubyte.gz to ./datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./datasets/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ./datasets/MNIST/raw/train-labels-idx1-ubyte.gz to ./datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./datasets/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ./datasets/MNIST/raw/t10k-images-idx3-ubyte.gz to ./datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ./datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./datasets/MNIST/raw



In most practical applications data will be located on folders on the hard drive and you will have a txt/csv file with the classifcation labels.

To mimic this real life situation we will first write some samples of MNIST dataset to the hard disk and provide a file with the labels.

We will then load this data with a data loader.

creating folder for images.

In [None]:
!mkdir 'images'

writing first 310 images to the images folder

In [None]:
from torchvision.utils import save_image
ind=0
im_name=[]
labels=[]
for i,batch in enumerate(train_loader):
  if i>=30:
      break
  for j in range(batch[0].shape[0]):

    im=batch[0][j,:,:,:]
    save_image(im,f"./images/{ind}.jpg",normalize=True)
    im_name.append(f"{ind}.jpg")
    labels.append(batch[1][j].item())
    ind+=1
    im=torch.squeeze(im)

    #plt.imshow(im)
    #plt.title(f"{batch[1][j].item()}")

In [None]:
!ls images/

0.jpg	 130.jpg  161.jpg  192.jpg  222.jpg  253.jpg  284.jpg  44.jpg  75.jpg
100.jpg  131.jpg  162.jpg  193.jpg  223.jpg  254.jpg  285.jpg  45.jpg  76.jpg
101.jpg  132.jpg  163.jpg  194.jpg  224.jpg  255.jpg  286.jpg  46.jpg  77.jpg
102.jpg  133.jpg  164.jpg  195.jpg  225.jpg  256.jpg  287.jpg  47.jpg  78.jpg
103.jpg  134.jpg  165.jpg  196.jpg  226.jpg  257.jpg  288.jpg  48.jpg  79.jpg
104.jpg  135.jpg  166.jpg  197.jpg  227.jpg  258.jpg  289.jpg  49.jpg  7.jpg
105.jpg  136.jpg  167.jpg  198.jpg  228.jpg  259.jpg  28.jpg   4.jpg   80.jpg
106.jpg  137.jpg  168.jpg  199.jpg  229.jpg  25.jpg   290.jpg  50.jpg  81.jpg
107.jpg  138.jpg  169.jpg  19.jpg   22.jpg   260.jpg  291.jpg  51.jpg  82.jpg
108.jpg  139.jpg  16.jpg   1.jpg    230.jpg  261.jpg  292.jpg  52.jpg  83.jpg
109.jpg  13.jpg   170.jpg  200.jpg  231.jpg  262.jpg  293.jpg  53.jpg  84.jpg
10.jpg	 140.jpg  171.jpg  201.jpg  232.jpg  263.jpg  294.jpg  54.jpg  85.jpg
110.jpg  141.jpg  172.jpg  202.jpg  233.jpg  264.jpg  295.jpg  55.jp

Writing the labels to labels.csv file

In [None]:
import pandas as pd
df = pd.DataFrame(list(zip(im_name, labels)),
               columns =['Name', 'Labels'])

df.head()
df.to_csv('labels.csv')

Writing a dataset class for our dataset. We have to provide 2 methods __len__() which returns the lenght of the dataset and __get_item__(i) which returns the ith element of the dataset.

In [None]:
df

Unnamed: 0,Name,Labels
0,0.jpg,9
1,1.jpg,0
2,2.jpg,4
3,3.jpg,1
4,4.jpg,4
...,...,...
295,295.jpg,5
296,296.jpg,2
297,297.jpg,1
298,298.jpg,0


In [None]:

class mnistDataset(Dataset):
    def __init__(self,imfol,labels_csv):
        super(mnistDataset,self).__init__()
        df=pd.read_csv(labels_csv)
        self.labels=df['Labels'].tolist()
        self.images=df['Name'].tolist()
        self.imfol=imfol
        
        
    def __len__(self):
        return len(self.images)
    
    def __getitem__(self,idx):
        im=Image.imread(f'{self.imfol}{self.images[idx]}')
        im=np.transpose(im,(2,0,1))
        label=self.labels[idx]
        data=[]
        data.append(np.asarray(im))
        data.append(label)
        return data
        
        

creating our dataset

In [None]:
data=mnistDataset('./images/','labels.csv')


In [None]:
sample=data[299]

In [None]:
sample[1]

2

creatign the dataloader

In [None]:
loader=DataLoader(data,batch_size=4)

examing the dataloader

In [None]:
for batch in loader:
    print(batch[0].shape)
    print(batch[1])

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt


%matplotlib inline

In [None]:
import torch
from torchvision import transforms
from torchvision.datasets import MNIST, CIFAR10


# MNIST
def mnist(batch_sz):
    num_classes = 10
    transform_train = transforms.Compose([
                        transforms.RandomCrop(28, padding=4),
                        transforms.ToTensor(),
                    ])
    transform_test = transforms.Compose([
                        transforms.ToTensor(),
                    ])

    # Training dataset
    train_data = MNIST(root='./datasets', train=True, download=True, transform=transform_train)
    train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_sz, shuffle=True,pin_memory=True)

    # Test dataset
    test_data = MNIST(root='./datasets', train=False, download=True, transform=transform_test)
    test_loader = torch.utils.data.DataLoader(test_data,
                                              batch_size=batch_sz, shuffle=False, pin_memory=True)

    return train_loader, test_loader, num_classes


# CIFAR10
def cifar10(batch_sz):
    num_classes = 10
    transform_train = transforms.Compose([
                        transforms.RandomCrop(32, padding=4),
                        transforms.RandomHorizontalFlip(),
                        transforms.ToTensor(),
                    ])
    transform_test = transforms.Compose([
                        transforms.ToTensor(),
                    ])

    # Training dataset
    train_data = CIFAR10(root='./datasets', train=True, download=True, transform=None)
    train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_sz,
                                               shuffle=True, pin_memory=True)

    # Test dataset
    test_data = CIFAR10(root='./datasets', train=False, download=True, transform=transform_test)
    test_loader = torch.utils.data.DataLoader(test_data,
                                              batch_size=batch_sz, shuffle=False, pin_memory=True)

    return train_loader, test_loader, num_classes

In [None]:
train_loader, test_loader,_=mnist(10)

In [None]:
from torchvision.utils import save_image
!mkdir 'images/'

mkdir: cannot create directory ‘images/’: File exists


In [None]:
idx=0
im_list=[]
label_list=[]
for i,batch in enumerate(train_loader):
  if i>200:
    break
  for j in range(batch[0].shape[0]):
    im=batch[0][j,:,:,:]
    label=batch[1][j].item()
    im_name=f"{idx}.jpg"
    im_list.append(im_name)
    label_list.append(label)
    save_image(im,f"images/{im_name}")
    idx+=1


In [None]:
!ls 'images/' -la


In [None]:
import pandas as pd
df=pd.DataFrame(zip(im_list,label_list),columns=['Name','label'])
print(df.head())

    Name  label
0  0.jpg      8
1  1.jpg      4
2  2.jpg      6
3  3.jpg      7
4  4.jpg      5


In [None]:
df.to_csv('im_list.csv')

In [None]:
!ls

datasets  images  im_list.csv  labels.csv  sample_data


In [None]:
from torch.utils.data import Dataset, DataLoader
import imageio as Image

class MyDataset(Dataset):
  def __init__(self,im_path,im_list_file):
    self.im_path=im_path
    df=pd.read_csv(im_list_file)
    self.im_names=df['Name'].tolist()
    self.labels=df['label'].tolist()


  def __len__(self):
    return len(self.im_names)


  def __getitem__(self,idx):
    im=Image.imread(f"{self.im_path}{self.im_names[idx]}")
    im=np.transpose(im,(2,0,1))
    label=self.labels[idx]
    sample={'im':np.asarray(im), 'label':label}
    return sample



In [None]:
dataset=MyDataset('images/','im_list.csv')

In [None]:
type(dataset[0]['im'])

numpy.ndarray

In [None]:
train_loader=DataLoader(dataset,batch_size=7)

In [None]:
for batch in train_loader:
  print(batch['im'].shape, batch['label'].shape)

torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size([7, 3, 28, 28]) torch.Size([7])
torch.Size(