# Why You Need to Learn PyTorch's Powerful DataLoader
-------------------------

In case you missed it, in a previous article we went over [batch gradient descent](https://cutt.ly/cg2ai-batch-gradient-descent-medium) in-depth and saw how it vastly improved the [vanilla gradient descent](https://cutt.ly/cg2nn-ch1-medium) approach. In this article, we'll revisit batch gradient descent, but instead, we'll take advantage of PyTorch's powerful Dataset and DataLoader classes. By the end of this article, you will be convinced to never go back to a life of deep learning without PyTorch's DataLoader.

Before we begin, we'll rerun through the steps which we performed in the previous [batch gradient descent article](https://cutt.ly/cg2ai-batch-gradient-descent-medium). Just like last time, we'll use the Pima Indians Diabetes dataset, set aside 33% for testing, standardize it and set the batch size to 64. We'll also keep the same neural network architecture - 1 layer of size 4.

I've labeled the cell codes with very high level titles, but if you wish to see the in-depth explanation of the code below, please refer to the [previous article](https://cutt.ly/cg2ai-batch-gradient-descent-medium) where we introduced batch gradient descent.


### Import libraries

In [9]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch.nn as nn

### Import data and standardize

In [10]:
df = pd.read_csv(r'https://raw.githubusercontent.com/a-coders-guide-to-ai/a-coders-guide-to-neural-networks/master/data/diabetes.csv')

X = df[df.columns[:-1]]
y = df['Outcome']
X = X.values
y = torch.tensor(y.values)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_train = torch.tensor(scaler.transform(X_train))
X_test = torch.tensor(scaler.transform(X_test))

### Create neural network architecture

In [11]:
class Model(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.hidden_linear = nn.Linear(8, 4)
        self.output_linear = nn.Linear(4, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, X):
        hidden_output = self.sigmoid(self.hidden_linear(X))
        output = self.sigmoid(self.output_linear(hidden_output))
        return output

### Make variables which will be reused

In [12]:
def accuracy(y_pred, y):
    return torch.sum((((y_pred>=0.5)+0).reshape(1,-1)==y)+0).item()/y.shape[0]

epochs = 1000+1
print_epoch = 100
lr = 1e-2
batch_size = 64

### Instantiate Model class and set loss and optimizer

In [13]:
model = Model()
BCE = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = lr)

### Run batch gradient descent without PyTorch's DataLoader

In [14]:
import numpy as np
train_batches = int(np.ceil(len(X_train)/batch_size))-1
test_batches = int(np.ceil(len(X_test)/batch_size))-1

for epoch in range(epochs):
    
    iteration_loss = 0.
    iteration_accuracy = 0.
    
    model.train()
    for i in range(train_batches):
      beg = i*batch_size
      end = (i+1)*batch_size

      y_pred = model(X_train[beg:end].float())
      loss = BCE(y_pred, y_train[beg:end].reshape(-1,1).float())     
      
      iteration_loss += loss
      iteration_accuracy += accuracy(y_pred, y_train[beg:end])

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

    if(epoch % print_epoch == 0):
      print('Train: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, iteration_loss/(i+1), iteration_accuracy/(i+1)))

    iteration_loss = 0.
    iteration_accuracy = 0.      

    model.eval()
    for i in range(test_batches):
      beg = i*batch_size
      end = (i+1)*batch_size
      
      y_pred = model(X_test[beg:end].float())
      loss = BCE(y_pred, y_test[beg:end].reshape(-1,1).float())
      
      iteration_loss += loss
      iteration_accuracy += accuracy(y_pred, y_test[beg:end])
      
    if(epoch % print_epoch == 0):
        print('Test: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, iteration_loss/(i+1), iteration_accuracy/(i+1)))

Train: epoch: 0 - loss: 0.67015; acc: 0.646
Test: epoch: 0 - loss: 0.67036; acc: 0.641
Train: epoch: 100 - loss: 0.63613; acc: 0.646
Test: epoch: 100 - loss: 0.64282; acc: 0.641
Train: epoch: 200 - loss: 0.60189; acc: 0.646
Test: epoch: 200 - loss: 0.61569; acc: 0.646
Train: epoch: 300 - loss: 0.55953; acc: 0.705
Test: epoch: 300 - loss: 0.58368; acc: 0.682
Train: epoch: 400 - loss: 0.52290; acc: 0.752
Test: epoch: 400 - loss: 0.55886; acc: 0.724
Train: epoch: 500 - loss: 0.49801; acc: 0.764
Test: epoch: 500 - loss: 0.54422; acc: 0.719
Train: epoch: 600 - loss: 0.48241; acc: 0.771
Test: epoch: 600 - loss: 0.53619; acc: 0.719
Train: epoch: 700 - loss: 0.47259; acc: 0.777
Test: epoch: 700 - loss: 0.53165; acc: 0.714
Train: epoch: 800 - loss: 0.46626; acc: 0.773
Test: epoch: 800 - loss: 0.52903; acc: 0.719
Train: epoch: 900 - loss: 0.46208; acc: 0.775
Test: epoch: 900 - loss: 0.52757; acc: 0.724
Train: epoch: 1000 - loss: 0.45927; acc: 0.771
Test: epoch: 1000 - loss: 0.52682; acc: 0.724


## Using PyTorch's DataLoader Class

Sorry for the chunks of code above before starting the topic at hand. The above is so that we can compare the results with and without PyTorch's DataLoader class.

Let's get right into it. We're going to import our required classes before moving forward.

In [15]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

You're going to see above that we imported the DataLoader class, but along with it, we also imported the Dataset class. This is because the DataLoader class accepts data in the form of a Dataset object.

To get our data in the form of a Dataset object, we're going to create a custom class which is going to inherit from the Dataset class. Other than the making of an \_\_init\_\_ method, inheriting the Dataset class requires us to also override the \_\_getitem\_\_ and \_\_len\_\_ methods. The \_\_len\_\_ method returns the length of our Dataset object and the \_\_getitem\_\_ method returns the xy pair at a given index.

Let's write the code for the class. 

Note: Something you'll notice is that I'm not preproceesing any of my data in the class, rather, I'm reusing all of my preprocessing from above. Due to the standardization and the splitting up of the data for train/test, I was unsure how to accomplish those steps in a clean way inside the class. Would love to get some advice in case I've been using the Dataset class the wrong way.

In [16]:
class PimaIndiansDiabetes(Dataset):

  def __init__(self, X, y):
    self.X = X
    self.y = y
    self.len = len(self.X)

  def __getitem__(self, index):
    return self.X[index], self.y[index]

  def __len__(self):
    return self.len

Due to the preprocessing being done above, the class is very straightforward. It's really just acting as a wrapper for our training and testing datasets, allowing them to be in a format acceptable for the DataLoader class.

We'll continue by making Dataset objects for both the training and testing data.

In [17]:
train_data = PimaIndiansDiabetes(X_train, y_train)
test_data = PimaIndiansDiabetes(X_test, y_test)

Finally, time to use the DataLoader class.

What we have below are the creation of 2 DataLoader objects - 1 for the training data and the other for the testing. The DataLoader class has many parameters which we can pass. Let's talk about the arguments we're passing it.

__Batch_size__ - this 1 is self-explanatory. The DataLoader creates batches for us to be able to iterate through them. We no longer have to care about slicing the data to retrieve batches.

__Shuffle__ - this allows our data to be shuffled, but more importantly, it shuffles our data every epoch. This trick allows our batches to be a random set of 64 records each time. This trick helps with generalization.

__Drop_last__ - this is something I set as True only when I have shuffle set to True. This drops the last non-full batch. If you realized, our training set has a size of 514. When we divide that by 64, you'll see that our last batch only contains 1 item. That 1 item is an insufficient sample size to be able to fit the model.

There's 1 more parameter which I haven't set but do wish to bring up. That parameter is num_workers. We're not making use of it, because we aren't using a GPU, but when you use many GPUs, num_workers allows us to take full advantage of all the GPUs. It allocates our data appropriately, allowing us to significantly improve the speed of our training process.

To read more about the DataLoader class and its capabilities, I suggest you head on over to PyTorch's documentation and have a look at its capabilities: https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader

In [18]:
train_loader = DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(dataset=test_data, batch_size=batch_size, shuffle=True, drop_last=True)

Let's reset our model, along with the loss and optimizer.

In [19]:
model = Model()
BCE = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = lr)

Let's run our model again, but this time, using the DataLoader class.

In [20]:
for epoch in range(epochs):
    
    iteration_loss = 0.
    iteration_accuracy = 0.
    
    model.train()
    for i, data in enumerate(train_loader):
      X, y = data
      y_pred = model(X.float())
      loss = BCE(y_pred, y.reshape(-1,1).float())     
      
      iteration_loss += loss
      iteration_accuracy += accuracy(y_pred, y)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

    if(epoch % print_epoch == 0):
        print('Train: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, iteration_loss/(i+1), iteration_accuracy/(i+1)))    

    iteration_loss = 0.
    iteration_accuracy = 0.    

    model.eval()
    for i, data in enumerate(test_loader):
      X, y = data
      y_pred = model(X.float())
      loss = BCE(y_pred, y.reshape(-1,1).float())

      iteration_loss += loss
      iteration_accuracy += accuracy(y_pred, y)

    if(epoch % print_epoch == 0):
        print('Test: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, iteration_loss/(i+1), iteration_accuracy/(i+1)))

Train: epoch: 0 - loss: 0.66246; acc: 0.646
Test: epoch: 0 - loss: 0.62454; acc: 0.698
Train: epoch: 100 - loss: 0.63621; acc: 0.646
Test: epoch: 100 - loss: 0.62233; acc: 0.667
Train: epoch: 200 - loss: 0.60738; acc: 0.646
Test: epoch: 200 - loss: 0.59760; acc: 0.667
Train: epoch: 300 - loss: 0.57139; acc: 0.676
Test: epoch: 300 - loss: 0.58358; acc: 0.646
Train: epoch: 400 - loss: 0.53193; acc: 0.740
Test: epoch: 400 - loss: 0.52303; acc: 0.771
Train: epoch: 500 - loss: 0.50465; acc: 0.773
Test: epoch: 500 - loss: 0.51226; acc: 0.766
Train: epoch: 600 - loss: 0.48522; acc: 0.779
Test: epoch: 600 - loss: 0.49977; acc: 0.776
Train: epoch: 700 - loss: 0.47455; acc: 0.775
Test: epoch: 700 - loss: 0.51014; acc: 0.750
Train: epoch: 800 - loss: 0.46725; acc: 0.781
Test: epoch: 800 - loss: 0.51183; acc: 0.734
Train: epoch: 900 - loss: 0.46253; acc: 0.783
Test: epoch: 900 - loss: 0.51971; acc: 0.729
Train: epoch: 1000 - loss: 0.45799; acc: 0.785
Test: epoch: 1000 - loss: 0.51306; acc: 0.740


Cool stuff! You can see that our code is much cleaner when using the DataLoader class, but also, our results are also slightly better. Both the accuracy and the loss for our model are slightly better when using the DataLoader class. It's the combination of these simple tricks which will take allow us to stand out in a crowd.

That concludes our little bit with the DataLoader class. Hopefully, you're convinced as to why you need to add this to your arsenal when implementing neural networks.