# Loading data with PyTorch
In this notebook we will investigate a few different ways to handle data with PyTorch on Alvis.

## Using your own data
In many cases you have a dataset in mind that you've already acquired and are keeping in your home folder or perhaps more probable in a storage project.

When it comes to using datasets in training datasets the most efficient approach that we have found to work on Alvis is to use utilities to directly stream data from uncompressed tar-archives or zip-archives (though highly compressed zip files can also sometimes be slow).

In this section we will use the tiny-ImageNet dataset in `/mimer/NOBACKUP/Datasets` but with the hope that you can adapt it to any dataset that you have in your project storage. First let us take a look at the dataset.

### Investigating the contents
Let's take a look at what is contained in this archive.

In [1]:
from torch.utils.data import DataLoader
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
from pytorch_dataset import (
    PATH_TO_DATASET,
    examine_zipfile_structure,
    read_labels_from_txt,
    count_train_images,
    visualize_sample_images,
    TinyImageNetDataset
)

def get_dataloader(path_to_dataset: str, batch_size: int = 64, shuffle=True, num_workers: int = 4):
    """Returns a DataLoader for the TinyImageNet dataset."""
    dataset = TinyImageNetDataset(path_to_dataset, split="train")
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=num_workers,
    )
    return dataloader



In [2]:
# torch.cuda.empty_cache()

In [3]:
# Investigate the contents
# Look at the structure of the zipfile to understand its contents.
# This gives an overview of the file organization within the archive.
examine_zipfile_structure(PATH_TO_DATASET)

Number of entries in the zipfile 120609
tiny-imagenet-200/
tiny-imagenet-200/words.txt
tiny-imagenet-200/wnids.txt
tiny-imagenet-200/test/
tiny-imagenet-200/test/images/
tiny-imagenet-200/test/images/test_1860.JPEG
tiny-imagenet-200/test/images/test_613.JPEG
...
tiny-imagenet-200/val/images/val_9872.JPEG
tiny-imagenet-200/val/images/val_2584.JPEG
tiny-imagenet-200/val/images/val_5908.JPEG


***NOTE:*** Investigating files like this can be quite slow if the archives are very large. Looking at the first few files are fast and can be good to get a sense of the file, but you don't want to have to search through them every time. If there is a README in connection with the dataset it is wise to take a look at it. Furthermore, you might want to note down the structure inside the archive yourself if it isn't in the README.

In [4]:
# Let's take a look at one of the txt files next
labels = read_labels_from_txt(PATH_TO_DATASET)
print("Labels:", labels)

Labels: ['n02124075', 'n04067472', 'n04540053', 'n04099969', 'n07749582', 'n01641577', 'n02802426', 'n09246464', 'n07920052', 'n03970156', 'n03891332', 'n02106662', 'n03201208', 'n02279972', 'n02132136', 'n04146614', 'n07873807', 'n02364673', 'n04507155', 'n03854065', 'n03838899', 'n03733131', 'n01443537', 'n07875152', 'n03544143', 'n09428293', 'n03085013', 'n02437312', 'n07614500', 'n03804744', 'n04265275', 'n02963159', 'n02486410', 'n01944390', 'n09256479', 'n02058221', 'n04275548', 'n02321529', 'n02769748', 'n02099712', 'n07695742', 'n02056570', 'n02281406', 'n01774750', 'n02509815', 'n03983396', 'n07753592', 'n04254777', 'n02233338', 'n04008634', 'n02823428', 'n02236044', 'n03393912', 'n07583066', 'n04074963', 'n01629819', 'n09332890', 'n02481823', 'n03902125', 'n03404251', 'n09193705', 'n03637318', 'n04456115', 'n02666196', 'n03796401', 'n02795169', 'n02123045', 'n01855672', 'n01882714', 'n02917067', 'n02988304', 'n04398044', 'n02843684', 'n02423022', 'n02669723', 'n04465501', 'n0

This will later be used as the labels for our task.

We can also look at the number of train files like this.

In [5]:
num_train_images = count_train_images(PATH_TO_DATASET)
print(f"Number of training images: {num_train_images}")

Number of training images: 100000


Reading file information from zip-files is fast as they have information about all of its members easily retriveable. For a tarfile you would have to traverse the entire archive (with e.g. `tarfile.TarFile.getmembers`).

In [6]:
# Visualize sample images
# Displays a 3x3 grid of sample images from the training set.
visualize_sample_images(PATH_TO_DATASET)

It might be worth noting that the image labels are listed in wnids.txt that can be found in the archive.

### Training a classifier from this data
Now we have some understanding of what the database does and we are ready to do some ML on it.

First we will define our machine learning model.

In [7]:
# We will use torch.hub to load a pretrained model
efficientnet = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_efficientnet_b0', pretrained=True)
#preprocessing_utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_convnets_processing_utils')

# We freeze all parameters except the last layer
freeze_blocks = [efficientnet.stem] + [
    layer for layer in efficientnet.layers if layer != efficientnet.classifier
]
for block in freeze_blocks:
    for parameter in block:
        parameter.requires_grad = False

# Modify the number of output classes to 200
efficientnet.classifier.fc = nn.Linear(efficientnet.classifier.fc.in_features, 200)


Using cache found in /cephyr/users/klim/Alvis/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


In [8]:
model = efficientnet
opt = optim.Adam(model.parameters(), lr=0.003)
loss_func = nn.CrossEntropyLoss()


- Initialize DataLoader using TinyImageNetDataset
- Constructs a DataLoader for the TinyImageNet training dataset.
- This will be used to iterate through the data in batches during training.

In [9]:
dataloader = get_dataloader(PATH_TO_DATASET)

- Define the training function
- Implements the main training loop. It iterates over the dataset for multiple epochs,
- calculates the loss, and updates the model parameters. Tracks loss and accuracy.

In [10]:
def train(dataloader, model, opt, loss_func, n_epochs=3, device=torch.device("cuda:0")):
    model = model.to(device)
    model.train()
    for epoch in range(n_epochs):
        loss_sum = 0.0
        n_correct = 0
        for i_batch, (x, label) in enumerate(dataloader):
            print(f'\rBatch: {i_batch}/{len(dataloader)}', end='')
            x, label = x.to(device), label.to(device)
            opt.zero_grad()
            logits = model(x)
            loss = loss_func(logits, label)
            loss_sum += loss.item()
            n_correct += (logits.argmax(1) == label).long().sum()
            loss.backward()
            opt.step()
        avg_loss = loss_sum / (i_batch + 1)
        accuracy = n_correct / len(dataloader.dataset)
        print(f" Loss: {avg_loss}", f'Accuracy: {accuracy}')


- Run the training
- Executes the training function for 3 epochs.
- This will output the loss and accuracy per epoch to track model performance.

In [11]:
%%time
train(dataloader, model, opt, loss_func, n_epochs=3)
# If you get an error message "Unable to find a valid cuDNN algorithm"
# then, you're probably running out of GPU memory and should kill other
# processes using up memory and/or reduce the batch size


Batch: 1562/1563 Loss: 3.1520847374250396 Accuracy: 0.26718997955322266
Batch: 1562/1563 Loss: 2.4606427144943255 Accuracy: 0.39757999777793884
Batch: 1562/1563 Loss: 2.19449408589764 Accuracy: 0.45181000232696533
CPU times: user 2min 47s, sys: 4.35 s, total: 2min 51s
Wall time: 2min 57s


### Tasks
 1. Make yourself acquainted with the above code.
 2. Take a look at `jobscript-pytorch.sh` to see how you would go about training something non-interactively.

## Using available datasets
Some common public datasets are available at `/mimer/NOBACKUP/Datasets`, if there are some specific dataset you would like to see added you can create a request through [support](https://supr.naiss.se/support/).

In this part we will access the processed MNIST dataset available at `/mimer/NOBACKUP/Datasets/MNIST`

In [13]:
from torchvision import datasets

In [14]:
# 10 (3, 3) convolutional filters followed by a dense layer
model = nn.Sequential(
    nn.Conv2d(1, 10, 3),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(6760, 10),
)

print(model)

opt = optim.Adam(model.parameters(), lr=0.01)
loss_func = nn.CrossEntropyLoss()

Sequential(
  (0): Conv2d(1, 10, kernel_size=(3, 3), stride=(1, 1))
  (1): ReLU()
  (2): Flatten(start_dim=1, end_dim=-1)
  (3): Linear(in_features=6760, out_features=10, bias=True)
)


In this case it is really simple as this dataset has been processed for use with `torchvision.datasets.MNIST` and all we need to do is supply the correct path.

In [15]:
from torchvision import transforms
dataset = datasets.MNIST("/mimer/NOBACKUP/Datasets", transform=transforms.ToTensor())
dataloader = DataLoader(
    dataset,
    batch_size=128,
    shuffle=True,
    num_workers=0,
)

In [16]:
train(dataloader, model, opt, loss_func)

Batch: 468/469 Loss: 0.19284047612122127 Accuracy: 0.942383348941803
Batch: 468/469 Loss: 0.07343067713518704 Accuracy: 0.9773333668708801
Batch: 468/469 Loss: 0.0519570899280364 Accuracy: 0.9830833673477173


## Loading data through the torchvision API
At `torchvision.datasets`, `torchaudio.datasets` and `torchtext.datasets` all have similar APIs that can be used to download datasets that do not exist in `/mimer/NOBACKUP/Datasets`. However, note that this can take some time and you will have to store them yourself. If you are interested in a dataset that permit us redistributing it, then contact us through the regular support form and we can look at storing it centrally.