# Loading data with PyTorch
In this notebook we will investigate a few different ways to handle data with PyTorch on Alvis.

## Using your own data
In many cases you have a dataset in mind that you've already acquired and are keeping in your home folder or perhaps more probable in a storage project.

When it comes to using datasets in training datasets the most efficient approach that we have found to work on Alvis is to use utilities to directly stream data from uncompressed tar-archives or zip-archives (though highly compressed zip files can also sometimes be slow).

In this section we will use the tiny-ImageNet dataset in `/mimer/NOBACKUP/Datasets` but with the hope that you can adapt it to any dataset that you have in your project storage. First let us take a look at the dataset.

### Investigating the contents
Let's take a look at what is contained in this archive.

In [None]:
from zipfile import ZipFile

# Look at the structure of the zipfile
path_to_dataset = '/mimer/NOBACKUP/Datasets/tiny-imagenet-200/tiny-imagenet-200.zip'
with ZipFile(path_to_dataset, 'r') as datazip:
    print(f"Number of entries in the zipfile {len(datazip.namelist())}")
    print(*datazip.namelist()[:7], "...", *datazip.namelist()[-3:], sep="\n")

***NOTE:*** Investigating files like this can be quite slow if the archives are very large. Looking at the first few files are fast and can be good to get a sense of the file, but you don't want to have to search through them every time. If there is a README in connection with the dataset it is wise to take a look at it. Furthermore, you might want to note down the structure inside the archive yourself if it isn't in the README.

In [None]:
# Let's take a look at one of the txt files next
with ZipFile(path_to_dataset, "r") as datazip:
    print(datazip.read("tiny-imagenet-200/wnids.txt").decode("utf8").split())

This will later be used as the labels for our task.

We can also look at the number of train files like this.

In [None]:
with ZipFile(path_to_dataset) as datazip:
    print(len([fn for fn in datazip.namelist() if 'train' in fn and fn.endswith('.JPEG')]))

Reading file information from zip-files is fast as they have information about all of its members easily retriveable. For a tarfile you would have to traverse the entire archive (with e.g. `tarfile.TarFile.getmembers`).

In [None]:
from io import BytesIO
from fnmatch import fnmatch

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline


# Visualize images
fig, ax_grid = plt.subplots(3, 3, figsize=(15, 15))
with ZipFile(path_to_dataset) as datazip:
    # Get filenames of training images
    filenames = [fn for fn in datazip.namelist() if 'train' in fn and fn.endswith('.JPEG')]
    for ax, fn in zip(ax_grid.flatten(), filenames):
        # Get path to file and label
        label = fn.split("/")[-1].split('_')[0]
    
        # Add to axis
        img = plt.imread(BytesIO(datazip.read(fn)), format="jpg")
        ax.imshow(img)
        ax.set_title(f'Label {label}')
    
fig.tight_layout()

It might be worth noting that the image labels are listed in wnids.txt that can be found in the archive.

### Training a classifier from this data
Now we have some understanding of what the database does and we are ready to do some ML on it.

First we will define our machine learning model.

In [None]:
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
# For performance set precision,
# see https://www.c3se.chalmers.se/documentation/applications/pytorch/#performance-and-precision
torch.set_float32_matmul_precision("high")

In [None]:
# We will use torch.hub to load a pretrained model
efficientnet = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_efficientnet_b0', pretrained=True)
#preprocessing_utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_convnets_processing_utils')

# We freeze all parameters except the last layer
freeze_blocks = [efficientnet.stem] + [
    layer
    for layer in efficientnet.layers
    if layer != efficientnet.classifier
]
for block in freeze_blocks:
    for parameter in block:
        parameter.requires_grad = False

# Modify the number of output classes to 200
efficientnet.classifier.fc = nn.Linear(efficientnet.classifier.fc.in_features, 200)

In [None]:
model = efficientnet
opt = optim.Adam(model.parameters(), lr=0.003)
loss_func = nn.CrossEntropyLoss()

Now we will construct the dataloader from a datapipe. Compared to previous datapipes we will also add:
 - possibility to shuffle data
 - at the end construct batchable tensors

In [None]:
from io import BytesIO

import numpy as np
import torch
from PIL import Image
from torch.utils.data import Dataset, DataLoader


# Construct a Dataset class for our dataset
class TinyImageNetDataset(Dataset):
    def __init__(self, path_to_dataset: str, split: str):
        if split not in ["train", "val", "test"]:
            raise ValueError("Invalid split, select 'train', 'val' or 'test'.")
        if split in ["val", "test"]:
            raise NotImplementedError("Only train split is currently implemented.")
        
        self.zfpath = path_to_dataset

        # Avoid reusing the file handle created here, for known issue with multi-worker:
        # https://discuss.pytorch.org/t/dataloader-with-zipfile-failed/42795
        self.zf = None
        with ZipFile(self.zfpath) as zf:
            # Get images from split
            self.imglist: list[str] = [
                path for path in zf.namelist()
                if split in path
                and path.endswith(".JPEG")
            ]

            # Get look-up dictionary for word name ID to label
            wnids = zf.read("tiny-imagenet-200/wnids.txt").decode("utf8").split()
            self.wnid2label: dict[str, int] = {wnid: label for label, wnid in enumerate(wnids)}

    def get_label(self, path: str) -> int:
        if not path.endswith(".JPEG"):
            raise ValueError(f"Expected path to image, got {path}")
        word_name_id: str = fn.split("/")[-1].split('_')[0]
        return self.wnid2label[word_name_id]

    def __len__(self):
        return len(self.imglist)

    def __getitem__(self, idx: int) -> tuple[Image.Image, int]:
        if self.zf is None:
            self.zf=ZipFile(self.zfpath)

        # Convert image to Tensor of size (Channel, Px, Py)
        imgpath = self.imglist[idx]
        img_array = np.array(Image.open(BytesIO(self.zf.read(imgpath))))
        if img_array.ndim < 3:
            # Greyscale to RGB
            img_array = np.repeat(img_array[..., np.newaxis], 3, -1)

        img_tensor = torch.from_numpy(img_array)
        img_tensor = img_tensor.permute(2,0,1)
                       
        # Get label from filename
        label = self.get_label(imgpath)
        return img_tensor.float(), label

In [None]:
# Construct dataloader from dataset
dataset = TinyImageNetDataset(path_to_dataset, split="train")
dataloader = DataLoader(
    dataset,
    batch_size=1024,
    shuffle=True,
    num_workers=4,
)

In [None]:
# Training
def train(
    dataloader,
    model,
    opt,
    loss_func,
    n_epochs=3,
    device=torch.device("cuda:0"),
):
    model = model.to(device)
    model.train()
    
    for epoch in range(n_epochs):
        
        loss_sum = 0.0
        n_correct = 0
        for i_batch, (x, label) in enumerate(dataloader):
            print('\r' + f'Batch: {i_batch}/{len(dataloader)}', end='')
            x, label = x.to(device), label.to(device)

            opt.zero_grad()

            logits = model(x)
            loss = loss_func(logits, label)
            
            loss_sum += loss.item()
            n_correct += (logits.argmax(1) == label).long().sum()
            
            loss.backward()
            opt.step()
        
        avg_loss = loss_sum / (i_batch + 1)
        accuracy = n_correct / len(dataloader.dataset)
        print(f" Loss: {avg_loss}", f'Accuracy: {accuracy}')


In [None]:
%%time
train(dataloader, model, opt, loss_func, n_epochs=3);
# If you get an error message "Unable to find a valid cuDNN algorithm"
# then, you're probably running out of GPU memory and should kill other
# processes using up memory and/or reduce the batch size

### Tasks
 1. Make yourself acquainted with the above code.
 2. Take a look at `jobscript-pytorch.sh` to see how you would go about training something non-interactively.

## Using available datasets
Some common public datasets are available at `/mimer/NOBACKUP/Datasets`, if there are some specific dataset you would like to see added you can create a request through [support](https://supr.naiss.se/support/).

In this part we will access the processed MNIST dataset available at `/mimer/NOBACKUP/Datasets/MNIST`

In [None]:
from torchvision import datasets

In [None]:
# 10 (3, 3) convolutional filters followed by a dense layer
model = nn.Sequential(
    nn.Conv2d(1, 10, 3),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(6760, 10),
)

print(model)

opt = optim.Adam(model.parameters(), lr=0.01)
loss_func = nn.CrossEntropyLoss()

In this case it is really simple as this dataset has been processed for use with `torchvision.datasets.MNIST` and all we need to do is supply the correct path.

In [None]:
from torchvision import transforms
dataset = datasets.MNIST("/mimer/NOBACKUP/Datasets", transform=transforms.ToTensor())
dataloader = DataLoader(
    dataset,
    batch_size=128,
    shuffle=True,
    num_workers=0,
)

In [None]:
train(dataloader, model, opt, loss_func)

## Loading data through the torchvision API
At `torchvision.datasets`, `torchaudio.datasets` and `torchtext.datasets` all have similar APIs that can be used to download datasets that do not exist in `/mimer/NOBACKUP/Datasets`. However, note that this can take some time and you will have to store them yourself. If you are interested in a dataset that permit us redistributing it, then contact us through the regular support form and we can look at storing it centrally.