# Plan for today
Today we will start working with [PyTorch](https://github.com/pytorch/pytorch). It’s a Python-based scientific computing package targeted at two sets of audiences:
- A replacement for NumPy to use the power of GPUs
- a deep learning research platform that provides maximum flexibility and speed

In [None]:
import torch

## PyTorch in 5 minutes

NumPy ndarrays -> PyTorch tensors

Tensors are similar to NumPy’s ndarrays, with the addition being that Tensors can also be used on a GPU to accelerate computing.

In [None]:
x = torch.rand(5, 3)
print(x)

In [None]:
x = torch.zeros(5, 3, dtype=torch.long)
print(x)

In [None]:
x = torch.tensor([5.5, 3])
print(x)

In [None]:
A = torch.rand(5, 5)
U, S, V = torch.svd(A)
print(U)
print(S) 
print(V)

There are a lot of useful operations already in core pytorch! We highly recommend to read documentation about [tensors](https://pytorch.org/docs/stable/tensors.html) and [most common operations](https://pytorch.org/docs/stable/torch.html)

### Automatic differentiation in PyTorch

In [None]:
x = torch.rand(5, 3, requires_grad=True)
y = torch.rand(5, 3)
print(x + y)

In [None]:
loss = (x ** 2 + y).mean()
print(loss)
loss.backward()

In [None]:
print(x.grad)

In [None]:
print(y.grad)

In [None]:
z = (x + y)
print(z)

In [None]:
z = z.detach()
print(z)

The general rule - if any of input operands requires grad, the output tensor will also require grad. If all inputs do not require grad, the output will not require grad as well.

## Image classification pytorch starter

What do we need for deep learning?
- dataset
- neural network model
- loss function (criterion)
- optimization algorithm (gradient descent?)
- training loop
- save/load trained model
- (optional) metrics
- (optional) visualizations (will be discussed in future seminars)
- (optional) learning rate schedule (will be discussed in future seminars)
- (optional) results reproducibility (today)

As **Data** Scientists of some kind we should always start from the **data**

## Dataset

In [None]:
# colab download link
# !wget https://raw.githubusercontent.com/yandexdataschool/Practical_DL/35c067adcc1ab364c8803830cdb34d0d50eea37e/week01_backprop/mnist.py -O mnist.py
import sys
sys.path.insert(0, '../week1-backprop')

import matplotlib.pyplot as plt
from mnist import load_dataset

X_train, y_train, X_val, y_val, X_test, y_test = load_dataset(flatten=False)
print(X_train.shape, y_train.shape)

plt.imshow(X_train[0], cmap='gray')

Pytorch [dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) should implement two methods: `__len__()` and `__getitem__()`. Let's do it.

In [None]:
class MNISTDataset(torch.utils.data.Dataset):
    def __init__(self, X, y, transform=None):
        super().__init__()
        assert len(X) == len(y)
        self.x = X
        self.y = y.astype(np.int64)
        self.transform = transform
        
    def __len__(self):
        return len(self.x)
    
    def __getitem__(self, idx):
        x, y = self.x[idx], self.y[idx]
        if self.transform is not None:
            x = self.transform(x)
        return x, y

In [None]:
from torchvision import transforms
from torchvision.utils import make_grid
import numpy as np

transform = transforms.Compose([
    transforms.ToTensor(), # Normalize to range [0, 1], reshape HWC to CHW
    transforms.Normalize((0.5, ), (0.5, ))
])

train_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.RandomRotation(degrees=15),
    transforms.RandomResizedCrop(
        size=(28,28),
        scale=(0.9, 1.1),
        ratio=(0.9, 1.1)
    ),
    transforms.ToTensor(),
    transforms.Normalize((0.5, ), (0.5, ))
])

train_dataset = MNISTDataset(X_train, y_train, transform=transform)

In [None]:
print(len(train_dataset))
x, y = train_dataset[0]
print(x.shape)

In [None]:
def draw_tensor_image(t, normalize=True, range=(-1, 1), **kwargs):
    if t.ndim == 4:
        t = make_grid(t, normalize=normalize, range=range, **kwargs)
    assert t.ndim == 3
    img = np.transpose(t.numpy(), (1, 2, 0))
    cmap = None
    if img.shape[-1] == 1:
        img = np.squeeze(img, axis=-1)
        cmap = 'gray'
    plt.imshow(img, cmap=cmap)
    
draw_tensor_image(x)

But the training is done on batches, so we should use aggregate several examples into batch. This is done by [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)

In [None]:
batch_size = 32
num_workers = 4  # set 0 for Windows (or enjoy bugs)

train_loader = torch.utils.data.DataLoader(
    train_dataset,  # descendant of torch.utils.data.Dataset
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    collate_fn=None,  # how to create batch from separate examples
    drop_last=False,  # drop incomplete batch
    worker_init_fn=None,  # may be useful to set workers different random seeds
)

In [None]:
print(len(train_loader))
for batch_image, batch_target in train_loader:
    break
draw_tensor_image(batch_image)
print(batch_target)

Convert also validation and test sets

In [None]:
valid_loader = torch.utils.data.DataLoader(
    MNISTDataset(X_val, y_val, transform=transform),
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers
)
test_loader = torch.utils.data.DataLoader(
    MNISTDataset(X_test, y_test, transform=transform),
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers
)

## Neural network model

In [None]:
from torch import nn

model = nn.Sequential(
    nn.Conv2d(1, 16, 5, 2),  #12x12
    nn.ReLU(),
    nn.Conv2d(16, 32, 3),  #10x10
    nn.ReLU(),
    nn.Conv2d(32, 64, 3),  #8x8
    nn.ReLU(),
    nn.Conv2d(64, 128, 3, 2), #3x3
    nn.ReLU(),
    nn.modules.Flatten(),
    nn.Linear(128*9, 10)
)

## Training loop

In [None]:
import time

In [None]:
epochs = 3
lr = 1e-3

optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0)
loss_fn = nn.CrossEntropyLoss()  # combines nn.LogSoftmax() and nn.NCELoss()

for e in range(epochs):
    # timing
    time_start = time.time()
    # train
    model.train()
    train_losses = []
    for batch_image, batch_target in train_loader:
        # zero_grad optimizer
        optimizer.zero_grad()
        # forward
        logits = model(batch_image)
        loss = loss_fn(logits, batch_target)
        # backward
        loss.backward()
        # update all params
        optimizer.step()
        # update metrics
        train_losses.append(loss.item())
    train_loss = np.mean(train_losses)
    
    # validation
    model.eval()
    val_losses = []
    for batch_image, batch_target in valid_loader:
        # forward
        logits = model(batch_image)
        loss = loss_fn(logits, batch_target)
        val_losses.append(loss.item())
    val_loss = np.mean(val_losses)
    
    # timing
    epoch_time = time.time() - time_start
    
    print(f"[Epoch {e:2d}]: loss={train_loss:.3f}, val_loss={val_loss:.3f}, epoch_time={epoch_time:.2f}")

## GPU training

In [None]:
torch.cuda.is_available()

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# model = model.to(device)
# optimizer = optimizer.to(device)
# loss_fn = loss_fn.to(device)

In [None]:
def run_loader(loader, train=True, device=device):
    losses = []
    for batch_image, batch_target in loader:
        # move inputs to device
        batch_image = batch_image.to(device)
        batch_target = batch_target.to(device)
        # forward
        logits = model(batch_image)
        loss = loss_fn(logits, batch_target)
        # backward
        if train:
            loss.backward()
            # update all params
            optimizer.step()
            optimizer.zero_grad()
        # update metrics
        losses.append(loss.item())
    
    return np.mean(losses)

epochs = 3
lr = 1e-3

optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0)
loss_fn = nn.CrossEntropyLoss().to(device)  # combines nn.LogSoftmax() and nn.NCELoss()
model = model.to(device)

for e in range(epochs):
    # timing
    time_start = time.time()
    # train
    model.train()
    train_loss = run_loader(train_loader, train=True)
    
    # validation
    model.eval()
    val_loss = run_loader(valid_loader, train=False)
    
    # timing
    epoch_time = time.time() - time_start
    
    print(f"[Epoch {e:2d}]: loss={train_loss:.3f}, val_loss={val_loss:.3f}, epoch_time={epoch_time:.2f}")

Is it good or bad loss? We need **metrics**

In [None]:
def acc(logits, targets):
    # TODO: implement

def run_loader(loader, train=True, device=device):
    model.train(train)
    
    losses = []
    accs = []
    for batch_image, batch_target in loader:
        # move inputs to device
        batch_image = batch_image.to(device)
        batch_target = batch_target.to(device)
        # forward
        logits = model(batch_image)
        loss = loss_fn(logits, batch_target)
        # backward
        if train:
            loss.backward()
            # update all params
            optimizer.step()
            optimizer.zero_grad()
        # update metrics
        losses.append(loss.item())
        accs.append(acc(logits, batch_target).item())
    
    return np.mean(losses), np.mean(accs)

epochs = 3
lr = 1e-3

optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0)
loss_fn = nn.CrossEntropyLoss().to(device)  # combines nn.LogSoftmax() and nn.NCELoss()
model = model.to(device)

for e in range(epochs):
    # timing
    time_start = time.time()
    # train
    train_loss, train_acc = run_loader(train_loader, train=True)
    
    # validation
    val_loss, val_acc = run_loader(valid_loader, train=False)
    
    # timing
    epoch_time = time.time() - time_start
    
    print(f"[Epoch {e:2d}]: loss={train_loss:.3f}, val_loss={val_loss:.3f}, "
          f"acc={train_acc:.3f}, val_acc={val_acc:.3f}, epoch_time={epoch_time:.2f}")

## Let's use a better & faster model now

In [None]:
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2))
        self.fc = nn.Linear(4*4*32, 10)
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

In [None]:
def run_loader(model, optimizer, criterion, loader, train=True, device=device):
    model.train(train)
    
    losses = []
    accs = []
    for batch_image, batch_target in loader:
        # move inputs to device
        batch_image = batch_image.to(device)
        batch_target = batch_target.to(device)
        # forward
        logits = model(batch_image)
        loss = criterion(logits, batch_target)
        # backward
        if train:
            loss.backward()
            # update all params
            optimizer.step()
            optimizer.zero_grad()
        # update metrics
        losses.append(loss.item())
        accs.append(acc(logits, batch_target).item())
    
    return np.mean(losses), np.mean(accs)


def train(model, optimizer, criterion, epochs=3, device=device):
    criterion = criterion.to(device)
    model = model.to(device)

    for e in range(epochs):
        # timing
        time_start = time.time()
        # train
        train_loss, train_acc = run_loader(model, optimizer, criterion, train_loader, train=True)

        # validation
        val_loss, val_acc = run_loader(model, optimizer, criterion, valid_loader, train=False)

        # timing
        epoch_time = time.time() - time_start

        print(f"[Epoch {e:2d}]: loss={train_loss:.3f}, val_loss={val_loss:.3f}, "
              f"acc={train_acc:.3f}, val_acc={val_acc:.3f}, epoch_time={epoch_time:.2f}")
    
    # test
    test_loss, test_acc = run_loader(model, optimizer, criterion, test_loader, train=False)
    print(f"[Test]: loss={test_loss:.3f}, acc={test_acc:.3f}")
    return test_loss, test_acc

In [None]:
epochs = 25

lr = 1e-3
model = CNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

train(model, optimizer, criterion, epochs=epochs)

### Save/load pretrained model

In [None]:
PATH = './model.pth'
torch.save(model.state_dict(), PATH)

In [None]:
model = CNN()
model.load_state_dict(torch.load(PATH))
model.to(device)
test_loss, test_acc = run_loader(model, optimizer, criterion, test_loader, train=False)
print(f"[Test]: loss={test_loss:.3f}, acc={test_acc:.3f}")

## Reproducibility

In [None]:
epochs = 2

lr = 1e-3
model = CNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

train(model, optimizer, criterion, epochs=epochs)

In [None]:
epochs = 2

lr = 1e-3
model = CNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

train(model, optimizer, criterion, epochs=epochs)

In [None]:
import random
import numpy as np
import torch.backends.cudnn as cudnn

def set_random_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    random.seed(seed)
    np.random.seed(seed)
    
def prepare_cudnn(deterministic=True, benchmark=False):
    if torch.cuda.is_available():
        # CuDNN reproducibility
        # https://pytorch.org/docs/stable/notes/randomness.html#cudnn
        cudnn.deterministic = deterministic

        # https://discuss.pytorch.org/t/how-should-i-disable-using-cudnn-in-my-code/38053/4
        cudnn.benchmark = benchmark
        
def set_deterministic_behaviour(seed=42):
    set_random_seed(seed)
    prepare_cudnn(deterministic=True, benchmark=False)
    
# for n_workers>0 you also may need to set worker_init_fn=worker_init_fn in (train) DataLoader
# def worker_init_fn(worker_id):
#     set_random_seed(worker_id)

In [None]:
set_deterministic_behaviour()
epochs = 2

lr = 1e-3
model = CNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

train(model, optimizer, criterion, epochs=epochs)

In [None]:
set_deterministic_behaviour()
epochs = 2

lr = 1e-3
model = CNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

train(model, optimizer, criterion, epochs=epochs)

## Implement grid neural architecture search on FasionMnist

[FashionMNIST](https://github.com/zalandoresearch/fashion-mnist) is a bit harder dataset than MNIST, but with same size

In [None]:
from torchvision.datasets import FashionMNIST

train_dataset = FashionMNIST('./fmnist', train=True, download=True, transform=transform)
test_dataset = FashionMNIST('./fmnist', train=False, transform=transform)

In [None]:
indices = np.arange(len(train_dataset))
np.random.seed(42)
np.random.shuffle(indices)

val_size = 10000
train_indices, valid_indices = indices[:-val_size], indices[-val_size:]
valid_dataset = # TODO
train_dataset = # TODO
print(len(train_dataset), len(valid_dataset), len(test_dataset))

In [None]:
batch_size = 256
num_workers = num_workers

train_loader = # TODO
valid_loader = # TODO
test_loader = # TODO
len(train_loader)

#### Define model parametrized by number of activations

In [None]:
class CNN(nn.Module):
    def __init__(self, activation_layer=nn.ReLU):
        super(CNN, self).__init__()
        # TODO
        
    def forward(self, x):
        # TODO
        return x

In [None]:
# define model, criterion, optimizer
model = CNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

epochs = 2
test_loss, test_acc = train(model, optimizer, criterion, epochs=epochs)

In [None]:
# transform target into appropriate form https://pytorch.org/docs/stable/nn.html#torch.nn.MultiLabelMarginLoss
def target_transform(target):
    # TODO
    return target

class MultiLabelMarginLossCustom(nn.MultiLabelMarginLoss):
    def forward(self, logits, target):
        target = target_transform(target)
        return super().forward(logits, target)

In [None]:
# define model, criterion, optimizer
model = CNN()
criterion = MultiLabelMarginLossCustom()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

epochs = 2
test_loss, test_acc = train(model, optimizer, criterion, epochs=epochs)

In [None]:
# TODO: implement grid search by activations and loss functions
# can we trust the results? e.g. say that some activation/loss function is definetely better than other?