# Thinking in tensors, writing in PyTorch

Hands-on training  by [Piotr Migdał](https://p.migdal.pl) (2019). Version 0.4 for Uniwersytet Śląski.


## ConvNets: Image classification

Notebook by Piotr Migdał, with some help from [Katarzyna Kańska](https://github.com/kkanska).

<a href="https://colab.research.google.com/github/stared/thinking-in-tensors-writing-in-pytorch/blob/master/convnets/Image%20classification.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>

Based on a beautiful dataset [Google Quickdraw](https://quickdraw.withgoogle.com/data), see also:

* [Machine Learning for Visualization - Let’s Explore the Cutest Big Dataset](https://medium.com/@enjalot/machine-learning-for-visualization-927a9dff1cab) - Ian Johnson
* [Learning neural networks within Jupyter Notebook](https://github.com/stared/keras-interactively-piterpy2018)  (my talk at [PiterPy #5](https://piterpy.com/en), 2-3 Nov 2018, St Petersburg, Russia); essentially the same in Keras
* `pip install livelossplot` - [Live training loss plot in Jupyter Notebook for Keras, PyTorch and others](https://github.com/stared/livelossplot/) 

In [None]:
# if run from colab
!pip install livelossplot --quiet

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

import seaborn as sns
import pandas as pd

import torch
from torch import nn, optim
from torch.utils.data import TensorDataset, DataLoader

import urllib.request
from livelossplot import PlotLosses

In [2]:
classes = ["cat", "dog", "spider", "octopus", "snowflake"]

In [4]:
!mkdir data

mkdir: data: File exists


In [3]:
# download classes in necessary
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'
for c in classes:
    path = '{}{}.npy'.format(base_url, c.replace('_', '%20'))
    print(path)
    urllib.request.urlretrieve(path, "data/{}.npy".format(c))

https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/cat.npy
https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/dog.npy
https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/spider.npy
https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/octopus.npy
https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/snowflake.npy


## What's inside?

In [None]:
!ls data/

## Data loading

I.e. the boring part.

In [None]:
limit = 500

X_list = []

for c in classes:
    X_c = np.load("data/{}.npy".format(c))  # or "../data/full_numpy_bitmap_{}.npy"
    print("Loaded {} out of {} {}s".format(limit, X_c.shape[0], c))
    X_list.append(X_c[:limit])

X = np.concatenate(X_list)
Y = np.concatenate([limit * [i] for i in range(len(classes))])

In [None]:
X.dtype

In [None]:
X.shape

In [None]:
X[0]

In [None]:
X[0].reshape(28, 28)[:10, :10]

In [None]:
size = 32

In [None]:
X = X.reshape(-1, 1, 28, 28)
X = X.astype('float32') / 255.

# but it is so much easier to work with 32x32 images
X = np.pad(X, [(0, 0), (0, 0), (2, 2), (2, 2)], mode='constant', constant_values=0)

In [None]:
# (samples, channels, x, y)
X.shape

In [None]:
# answer keys are integers
Y.dtype

## Train-test split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

## First, let's have a look

In [None]:
plt.imshow(X_train[53].reshape(size, size), cmap='Greys');

In [None]:
Y_train[:20]

In [None]:
def draw_examples(X, Y, classes, rows=6, scale=1):
    fig, axs = plt.subplots(rows, len(classes), figsize=(scale * len(classes), scale * rows))
    size = X.shape[-1]
    for class_id in range(len(classes)):
        X_class = X[Y == class_id]
        for i in range(rows):
            ax = axs[i, class_id]
            x = X_class[np.random.randint(len(X_class))].reshape(size, size)
            ax.imshow(x, cmap='Greys', interpolation='none')
            ax.axis('off')

In [None]:
draw_examples(X_train, Y_train, classes, rows=6, scale=2)

## Per-class averages

Vide [this tweet](https://twitter.com/kcimc/status/902229612666658816)

In [None]:
def draw_class_averages(X, Y, classes, scale=2):
    fig, axs = plt.subplots(1, len(classes), figsize=(scale * len(classes), scale))
    size = X.shape[-1]
    for class_id in range(len(classes)):
        X_class = X[Y == class_id]
        ax = axs[class_id]
        x = X_class.mean(axis=0).reshape(size, size)
        ax.imshow(x, cmap='Greys', interpolation='none')
        ax.axis('off')

In [None]:
draw_class_averages(X_train, Y_train, classes)

## Datasets and data loaders

We need to create data loaders to load and preprocess data. We use split:
* train - for training,
* validation - not used for training, but to evaluate model performance.

In [None]:
torch.from_numpy(Y_train).long().dtype

In [None]:
# download CIFAR10 train and validation datasets

# define data loaders
dataloaders = {
    'train':
    DataLoader(TensorDataset(torch.from_numpy(X_train), torch.from_numpy(Y_train).long()),
               batch_size=64,
               shuffle=True, num_workers=4),
    'validation': 
    DataLoader(TensorDataset(torch.from_numpy(X_test), torch.from_numpy(Y_test).long()),
               batch_size=64,
               shuffle=False, num_workers=4)
}

## Before we start

While training a model, it is important to set `train` or `eval` mode of the model, as some layers have different behavior during train and evaluation.

See also: [Keras vs. PyTorch: Alien vs. Predator recognition with transfer learning](https://deepsense.ai/keras-vs-pytorch-avp-transfer-learning) which explains API differences between these frameworks.

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

def train_model(model, criterion, optimizer, num_epochs=10):
    liveloss = PlotLosses()
    model = model.to(device)
    
    for epoch in range(num_epochs):
        logs = {}
        for phase in ['train', 'validation']:
            if phase == 'train':
                model.train()
            else:
                model.eval()

            running_loss = 0.0
            running_corrects = 0

            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                outputs = model(inputs)
                loss = criterion(outputs, labels)

                if phase == 'train':
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

                _, preds = torch.max(outputs, 1)
                running_loss += loss.item() * inputs.size(0)
                running_corrects += (preds == labels.data).sum().item()

            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = running_corrects / len(dataloaders[phase].dataset)
            
            prefix = ''
            if phase == 'validation':
                prefix = 'val_'

            logs[prefix + 'log loss'] = epoch_loss
            logs[prefix + 'accuracy'] = epoch_acc
        
        liveloss.update(logs)
        liveloss.draw()

## Logistic regression

Multi-class logistic regression can be expressed as a shallow neural network consisting of one linear layer and a softmax activation function.

For binary classification, we can use sigmoid (a.k.a. logistic function):

$$ \sigma(x) = \frac{1}{1+\exp(-x)} $$

Softmax function transforms any vector into distribution vector (values in range (0., 1.) that sum up to 1.):
$$\text{softmax}(x_i) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$$

We use a cross-entropy loss function:
$$- \sum_j p_{j, true} \log(p_{j, pred})$$

Note that we do not state explicitly the softmax function in the model class below. For details see [torch.nn.CrossEntropyLoss](https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss).

See also:

* [Cross-entropy vs. mean-squared error loss](https://www.reddit.com/r/MachineLearning/comments/8im9eb/d_crossentropy_vs_meansquared_error_loss/)
* [Understanding binary cross-entropy / log loss: a visual explanation](https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a)
* [Cross entropy](https://pandeykartikey.github.io/machine/learning/basics/2018/05/22/cross-entropy.html) - another explanation
* [Softmax function](https://en.wikipedia.org/wiki/Softmax_function)
* [Multiclass logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression)

In [None]:
class LogisticRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(size**2, len(classes))
    
    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

In [None]:
model = LogisticRegression()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
# optimizer = optim.SGD(model.parameters(), lr=100.)
# optimizer = optim.Adam(model.parameters(), lr=1e-4)

model_trained = train_model(model, criterion, optimizer, num_epochs=20)

## Predictions

In [None]:
pred_logit = model(torch.from_numpy(X_test[:5]).to(device))
pred_logit

In [None]:
pred_logit.softmax(dim=1)

### Exercise

Make some changes and see how it goes.

Hints:

* Test optim.SGD learning rate (e.g. x0.1 and x10).
* Use optim.Adam instead of optim.SGD.

Optimizers are important, see:

* [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/) by Sebastian Ruder
* [SGD > Adam?? Which One Is The Best Optimizer: Dogs-VS-Cats Toy Experiment | SALu](https://shaoanlu.wordpress.com/2017/05/29/sgd-all-which-one-is-the-best-optimizer-dogs-vs-cats-toy-experiment/)

tl;dr: If you don't now what to do, use Adam.

## Old school neural network

Linear layers are also called dense layers or fully-connected layers. Stacking a few of them gives a model called multilayer perceptron (MLP). Importantly, we need to use an activation function for our network to be nonlinear transformation. Here we use sigmoid activation function.

In [None]:
class MLP(nn.Module):
    def __init__(self, hidden_1=128, activation='sigmoid'):
        super().__init__()
        func = {'sigmoid': nn.Sigmoid(), 
                'relu': nn.ReLU(),
                'tanh': nn.Tanh()}[activation]
        self.fc = nn.Sequential(
            nn.Linear(1 * size * size, hidden_1),
            func,
            #nn.Linear(hidden_1, hidden_1),
            #func,
            nn.Linear(hidden_1, len(classes))
        )

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

In [None]:
model = MLP(hidden_1=256)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

model_trained = train_model(model, criterion, optimizer, num_epochs=20)

### Exercise

Make some changes and see how it goes.

Hints:

* Use Tanh or ReLU instead of Sigmoid.
* Use more than 20 epochs.
* In practice, neural networks use 2-3 dense layers.
* Make big changes to see a difference. In this case change the hidden layer size by 2x or even 10x.

## Convolutional neural network

Treating an image as a flat vector looses its spatial structure. Instead we can use the spacial structure in our advantage and perform convolutions.
Convolution is an operation which performs the same local operation on each part of the image.

![](https://github.com/vdumoulin/conv_arithmetic/blob/master/gif/same_padding_no_strides.gif?raw=true)

Each convolution layer produces new channels based on those which preceded it. First, we start with 3 channels for red, green and blue (RGB) components. Next, channels get more and more abstract.

While producing new channels with representations of various properties of the image, we also reduce the resolution, usually using pooling layers.

See also:
* [Image Kernels - visually explained](http://setosa.io/ev/image-kernels/)
* [How neural networks build up their understanding of images](https://distill.pub/2017/feature-visualization/)
* source of above image: [Convolution arithmetic](https://github.com/vdumoulin/conv_arithmetic)
* [Convolutional Neural Networks by Andrej Karpathy](http://cs231n.github.io/convolutional-networks/) for in-depth explanation of convolutions and other accompanying blocks
* [CNNs, Part 1: An Introduction to Convolutional Neural Networks](https://victorzhou.com/blog/intro-to-cnns-part-1/) by Victor Zhou
* [How do Convolutional Neural Networks work?](http://brohrer.github.io/how_convolutional_neural_networks_work.html)

In [None]:
class ConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.convs = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
        self.fc = nn.Linear(32 * 8 * 8, len(classes))
    
    
    def forward(self, x):
        x = self.convs(x)
        x = self.fc(x.view(x.size(0), -1))
        return x

In [None]:
# or we can modularize that
class ConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.convs = nn.Sequential(
            self._block(1, 16),
            self._block(16, 32)
        )
        self.fc = nn.Sequential(
            # nn.Linear(...),
            # nn.ReLU(),
            # nn.Dropout(0.5),
            nn.Linear(32 * 8 * 8, len(classes)) # dropout between dense layers
        )
        
    def _block(self, in_channels, out_channels):
        return nn.Sequential(
            # nn.BatchNorm2d(in_channels),
            nn.Conv2d(in_channels, out_channels, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
    
    def forward(self, x):
        x = self.convs(x)
        x = self.fc(x.view(x.size(0), -1))
        return x

In [None]:
model = ConvNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)

model_trained = train_model(model, criterion, optimizer, num_epochs=20)

## Exercise
Now, feel free to experiment.

Hints:
* Play with the number of channels and how they grow.
* Usually 3×3 convolutions work the best; stick to them (and 1×1 convolutions which only mix channels).
* You can have 1-3 convolutional layers before each MaxPool operation.
* Adding a Dense layer may help.
* Between dense layers you can use Dropout, to reduce overfitting (i.e. if you see that training accuracy is higher than validation accuracy).


In [None]:
def predict(model, X):
    return model(torch.from_numpy(X).to(device)).softmax(dim=1).cpu().detach().numpy()

def plot_preditions(model, X, Y, classes, rows=6, only_wrong=False):
    
    # very greedy
    preds = predict(model, X)
    
    if only_wrong:
        incorrect = preds.argmax(1) != Y
        preds = preds[incorrect]
        X = X[incorrect]
        Y = Y[incorrect]

    fig, axs = plt.subplots(rows, 2, figsize=(8, 1.5 * rows))
    for i in range(rows):
        ax = axs[i, 0]
        ax.imshow(X[i].reshape(size, size),
                  cmap='Greys', interpolation='none')

        ax.axis('off')
    
        pd.DataFrame({"pred": preds[i], "true": [int(Y[i] == j) for j in range(len(classes))]}, index=classes) \
          .plot(kind='barh', ax=axs[i, 1], xlim=[0, 1], stacked=True, legend=False)
        
    fig.tight_layout()

In [None]:
plot_preditions(model, X_train, Y_train, classes)

In [None]:
plot_preditions(model, X_test, Y_test, classes)

In [None]:
plot_preditions(model, X_test, Y_test, classes, only_wrong=True)

## Quantify confusion

In [None]:
preds = predict(model, X_test).argmax(1)
cm = confusion_matrix(Y_test, preds)
cm

In [None]:
cm_df = pd.DataFrame(cm, index=classes, columns=classes)
cm_df.columns.name = "predicted"
cm_df.index.name = "ground truth"

plt.subplots(figsize=(10,10))
sns.heatmap(cm_df, annot=True, fmt='d')

In [None]:
def confusion_image_matrix(model, X, Y, classes, size=size):
    confused = np.zeros((len(classes), len(classes), size, size), dtype='float32')
    Y_pred = predict(model, X).argmax(1)
    for x, y_true, y_pred in zip(X, Y, Y_pred):
        confused[y_true, y_pred] = x[0, :, :]

    fig, axs = plt.subplots(len(classes), len(classes), figsize=(2*len(classes), 2*len(classes)))
    for i in range(len(classes)):
        for j in range(len(classes)):
            ax = axs[i, j]
            ax.imshow(confused[i, j], cmap='Greys', interpolation='none')
            ax.axis('off')

    fig.suptitle('predicted', fontsize=16)
    for ax, c in zip(axs[0], classes):
            ax.set_title(c)

In [None]:
confusion_image_matrix(model, X_test, Y_test, classes)

## Further notes

If you want to learn more, some relevant blog posts:

* [Data science intro for math/phys background](http://p.migdal.pl/2016/03/15/data-science-intro-for-math-phys-background.html)
* [Learning Deep Learning with Keras](https://p.migdal.pl/2017/04/30/teaching-deep-learning.html)
* [Keras or PyTorch as your first deep learning framework](https://deepsense.ai/keras-or-pytorch/) (previously with an inflammatory title *Don't learn TensorFlow - start with Keras or PyTorch instead*)
* [Keras vs. PyTorch: Alien vs. Predator recognition with transfer learning](https://deepsense.ai/keras-vs-pytorch-avp-transfer-learning/) with interactive code in Jupyter Notebook: https://www.kaggle.com/pmigdal/alien-vs-predator-images/kernels
* [Simple diagrams of convoluted neural networks](https://medium.com/inbrowserai/simple-diagrams-of-convoluted-neural-networks-39c097d2925b) - In Browser AI
* [Train a model in tf.keras with Colab, and run it in the browser with TensorFlow.js](https://medium.com/tensorflow/train-on-google-colab-and-run-on-the-browser-a-case-study-8a45f9b1474e) - Zaid Alyafeai

