# First Convolution based neural net

## References

* LeCun et al. 1990, _Handwritten Digit Recognition: Applications of Neural Net Chips and Automatic Learning_, [Neurocomputing](https://link.springer.com/chapter/10.1007/978-3-642-76153-9_35)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# plot first item in dataset
import matplotlib.pyplot as plt
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import tqdm

# load mnist using scikit-learn
from sklearn.datasets import fetch_openml
from torch.optim import SGD
from torch.utils.data import DataLoader, Dataset

## LeCun et al. 1990, "Handwritten Digit Recognition: Applications of Neural Net Chips and Automatic Learning"
> The following tries to reproduce the original paper. Note that the digits dataset actually used in the paper could not be found and [MNIST 784](https://www.openml.org/search?type=data&status=active&id=554) is used instead

specifics in the paper:

* neural net
    * weight initialization: uniformly at random $\in [-2.4 / F_i, 2.4 / F_i]$ with $F_i = $ number of inputs of the unit
    * "tanh activation": $A \cdot \tanh (S \cdot a)$ with $A = 1.716$, $S = 2/3$ and $a = \text{weights} \cdot \text{input}$
    * 256 input (16 x 16 pixel images)
    * layer #1: 
        * convolution with 12 5x5-kernels and stride 2 (output: 8 x 8 x 12 = 786 "units")
        * tanh activation
        * $F_i = 25$?
    * layer #2: 
        * convolution with 12 5x5-kernels and stride 2 (output: 4 x 4 x 12 = 192 "units")
        * tanh activation
        * $F_i = 25$?
    * layer #3:
        * dense with 30 neurons
        * tanh activation
        * $F_i = 192$
    * layer #4:
        * dense output layer with 10 neurons
        * tanh activation
        * $F_i = 30$
* target: vector of 10 values either 1 or -1 (so 9x -1 and 1x 1)
* loss: mean squared error between prediction and target (paper reached 1.8e-2 on test and 2.5e-3 on train)
* error rates: 0.14% on train, 5% on test
* training:
    * stochastic gradient descent (1 sample per backpropagation)
    * samples always in the same order, no shuffeling
    * 23 or 30 epochs, paper is ambiguous
    * learning rate was set using some not defined 2nd order derivative method

In [None]:
mnist = fetch_openml("mnist_784", version=1, cache=True)

In [None]:
def get_device() -> str:
    return "cuda" if torch.cuda.is_available() else "cpu"


device = get_device()
device

In [None]:
X = mnist["data"]
y = mnist["target"]
X.shape, y.shape

In [None]:
ix0 = 100
X0, y0 = X[:ix0], y[:ix0]

In [None]:
class DigitsDataset(Dataset):
    def __init__(self, X: pd.DataFrame, y: pd.Series):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        img = (
            torch.from_numpy(self.X.iloc[idx].values / 255.0)
            .reshape(28, 28)
            .double()
        )
        label = int(self.y.iloc[idx])
        return (img, label)

In [None]:
ds = DigitsDataset(X0, y0)

In [None]:
item = ds[4]
plt.imshow(item[0], cmap="gray", origin="upper")
plt.title(f"Label: {item[1]}")
plt.tight_layout()

In [None]:
batch_size = 4
dataloader = DataLoader(ds, batch_size=batch_size, shuffle=True)

In [None]:
train_features, train_labels = next(iter(dataloader))

In [None]:
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")
img = train_features[0]  # .reshape((28,28))
label = train_labels[0]
plt.imshow(img, cmap="gray")
plt.show()
print(f"Label: {label}")

In [None]:
def calc_conv_output_dim(input_dim, kernel_size, padding, stride):
    return int((input_dim - kernel_size + 2 * padding) / stride + 1)


calc_conv_output_dim(28, 5, 2, 2), calc_conv_output_dim(14, 5, 2, 2)

In [None]:
class Model(nn.Module):
    def __init__(self, edge: int = 28, n_classes: int = 10):
        super().__init__()

        self.conv1 = nn.Conv2d(1, 12, kernel_size=5, stride=2, padding=2)
        edge = edge // 2  # effect of stride
        self.conv2 = nn.Conv2d(12, 12, kernel_size=5, stride=2, padding=2)
        edge = edge // 2  # effect of stride
        self.lin1 = nn.Linear(edge * edge * 12, 30)
        self.lin2 = nn.Linear(30, n_classes)
        self.act = F.tanh

    def forward(self, x):
        x = x.unsqueeze(dim=1)
        x = self.act(self.conv1(x))
        x = self.act(self.conv2(x))
        x = torch.flatten(x, 1)
        x = self.act(self.lin1(x))
        x = self.lin2(x)
        return F.softmax(x, dim=-1)

In [None]:
model = Model()
model.double()

In [None]:
model.parameters

In [None]:
opt = SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

In [None]:
loss_func = nn.CrossEntropyLoss()

In [None]:
model.conv1.weight.device

In [None]:
model.to(device)

In [None]:
n_epochs = 5
model.train()
for epoch in tqdm.tqdm(range(n_epochs), desc="Epochs", total=n_epochs):
    for i, (xb, yb) in tqdm.tqdm(
        enumerate(dataloader), desc="Batches", total=len(dataloader)
    ):
        xb = xb.to(device)
        yb = yb.to(device)
        loss = loss_func(model(xb), yb)

        opt.zero_grad()
        loss.backward()
        opt.step()

    print(loss_func(model(xb), yb))

In [None]:
train_features, train_labels = next(iter(dataloader))

In [None]:
model.eval()

In [None]:
train_features = train_features.to(device)
pred_probs = model(train_features)

In [None]:
y_pred = pred_probs.to("cpu").detach().numpy().argmax(axis=1)
y_pred

In [None]:
train_labels

In [None]:
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")
img = train_features[0].cpu()  # .reshape((28,28))
label = train_labels[0]
plt.imshow(img, cmap="gray")
plt.show()
print(f"Label: {label}, pred: {y_pred[0]}")