## Read the data

We will download the MNIST dataset for training a classifier. Torch provides a convenient function for that.

The MNIST dataset is composed of images of digits that must be classified with labels from 0 to 9. The inputs are 28x28 matrices containing the grayscale intensity in each pixel.

In [None]:
import numpy as np
import mnist

train_x = mnist.train_images()
train_y = mnist.train_labels()
test_x = mnist.test_images()
test_y = mnist.test_labels()

print('%d training instances and %d test instances' % (len(train_x), len(test_x)))

Check the shape of our training data to see how many input features there are:

In [None]:
print(train_x.shape)
print(train_x[0])

### Formatting

Each sample is a 28x28 matrix. But we want to represent them as vectors, since our model doesn't take any advantage of the 2-d nature of the data.

So, we reshape the data:

In [None]:
num_features = 28 * 28
new_shape = [60000, num_features]
train_x_vectors = train_x.reshape(new_shape)
print(train_x_vectors.shape)
print(train_x_vectors[0])

When we reshape an array (or torch tensor, for that matter), we don't need to specify all dimensions. We can leave one as -1, and it will be automatically determined from the size of the data. This is useful when we don't know a priori the shape of some array.

In [None]:
train_x_vectors = train_x.reshape([-1, num_features])
test_x_vectors = test_x.reshape([-1, num_features])

Also, the values are integers in the range $[0, 255]$. It is better to work with float values in a smaller interval, such as $[0, 1]$ or $[-1, 1]$. There are some more elaborate normalization techniques, but for now let's just normalize it to $[0, 1]$.

In [None]:
train_x_vectors /= 255
test_x_vectors /= 255

Oops! Notice that the arrays had integer values, but the result of the division would be floats. The `dtype` of the arrays cannot be changed by arithmetic operators; we need instead to create new arrays.

Keep in mind that data type are a common source of errors!

In [None]:
train_x_norm = train_x_vectors / 255
test_x_norm = test_x_vectors / 255
print(train_x_norm[0])

Now, check the labels:

In [None]:
print(np.unique(train_y))
num_classes = len(np.unique(train_y))

In [None]:
train_x.shape

In [None]:
train_y.shape

### Creating a simple linear classifier

Our input has 748 dimensions (each one is a pixel), and the output has 10 possible classes. We will create a weight matrix $w$ and a bias vector $b$.

The parameter `requires_grad` tells pytorch that their values are adjustable through gradient backpropagation.

In [None]:
import torch
w = torch.randn([num_features, num_classes], requires_grad=True)
b = torch.randn([num_classes], requires_grad=True)

For illustration purposes, let's take the first row of the data and create a pytorch tensor with it.

In [None]:
x0 = torch.tensor(train_x_norm[0])
torch.matmul(x0, w) + b

Again, take care with data types! The inputs were double precision floats (64 bit) and the weights are normal floats (32 bits). Let's explicitly create the batch tensor with normal floats.

In [None]:
x0 = torch.tensor(train_x_norm[0], dtype=torch.float)
logits = torch.matmul(x0, w) + b

This is how the logits look like. Think of them as the scores for each instance/class combination.

In [None]:
logits

We want to take the highest scoring class for each instance, i.e., the argmax:

In [None]:
answer = torch.argmax(logits)
answer

What are the correct classes for those? Most of them must be wrong, we just initialized weights randomly.

In [None]:
label = torch.tensor(train_y[0], dtype=torch.long)
label

#### Loss

We can compute the loss as the mean cross-entropy, as usual for classification problems. Remember that the cross-entropy between the true label distribution $p$ and the predicted $q$ is computed as:

\begin{align}
loss(x, y, \theta) = -\sum_c p(y=c|x) \log q(y=c|x, \theta)
\end{align}

for every label $c$.

The true distribution $p$ is one for the correct label and 0 elsewhere; the predicted $q$ can be computed as the softmax over the logits.

In [None]:
p = torch.zeros([num_classes])
p[label] = 1
p

In [None]:
q = torch.softmax(logits, dim=0)
q

In [None]:
q.sum()

In [None]:
cross_entropy = -torch.sum(p * torch.log(q))
cross_entropy

We can also use pytorch's own function for that! 

We just need to reshape our logits to be a 1x10 tensor and the label to be a 1-dim tensor. This is because usually we process lots of inputs at once.

In [None]:
logits = logits.reshape([1, -1])
label = label.reshape([1])

loss = torch.nn.functional.cross_entropy(logits, label)
loss

#### Gradients

Now we have to compute the gradients to ajust weights. If we take the partial derivative of the cross-entropy (formula above!) with respect to weights $w_c$, we eventually end up with:

\begin{align}
\frac{\partial loss(x, y, \theta)}{\partial w_c} = x \cdot (\mathbb{1}(c = y) - q(c|x, \theta))
\end{align}

Let's compute the gradient for $w_0$, i.e., the weights for the label 0:

In [None]:
c = 0
q0 = q[0]
gradient0 = x0 * (int(label == answer) - q0)

Of course, we can compute the gradient with pytorch as well. Once we call the method `backward()` in a tensor, all tensors that are used to compute it get an attribute `grad`.

In [None]:
loss.backward()
w.grad

In [None]:
w.grad.nonzero().shape

In [None]:
x0.nonzero().shape

Let's check if pytorch gradients match ours. Again, we use the mean squared error instead of the simple `==` operator because of possible differences in precision.

In [None]:
gradient0_pytorch = w.grad[:, 0]
torch.sum((gradient0 - gradient0_pytorch) ** 2)

That's great! Now we have to effectively change the weights in the direction of the gradients. While we are at it, let's also compute the gradient with respect to the bias.

In [None]:
w.data.sub_(w.grad.data)
b.data.sub_(b.grad.data)

Run the forward pass again

In [None]:
logits = torch.matmul(x0, w) + b
logits = logits.view([1, -1])
torch.softmax(logits, dim=1)

And the loss:

In [None]:
torch.nn.functional.cross_entropy(logits.view([1, -1]), label.view([1]))

We zeroed the loss! So is it that simple to get 100% accuracy classifiers?