# PyTorch 60-Minute Blitz

This notebook is to jot down some thoughts as I'm going through the 60-Minute Blitz tutorial [here](http://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py). 

## Basic Stuff

In [1]:
import torch

x = torch.Tensor(5, 3)
print(x)


-2724.7168     0.0000 -2724.7168
    0.0000     0.0000     0.0000
    0.0000     0.0000     0.0000
    0.0000    -0.0000     0.0000
    0.0000     0.0000    -0.0000
[torch.FloatTensor of size 5x3]



Note: tensors can be uninitialized.

In [2]:
x = torch.rand(5, 3)
print(x)


 0.5033  0.7657  0.7389
 0.8549  0.7376  0.9325
 0.7186  0.5029  0.1061
 0.9886  0.8077  0.0406
 0.4225  0.5819  0.6325
[torch.FloatTensor of size 5x3]



In [3]:
print(x.size())

torch.Size([5, 3])


In [4]:
y = torch.rand(5, 3)
print(x + y)


 0.5080  1.2187  1.3362
 1.1371  1.1508  1.6654
 1.3650  1.2020  0.7618
 1.5371  1.7909  0.7624
 0.6864  1.5570  0.7882
[torch.FloatTensor of size 5x3]



Different kinds of adding, without changing `x` or `y`:

In [5]:
print(x + y)
print(x.add(y))
print(torch.add(x, y))


 0.5080  1.2187  1.3362
 1.1371  1.1508  1.6654
 1.3650  1.2020  0.7618
 1.5371  1.7909  0.7624
 0.6864  1.5570  0.7882
[torch.FloatTensor of size 5x3]


 0.5080  1.2187  1.3362
 1.1371  1.1508  1.6654
 1.3650  1.2020  0.7618
 1.5371  1.7909  0.7624
 0.6864  1.5570  0.7882
[torch.FloatTensor of size 5x3]


 0.5080  1.2187  1.3362
 1.1371  1.1508  1.6654
 1.3650  1.2020  0.7618
 1.5371  1.7909  0.7624
 0.6864  1.5570  0.7882
[torch.FloatTensor of size 5x3]



Different kinds of adding, where one of the variables is mutated:

In [6]:
xclone = x.clone() # So we don't mess up x
xclone.add_(y)
print(xclone) # Should be == x + y


 0.5080  1.2187  1.3362
 1.1371  1.1508  1.6654
 1.3650  1.2020  0.7618
 1.5371  1.7909  0.7624
 0.6864  1.5570  0.7882
[torch.FloatTensor of size 5x3]



In [7]:
xclone = x.clone()
torch.add(xclone, y, out=xclone)
print(xclone)


 0.5080  1.2187  1.3362
 1.1371  1.1508  1.6654
 1.3650  1.2020  0.7618
 1.5371  1.7909  0.7624
 0.6864  1.5570  0.7882
[torch.FloatTensor of size 5x3]



Slicing, and stuff:

In [8]:
print(x)
print(x[:, 0])
print(x[1, :])
print(x[2:3, 1:2])


 0.5033  0.7657  0.7389
 0.8549  0.7376  0.9325
 0.7186  0.5029  0.1061
 0.9886  0.8077  0.0406
 0.4225  0.5819  0.6325
[torch.FloatTensor of size 5x3]


 0.5033
 0.8549
 0.7186
 0.9886
 0.4225
[torch.FloatTensor of size 5]


 0.8549
 0.7376
 0.9325
[torch.FloatTensor of size 3]


 0.5029
[torch.FloatTensor of size 1x1]



Note: slices are, as expected, inclusive in first and exclusive in second parameter.

Tensors in PyTorch are stored in an underlying numpy array. This numpy array can be changed obtained and modified--and the tensor variable will reflect those changes (and vice-versa). 

In [9]:
a = torch.ones(5)
print(a)


 1
 1
 1
 1
 1
[torch.FloatTensor of size 5]



In [10]:
b = a.numpy()
print(b)

[ 1.  1.  1.  1.  1.]


In [11]:
a.add_(1)


 2
 2
 2
 2
 2
[torch.FloatTensor of size 5]

Note: scalars addition with tensors is handled conveniently.

In [12]:
print(b)

[ 2.  2.  2.  2.  2.]


You can build a tensor variable on top of an existing numpy array, and the tensor will use the existing numpy array as its underlying storage.

In [13]:
import numpy as np
a = np.ones(5)
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a)
print(b)
# a and b should be the same

[ 2.  2.  2.  2.  2.]

 2
 2
 2
 2
 2
[torch.DoubleTensor of size 5]



Putting variable on GPU:

In [14]:
print(torch.cuda.is_available())
if torch.cuda.is_available():
    x = x.cuda()
    y = y.cuda()
    print(x + y)

True

 0.5080  1.2187  1.3362
 1.1371  1.1508  1.6654
 1.3650  1.2020  0.7618
 1.5371  1.7909  0.7624
 0.6864  1.5570  0.7882
[torch.cuda.FloatTensor of size 5x3 (GPU 0)]



### Thoughts:

The underscore after an operation (e.g., `x.add_(y)`) means in-place. Does `torch.add(x, y, out=x)` also do it in-place (i.e., it optimizes this case automatically)? Either way, for in-place, it's more clear to use `_` notation, I think. 

The `_` notation doesn't make sense to use with `torch.add`, since in-place would be ambiguous. Lo and behold, as expected, it's note defined.

In [15]:
try:
    torch.add_(x, y)
except AttributeError as e:
    print(e)

module 'torch' has no attribute 'add_'


### Summary

We observed a few things:
 * creating a tensor;
 * operations on a tensor, which can be
    - with mutations (in-place, when the syntax makes sense, or using `out` keyword parameter), or
    - without mutations (`x.add(y)`, `torch.add(x, y)`, `x + y`);
 * tensors are represented with an underlying numpy array:
   - operations to one are reflected in the other, and
   - we can go back and forth between them (`x.numpy()`, `torch.from_numpy(a)`);
 * and, finally, we can put tensors on the GPU (`x = x.cuda()`) and do operations on them (`x + y`) with ease.

## Autograd: closer to the good stuff

In [16]:
import torch
import torch.autograd as autograd

x = autograd.Variable(torch.ones(2, 2), requires_grad=True)
print(x)

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]



In [17]:
y = x + 2
print(y)

Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]



In [18]:
print(y.grad_fn)

<torch.autograd.function.AddConstantBackward object at 0x7f15b8172b88>


In [19]:
y.sum().backward()
print(x.grad)

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]



In [20]:
y.grad = None
x.grad = None
grad_wrt_downstream_scalar_output = torch.ones(2, 2)
y.backward(grad_wrt_downstream_scalar_output)
print(x.grad)
print(y.grad)

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

None


So, to summarize the concepts of `autograd` so far:
* all gradients are computed w.r.t. some eventual downstream scalar output, so:
  - when computing `y.backward()` where `y` is a scalar, no arguments are needed (equivalent to `y.backward(torch.Tensor([1.]))`), but
  - when computing `y.backward()` where `y` is a tensor, the gradient of the eventual downstream scalar output `z` with respect to `y` must be given as an argument to the `backward()` function;
* and, there two main classes in autograd, which are `Variable` and `Function`:
  - Variable is a symbolic variable with state, and
  - Function is...something (not entirely clear from the tutorial) which is used (at least) for computing downstream gradients (Function is probably some object which stores references to the upstream `Variable`s--which were used in calculatations to create the new `Variable`--as well as how to compute the gradient of the current `Variable` with respect to the immediately-upstream `Variable`s; however, I cannot say for sure yet).

In [21]:
z = y*y*3
out = z.mean()
print(z, out)

Variable containing:
 27  27
 27  27
[torch.FloatTensor of size 2x2]
 Variable containing:
 27
[torch.FloatTensor of size 1]



In [22]:
x.grad = None
y.grad = None
z.grad = None
out.grad = None
out.backward()
print(x.grad)

Variable containing:
 4.5000  4.5000
 4.5000  4.5000
[torch.FloatTensor of size 2x2]



Let's verify by hand, where $x$ is `x`, $y$ is `y`, $z$ is `z`, and $o$ is `out`.
$$o = \frac{1}{4} \sum_{i=1}^4 3(x_i + 2)^2$$
$$\frac{\partial o}{\partial x_i} = 1.5(x_i + 2)$$
$$\frac{\partial o}{\partial x_i}\Bigr|_{x = 1} = 1.5(3) = 4.5$$
We see that our program is correct! Very cool. 

We can do "many crazy things", too (direct quote from tutorial). 

In [23]:
x = autograd.Variable(torch.randn(3, 3), requires_grad=True)
y = x * 2
while y.data.norm() < 1000:
    y = y * 2
y.sum().backward()
print(x.grad) # Should be some power of 2.

Variable containing:
 512  512  512
 512  512  512
 512  512  512
[torch.FloatTensor of size 3x3]



So, two few more things:
* first, if no `Variable`s in a graph are initialized with `requires_grad=True`, then no gradients will be computed on a call to `backward()` (and it will raise an exception);
* second, gradients are accumulated at the leaf nodes in the acyclic computation graph (i.e., the ones which were created by the user), and persist between calls to `backward()`.

### Practice: learning to detect symmetry

I'll take a page from one of the original neural net backprop papers by Rumelhart et al. (1986, Nature) and learn to detect symmetry in a binary vector $v \in \{0,1\}^{2n}$ using a two-layer neural network. 

First, let's define our dataset.

In [68]:
np.set_printoptions(threshold=np.nan) # So I can print whole array.

def fill_binary_numbers(arr, start_index=0, prev_digits=[]):
    """Fills 2-d numpy array with binary numbers, counting upwards from 0, 1, 10, 11, etc."""
    for power in range(0, arr.shape[1]):
        for i in range(0, arr.shape[0]):
            arr[i, arr.shape[1] - power - 1] = int(i/(2**power)) % 2

dataset = np.zeros((64, 6), dtype="float32")
fill_binary_numbers(dataset)
print(dataset[0:17, :])

[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.  1.]
 [ 0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.  1.]
 [ 0.  0.  0.  1.  1.  0.]
 [ 0.  0.  0.  1.  1.  1.]
 [ 0.  0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.]
 [ 0.  0.  1.  0.  1.  1.]
 [ 0.  0.  1.  1.  0.  0.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  1.  0.]
 [ 0.  0.  1.  1.  1.  1.]
 [ 0.  1.  0.  0.  0.  0.]]


Now that we've got our dataset, we need to get our labels. (0 indicates symmetry, 1 indicates asymmetry.)

In [75]:
labels = np.logical_not(
    np.all(
        np.equal(dataset[:, 0:3], dataset[:, 5:2:-1]), axis=1)).astype("float32")
labels = (labels - 0.5)*1.8
print(labels[0:17])

[-0.89999998  0.89999998  0.89999998  0.89999998  0.89999998  0.89999998
  0.89999998  0.89999998  0.89999998  0.89999998  0.89999998  0.89999998
 -0.89999998  0.89999998  0.89999998  0.89999998  0.89999998]


Some notes from debugging:
* `==` can't be used for element-wise array comparison, so use `np.equal(a, b)` instead;
* numpy arrays use `.shape` to store the shape tuple, and `.size` to store the number of elements in total.

Now that we've got our labels, let's make our neural net function.

In [76]:
binary_inputs = autograd.Variable(torch.from_numpy(dataset))
labels = autograd.Variable(torch.from_numpy(labels))

In [77]:
weights_l1 = autograd.Variable(torch.randn(6, 2), requires_grad=True)
biases_l1 = autograd.Variable(torch.randn(1, 2), requires_grad=True)
weights_l2 = autograd.Variable(torch.randn(2, 1), requires_grad=True)
biases_l2 = autograd.Variable(torch.randn(1, 1), requires_grad=True)

hidden_layer = torch.sigmoid(torch.matmul(binary_inputs, weights_l1) + biases_l1)
print(hidden_layer.data[0:17])

output_layer = torch.sigmoid(torch.matmul(hidden_layer, weights_l2) + biases_l2).squeeze()

error = sum((labels - output_layer).pow(2))/64.0
print(error)


 0.2609  0.6436
 0.2619  0.3725
 0.4759  0.5814
 0.4771  0.3135
 0.2135  0.8076
 0.2144  0.5799
 0.4112  0.7636
 0.4124  0.5150
 0.2841  0.7472
 0.2851  0.4928
 0.5051  0.6946
 0.5064  0.4278
 0.2338  0.8730
 0.2347  0.6932
 0.4397  0.8410
 0.4410  0.6349
 0.1599  0.5721
[torch.FloatTensor of size 17x2]

Variable containing:
 0.4968
[torch.FloatTensor of size 1]



Let us proceed to the actual learning portion. We'll use $\epsilon = 0.1$ as our learning rate.

In [78]:
eps = 0.1

for i in range(10000):
    hidden_layer = torch.sigmoid(torch.matmul(binary_inputs, weights_l1) - biases_l1)
    output_layer = torch.sigmoid(torch.matmul(hidden_layer, weights_l2) - biases_l2).squeeze()
    error = sum((labels - output_layer).pow(2))/64.0
    error.backward()
    
    weights_l1.data -= eps * weights_l1.grad.data
    biases_l1.data -= eps * biases_l1.grad.data
    weights_l2.data -= eps * weights_l2.grad.data
    biases_l2.data -= eps * biases_l2.grad.data

hidden_layer = torch.sigmoid(torch.matmul(binary_inputs, weights_l1))
output_layer = torch.sigmoid(torch.matmul(hidden_layer, weights_l2))
output_layer.squeeze_()
error = sum((labels - output_layer).pow(2))/64.0

print(error)
print(weights_l1)
print(biases_l1)
print(weights_l2)
print(biases_l2)

Variable containing:
 0.3772
[torch.FloatTensor of size 1]

Variable containing:
-213.4112 -118.8703
-331.1634  -68.7048
-256.3842  -54.3993
-272.2270  -85.8507
-263.9906  -37.8422
-233.3955   13.7417
[torch.FloatTensor of size 6x2]

Variable containing:
-143.9969  174.5171
[torch.FloatTensor of size 1x2]

Variable containing:
-455.7071
 -20.2008
[torch.FloatTensor of size 2x1]

Variable containing:
-0.9131
[torch.FloatTensor of size 1x1]



So, there's something wrong here. I'll try to fix this be doing a simpler optimization problem later. 