# The PyTorch Tutorials

Notes following along with the [beginner NLP PyTorch tutorials](https://github.com/pytorch/tutorials/tree/master/beginner_source/nlp).

## Introduction to PyTorch

[Article 1](https://pytorch.org/tutorials/beginner/nlp/pytorch_tutorial.html)

In [1]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x12b91fd10>

### Create Tensors

Just wrap a list in `torch.tensor()` and you are good. 

Remember that the shape of a tensor is (2, 3, 4) means there are 2 tensors x 3 rows x 4 columns. 

You can slice tensors as expected.

To get Python numbers from them call `.item()` on the particular slice. 

In [2]:
V_data = [1.0, 2.0, 3.0]
V = torch.tensor(V_data)
print(V)

T_data = [[[1.0, 2.0], [3.0, 4.0]], [[5.0, 6.0], [7.0, 8.0]]]
T = torch.tensor(T_data)
print(T)

tensor([1., 2., 3.])
tensor([[[1., 2.],
         [3., 4.]],

        [[5., 6.],
         [7., 8.]]])


In [3]:
print(V[0])
print(V[0].item())

print(T[0])

tensor(1.)
1.0
tensor([[1., 2.],
        [3., 4.]])


In [4]:
# Raises ValueError: only one element tensors can be
# converted to Python scalars
# print(T[0].item())

Loads of ways to [combine tensors together](https://pytorch.org/docs/torch.html). 

One example is concatenation.

In [5]:
# By default it concatenates along the first axis (rows)
# Note that this is the same as adding the

Note that concatenation can only happen if one of the shapes of either the rows or columns is the same. Apart from square tenors, you can only concatenate along the rows that have different shapes.

To add more columns, we must have the same number of rows. To add more rows, we must have the same number of columns. Light bulb moment.

Vectors with shape (3, 4) and (2, 4) can be concatenated along rows to give (5, 4) shape. 

In [6]:
# Default is to cat along rows i.e. axis=0
# as if we are looking at a shape vector, s, to access the row
# shape we do s[0] to access column shape we do s[1]
# all makes so much sense now!
x_1 = torch.randn(2, 5)
y_1 = torch.randn(3, 5)
z_1 = torch.cat([x_1, y_1])
print(z_1.shape)

x_2 = torch.randn(2, 3)
y_2 = torch.randn(2, 5)
z_2 = torch.cat([x_2, y_2], 1)
print(z_2.shape)

torch.Size([5, 5])
torch.Size([2, 8])


In [7]:
# If tensors not compatible, PyTorch will complain
# torch.cat([x_1, x_2])

### Reshaping

Use the `.view()` method to reshape a tensor. Note that `.view()` never copies memory. Ever. Reshape and resize either copy memory or recreate the tensor entirely. I imagine this is vital since we want to just move the shape around a bit but still remember everything in the past like gradients and such.

Note: `.stride()` method returns the shape of the tensor.

In [8]:
x = torch.randn(2, 3, 4)
print(x.stride())
print(x.view(2, 12).stride())
print(x.view(2, -1).stride())

(12, 4, 1)
(12, 1)
(12, 1)


## Computation Graphs

Obvs we will be using computation graphs for our NNs. 

By default user created tensors do not remember the gradients. We can modify this though. All tensor factory methods hav ea `requires_grad` flag. 

Now the tensor remembers how it was created. This is vital if we are to compute derivatives. 

In [9]:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = torch.tensor([4.0, 5.0, 6.0], requires_grad=True)
z = x + y
print(z)
print(z.grad_fn)

tensor([5., 7., 9.], grad_fn=<AddBackward0>)
<AddBackward0 object at 0x12c0e9150>


If you keep following `z.grad_fn` you will return to x and y. 

In [10]:
s = z.sum()
print(s)
print(s.grad_fn)

tensor(21., grad_fn=<SumBackward0>)
<SumBackward0 object at 0x12c0e9190>


We are always calculating partial derivatives (since obvs taking full derivatives for multivariate functions is super hard. So we take, for example, the derivative of s wrt x_0 (i.e. the first element of x).

Well, to get s we did

```python
s = x_0 + y_0 + x_1 + y_1 + x_2 + y_2
```

And ds/dx_0 = 1!

In [11]:
# Call this on any variable to run backprop starting from this
# variable (all the way to the end I think)
s.backward()
print(x.grad)

tensor([1., 1., 1.])


If you run this multiple times, it will increment the gradient. PyTorch accuumulates the gradient into the `.grad` property since for many models this is convenient. 

In [12]:
x = torch.randn(2, 2)
y = torch.randn(2, 2)
print(x.requires_grad, y.requires_grad)
z = x + y
print(z.grad_fn)

False False
None


In [13]:
# This happens in-place but also returns the object
# In the tutorial he reassigns the output to x which is
# unnecessary
x.requires_grad_()
y.requires_grad_()
z = x + y
print(z.grad_fn)
print(z.requires_grad)

<AddBackward0 object at 0x12c0f9690>
True


You can stop tracing history on Tensors with `requires_grad=True` with the context manager `with torch.no_grad():`

In [15]:
print(x.requires_grad)
print((x**2).requires_grad)

with torch.no_grad():
    print((x**2).requires_grad)

True
True
False


## Word Embeddings

[Article](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).

Idea super relevant to time series forcasting:

Recall that in an n-gram language model, given a sequence of words _w_, we want to compute 

```python
P(w_i | w_(i-1), w_(i-2), ..., w(i_n+1))
```

Where `w_i` is the ith word of the sequence. In other words, we are predicting the ith word using _all of the previous words we have seen_.

I understand pretty much everything apart from the line in `class NGramLanguageModeler(nn.Module):` class definition saying

```python
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
```

No idea where he got `context_size * embedding_dim` from. I think it's because the context is 2 words long and so the linear transformation is obvs going to include the context but also the embeddings. 

OOOOOH SHIT. Ok so the model does not actually do anything with regard to the targets. It doesn't care about the targets. It is going to use the 2 previous words (the context) to predict the third word. So obvs the input is the context words and the output (that the model will actually work with) are probabilities of what the next word will be. We then compare this output to the actual output we want (i.e. the target) and then update the weights of the model accordingly to be more aligned with the output we want. 

Then the output for a user may be a particular word. But an NN is always going to output softmax probabilities and show you the range of possible values it thinks it could be. Clever. Very clever. 

Wou