# Tensors

There is some pretty terrible mixing of notations, and generally horrible notation, for indices in the mathematical sections (TeX code)--I was just checking that I understood what would happen for various calculations.

## `torch` tensors basics

From Chapter 3 of the PyTorch book: This chapter focuses heavily on the syntax and typing of tensors.

From [Dive into Deep Learning Chapter 2](https://d2l.ai/chapter_preliminaries/linear-algebra.html#tensors): This section discusses arithemtic with tensors, as well as some preformance and memory considerations.


A _tensor_ is also a multidimensional array. A PyTorch tensor can preform very fast operations on the GPU, distribute calculations over multiple devices, and keep track of the graph of computations, as opposed to a NumPy array. 

In [34]:
import torch 

# create a 1d tensor of size 3 with all 1s
a = torch.ones(3) 
a

tensor([1., 1., 1.])

In [71]:
# create a tensor whose points are the corners of a triangle
points = torch.tensor([[4.0, 1.0], [5.0, 3.0], [2.0, 1.0]])
points

tensor([[4., 1.],
        [5., 3.],
        [2., 1.]])

Each entry is its own tensor, for example

In [36]:
points[0]

tensor([4., 1.])

`points[0]` is the 1D tensor of size $2$ `tensor([4., 1.])`. 

We can see the shape and size of these tensors by using `points.shape` and `points[0].shape`. Both outputs have the form `torch.Size([...])`

In [37]:
points.shape

torch.Size([3, 2])

In [38]:
points[0].shape

torch.Size([2])

In [39]:
points.storage

<bound method Tensor.storage of tensor([[4., 1.],
        [5., 3.],
        [2., 1.]])>

We can also use `arange` to generate entries and `reshape`  to specify the shape. We can also specify the data type. 

In [72]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone()  # Assign a copy of A to B by allocating new memory
A, A + B

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[ 0.,  2.,  4.],
         [ 6.,  8., 10.]]))

We can also use `.zeros_` and a few other operations (which ones?) with a trailing underscore to update the definition of a tensor.

In [75]:
A.zero_()
A

tensor([[0., 0., 0.],
        [0., 0., 0.]])

In [77]:
# generate numbers 0 to 23
# then create a 3d tensor with two 3x4 matrices
X=torch.arange(24).reshape(2, 3, 4)
X


tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

The entries of `X` should be (indexed from $0$)
    $$X[0]=\begin{bmatrix}
        X_{000} &   X_{001} & X_{002} & X_{003} \\
        X_{010} &   X_{011} & X_{012} & X_{013} \\
        X_{020} &   X_{021} & X_{022} & X_{023}
    \end{bmatrix},
    X[1]=\begin{bmatrix}
        X_{100} &   X_{101} & X_{102} & X_{103} \\
        X_{110} &   X_{111} & X_{112} & X_{113} \\
        X_{120} &   X_{121} & X_{122} & X_{123}
    \end{bmatrix}
    $$

We can pull out parts of the tensor using indices

In [78]:
X[0], X[1],X[0][1]

(tensor([[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]]),
 tensor([[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]),
 tensor([4, 5, 6, 7]))

We can also preform operations with the tensors. If the sizes match, operations will be preformed entry-wise.

In [79]:
X[0]+ X[1], X[0]* X[1]

(tensor([[12, 14, 16, 18],
         [20, 22, 24, 26],
         [28, 30, 32, 34]]),
 tensor([[  0,  13,  28,  45],
         [ 64,  85, 108, 133],
         [160, 189, 220, 253]]))

We can also transpose a matrix using `transpose` and preform "standard" matrix multiplication using `.mm` Dot products can be done with `torch.dot(x,y)` and a matrix times a vector can be done with `torch.mv(A,x)`, where `x` and `y` are vectors (1D tensors) with `torch.Size([n])` and `A` is a matrix (2D tensor) with `torch.Size([m,n]).`

In [80]:
torch.transpose(X[0],0,1)

tensor([[ 0,  4,  8],
        [ 1,  5,  9],
        [ 2,  6, 10],
        [ 3,  7, 11]])

In [81]:
X[0].transpose(0,1)

tensor([[ 0,  4,  8],
        [ 1,  5,  9],
        [ 2,  6, 10],
        [ 3,  7, 11]])

In [82]:
torch.mm(X[0].transpose(0,1),X[1]),torch.mm(X[1],X[0].transpose(0,1))

(tensor([[224, 236, 248, 260],
         [272, 287, 302, 317],
         [320, 338, 356, 374],
         [368, 389, 410, 431]]),
 tensor([[ 86, 302, 518],
         [110, 390, 670],
         [134, 478, 822]]))

Notice how the `(0,1)` in `transpose` are saying to switch the $0^{th}$ and $1^{st}$ dimensions (rows and columns). 

Let's see what happens when we transpose the 3d tensor.
Switching the $0^{th}$ and $1^{st}$ dimension should produce a tensor with shape `([3,2,4])` whose entries correspond to 
    $$\begin{bmatrix}
        X_{000} &   X_{001} & X_{002} & X_{003} \\
        X_{100} &   X_{101} & X_{102} & X_{013}
    \end{bmatrix} 
    = \begin{bmatrix}
        0   &   1   & 2 & 3 \\
        12  &   13  & 14& 15
    \end{bmatrix},$$
    $$\begin{bmatrix}
        X_{010} &   X_{011} & X_{012} & X_{013} \\
        X_{110} &   X_{111} & X_{112} & X_{113} \\
    \end{bmatrix}
    = \begin{bmatrix}
        4   &   5   & 6 & 7 \\
        16  &   17  & 18& 19
    \end{bmatrix},$$
    $$\begin{bmatrix}
        X_{020} &   X_{021} & X_{022} & X_{023} \\
        X_{120} &   X_{121} & X_{122} & X_{123} \\
    \end{bmatrix}
    = \begin{bmatrix}
        8   &   9   & 10 & 11 \\
        20  &   21  & 22 & 23
    \end{bmatrix}
    $$
So the $0^{th}$ row of each original matrix becomes a row in the new $0^{th}$ matrix, the $1^{st}$ (index from 0) row of each original matrix becomes a row in the new $1^{st}$ matrix, etc.


In [83]:
torch.transpose(X,0,1)

tensor([[[ 0,  1,  2,  3],
         [12, 13, 14, 15]],

        [[ 4,  5,  6,  7],
         [16, 17, 18, 19]],

        [[ 8,  9, 10, 11],
         [20, 21, 22, 23]]])

Likewise, switching the $0^{th}$ and the 2nd dimension should produce a tensor with shape `([4,3,2])` whose entries correspond to
    $$\begin{bmatrix}
        X_{000} &   X_{100}  \\
        X_{010} &   X_{110}  \\
        X_{020} &   X_{120}  \\
    \end{bmatrix} 
    = \begin{bmatrix}
        0   &   12\\
        4   &   16\\
        8   &   20
    \end{bmatrix},
    \begin{bmatrix}
        X_{001} &   X_{101}  \\
        X_{011} &   X_{111}  \\
        X_{021} &   X_{121}  \\
    \end{bmatrix} 
    = \begin{bmatrix}
        1   &   13\\
        5   &   17\\
        9   &   21
    \end{bmatrix},$$
    $$\begin{bmatrix}
        X_{002} &   X_{102}  \\
        X_{012} &   X_{112}  \\
        X_{022} &   X_{122}  \\
    \end{bmatrix} 
    = \begin{bmatrix}
        2   &   14\\
        6   &   18\\
        10  &   22
    \end{bmatrix},
    \begin{bmatrix}
        X_{003} &   X_{103}  \\
        X_{013} &   X_{113}  \\
        X_{023} &   X_{123}  \\
    \end{bmatrix} 
    = \begin{bmatrix}
        3   &   15\\
        6   &   18\\
        10  &   23
    \end{bmatrix},$$
So the $0^{th}$ column of each original matrix becomes a column in the new $0^{th}$ matrix, the $1^{st}$ (index from $0$) column of each original matrix becomes a column in the new $1^{st}$ matrix, etc.

In [84]:
torch.transpose(X,0,2)

tensor([[[ 0, 12],
         [ 4, 16],
         [ 8, 20]],

        [[ 1, 13],
         [ 5, 17],
         [ 9, 21]],

        [[ 2, 14],
         [ 6, 18],
         [10, 22]],

        [[ 3, 15],
         [ 7, 19],
         [11, 23]]])

## Reduction and norms

We can _reduce_ the size of a tensor by summing the elements or by summing along a particular axis.

In [49]:
x = torch.arange(3, dtype=torch.float32)
x, x.sum()

(tensor([0., 1., 2.]), tensor(3.))

In [85]:
X.shape, X.sum()

(torch.Size([2, 3, 4]), tensor(276))

If a tensor $A$ has $n$-dimensions, the index for an entry is $A_{i_0 i_1 \dots i_{n-1}}$. Summing along `axis=j` produses an $n-1$-dimensional tensor $B$ whose entries are $\displaystyle B_{i_0 i_1 \dots i_{j-1} i_{j+1} i_{n-1}}=\sum_{i_j} A_{i_0 i_1 \dots i_{j-1} i_j i_{j+1} i_{n-1}}$. This reduces the $j^{th}$ axes to one entry, removing it from the shape of the output.

For our 3D tensors example, summing along `axis=0` is the same as summing the individual matrices. That is, we reduce the 2D slice (matrices) axis.

In [86]:
X.sum(axis=0),X.sum(axis=0).shape


(tensor([[12, 14, 16, 18],
         [20, 22, 24, 26],
         [28, 30, 32, 34]]),
 torch.Size([3, 4]))

For our 3D tensors example, summing along `axis=1` is the same as summing the _columns_ of the original tensor. That is, we are reducing the axis corresponding to rows, but we do this by collapsing the number of rows, ie $B_{ij}=X_{i0j}+X_{i1j}+X_{i2j}$, the sum of the $j^{th}$ column in the $i^{th}$ matrix.

In [87]:
X.sum(axis=1),X.sum(axis=1).shape


(tensor([[12, 15, 18, 21],
         [48, 51, 54, 57]]),
 torch.Size([2, 4]))

Similarly, for our 3D tensors example, summing along `axis=2` is the same as summing the _rows_ of the original tensor. That is, we are reducing the axis corresponding to columns, but we do this by collapsing the number of column, ie $B_{ij}=X_{ij0}+X_{ij1}+X_{ij2}+X_{ij3}$, the sum of the $j^{th}$ row in the $i^{th}$ matrix.

In [88]:
X.sum(axis=2),X.sum(axis=2).shape

(tensor([[ 6, 22, 38],
         [54, 70, 86]]),
 torch.Size([2, 3]))

We can do the same thing using mean, two different ways. Note that we want to make sure to use floats.

In [89]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)

In [90]:
A.mean(), A.sum() / A.numel()

(tensor(2.5000), tensor(2.5000))

In [91]:
A.mean(axis=0), A.sum(axis=0) / A.shape[0]


(tensor([1.5000, 2.5000, 3.5000]), tensor([1.5000, 2.5000, 3.5000]))

We can also sum (across various axes) without reducing the number of axes. Using our 2D tensor as an example, the size of the axis will be reduced to $1$, but the axis will not be eliminated:

In [92]:
sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape

(tensor([[ 3.],
         [12.]]),
 torch.Size([2, 1]))

There is also the cumulative sum funtion along an axis which does not reduce the number of axes. That is, if a tensor $A$ has $n$-dimensions, the index for an entry is $A_{i_0 i_1 \dots i_{n-1}}$. The cummulative sum along `axis=j` produses an $n$-dimensional tensor $B$ whose entries are $\displaystyle B_{i_0 i_1 \dots i_{j}=k \dots i_{n-1}}=\sum_{i_j=0}^k A_{i_0 i_1 \dots i_{j-1} i_j i_{j+1} i_{n-1}}$. 

In [93]:
A.cumsum(axis=0)

tensor([[0., 1., 2.],
        [3., 5., 7.]])

Just a brief reminder to myself about how $\ell_p$ norms work
    
$$|| \vec{x} ||_p = \left(\sum_{i} |x_i|^p \right)^{1/p}$$

Although it is not immediately clear why we are discussing $\ell_p$ norms, as the only examples are $\ell_1$ (sum of absolute values of entries) and $\ell_2$ (Euclidean), and the corresponding norms for matrices. For example, an $m\times n$ matrix has _Frobenius norm_ 
    $$ ||A||_F = \sqrt{\sum_i \sum_j |x_{ij}|^2}.$$


In [94]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
torch.norm(A)

tensor(7.4162)

Which we can generalize to
    $$ ||A||_F = \sqrt{\sum_i \sum_j \cdots \sum_n |x_{ij\dots n}|^2}.$$

In [105]:
X=torch.tensor([[4.0, 1.0], [5.0, 3.0], [2.0, 1.0]])

In [106]:
torch.norm(X)


tensor(7.4833)

### Exercises

We defined the tensor `X` of shape (2, 3, 4) in this section. What is the output of `len(X)`? Write your answer without implementing any code, then check your answer using code.

Answer: `len(X)` should be $2$, the size of the first axis $X$

For a tensor X of arbitrary shape, does len(X) always correspond to the length of a certain axis of X? What is that axis?

Answer: Yes, the $0^{th}$ axis

In [100]:
X = torch.arange(24).reshape(2, 3, 4)
len(X)

2

Run `A / A.sum(axis=1)` and see what happens. Can you analyze the results?

Answer: The dimensions do not match, so this does not work.


In [102]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
A / A.sum(axis=1)

RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 1

Consider three large matrices, say $A\in\R^{2^{10}}\times \R^{2^{16}}, B\in\R^{2^{16}}\times \R^{2^{5}}$, and 
$C\in\R^{2^{5}}\times\R^{2^{14}}$ initialized with Gaussian random variables. You want to compute the product $ABC$. Is there any difference in memory footprint and speed, depending on whether you compute $(AB)C$ or $A(BC)$. Why?

Answer: In either case, $ABC$ is in $\R^{2^{10}}\times \R^{2^{14}}$
- $AB$ is in $\R^{2^{10}}\times \R^{2^{5}},$ where each of the $2^{15}$ entries was found using a dot product, ie the sum of $2^{16}$ products. This is a total of $2^{31}$ products. Then $(AB)C$ is an additional $2^{24}$ dot products, each of which sums $2^{5}$ products.
- $BC$ is in $\R^{2^{16}}\times \R^{2^{14}},$ where each of the $2^{30}$ entries was found using a dot product, ie the sum of $2^{14}$ products. This is a total of $2^{44}$ products. Then $A(BC)$ is an additional $2^{24}$ dot products, each of which sums $2^{14}$ products.

$(AB)C$ will be faster and use less storage, as it reduces the dimension to $\R^{2^{10}}\times \R^{2^{5}}$ in the first calculation, while $A(BC)$ increase the dimension to $\R^{2^{16}}\times \R^{2^{14}}$. This is supported by running the code for both methods.

In [163]:
A = torch.randn(2**10,2**16)
B = torch.randn(2**16,2**5)
C = torch.randn(2**5,2**14)

In [164]:
torch.mm(torch.mm(A,B),C)
%time

CPU times: user 6 µs, sys: 2 µs, total: 8 µs
Wall time: 24.8 µs


In [166]:
torch.mm(A,torch.mm(B,C))
%time

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.87 µs


Consider three large matrices, say $A\in\R^{2^{10}}\times \R^{2^{16}}, B\in\R^{2^{16}}\times \R^{2^{5}}$, and 
$C\in\R^{2^{5}}\times\R^{2^{16}}$ initialized with Gaussian random variables. Is there any difference in speed depending on whether you compute $AB$ or $AC^\perp$? Why? What changes if you initialize $C=B^\perp$ without cloning memory? Why?

Answer: For the first question, where $B$ and $C^\perp$ have the same dimension, but are otherwise unrelated, $AB$ is faster than $C^\perp,$ since there are fewer computations. After running the code a few times, it seems that _generally_ $AB$ takes less CPU time, but $AC^\perp$ takes less wall time.

For the second, the only possible calculation is $AC^\perp,$ which seems to also be very similar in time.

In [157]:
A = torch.randn(2**10,2**16)
B = torch.randn(2**16,2**5)
C = torch.randn(2**5,2**16)

In [169]:
torch.mm(A,B)
%time

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 8.34 µs


In [159]:
torch.mm(A,C.transpose(0,1))
%time

CPU times: user 16 µs, sys: 1e+03 ns, total: 17 µs
Wall time: 5.96 µs


In [167]:
C = B.transpose(0,1)

In [170]:
torch.mm(A,C.transpose(0,1))
%time

CPU times: user 19 µs, sys: 1e+03 ns, total: 20 µs
Wall time: 8.11 µs
