## torch

### `broadcast`
- Docs about this feature is concise
- cannot handle dimension size larger than 1

In [4]:
import torch

a = torch.rand((6,5))
b = torch.ones((6,1))
c = torch.rand((1,1,3))

a,b,c,a+b,b+c

(tensor([[0.1536, 0.5777, 0.1836, 0.1555, 0.0351],
         [0.0210, 0.0448, 0.2552, 0.3459, 0.1751],
         [0.0208, 0.0027, 0.4269, 0.2744, 0.3023],
         [0.2878, 0.5131, 0.2553, 0.6664, 0.1479],
         [0.8311, 0.2688, 0.7493, 0.1318, 0.0262],
         [0.6754, 0.0981, 0.1209, 0.8408, 0.2742]]),
 tensor([[1.],
         [1.],
         [1.],
         [1.],
         [1.],
         [1.]]),
 tensor([[[0.7299, 0.6202, 0.6148]]]),
 tensor([[1.1536, 1.5777, 1.1836, 1.1555, 1.0351],
         [1.0210, 1.0448, 1.2552, 1.3459, 1.1751],
         [1.0208, 1.0027, 1.4269, 1.2744, 1.3023],
         [1.2878, 1.5131, 1.2553, 1.6664, 1.1479],
         [1.8311, 1.2688, 1.7493, 1.1318, 1.0262],
         [1.6754, 1.0981, 1.1209, 1.8408, 1.2742]]),
 tensor([[[1.7299, 1.6202, 1.6148],
          [1.7299, 1.6202, 1.6148],
          [1.7299, 1.6202, 1.6148],
          [1.7299, 1.6202, 1.6148],
          [1.7299, 1.6202, 1.6148],
          [1.7299, 1.6202, 1.6148]]]))

### `repeat_interleave`
repeating samples along axis

- equal to *cat and view*
- *broadcast* a tensor when the dimension is not 1

In [3]:
import torch

a =torch.tensor([[[1,2]],[[2,3]]])
print(a,a.shape)

a.repeat_interleave(repeats=2,dim=0),torch.cat([a.unsqueeze(dim=1)]*2,dim=1).view(-1,1,2)

tensor([[[1, 2]],

        [[2, 3]]]) torch.Size([2, 1, 2])


(tensor([[[1, 2]],
 
         [[1, 2]],
 
         [[2, 3]],
 
         [[2, 3]]]),
 tensor([[[1, 2]],
 
         [[1, 2]],
 
         [[2, 3]],
 
         [[2, 3]]]))

### `permute`

let's suppose there is a tensor of $[dim0, dim1, dim2, dim3]$, then we permute it to $[dim3, dim1, dim2, dim0]$

- **consequence**: origin value at $[a, b, c, d]$ will be switched to $[d, b, c, a]$

In [None]:
import torch

a = torch.zeros([5,4,3,2]),
# say a=2, b=3, c=1, d=1,
a[2,3,1,1] = 1,
print("origin 1 at [2,3,1,1]: {}".format(a[2,3,1,1])),
b = a.permute(3,1,2,0),
print("permuted 1 at [3,1,2,0]: {}".format(b[1,3,1,2]))

### `nonzero`
finding non-zero indices

In [1]:
import torch

a = torch.zeros(2,2)
a[1,0] = 1
a[1,1] = 1
a.nonzero()

tensor([[1, 0],
        [1, 1]])

### `matmul`
tensor multiplication

- Docs about this function is concise, I want to give an example:

In [5]:
import torch

a = torch.tensor([[[1,2,3],[1,0,0]],[[2,3,4],[0,1,0]]])
b = torch.tensor([[[1,0],[0,1],[1,1]],[[1,1],[0,0],[0,1]]])

a,a.shape, b, b.shape, b[0], b[0].shape,torch.matmul(a,b[0]), torch.matmul(a,b), torch.matmul(a[0],b), a[0].shape

(tensor([[[1, 2, 3],
          [1, 0, 0]],
 
         [[2, 3, 4],
          [0, 1, 0]]]),
 torch.Size([2, 2, 3]),
 tensor([[[1, 0],
          [0, 1],
          [1, 1]],
 
         [[1, 1],
          [0, 0],
          [0, 1]]]),
 torch.Size([2, 3, 2]),
 tensor([[1, 0],
         [0, 1],
         [1, 1]]),
 torch.Size([3, 2]),
 tensor([[[4, 5],
          [1, 0]],
 
         [[6, 7],
          [0, 1]]]),
 tensor([[[4, 5],
          [1, 0]],
 
         [[2, 6],
          [0, 0]]]),
 tensor([[[4, 5],
          [1, 0]],
 
         [[1, 4],
          [1, 1]]]),
 torch.Size([2, 3]))

### `argmax`
find index of max value

In [5]:
import torch

a = torch.rand((2,2,3))
# when dim=-1, find max value index per row
# when dim=1, find max value index per col
# when dim=0, find max value of the same entry among batches
a, a.argmax(dim=-1), a.argmax(dim=1), a.argmax(dim=0)

(tensor([[[0.9193, 0.9062, 0.3764],
          [0.2910, 0.3237, 0.0300]],
 
         [[0.9931, 0.2293, 0.9009],
          [0.5535, 0.4756, 0.2353]]]),
 tensor([[0, 1],
         [0, 0]]),
 tensor([[0, 0, 0],
         [0, 1, 0]]),
 tensor([[1, 0, 1],
         [1, 1, 1]]))

### `sort`

- sort value of tensor

In [2]:
import torch

a = torch.rand((2,2,3))
b,c = a.sort(dim=-1, descending=True)
a, b, c

(tensor([[[0.8392, 0.7046, 0.8891],
          [0.7748, 0.7321, 0.1097]],
 
         [[0.9433, 0.9492, 0.0342],
          [0.4776, 0.9315, 0.7581]]]),
 tensor([[[0.8891, 0.8392, 0.7046],
          [0.7748, 0.7321, 0.1097]],
 
         [[0.9492, 0.9433, 0.0342],
          [0.9315, 0.7581, 0.4776]]]),
 tensor([[[2, 0, 1],
          [0, 1, 2]],
 
         [[1, 0, 2],
          [1, 2, 0]]]))

## torch.nn

### `nn.ConvX`

- calculate *signal_length* $L_{out}$ after convolution:

    - consider a sequence of length $L_{in}$ of $C_{in}$ channels, then the output sequence is generated from $(d-1) * (k-1) - p$ to $L_{in} + p$, thus $L_{out}$ as the number of convolution calculations can be derived as $$L_{out} = \lfloor\frac{L_{in} - d * (k-1) - 1 + 2*p}{s} + 1\rfloor$$where $d$ denotes *dilation rate*, $k$ denotes *kernel_size*, $p$ denotes *padding(on both sides)* and $s$ denotes *stride*


### `nn.LayerNorm`

- *mean and variance* is calculated on each sample rather over the whole batch

### `learning-to-rank`

- say we got a vector for the relatedness between *query* and *doc*, then we feed the vector into a *MLP(i.e. Multi-Layer Perceptron)* to project the vector into a single value, which stands for the *score*. The above procedure is namely **Learning to Rank** because the weights of *MLP* can be learnt automatically.

### `indice embedding`

sometimes we have to create an embedding layer (*loop up layer*). 

The derivation from emebdding layer is straight forward: the last |n-1| dimension in embedding layer will be appended to the index tensor.

In [4]:
import torch

embedding_layer = torch.rand((5,3))
index_tensor = torch.tensor([[3,4],[0,1]]) # only tensor of dtype=torch.long works
print("1 dimensional embedding:{} of size {}\n".format(embedding_layer[index_tensor], embedding_layer[index_tensor].shape))

embedding_layer = torch.rand((5,5,3))
print("2 dimensional embedding:{} of size {}".format(embedding_layer[index_tensor], embedding_layer[index_tensor].shape))

1 dimensional embedding:tensor([[[0.4890, 0.8069, 0.8093],
         [0.6145, 0.3115, 0.6036]],

        [[0.3977, 0.5009, 0.9692],
         [0.8422, 0.7311, 0.7406]]]) of size torch.Size([2, 2, 3])

2 dimensional embedding:tensor([[[[4.3572e-01, 3.7110e-01, 5.5503e-01],
          [3.4236e-01, 3.0673e-01, 7.1331e-01],
          [2.3374e-01, 3.6648e-01, 9.3444e-01],
          [4.8977e-01, 9.6126e-01, 5.6057e-02],
          [5.8815e-01, 6.7106e-01, 9.9311e-01]],

         [[5.7527e-01, 2.8838e-01, 8.5722e-01],
          [5.2984e-01, 8.5100e-02, 5.7482e-01],
          [7.2258e-01, 2.0804e-01, 7.0520e-02],
          [9.9510e-01, 8.3778e-01, 6.3597e-01],
          [5.6489e-01, 3.9864e-01, 6.1766e-01]]],


        [[[5.9626e-01, 9.5241e-02, 8.3267e-01],
          [1.1131e-01, 3.0026e-01, 6.5316e-01],
          [6.7256e-01, 4.0614e-01, 6.9878e-01],
          [7.8866e-01, 7.1249e-01, 7.5015e-01],
          [2.0895e-01, 5.7120e-01, 8.2893e-01]],

         [[7.8303e-04, 1.8824e-01, 4.3864e-01],
 

### `nn.CosineSimilarity`
`PyTorch` provides convenient api for computing cosine similarity between two tensor, however it's confusing when dimension is more than one.

- From my perspective, the `dim` parameter can be viewed as the dimension to *compress*, which means computing cosine similarity along `dim` is actually transforming the vector on this dimension to a single value.
- As for calculating, we first slice the tensor of given `dim` and compute cosine similarity pair-wise
- when `dim` is higher dimension:
    - `dim` = $n$, $n>=2$: value at the same place across the given dimension will be packed into a vector
    - `dim` = -1: value at the last dimension will be collected into a vector

In [1]:
import torch
from torch.nn import CosineSimilarity

" example for cosine similarity along the last dimension "
cos = CosineSimilarity(dim=2)

a = torch.rand((3,2,3))
b = torch.rand((3,2,3))

c = a[0]
d = b[0]

e = a[:,0,:].unsqueeze(dim=1)
f = b[:,0,:].unsqueeze(dim=2)
g = a[:,1,:].unsqueeze(dim=1)
h = b[:,1,:].unsqueeze(dim=2)

result_1 = torch.matmul(e,f) / torch.sqrt(torch.matmul(e,e.permute(0,2,1)) * torch.matmul(f.permute(0,2,1),f))
result_2 = torch.matmul(g,h) / torch.sqrt(torch.matmul(g,g.permute(0,2,1)) * torch.matmul(h.permute(0,2,1),h))

cos_2 = CosineSimilarity(dim=1)
cos(a,b), cos_2(c,d), result_1.squeeze(), result_2.squeeze()

(tensor([[0.7652, 0.9477],
         [0.9658, 0.5803],
         [0.6498, 0.9634]]),
 tensor([0.7652, 0.9477]),
 tensor([0.7652, 0.9658, 0.6498]),
 tensor([0.9477, 0.5803, 0.9634]))

In [3]:
cos = CosineSimilarity(dim=-1)
a = torch.zeros((1,2))
b = torch.zeros((2,2))
c = torch.rand(2,2)
a,b,c,cos(a,b),cos(b,c)

(tensor([[0., 0.]]),
 tensor([[0., 0.],
         [0., 0.]]),
 tensor([[0.8044, 0.7442],
         [0.1688, 0.2175]]),
 tensor([0., 0.]),
 tensor([0., 0.]))

In [5]:
" broadcast in CosineSimilarity "
import os
sys.path.append('..')

import torch
from torch.nn import CosineSimilarity
from utils.TestTensors import t_2_2_3

cos = CosineSimilarity(dim=-1)
a = torch.tensor([[[1,2,3.]]])
a, t_2_2_3, cos(a,t_2_2_3)

(tensor([[[1., 2., 3.]]]),
 tensor([[[ 1.,  2.,  3.],
          [ 2.,  3.,  4.]],
 
         [[ 5.,  6.,  9.],
          [-9., -2., -7.]]]),
 tensor([[ 1.0000,  0.9926],
         [ 0.9868, -0.7850]]))

### `nn.LayerNorm`

Layer Normalization is applied over the last given dimensions of the input tensor, i.e. `mean` and `variance` are calculated within the current input, rather than the whole batch

In [4]:
LayerNorm = torch.nn.LayerNorm((2,3))
a = torch.tensor([[[-0.5,0.5,0],[0,1,-1]],[[1,2,3],[5,9,11]]])
a,LayerNorm(a)

(tensor([[[-0.5000,  0.5000,  0.0000],
          [ 0.0000,  1.0000, -1.0000]],
 
         [[ 1.0000,  2.0000,  3.0000],
          [ 5.0000,  9.0000, 11.0000]]]),
 tensor([[[-0.7746,  0.7746,  0.0000],
          [ 0.0000,  1.5492, -1.5492]],
 
         [[-1.1352, -0.8627, -0.5903],
          [-0.0454,  1.0444,  1.5893]]], grad_fn=<NativeLayerNormBackward>))

## torch.nn.functional

### `F.normalize()`

In [4]:
import torch
import torch.nn.functional as F

a = torch.ones((2,2,3,4))
b = F.normalize(a, dim=-1)
a,b

(tensor([[[[1., 1., 1., 1.],
           [1., 1., 1., 1.],
           [1., 1., 1., 1.]],
 
          [[1., 1., 1., 1.],
           [1., 1., 1., 1.],
           [1., 1., 1., 1.]]],
 
 
         [[[1., 1., 1., 1.],
           [1., 1., 1., 1.],
           [1., 1., 1., 1.]],
 
          [[1., 1., 1., 1.],
           [1., 1., 1., 1.],
           [1., 1., 1., 1.]]]]),
 tensor([[[[0.5000, 0.5000, 0.5000, 0.5000],
           [0.5000, 0.5000, 0.5000, 0.5000],
           [0.5000, 0.5000, 0.5000, 0.5000]],
 
          [[0.5000, 0.5000, 0.5000, 0.5000],
           [0.5000, 0.5000, 0.5000, 0.5000],
           [0.5000, 0.5000, 0.5000, 0.5000]]],
 
 
         [[[0.5000, 0.5000, 0.5000, 0.5000],
           [0.5000, 0.5000, 0.5000, 0.5000],
           [0.5000, 0.5000, 0.5000, 0.5000]],
 
          [[0.5000, 0.5000, 0.5000, 0.5000],
           [0.5000, 0.5000, 0.5000, 0.5000],
           [0.5000, 0.5000, 0.5000, 0.5000]]]]))

### `F.one_hot`
see document

In [7]:
import torch
import torch.nn.functional as F
a = torch.rand((2,2,3))
b = a.argmax(dim=-1)
c = F.one_hot(b,num_classes=a.shape[-1])
a,b,c

(tensor([[[0.7259, 0.4600, 0.1225],
          [0.1762, 0.3771, 0.0185]],
 
         [[0.3044, 0.0490, 0.4388],
          [0.8985, 0.5789, 0.2526]]]),
 tensor([[0, 1],
         [2, 0]]),
 tensor([[[1, 0, 0],
          [0, 1, 0]],
 
         [[0, 0, 1],
          [1, 0, 0]]]))

## torch.autograd

Only when the operation in the forward phrase is **not differentiable** while you want the gradient to be pass through that you should rewrite torch.autograd, where you can define your own backward algorithm to give an approximate gradient of the **indifferentiable** operation.

### `detach` 

*Straight-through trick*

- transform a tensor while keep its gradient unchanged

In [2]:
import torch

a = torch.tensor([[0.0002,40.]],requires_grad=True)
b = torch.tensor([[0.,40.]]) - a.detach() + a
a,b 

(tensor([[2.0000e-04, 4.0000e+01]], requires_grad=True),
 tensor([[ 0., 40.]], grad_fn=<AddBackward0>))