# Embedding layer vs Linear Layer in PyTorch

> See this question at  [What is the difference between an Embedding Layer with a bias immediately afterwards and a Linear Layer in PyTorch](https://stackoverflow.com/questions/65445174/what-is-the-difference-between-an-embedding-layer-with-a-bias-immediately-afterw)

## Embedding Layer Syntax

Syntax: 
```python
torch.nn.Embedding(num_embeddings, embedding_dim, 
                   padding_idx=None, max_norm=None,...)
```
                
A simple lookup table that stores embeddings of a fixed dictionary and size.

This module is often used to store word embeddings and retrieve them using indices. The input to the module is  a list of indices, and the output is the corresponding word embeddings.

## Answer 1:

Both implement a linear transformation of the input data, however  `nn.Embedding`  allow to pass an index-based objects, while  `nn.Linear`  requires you to encode them using one-hot encoding. I.e. let's say you have a vocabulary of size 16, with characters encoded by numbers from 0 to 15.

In [1]:
import torch
from torch import nn

model = nn.Embedding(num_embeddings=16, embedding_dim=256)
x = torch.tensor([0, 4, 7]) # 3 characters
emb = model(x)
emb.shape #>> (3, 256)

torch.Size([3, 256])

So essentially, each of your input characters were mapped to vector of 256 floats. Because of that, people refer to  `nn.Embedding`  as a look-up table, which might be confusing, 
if you think of a look-up table as something frozen, but it is not, the parameters of  `nn.Embedding`  are trainable.

To achieve similar result with  `nn.Linear`, you will need to first encode your input `x` using one-hot encoding:

In [2]:
model = nn.Linear(in_features=16, out_features=256, bias=False)
x = torch.tensor([0, 4, 7]) # 3 characters
# emb = model(x)
#>> RuntimeError: RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x3 and 16x256)

x_enc = torch.eye(16)[[0, 4, 7], :]
emb = model(x_enc)
emb.shape
# >> (3, 256)

torch.Size([3, 256])

From a usage standpoint, you typically use  `nn.Embedding(num_embeddings=N, embedding_dim=M)`  if you need to encode a categorical entity with N possible 
unique values with a vector of size M. E.g., you have a vocabulary of size 26 + 10 + 1 = 37 characters (lower-case English alphabet, digits and a space), 
each should be encoded with a 256-long vector. And an  `nn.Linear`  elsewhere.

Both modules implement different subroutines for their intended purposes, e.g.,  
- `nn.Embedding`  does not have a  `bias`  parameter, since it does not make sense to add bias to an item value in a look-up table; 
- `nn.Embedding`  has a  `padding_idx`, which allows one to freeze the representation of some character, but is not implemented for `nn.Linear`.

Here is how you could get exactly the same results with both:

In [3]:
vocab_size = 16
emb_dim = 4

linear = nn.Linear(vocab_size, emb_dim, bias=False)
embedding = nn.Embedding(vocab_size, emb_dim)

w  = torch.rand(vocab_size, emb_dim) #[N,D]
linear.weight = nn.parameter.Parameter(w.T) #[D,*]
embedding.weight = nn.parameter.Parameter(w) #[N,D]

x = torch.tensor([1,2,3])
y1 = embedding(x)
print("via Embedding layer: ", y1)

x_enc = torch.eye(16)[[1, 2, 3], :]
y2 = linear(x_enc)
print("via linear layer: ", y2)

print(torch.allclose(y1, y2))

via Embedding layer:  tensor([[0.5940, 0.3003, 0.7661, 0.5040],
        [0.0210, 0.5245, 0.5118, 0.3714],
        [0.4909, 0.2059, 0.3402, 0.2457]], grad_fn=<EmbeddingBackward0>)
via linear layer:  tensor([[0.5940, 0.3003, 0.7661, 0.5040],
        [0.0210, 0.5245, 0.5118, 0.3714],
        [0.4909, 0.2059, 0.3402, 0.2457]], grad_fn=<MmBackward0>)
True


Recall, that  `nn.Embedding` stores the parameters as an `(embedding_dim, num_embeddings)` matrix, 
but `nn.Linear` stores the parameters as `(in_features, out_features)`. There is no reason for that, other than backward compatibility, see  [PyTorch - shape of nn.Linear weights](https://stackoverflow.com/questions/53465608/pytorch-shape-of-nn-linear-weights)  for explanation.

## Answer 2 TL;DR
### TL;DR
-   `nn.Embedding`  is for categorical input.
-   `nn.Linear`  is for ordinal input.

### Explanation

You use `nn.Embedding`  when dealing with categorical data, e.g., class labels (0, 1, 2, ...). Because in a lookup table, the value would not be proportional to the key. This behavior suits categorical data, whose value has nothing to do with the semantics.

On the other hand,  `nn.Linear`, being a matrix multiplication, does not provide the aforementioned behavior. The input and output are proportional due to the natural of multiplication. Therefore, you use  `nn.Linear`  for ordinal data.

## Answer 3

 [`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)  is a lookup table; it works the same as  `torch.Tensor`  but with a few twists (like the possibility to use sparse embedding or default value at specified index).
For example:

In [4]:
import torch

embedding = torch.nn.Embedding(3, 4)

print("embedding.weight: ", embedding.weight)

print("To select the first row of embedding:", embedding(torch.tensor([1])))

embedding.weight:  Parameter containing:
tensor([[ 0.2495, -0.6835, -0.3323,  0.4207],
        [-0.1019,  1.5309,  2.9065, -1.0414],
        [ 0.4161, -0.6921, -1.0098,  2.1612]], requires_grad=True)
To select the first row of embedding: tensor([[-0.1019,  1.5309,  2.9065, -1.0414]], grad_fn=<EmbeddingBackward0>)


So we took the first row of the embedding. It does nothing more than that.

### Where is Embedding Layer used?

Usually when we want to encode some meaning (like `word2vec`) for each row (e.g., words being close semantically are close in euclidean space) and possibly train them.

### Linear Layer

[`torch.nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)  (without bias) is also a  `torch.Tensor`  (weight)  **but**  it does operation on it (and the input) which is essentially:

```python
output = input.matmul(weight.t())
```
Every time you call the layer (see  [source code](https://pytorch.org/docs/stable/_modules/torch/nn/functional.html#linear)  and  [functional definition of this layer](https://pytorch.org/docs/stable/nn.functional.html)).

### Embedding Layer's Code snippet

The Embedding layer in your code snippet does this:

In [7]:
class DotProduct(torch.nn.Module):
    def __init__(self, n_users, n_movies, n_factors):
        # creates two lookup tables 
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

What it does:
-   To creates two lookup tables in  `__init__`
-   the layer is called with input of shape  `(batch_size, 2)`:
	-  The first column contains indices of user embeddings
    -  The second column contains indices of movie embeddings
-   these embeddings are multiplied and summed returning  `(batch_size,)`. So it's different from `nn.Linear` which would return  `(batch_size, out_features)`.

This is probably used to train both representations (of users and movies) for some recommender-like system.

### Other stuff

> I know it does some faster computational version of a dot product where one of the matrices is a one-hot encoded matrix and the other is the embedding matrix.

No, it doesn't.  `torch.nn.Embedding`  **can be one hot encoded**  and might also be sparse, but 
depending on the algorithms (and whether those support sparsity) there might be performance boost or not.