# Embeddings
Embeddings are vectorizing the token, the vectorized numerical representation of tokens.

Tokenization can be done on different levels
- Sentence Level
- Word Level
- Character Level etc.

We do token embeddings. When we say **Word Embeddings** it means that we did word level tokeniation of our data and now creating embedding from those word level tokens, same for character and sentence level tokens.



In [None]:
from transformers import AutoTokenizer, AutoModel

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# tokenize a sentence
tokens = tokenizer("Hello world!.", return_tensors="pt")

# Process the tokens
output = model(**tokens)[0]

In [None]:
output.shape, output

(torch.Size([1, 5, 384]),
 tensor([[[-3.3852,  0.1942, -0.3037,  ..., -0.1020, -0.3737,  0.2672],
          [-0.4374,  0.6422, -0.0889,  ...,  0.0400, -0.0443,  0.2459],
          [ 0.0882,  0.5516, -0.3626,  ...,  0.8513, -0.0747,  1.4372],
          [-0.4828,  0.5529,  0.1193,  ...,  0.3924, -0.4024,  0.3509],
          [-3.1103,  0.4121, -0.3935,  ...,  0.0524, -0.6022,  0.2807]]],
        grad_fn=<NativeLayerNormBackward0>))

In [None]:
for token in tokens['input_ids'][0]:
  print(tokenizer.decode(token))

[CLS]
Hello
 world
!.
[SEP]


# Embeddings using **sentence_transformers** package

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

vector = model.encode(['Hello World!'])

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
vector.shape

(1, 768)

In [None]:
# squeeze: remove all 1 dims
vector = vector.squeeze()

vector.shape

(768,)

# **Word2Vec** Embeddings

### Error:
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

**Resolve:** install specific numpy version with gensim as the error tells that the numpy version is not compatiable with the one used with gensim.

In [None]:
n
!pip uninstall numpy gensim
!pip install numpy gensim

# !pip install gensim

In [None]:
import gensim.downloader as api

# Download embeddings (66MB, glove, trained on wikipedia, vectorsize: 50)
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-data
# Trained on Wikipedia words corpus

model = api.load("glove-wiki-gigaword-50")



In [None]:
# print top n similar words
model.most_similar([model['king']], topn=11)

[('king', 1.0000001192092896),
 ('prince', 0.8236179351806641),
 ('queen', 0.7839043140411377),
 ('ii', 0.7746230363845825),
 ('emperor', 0.7736247777938843),
 ('son', 0.766719400882721),
 ('uncle', 0.7627150416374207),
 ('kingdom', 0.7542161345481873),
 ('throne', 0.7539914846420288),
 ('brother', 0.7492411136627197),
 ('ruler', 0.7434253692626953)]