# Word embedding
word embeddings represent words as low-dimensional vectors in mathematical space and capture their semantic and syntactic meaning.

- Approaches:
    - One-Hot Encoding
    - Frequency-based
    - Neural Networks

the word embedding by means of NN can be made with Word2Vec, GloVe, Bert, GPT.

- Word2Vec has 2 approaches:
    - Continuous Bag of Words
    - Skip-Gram

NN based embeddings aim to:
- capture context/meaning
- capture similarity to other words
- reduce dimension
- avoid memory issues

- Developed based on NN:

corpus --> NN --> Word Embedding

- Word2Vec:Continuous Bag of Words
the key point is a sliding window where the center is the dependent word to be predicted

we take 2 words before the dependent word, 2 after it, we take the embeddings, we make the average and then we give these values to a NN.

- GloVe: Global Vectors for word representations

based on co-occurrence matric of words in a corpus, which counts how often words appear together in the same context.

so you construct a matrix of word co-occurrence counts 
and then factorize this matrix to obtain word embeddings.

the factorization is made with SVD. the resulting embedding are dnese, low-dimensional vectors encoding words as vector of other words.

- BERT: Bidirectional Encoder Representations from Transformers

is a pre-trained word embedding based on trasformers. applies masked language modelling masking some words in sentence and learn to predict them.

applies next sentence prediction - is a model to predicts whether two sentences are similar in a text


- GPT: Generative Pre-trained Transformers

not strictly a word embedding but contextualized word embedding. Unique emedding for each occurrence of a word based on surrounding words in a text. Applies trasformers architecture

# GloVe EMBEDDING

In [1]:
import torch 
import torchtext.vocab as vocab 

In [2]:
# download pre-trained word vectors
glove = vocab.GloVe(name = '6B', dim = 100)

.vector_cache/glove.6B.zip: 862MB [03:20, 4.29MB/s]                               
100%|█████████▉| 399999/400000 [00:24<00:00, 16463.58it/s]


In [3]:
# number of words and embeddings
glove.vectors.shape
print(f'number of tokens {glove.vectors.shape[0]}\nembedding space size {glove.vectors.shape[1]}')

number of tokens 400000
embedding space size 100


In [4]:
# get the embedding vector for a specific word
def get_embedding_vector(word):
    word_index = glove.stoi[word]
    emb = glove.vectors[word_index]
    return emb

#example
get_embedding_vector('chess')

tensor([ 3.7635e-01,  5.4567e-01,  3.0534e-01,  9.0395e-01, -8.8172e-02,
         6.2945e-01,  4.0376e-01, -8.1160e-01, -1.9370e-01, -3.1395e-01,
        -1.6067e-02, -6.8291e-01, -1.2400e-02, -2.0827e-01, -1.0267e+00,
         1.4386e+00,  5.1816e-01,  2.0026e-01, -8.3672e-04, -2.9563e-01,
        -7.5463e-01,  1.9618e-01,  6.0900e-01,  3.6774e-01,  7.2106e-01,
        -8.6832e-01, -2.1198e-01, -4.3051e-01,  7.1873e-01,  7.5019e-01,
        -6.0245e-01,  7.5618e-01, -5.5033e-01, -6.6510e-01,  5.3047e-01,
        -2.2391e-01, -9.2297e-01,  6.2659e-01, -2.5183e-01, -8.2082e-01,
        -1.6507e-01,  2.9234e-01, -2.6373e-01, -8.1124e-01, -4.0006e-02,
        -1.3341e-01,  2.9392e-01, -4.4894e-01,  5.6080e-02,  3.9754e-01,
        -6.8598e-01, -3.4001e-01, -1.1112e-02,  7.5445e-01,  2.8091e-01,
        -1.4169e+00,  2.7837e-01,  3.4846e-01,  1.3482e-01,  1.2508e+00,
        -8.0446e-02,  4.9207e-01, -7.0844e-01,  6.3239e-01, -3.8550e-01,
        -4.9367e-01, -2.1818e-01,  7.6461e-01,  6.3

In [5]:
#find the closest words from input word
def get_closest_words_from_word(word, max_n = 5):
    word_emb = get_embedding_vector(word)
    distances = [(w,torch.dist(word_emb , get_embedding_vector(w)).cpu().item()) for w in glove.itos]
    dist_sorted_list = sorted(distances, key = lambda x : x[1])[:max_n]
    return dist_sorted_list

get_closest_words_from_word('cat')

[('cat', 0.0),
 ('dog', 2.681130886077881),
 ('rabbit', 3.648970603942871),
 ('cats', 3.6892004013061523),
 ('monkey', 3.7469322681427)]

In [8]:
# find closest words from embedding
def get_closest_words_from_embedding(word_emb, max_n = 5):
    distances = [(w, torch.dist(word_emb, get_embedding_vector(w)).cpu().item()) for w in glove.itos ]
    dist_sorted_list = sorted(distances, key = lambda x: x[1])[:max_n]
    return dist_sorted_list


In [12]:
# find word analogies
# for eaxmple king is to queen like man is to woman

def get_word_analogy(word1, word2, word3, max_n = 5):
    # w1 -w2 + w3 -> w4 analogy for w4
    word_emb1 = get_embedding_vector(word1)
    word_emb2 = get_embedding_vector(word2)
    word_emb3 = get_embedding_vector(word3)

    word_emb4 = word_emb1 - word_emb2 + word_emb3
    analogy = get_closest_words_from_embedding(word_emb4, max_n = max_n)
    return analogy
get_word_analogy('king', 'queen','man')

[('man', 4.281251907348633),
 ('brother', 5.3268609046936035),
 ('thought', 5.396074295043945),
 ('son', 5.4332075119018555),
 ('father', 5.440552234649658)]