# Word Embeddings
## Word embedding - origins and fundamentals
-    collective name for a set of language modeling and feature learning techings in natural language processing (NLP)
-    one-hot encoding is early example, but doesn't work
-    other examples borrow from Information Retrieval (IR): Term Frequency-Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), and topic modeling

### Distributed representations
-    Attempt to capture the meaning of word by considering its relations with other words in its context.

### Static embeddings
-    Embeddings are generated against a large corpus, but the nuumber of words, though large, is finite.
-    Think of static embedding as a dictionary
#### Word2Vec
-    Self-supervised
-    Continuous Bag of Words (CBOW) and Skip-gram
-    Skip-Gram with Negative Sampling (SGNS) model
-    GloVe - Global vectors for word representation
#### Creating your own embeddings using Gesim
-    Gesim is an open-source python library designed to extract semantic meaning from text documents.

In [1]:
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8")
model = Word2Vec(dataset)
model.save("data/text8-word2vec.bin")

### Exploring the embedding space with Gensim
-    Reload the model we just built and explore it

In [2]:
from gensim.models import KeyedVectors
model = KeyedVectors.load("data/text8-word2vec.bin")
word_vectors = model.wv

-    Look at the first few words in the vocabulary

In [3]:
#words = word_vectors.vocab.keys()

#print([x for i, x in enumerate(words) if i < 10])
#assert("king" in words)

my_dict = dict({})
i = 0
for idx, key in enumerate(model.wv.key_to_index):
    my_dict[key] = model.wv[key]
    i += 1
    if i >= 10:
        break

my_dict.keys()

dict_keys(['the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero', 'nine', 'two'])

Look for similar words to a given word "king"

In [4]:
def print_most_similar(word_conf_pairs, k):
    for i, (word, conf) in enumerate(word_conf_pairs):
        print("{:.3f} {:s}".format(conf, word))
        if i >= k-1:
            break
    if k < len(word_conf_pairs):
        print("...")

print_most_similar(word_vectors.most_similar("king"), 5)

0.737 prince
0.733 queen
0.716 emperor
0.709 vii
0.691 throne
...


You can also do vector arithmetic similar to the country-capital example we described earlier

In [8]:
print_most_similar(word_vectors.most_similar(
    positive=["france", "berlin"], negative=["paris"]), 1
)

0.826 germany
...


The preceding similaring value is reported cosine.  Alternatively copyte the distance with lag scale, amplifying the difference between sorter distance and reducing the difference between longer ones.

In [9]:
print_most_similar(word_vectors.most_similar_cosmul(
    positive=["france", "berlin"], negative=["paris"]), 1
)

0.971 germany
...


Gensim also provides a doesnt_match function

In [10]:
print(word_vectors.doesnt_match(["hindus", "parsis", "singapore", "christians"]))

singapore
