<a href="https://colab.research.google.com/github/dude123studios/AdvancedDeepLearning/blob/main/Embeddings_with_Gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import gensim.downloader as api
from gensim.models import Word2Vec

dataset = api.load('text8')
model = Word2Vec(dataset)

model.save('data/text8-word2vec.bin')

In [3]:
from gensim.models import KeyedVectors

model = KeyedVectors.load('data/text8-word2vec.bin')
word_vectors = model.wv

words = word_vectors.vocab.keys()
print([x for i,x in enumerate(words) if i<10])
assert('king' in words)

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


In [4]:
def print_most_similar(word_conf_pairs,k):
  for i, (word,conf) in enumerate(word_conf_pairs):
    print('{:.3f} {:s}'.format(conf,word))
    if i >= k-1:
      break
  if k < len(word_conf_pairs):
    print('...')

print_most_similar(word_vectors.most_similar('king'),5)

0.728 prince
0.719 queen
0.704 emperor
0.697 kings
0.694 throne
...


We can do arimthmatic, for example, paris:france ~ berlin:germany

In [6]:
print_most_similar(word_vectors.most_similar(positive=['france','berlin'],negative=['paris']),1)

0.773 germany
...


Cosine similarity is a better measure of similarity in embedding space

In [7]:
print_most_similar(word_vectors.most_similar_cosmul(positive=['france','berlin'],negative=['paris']),1)

0.935 germany
...


As you can the confidence score was significantly higher

Now we can expiremnt with out of place words. In this example, the word, "random," is clearly out of place with the fruits 

In [8]:
print(word_vectors.doesnt_match(['apples','oranges','random','bananas','grapes']))

random


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


We can also calculate the similarity between words like so

In [9]:
words = ['apples','oranges','bananas','car','tree','rollercoaster']
for word in words:
  print('similarity(grapes,{:s}) = {:.3f}'.format(word,word_vectors.similarity('grapes',word)))

similarity(grapes,apples) = 0.775
similarity(grapes,oranges) = 0.727
similarity(grapes,bananas) = 0.610
similarity(grapes,car) = 0.063
similarity(grapes,tree) = 0.495
similarity(grapes,rollercoaster) = 0.307


grapes are similar to apples, oranges, and bananas, somewhat similar to trees, and not similar to cars at all

In [11]:
print_most_similar(word_vectors.similar_by_word('singapore'),5)

0.878 malaysia
0.840 indonesia
0.833 uganda
0.829 brunei
0.823 barbados
...


We can also compute distance between word vectors

In [12]:
print('distance(singapore, malaysia) = '+ str(word_vectors.distance('singapore','malaysia')))

distance(singapore, malaysia) = 0.12174350023269653


In addition, we can use specific word vectors for specific words

In [13]:
vec_apple = word_vectors['apple']
print(vec_apple)

[ 0.03849123  2.0464475  -1.5543152   0.425766    0.39379004 -0.6558501
  0.85083926 -0.04886514  0.6428664   0.41137114  0.5111856   2.4500477
 -0.6449271   0.6567483  -3.785012   -0.30663005  1.1449054   1.7108201
  0.010325   -0.12602803  0.7945448  -2.6246612  -1.01991     1.6095597
  0.22073463  0.8946597  -0.27490544 -0.32423925 -0.6489213  -0.33749405
 -0.25108802  2.0312889  -0.75423306 -0.57523495 -2.0655732  -0.28978768
  0.02136061 -1.6766669  -0.27531907  1.5908382  -2.655386    2.7505744
 -1.7928637   0.9573799   0.24740696 -0.49393222 -2.4332561   2.3486032
  1.7046545  -0.44828755  0.39563498 -1.1315526  -0.31336647 -1.3206514
  0.08113258  2.1907551   2.0735445   1.8546292   0.2900023   4.32731
 -1.0502985   0.25231117  1.0018309  -2.5597386  -0.2105543  -0.74183244
  1.9333822   1.4301778   0.57632947  1.7417641  -2.2287362  -1.222632
  0.197973   -1.0327437   2.3882747   0.3089105  -0.77825034  1.2305263
 -0.81766325 -0.40372443 -0.67805296 -1.8627924   0.81034505  0.