This notebook is to test the word embeddings for analogy tasks. This is done by taking the L2 norm of the difference of actual value and the predicted value. The lower the L2 norm is, the better.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


In [None]:
!pip install bcolz --quiet

Collecting bcolz
[?25l  Downloading https://files.pythonhosted.org/packages/5c/4e/23942de9d5c0fb16f10335fa83e52b431bcb8c0d4a8419c9ac206268c279/bcolz-1.2.1.tar.gz (1.5MB)
[K     |████████████████████████████████| 1.5MB 3.5MB/s 
Building wheels for collected packages: bcolz
  Building wheel for bcolz (setup.py) ... [?25l[?25hdone
  Created wheel for bcolz: filename=bcolz-1.2.1-cp36-cp36m-linux_x86_64.whl size=2668999 sha256=85654e2ab275eccdac208716f3d29b9af64b5cd912afb204a84bc24fa6bb2902
  Stored in directory: /root/.cache/pip/wheels/9f/78/26/fb8c0acb91a100dc8914bf236c4eaa4b207cb876893c40b745
Successfully built bcolz
Installing collected packages: bcolz
Successfully installed bcolz-1.2.1


In [None]:
import bcolz
import numpy as np
import pickle

In [None]:
# Making Dictionary from carray for 50 length embedding

vectors = bcolz.open(f'./drive/My Drive/AML_2/6B.50d.dat')[:]
words = pickle.load(open(f'./drive/My Drive/AML_2/6B.50_words.pkl', 'rb'))
word2idx = pickle.load(open(f'./drive/My Drive/AML_2/6B.50_idx.pkl', 'rb'))

myDict_50 = {w: vectors[word2idx[w]] for w in words}

In [None]:
# Making Dictionary from carray for 200 length embedding

vectors_200 = bcolz.open(f'./drive/My Drive/AML_2/6B.200d.dat')[:]
words_200 = pickle.load(open(f'./drive/My Drive/AML_2/6B.200_words.pkl', 'rb'))
word2idx_200 = pickle.load(open(f'./drive/My Drive/AML_2/6B.200_idx.pkl', 'rb'))

myDict_200 = {w: vectors_200[word2idx_200[w]] for w in words_200}

Here, we have taken a weighted L2 norm to effectively compare the effect of embedding vector length on the analogy task. For this, we divide the actual norm value by 250 (200 + 50) and multiply by the length corresponding to which we are trying to find the L2 norm

In [None]:
# Method for Taking L2 norm of difference of 2 word embeddings(weighted L2 norm)

def takeL2norm(vec1, vec2):
  vec = vec1 - vec2
  norm = np.sum(np. power(vec,2))
  l2 = (norm * len(vec))/250
  return l2



---


Analogy 1

---

> Checking the logic,

> princess = prince - boy + girl



In [None]:
# Analogy 1
# for 50 length vector embedding

princess_50_actual = myDict_50['princess']
princess_50_predicted = myDict_50['prince'] - myDict_50['boy'] + myDict_50['girl'] 
loss_metric_50 = takeL2norm(princess_50_actual, princess_50_predicted)

In [None]:
# for 300 length vector embedding

princess_200_actual = myDict_200['princess']
princess_200_predicted = myDict_200['prince'] - myDict_200['boy'] + myDict_200['girl'] 
loss_metric_200 = takeL2norm(princess_200_actual, princess_200_predicted)

In [None]:
print(loss_metric_50)
print(loss_metric_200)

2.1125659543968114
23.508401588223208




---


Analogy 2

---



>Checking the logic,

>asia**:**india **::** europe**:**germany

In [None]:
# Analogy 2
# for 50 length embedding

asia_50 = myDict_50['asia']
india_50 = myDict_50['india']
europe_50 = myDict_50['europe']
germany_50_actual = myDict_50['germany']
germany_50_predicted = europe_50 - asia_50 + india_50
loss_metric_50= takeL2norm(germany_50_actual, germany_50_predicted)

In [None]:
# for 200 length embedding

asia_200 = myDict_200['asia']
india_200 = myDict_200['india']
europe_200 = myDict_200['europe']
germany_200_actual = myDict_200['germany']
germany_200_predicted = europe_200 - asia_200 + india_200
loss_metric_200= takeL2norm(germany_200_actual, germany_200_predicted)

In [None]:
print(loss_metric_50)
print(loss_metric_200)

5.26106093921469
41.86019923530512




---


Analogy 3

---



>Checking the logic,

>do**:**did  **::** go**:**went

In [None]:
# Analogy 3

# for 50 length embedding

went_50_actual = myDict_50['went']
went_50_predicted = myDict_50['did'] - myDict_50['do'] + myDict_50['go']
loss_metric_50 = takeL2norm(went_50_actual, went_50_predicted)

In [None]:
# for 200 length embedding

went_200_actual = myDict_200['went']
went_200_predicted = myDict_200['did'] - myDict_200['do'] + myDict_200['go']
loss_metric_200 = takeL2norm(went_200_actual, went_200_predicted)

In [None]:
print(loss_metric_50)
print(loss_metric_200)

0.6423520522060218
8.688189643888883



---
> OBSERVATION
---

> It is observed that smaller the length of embedding,lower the L2 norm, better the analogy. We can conclude that analytically close words when represented in lower dimensions would lie nearer.