#Introduction

[Colab Notebook Link](https://colab.research.google.com/drive/1WC1pB5j5ZA7h92SqqFmBB8jUZH5qgdO9?usp=sharing)

In this part of the homework, cosine similarity and a dissimilarity metric was implemented to measure the similarity and dissimilarity of word embeddings. Ideally, this could be used for scenarios like detecting the likelihood of a disease like COVID being present in a patient based on their medical report. 

#Setup Code

We will use the COVID-19 Word Embeddings provided by Tensorflow and Keras as seen in https://www.tensorflow.org/hub/tutorials/cord_19_embeddings_keras

This was referenced via Lecture 8 at
https://github.com/Uzmamushtaque/CSCI4962-Projects-ML-AI/blob/main/Lecture_8.ipynb

In [67]:
import numpy as np
from numpy.linalg import norm
import pandas as pd

import tensorflow as tf

import tensorflow_datasets as tfds
import tensorflow_hub as hub

In [68]:
module = hub.load('https://tfhub.dev/tensorflow/cord-19/swivel-128d/3')
embeddings = module(queries)

In [69]:
print(embeddings.get_shape().as_list())

[9, 128]


In [70]:
print(embeddings)

tf.Tensor(
[[-0.47197425 -0.33664635 -0.0161463  ...  0.26368123  0.05248641
   0.14978611]
 [-0.9464257  -0.5886067   0.01101076 ...  0.17479871 -0.11735114
   0.11467868]
 [-0.6674375  -0.24282083 -0.22304116 ... -0.1180256   0.03881054
   0.23996338]
 ...
 [-0.25398737 -0.28904256 -0.25699255 ...  0.00661578 -0.21710789
  -0.02514506]
 [-0.3953484  -0.00334413  0.2994688  ... -0.3633064   0.29915136
   0.47241527]
 [ 0.02567826 -0.28097168 -0.02309187 ... -0.19157878  0.3810111
  -0.5676846 ]], shape=(9, 128), dtype=float64)


In [71]:
#preliminary test for 
single_embedding = module(["coronavirus"])
single_embedding2 = module(["sick"])

In [72]:
print(embeddings.get_shape().as_list())

[9, 128]


In [73]:
se_np = single_embedding.numpy()
print(se_np)

[[-0.47197425 -0.33664635 -0.0161463  -0.20361769 -0.09667411  0.0999501
   0.16814792  0.08613649  0.25580052 -0.41777578  0.1589236   0.04924057
  -0.13339666 -0.41559076 -0.5020608  -0.3879442  -0.1449616   0.2236112
   0.1604103   0.02024477  0.21716478  1.3675784  -0.3759036   0.29201704
  -0.249948   -0.13364193  0.02869424 -0.08525381 -0.08422393  0.04434494
   0.46645662 -0.18982655 -0.06655415 -0.08425294  0.25412214  0.25736067
   0.17924756 -0.09944166  0.3524065   0.17569394  0.12211543 -0.05792318
   0.17645526  0.00318342  0.1335713  -0.14243527 -0.16918755  0.172651
  -0.22125176  0.18151772 -0.34897557 -0.07565673 -0.3123699   0.14559059
  -0.3429669   0.1311255   0.17742424 -0.01907614 -0.08728905 -0.38009694
   0.01538454  0.08928897  0.07758309  0.24075277 -0.5705643   0.05707884
  -0.10113283  0.19736579  0.24092886  0.8095957  -0.54002225 -0.11051869
  -0.03872953 -0.1940451  -1.1387123   0.21548632 -0.06138964 -0.1710748
  -0.19030465 -0.0698685  -0.10987974 -0.12

In [74]:
se_np2 = single_embedding2.numpy()
print(se_np2)

[[-0.03786096  0.41188627  0.21214679  0.21037325  0.28856608  0.14548205
   0.07558599 -0.2471419   0.13063702 -0.19849044 -0.2492039   0.3172877
  -0.46355227 -0.07040227 -0.0125507  -0.3099569  -0.09413338  0.0993289
   0.05264906  0.00647391  0.03397061 -0.66954565  0.13366504  0.17451647
  -0.45237887  0.13276652 -0.08639037 -0.66040087  0.2032552   0.1887481
  -0.15657339 -0.42457747  0.21376255 -0.02942701  0.004192   -0.17389831
  -0.1890726  -0.12050067  0.3550312   0.25426802 -0.10110611 -0.08281083
   0.532275   -0.23962778  0.43648535 -0.11170966  0.08970647 -0.05854528
  -0.23372138 -0.25665388 -0.11999303 -0.08714675  0.11556793  0.21268952
  -0.07424636  0.29746944  0.10880441 -0.01634128  0.12117627  0.07347861
  -0.25201792 -0.3749179   0.03636371  0.02347431  0.50819546  0.0843392
  -0.00735214  0.13006946  0.18798718  1.1503567   0.16891414  0.03991701
   0.10986429 -0.14725992 -0.7261987  -0.10976892  0.28582814  0.17468375
   0.00894413  0.27563784  0.10947332 -0.1

Cosine Similarity = A dot B / (norm of A * norm of B)



In [75]:
#closer score is to 1, more similar it is
def cosine_sim(a,b):
  return np.dot(a,b) / (norm(a)*norm(b))

In [76]:
cosine_sim(se_np,se_np2.T)[0][0]

0.11097815389511213

In [77]:
def query_word(word):
  module_qw = hub.load('https://tfhub.dev/tensorflow/cord-19/swivel-128d/3')
  embeddings_qw = module([word])
  return embeddings_qw.numpy()[0]


In [78]:
query_word("COVID")

array([-0.07893691, -0.84426206, -0.17797703, -0.33199316, -0.10179356,
        0.03715919, -0.23468351, -0.24482557, -0.02832239, -0.9027776 ,
       -0.22809282, -0.15733978, -0.4170667 ,  0.33589268,  0.25243297,
       -0.488037  , -0.01950541,  0.13705918,  0.00686291, -0.42542958,
        0.04271523, -0.3328992 , -0.303064  ,  0.1326254 , -0.17635414,
       -0.47369516, -0.10909307,  0.29479495,  0.00944898, -0.39672956,
        0.62460935,  0.07306614, -0.65731996,  0.3820488 , -0.14344929,
        0.01758205,  0.44644693,  0.70899796, -0.41327083,  0.23386273,
        0.9168475 , -0.179509  ,  0.15637589,  0.35650754,  0.1834853 ,
       -0.7329489 , -0.45997468,  0.20471582,  0.2629488 ,  0.1924147 ,
       -0.3271179 , -0.07381116,  0.19613439,  0.31000704,  0.45095775,
       -0.12058749, -0.29184896, -0.28959337,  0.00393923, -0.06627241,
        0.19456702, -0.4565516 , -0.7398012 ,  0.17144641,  0.09728805,
        0.03855183, -0.3078341 ,  0.0915118 ,  0.1244176 ,  2.15

In [79]:
query_word("Coronavirus")

array([-0.43641675, -0.58570874, -0.03159469,  0.02783273, -0.3453937 ,
       -0.19565678,  0.17670918, -0.13008611,  0.11136445, -0.19470885,
        0.06490666,  0.09060626, -0.47316647, -0.09063513,  0.05985916,
       -0.38905844, -0.18295151,  0.43953878,  0.15617532,  0.14274383,
        0.10613035,  1.1713923 , -0.23227891,  0.4581485 , -0.22382572,
        0.21189542, -0.18788725, -0.23981364, -0.27315247,  0.10085894,
        0.6325786 , -0.2737549 , -0.25355387,  0.09503996,  0.08208735,
        0.3652776 ,  0.26735675,  0.17778966,  0.2164811 ,  0.12169449,
        0.10741422,  0.08151668, -0.45908672, -0.42137375, -0.07021946,
        0.06885307, -0.4925379 ,  0.13148345, -0.16676837,  0.23670691,
       -0.33426556, -0.29625368, -0.25981098,  0.1682624 , -0.19043106,
        0.305021  ,  0.20358385,  0.14797693, -0.12732872, -0.43301624,
        0.32175058, -0.175613  , -0.13813487,  0.00160287,  0.15779684,
        0.24664287, -0.05768351,  0.46140814, -0.04929634,  0.71

In [80]:
#test similarity - identical word should return cosine similarity 1
cosine_sim(query_word("Coronavirus"),query_word("Coronavirus"))

1.0

In [81]:
#test with two different words
cosine_sim(query_word("COVID"),query_word("Coronavirus"))

0.29860783778945804

In [82]:
#COVID and Coronavirus don't have high similarity - perhaps
#the medical term and public term weren't used in very similar contexts?

#Similarity score is much better between terms corresponding to
#medical classifications of the word.
print(cosine_sim(query_word("COVID-19"),query_word("COVID")))
print(cosine_sim(query_word("COVID-19"),query_word("SARS")))

0.7096248764570396
0.5853716966291452


In [83]:
#testing non COVID-Specific terms
print(cosine_sim(query_word("sick"),query_word("flu")))

0.42147003897222785


Dissimilarity score could be thought of as "not similarity score".
The worse a similarity score is, the more dissimilar it is.
We could obtain the opposite of a similarity score such as Euclidean distance
to calculate dissimilarity.

Since more similar words in a word embedding will have closer
distance, we can infer that less similar words will be farther away. 

Euclidean Distance - square root (sum of difference of vectors squared)

In [84]:
#Closer to 0 = more similar
def euclidean(a,b):
  squarediff = np.square(np.subtract(a, b))
  rootsum = np.sqrt(np.sum(squarediff))
  return rootsum

In [85]:
euclidean(query_word("Coronavirus"),query_word("Coronavirus"))

0.0

We should expect words that are more dissimilar to have a positive value further away from zero. A higher dissimilarity score could be used to deliberately favor or ignore word pairings with a certain intent - 
for example, if the average of dissimilarity scores is known to be
around 3.5, an algorithm could deliberately operate on word pairings 
that had dissimilarity of 3.5 or above.

In [86]:
print(euclidean(query_word("nail"),query_word("cough"))) #Not Similar At All
print(euclidean(query_word("COVID"),query_word("Coronavirus"))) #Least Similar
print(euclidean(query_word("sick"),query_word("flu"))) #Less Similar
print(euclidean(query_word("COVID-19"),query_word("COVID"))) #More Similar
print(euclidean(query_word("COVID-19"),query_word("SARS"))) #Most Similar

4.930275601915297
4.729276386554468
3.6042326231258217
3.1736450804098744
2.820645547358311
