#Tokenizers
Copyright 2021-2023 Denis Rothman, MIT License

Reference 1 for word embedding:
https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/

Reference 2 for cosine similarity:
[SciKit Learn cosine similarity documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

**June 2023: notebook updated Gensim 4.0.0 and code updated**


#Pre-Requisistes

In [1]:
!pip install gensim
import nltk
nltk.download('punkt')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
import math
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize 
import gensim 
from gensim.models import Word2Vec 
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings(action = 'ignore') 

#Word2Vec Tokenization

Update: download from GitHub added

In [3]:
#1.Load text.txt using the Colab file manager
#2.Downloading the file from GitHub
!curl -L https://raw.githubusercontent.com/fenago/nlp-transformers/master/Lab09/text.txt --output "text.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.9M  100 10.9M    0     0  36.2M      0 --:--:-- --:--:-- --:--:-- 36.2M


Update: With Gensim 4.0.0, use the vector_size parameter instead of size when initializing the Word2Vec model.

In [4]:
#‘text.txt’ file 
sample = open("text.txt", "r") 
s = sample.read() 

# processing escape characters 
f = s.replace("\n", " ") 

data = [] 
# sentence parsing
for i in sent_tokenize(f): 
	temp = [] 
	# tokenize the sentence into words 
	for j in word_tokenize(i): 
		temp.append(j.lower())
	data.append(temp)

# Creating Skip Gram model 
model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 512,window = 5, sg = 1) 
print(model2)

Word2Vec<vocab=11770, vector_size=512, alpha=0.025>


#Cosine Similarity

In [5]:
def similarity(word1,word2):
        cosine=False #default value
        try:
                a=model2[word1]
                cosine=True
        except KeyError:     #The KeyError exception is raised
                print(word1, ":[unk] key not found in dictionary")#False implied

        try:
                b=model2[word2]#a=True implied
        except KeyError:       #The KeyError exception is raised
                cosine=False   #both a and b must be true
                print(word2, ":[unk] key not found in dictionary")

        if(cosine==True):
                b=model2[word2]
                # compute cosine similarity
                dot = np.dot(a, b)
                norma = np.linalg.norm(a)
                normb = np.linalg.norm(b)
                cos = dot / (norma * normb)

                aa = a.reshape(1,512) 
                ba = b.reshape(1,512)
                #print("Word1",aa)
                #print("Word2",ba)
                cos_lib = cosine_similarity(aa, ba)
                #print(cos_lib,"word similarity")
          
        if(cosine==False):cos_lib=0;
        return cos_lib

#Case 0: Words in the dataset and the dictionary

Update: In Gensim 4.0.0, direct access to vectors using the model instance (like model[word]) has been changed. Use model.wv[word] to access the vector for a word.

In [6]:
def similarity(word1, word2):
    cosine = False  # default value
    try:
        a = model2.wv[word1]
        cosine = True
    except KeyError:     # The KeyError exception is raised
        print("The word ",word1," does not exist in the dictionary")
    try:
        b = model2.wv[word2]
    except KeyError:     # The KeyError exception is raised
        print("The word ",word2," does not exist in the dictionary")
        cosine = False  # reset to False if the second word doesn't exist
    if cosine: # if both words are in the vocabulary
        return cosine_similarity([a],[b]) # sklearn cosine_similarity requires 2D arrays
    else:
        return 0 # if either word is not in the vocabulary return similarity as 0

In [7]:
word1 = "freedom"
word2 = "liberty"
print("Similarity between", word1, "and", word2, "is", similarity(word1, word2))

Similarity between freedom and liberty is [[0.34518605]]


#Case 1: Words not in the dataset or the dictionary

In [8]:
word1="corporations";word2="rights"
print("Similarity",similarity(word1,word2),word1,word2)

The word  corporations  does not exist in the dictionary
Similarity 0 corporations rights


#Case 2: Noisy Relationship

In [9]:
word1="etext";word2="declaration"
print("Similarity",similarity(word1,word2),word1,word2)

Similarity [[0.5607709]] etext declaration


#Case 3: Words in the text but not in the dictionary

In [10]:
word1="pie";word2="logic"
print("Similarity",similarity(word1,word2),word1,word2)

The word  pie  does not exist in the dictionary
Similarity 0 pie logic


#Case 4: Rare words

In [11]:
word1="justiciar";word2="judgement"
print("Similarity",similarity(word1,word2),word1,word2)

Similarity [[0.22534767]] justiciar judgement


#Case 5: Replacing rare words

In [12]:
word1="judge";word2="judgement"
print("Similarity",similarity(word1,word2),word1,word2)

Similarity [[0.15180185]] judge judgement


#Case 6: Entailment

In [13]:
word1="pay";word2="debt"
print("Similarity",similarity(word1,word2),word1,word2)

Similarity [[0.52759117]] pay debt
