# WORD Similarity Task in WordSim353

A task to test word embeddings and their smeantics relations

Dataset:
Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa, A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches, In Proceedings of NAACL-HLT 2009 - at http://alfonseca.org/eng/research/wordsim353.html

Reference:
- Katrin Erk(2020) Python demo: predicting word similarity
 at https://www.katrinerk.com/courses/python-worksheets/python-demo-predicting-word-similarity
- Gulonnlp Framework for embedding evaluation at https://nlp.gluon.ai/examples/word_embedding_evaluation/word_embedding_evaluation.html

In [None]:
import nltk
import pandas
import scipy
import warnings
from gensim.models import KeyedVectors
warnings.filterwarnings("ignore")

You can upload files in Google Colab easily with the following comman. If you are working locally then you can ignore the following cell

In [None]:
# Upload the data to google cloud in case the drag-drop upload is not working.
# This was the case on my side
from google.colab import files
dataset_file_dict = files.upload()

Saving win353.csv to win353.csv


Other files that are used for testing. If you do not have them no worries just ignore the following cells. This were old test and experimental results.

You need to specify your own file paths accordingly.

In [None]:
large_corpus = []
with open('/content/line_corpus.txt', 'r') as inp:
    for line in inp:
        large_corpus.append(line.split())

small_corpus = []
with open('/content/small_line_corpus_no_stopwords.txt', 'r') as inp:
    for line in inp:
        small_corpus.append(line.split())

In [None]:
# now we build a toy distributional space
from gensim.models import Word2Vec
from nltk.corpus import brown
brownmodel1 = Word2Vec(brown.sents(), iter=100, min_count=10, size=128, workers=4)
# brownmodel2 = Word2Vec(brown.sents(), iter=100, min_count=10, size=100, workers=4)

In [None]:
# Some other models just for testing with our corpus
word2vec_small = Word2Vec(small_corpus, iter=100, min_count=10, size=128, workers=4)
word2vec_big = Word2Vec(large_corpus, iter=100, min_count=10, size=128, workers=4)

Since some files are really large to reupload them every time I have already uploaded them in drive and used from my drive. You can ignore the drive setup if you are working locally. Just make sure to chage the path of the files accordingly.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# This is from my drive
enahnced_file = "/content/gdrive/MyDrive/GloveEmbeddings/embeddings_128_enhanced_ws2.tsv"
normal_file = "/content/gdrive/MyDrive/GloveEmbeddings/embeddings_128_normal_ws2.tsv"
enahnced_file_ws3 = "/content/gdrive/MyDrive/GloveEmbeddings/embeddings_128_enhanced_ws3.tsv"
normal_file_ws3 = "/content/gdrive/MyDrive/GloveEmbeddings/embeddings_128_normal_ws3.tsv"

glove_file = "/content/gdrive/MyDrive/GloveEmbeddings/glove.twitter.27B.200d.txt"

In [None]:
improved_wg2v_file = "/content/gdrive/MyDrive/GloveEmbeddings/embeddings_50_enhanced_200k.tsv"
original_wg2v_file = "/content/gdrive/MyDrive/GloveEmbeddings/embeddings_50_enhanced_200k_small.tsv"
loss_wg2v_file = "/content/gdrive/MyDrive/GloveEmbeddings/embeddings_original_graph_wg2v.tsv"
baseline_w2v_file = "/content/gdrive/MyDrive/GloveEmbeddings/embeddings_50_normal_big.tsv"
synonym_w2v_file = "/content/gdrive/MyDrive/GloveEmbeddings/embeddings_synonym_relation_w2v.tsv"
hypernym_w2v_file = "/content/gdrive/MyDrive/GloveEmbeddings/embeddings_hypernym_relation_w2v.tsv"

Write the file in correct format. For each file  that you want to test please specify the path and the name you want the file to be saved. Also count the number of words that are supposed to be in the file.

A function that converts TSV embedding files into .txt files used by gensim library.

In [None]:
def writeEmbeddings2File(file, title='word2vec-enhanced-format.txt', line_count='60400', dimensions='100'):
    with open(file, 'r') as inp, open('/content/'+title, 'w') as outp:
        # The genism library needs the fist line of text to be line_count dimensions
        # There is a wrong number
        outp.write(' '.join([line_count, dimensions]) + '\n')
        for line in inp:
            if '60401 128' in line:
                print(line)
                continue
            words = line.strip().split()
            outp.write(' '.join(words) + '\n')

In [None]:
writeEmbeddings2File(improved_wg2v_file, 'wg2v-improved.txt', '282164', '50')
writeEmbeddings2File(original_wg2v_file, 'wg2v-original.txt', '131110', '50')
writeEmbeddings2File(loss_wg2v_file, 'wg2v-loss.txt', '253696')
writeEmbeddings2File(baseline_w2v_file, 'w2v-baseline.txt', '282164', '50')
writeEmbeddings2File(synonym_w2v_file, 'w2v-synonym.txt', '253696')
writeEmbeddings2File(hypernym_w2v_file, 'w2v-hypernym.txt', '253696')

In [None]:
wordgraph2vec_improved = KeyedVectors.load_word2vec_format('/content/wg2v-improved.txt')
wordgraph2vec_original = KeyedVectors.load_word2vec_format('/content/wg2v-original.txt')
wordgraph2vec_loss = KeyedVectors.load_word2vec_format('/content/wg2v-loss.txt')
word2vec_baseline = KeyedVectors.load_word2vec_format('/content/w2v-baseline.txt')
word2vec_synonym = KeyedVectors.load_word2vec_format('/content/w2v-synonym.txt')
word2vec_hypernym = KeyedVectors.load_word2vec_format('/content/w2v-hypernym.txt')

In [None]:
word2vec_model_enhanced_ws2 = KeyedVectors.load_word2vec_format('/content/word2vec-enhanced-format.txt')
word2vec_model_normal_ws2 = KeyedVectors.load_word2vec_format('/content/word2vec-normal-format.txt')

In [None]:
word2vec_model_enhanced_ws3 = KeyedVectors.load_word2vec_format('/content/word2vec-enhanced-ws3-format.txt')
word2vec_model_normal_ws3 = KeyedVectors.load_word2vec_format('/content/word2vec-normal-ws3-format.txt')

In [None]:
glove_model = KeyedVectors.load_word2vec_format(glove_file)

In [None]:
print('Enhanced Embeddings:', word2vec_model_enhanced.similarity('write', 'writes'))
print('Normal Embeddings:', word2vec_model_normal.similarity('write', 'writes'))
print('Word2Vec Embeddings:', brownmodel1.similarity('write', 'writes'))
print('Glove Embeddings:', glove_model.similarity('write', 'writes'))

Enhanced Embeddings: 0.23435295
Normal Embeddings: 0.14642885
Word2Vec Embeddings: 0.22172073
Glove Embeddings: 0.6877123


Load the word similarity dataset

In [None]:
# read wordsim353. separating two columns is whitespace
filename = "/content/win353.csv"
wordsim353 = pandas.read_csv(filename)

In [None]:
# accessing this table
print(wordsim353)

           Word 1      Word 2  Human (Mean)
0       admission      ticket         5.846
1         alcohol   chemistry         1.154
2        aluminum       metal         6.286
3    announcement      effort         2.000
4    announcement        news         7.077
..            ...         ...           ...
348        weapon      secret         1.500
349       weather    forecast         5.067
350     Wednesday        news         1.000
351          wood      forest         7.214
352          word  similarity         0.923

[353 rows x 3 columns]


In [None]:
# pulling similarity ratings from the model:
# if a word is missing, we want to just return a similarity of zero
# If fall back is provided word is used with that model
def sim_or_zero(word1, word2, model, fallBack):
    if model.wv.__contains__(word1) and model.wv.__contains__(word2):
        # Return the similarity score between words
        return model.wv.similarity(word1, word2), _
    else:
        # Try the fall back model in case a word is missing
        if fallBack.wv.__contains__(word1) and fallBack.wv.__contains__(word2):
            return fallBack.wv.similarity(word1, word2), _
        else:
            return 0.0, True

In [None]:
wordgraph2vec_improved = KeyedVectors.load_word2vec_format('/content/wg2v-improved.txt')
# wordgraph2vec_original = KeyedVectors.load_word2vec_format('/content/wg2v-original.txt')
# wordgraph2vec_loss = KeyedVectors.load_word2vec_format('/content/wg2v-loss.txt')
# word2vec_baseline = KeyedVectors.load_word2vec_format('/content/w2v-baseline.txt')
# word2vec_synonym = KeyedVectors.load_word2vec_format('/content/w2v-synonym.txt')
# word2vec_hypernym = KeyedVectors.load_word2vec_format('/content/w2v-hypernym.txt')

# making predictions for the wordsim353 data,
# storing them in the column "modelpredict"
modelpredict_wordgraph2vec_improved = [ ]
modelpredict_wordgraph2vec_original = [ ]
modelpredict_wordgraph2vec_loss = [ ]
modelpredict_word2vec_baseline = [ ]
modelpredict_word2vec_synonym = [ ]
modelpredict_word2vec_hypernym = [ ]
glovePred = []
word2vecPred = []
for index, row in wordsim353.iterrows():
    modelpredict_wordgraph2vec_improved.append( sim_or_zero(row["Word 1"], row["Word 2"], wordgraph2vec_improved, glove_model) )
    modelpredict_wordgraph2vec_original.append( sim_or_zero(row["Word 1"], row["Word 2"], wordgraph2vec_original, glove_model) )
    modelpredict_wordgraph2vec_loss.append( sim_or_zero(row["Word 1"], row["Word 2"], wordgraph2vec_loss, glove_model) )
    modelpredict_word2vec_baseline.append( sim_or_zero(row["Word 1"], row["Word 2"], word2vec_baseline, glove_model) )
    modelpredict_word2vec_synonym.append( sim_or_zero(row["Word 1"], row["Word 2"], word2vec_synonym, glove_model) )
    modelpredict_word2vec_hypernym.append( sim_or_zero(row["Word 1"], row["Word 2"], word2vec_hypernym, glove_model) )

wordsim353["modelpredict_wordgraph2vec_improved"] = modelpredict_wordgraph2vec_improved
wordsim353["modelpredict_wordgraph2vec_original"] = modelpredict_wordgraph2vec_original
wordsim353["modelpredict_wordgraph2vec_loss"] = modelpredict_wordgraph2vec_loss
wordsim353["modelpredict_word2vec_baseline"] = modelpredict_word2vec_baseline
wordsim353["modelpredict_word2vec_synonym"] = modelpredict_word2vec_synonym
wordsim353["modelpredict_word2vec_hypernym"] = modelpredict_word2vec_hypernym

# we print pairs of correlation and pvalue

print('modelpredict_wordgraph2vec_improved Embedding Result')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredict_wordgraph2vec_improved"]), '\n')

print('modelpredict_wordgraph2vec_original Embedding Result')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredict_wordgraph2vec_original"]), '\n')

print('modelpredict_wordgraph2vec_loss Embedding Result 128 WS=3')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredict_wordgraph2vec_loss"]), '\n')

print('modelpredict_word2vec_baseline Embedding Result')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredict_word2vec_baseline"]), '\n')

print('modelpredict_word2vec_synonym Embedding Result')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredict_word2vec_synonym"]), '\n')

print('modelpredict_word2vec_hypernym Embedding Result')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredict_word2vec_hypernym"]), '\n')

modelpredict_wordgraph2vec_improved Embedding Result
SpearmanrResult(correlation=0.29723642045511517, pvalue=1.241287694374442e-08) 

modelpredict_wordgraph2vec_original Embedding Result
SpearmanrResult(correlation=0.2617145274063511, pvalue=6.135095460463393e-07) 

modelpredict_wordgraph2vec_loss Embedding Result 128 WS=3
SpearmanrResult(correlation=-0.040117485646638654, pvalue=0.45243106805183453) 

modelpredict_word2vec_baseline Embedding Result
SpearmanrResult(correlation=0.3725227252103845, pvalue=4.609207445689924e-13) 

modelpredict_word2vec_synonym Embedding Result
SpearmanrResult(correlation=-0.051021877478921354, pvalue=0.33915476269219547) 

modelpredict_word2vec_hypernym Embedding Result
SpearmanrResult(correlation=-0.0355718063161641, pvalue=0.5052999272219449) 



### Old Experimental Results

In [None]:
# making predictions for the wordsim353 data,
# storing them in the column "modelpredict"
# as one-liners:
# wordsim353["modelpredict1"] = [sim_or_zero(row["Word1"], row["Word2"], brownmodel1) for index, row in wordsim353.iterrows()]
# wordsim353["modelpredict2"] = [sim_or_zero(row["Word1"], row["Word2"], brownmodel2) for index, row in wordsim353.iterrows()]
# or less compactly:
modelpredictEnhancedws2 = [ ]
modelpredictNormalws2 = [ ]
modelpredictEnhancedws3 = [ ]
modelpredictNormalws3 = [ ]
modelpredictSmall = [ ]
modelpredictBig = [ ]
glovePred = []
word2vecPred = []
for index, row in wordsim353.iterrows():
    modelpredictEnhancedws2.append( sim_or_zero(row["Word 1"], row["Word 2"], word2vec_model_enhanced_ws2, glove_model) )
    modelpredictNormalws2.append( sim_or_zero(row["Word 1"], row["Word 2"], word2vec_model_normal_ws2, glove_model) )
    modelpredictEnhancedws3.append( sim_or_zero(row["Word 1"], row["Word 2"], word2vec_model_enhanced_ws3, glove_model) )
    modelpredictNormalws3.append( sim_or_zero(row["Word 1"], row["Word 2"], word2vec_model_normal_ws3, glove_model) )
    word2vecPred.append( sim_or_zero(row["Word 1"], row["Word 2"], brownmodel1, glove_model) )
    modelpredictSmall.append( sim_or_zero(row["Word 1"], row["Word 2"], word2vec_small, brownmodel1) )
    modelpredictBig.append( sim_or_zero(row["Word 1"], row["Word 2"], word2vec_big, brownmodel1) )
    glovePred.append( sim_or_zero(row["Word 1"], row["Word 2"], glove_model, brownmodel1) )


wordsim353["modelpredictEnhancedws2"] = modelpredictEnhancedws2
wordsim353["modelpredictNormalws2"] = modelpredictNormalws2
wordsim353["modelpredictEnhancedws3"] = modelpredictEnhancedws3
wordsim353["modelpredictNormalws3"] = modelpredictNormalws3
wordsim353["modelpredictSmall"] = modelpredictSmall
wordsim353["modelpredictBig"] = modelpredictBig
wordsim353["glovePred"] = glovePred
wordsim353["word2VecPred"] = word2vecPred


# we print pairs of correlation and pvalue
# brownmodel1 is beyond miserable
print('Enhanced Embedding Result 128 WS=2')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredictEnhancedws2"]), '\n')

# brownmodel2 has looked at the data a greater number of times,
# and compresses its information into fewer dimensions.
# It does better, though also not great.
# the words with missing entries are really harming the model
print('Normal Embedding Result 128 WS=2')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredictNormalws2"]), '\n')


print('Enhanced Embedding Result 128 WS=3')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredictEnhancedws3"]), '\n')

# brownmodel2 has looked at the data a greater number of times,
# and compresses its information into fewer dimensions.
# It does better, though also not great.
# the words with missing entries are really harming the model
print('Normal Embedding Result 128 WS=3')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredictNormalws3"]), '\n')

print('Big Embedding Result')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredictBig"]), '\n')

print('Small Embedding Result')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["modelpredictSmall"]), '\n')

print('Glove Embedding Result')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["glovePred"]), '\n')

print('Word2vec Brown Embedding Result')
print(scipy.stats.spearmanr(wordsim353["Human (Mean)"], wordsim353["word2VecPred"]), '\n')



Enhanced Embedding Result 128 WS=2
SpearmanrResult(correlation=0.20220446404371606, pvalue=0.0001306722734620016) 

Normal Embedding Result 128 WS=2
SpearmanrResult(correlation=0.14972834204305163, pvalue=0.004816292922489155) 

Enhanced Embedding Result 128 WS=3
SpearmanrResult(correlation=0.24398752181175862, pvalue=3.5157224443736937e-06) 

Normal Embedding Result 128 WS=3
SpearmanrResult(correlation=0.15725722404236184, pvalue=0.003050827369036314) 

Big Embedding Result
SpearmanrResult(correlation=0.4892507259510937, pvalue=1.207630283546632e-22) 

Small Embedding Result
SpearmanrResult(correlation=0.15299156959054352, pvalue=0.003961191741279982) 

Glove Embedding Result
SpearmanrResult(correlation=0.38880933943525936, pvalue=3.4704034981132106e-14) 

Word2vec Brown Embedding Result
SpearmanrResult(correlation=0.37429903459685054, pvalue=3.500613751849894e-13) 

