# Assigment 2: Vector vs. Lexical Semantics

#### Given a golden standard G and a large corpus of text C for English language, calculate the average Information Retrieval (IR) metric m of top-k similar words retrieved by the vector semantics based on method v.

- G: Report the evaluation results based on the golden standards SimLex-9991.
- C: Report the evaluation results based on 2 large corpus from different genres available in NLTK libraries.
- v: Report the evaluation results of methods TF-iDF3, Word2Vec4 using the cosine similarity. These methods are also called baselines.
- top-k: Report the evaluation results for top-10, i.e., k=10.
- m: Report the evaluation results based on average nDCG5 using pytrec-eval-terrier6.

In [4]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download("wordnet")
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
import pytrec_eval
import pandas


[nltk_data] Downloading package wordnet to /Users/yduong/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Procedure 1: We select SimLex-999 as our golden truth.
1. For each word w, we order the top-10 similar words to w as golden list for w. Note that we may have list of different sizes for each word w. For instance, for ‘soccer’ we may have 3 most similar words and for ‘apple’ we may have 20 most similar words.
2. When the size is smaller than 10, we try to expand it by transitivity rule, i.e., w similar-to a, a similar-to b, then w similar-to b. If we don’t reach to top-10, we leave it as it is.
3. When the size is greater than 10, we truncate the list to top-10.
4. Let’s call the golden top-10 similar words to w as top-k-G[w]; k=10.
5. Note the the top-10 list is ordered descending based on the similarity scores in G.

In [5]:
G = pandas.read_table("SimLex-999.txt")
G.head()

Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93


#### Procedure 2: We pick C as our large corpus.

In [6]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("old"):
	for l in syn.lemmas():
		synonyms.append(l.name())
		if l.antonyms():
			antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

{'old', 'Old', 'sometime', 'former', 'one-time', 'onetime', 'erstwhile', 'previous', 'older', 'honest-to-goodness', 'quondam', 'honest-to-god', 'sure-enough'}
{'new', 'young'}


#### Procedure 3: We pick v method (baseline). 
1. We train v on C.
    1. We report the running parameters of v if any.
    2. For Word2Vec, we run for context window size {1, 2, 5, 10}, vector size {10, 50, 100, 300}, and iteration number = 1000.

In [7]:
tfidf_vectorizer=TfidfVectorizer(use_idf=True) 
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(synonyms)
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[10]
TF_result = pandas.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names_out(), columns=["tfidf"])
TF_result

Unnamed: 0,tfidf
enough,0.0
erstwhile,0.0
former,0.0
god,0.0
goodness,0.0
honest,0.0
old,0.0
older,0.0
one,0.707107
onetime,0.0


#### Procedure 4: For each word w in our golden standard G, we find the top-10 most similar words according to cosine similarities of vectors based on method v.
1. If w is not in our large corpus, then it is unseen words and an instance of OOV. In this assignment, we simply ignore this word.
2. If w is in our large corpus, then there are top-10 most similar words that are ordered based on descending order of cosine similarity scores.
3. Let’s call the top-10 most similar words of w based on v as top-k-v[w]; k=10.

#### Procedure 5: Now we have to compare top-k-G[w] and top-k-v[w] for all w that exists both in golden standard and our large corpus.
1. We ask pytrec_eval to calculate ‘nDCG’ as our metric m. The result is for each for w.
2. We calculate the average of ‘nDCG’ on all words.
3. We report the results on a bar chart.

#### Procedure 6: We have to repeat the procedure 3 to 5 for all methods v (baselines).