# lexsub: default program

In [1]:
from default import *
import os

## Run the default solution on dev

In [2]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.magnitude'))
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split())))
print("\n".join(output[:10]))

sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner


## Evaluate the default output

In [3]:
from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.2f}".format(100*precision(ref_data, output)))

Score=27.89


## Documentation

Default program finds the topn most similar words to a target word from the provided GloVe vector set.

## Analysis

Provided Glove vector was bad to find synonyms of a target word for each sentences; therefore, we decided to implement baseline first.

# lexsub: baseline program(single synonym graph)

In [2]:
from lexsub_base import *
import os

## Run the baseline solution on dev

In [3]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.retrofit.magnitude'))
lexsub_old = LexSub(os.path.join('data','glove.6B.100d.magnitude'))
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split(), lexsub_old)))
print("\n".join(output[:10]))

english edge position line point place way while back front
english edge position line point place way while back front
english edge position line point place way while back front
english edge position line point place way while back front
english edge position line point place way while back front
english edge position line point place way while back front
english edge position line point place way while back front
english edge position line point place way while back front
english edge position line point place way while back front
english edge position line point place way while back front


## Evaluate the baseline output

In [4]:
from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.2f}".format(100*precision(ref_data, output)))

Score=50.03


## Documentation

we used Wordnet synonym text file for our synonym set. We found from paper that the score is the best when the weights are: Beta = 1 and Alpha = (number of synonyms) * k, so we adopted this. The iteration T was set to 25. For the synonym graph, we used single synonym graph.

## Building full synonym graph

full undirected graph is having edges for each synonyms for every single word

In [None]:
synonyms = line.lower().strip().split(sep)
            
synonyms = [filterWords(i) for i in synonyms]
          
############## using full graph
or i in range(0, len(synonyms)):
    if (synonyms[i] not in synonymSets):
        synonymSets[synonyms[i]] = set()
    synonymSets[synonyms[i]].update(synonyms[:i]+synonyms[i+1:])

## Building single synonym graph

single undirected graph is having edges from first word to other words.

In [None]:
synonyms = line.lower().strip().split(sep)
            
synonyms = [filterWords(i) for i in synonyms]

if (synonyms[0] not in synonymSets):
    synonymSets[synonyms[0]] = set()
synonymSets[synonyms[0]].update(synonyms[1:])

## Setting Weights according to the number of synonyms of a word

In [None]:
alpha = 1
beta = 1
k = 0.5

oldWordVector = oldWordVectors.query(word)
vectorSum = reduce(lambda x,y:x+y, (wordVectors[synonym]*beta for synonym in synonymSets[word] if synonym in wordVectors), array([0]*oldWordVector.ndim))
synonymCount = sum(1 for synonym in synonymSets[word] if synonym in wordVectors)
            
if synonymCount > 0:
    try:
        alpha = beta*synonymCount * k
        vectorSum += oldWordVector * (alpha) 
                    
        vectorSum = vectorSum/(synonymCount*beta+alpha)

## Approaches

1. adjusting weights(alpha, beta, and k) and T according to the length of edges. <br>
2. testing two different styles of graphs(full and single)<br>
3. using different lexicons(wordnet, wordnet+, ppdb, and framenet) or combinations(ex. ppdb + wordnet)<br>

## Analysis

Adjusting T and weights(alpha, beta, and k) was slightly effective. We tried to merge synonym files (ppdb, wordnet, wordnet+, and framenet) into one synonym set, but the result was not good. Using only one wordnet text file had the best result, so we decided to use wordnet txt file only. We tried various combinations of weights and confirmed that the beta =1 and alpha = (number of the synonyms) * 0.5 is the best. For the synonym graph, we tested two graph styles which are single and full undirected grpah, and single graph had better performance.

# lexsub: context substitution program

In [1]:
from lexsub_con import *
import os

## Run the context substitution solution on dev

In [3]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.retrofit.magnitude'), 100)
lexsub_old = LexSub(os.path.join('data','glove.6B.100d.magnitude'))
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split(), lexsub_old)))
print("\n".join(output[:10]))

goal english draw match place bottom corner back edge away
english edge position line point place way while back front
english view perspective point middle way part place edge the
english edge position line point place way while back front
english edge position line point place way while back front
edge back turn way line look right turning shoes going
can means stand proper place they only turn rather way
english edge position line point place way while back front
english edge position line point place way while back front
along near line part located edge middle area the where


## Evaluate the context substitution output

In [4]:
from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
if len(ref_data) == len(output):
    print("Score={:.2f}".format(100*precision(ref_data, output)))
else:
    print("length error")

Score=42.51


## Documentation

From the baseline, lexical substitution gets 100 candidates. This program calculates scores using context words and target words on the GloVe and retrofitted vector field.

## Getting  context words

setence_range parameter decides w(i-r)-w(i-1) and (wi+1)-w(i+r) as the context words where i is index(w(i) is target word) and r is range

In [None]:
sentence_range = 1 ##around words from index
for word in sentence[index-sentence_range:index]:
    if word in oldvec.wvecs_dict and word.isalpha() and word not in self.non_context:
        context_words.append(oldvec.wvecs_dict[word])
for word in sentence[index+1:index+sentence_range+1]:
    if word in oldvec.wvecs_dict and word.isalpha() and word not in self.non_context:
        context_words.append(oldvec.wvecs_dict[word])

## Score Calculation: Distance

In [None]:
bal = len(context_words)
if context_words:
    for word in candidates:
        candidates_dict[word] = sum((np.linalg.norm(oldvec.wvecs_dict[word] - c)) for c in context_words))
    else:
        return candidates[:10]

## Score Calculation: add(cosine similarity)

In [None]:
bal = len(context_words)
if context_words:
    for word in candidates:
        candidates_dict[word] = ((1-spatial.distance.cosine(self.wvecs_dict[word], v)) + 
                                 sum((1-spatial.distance.cosine(oldvec.wvecs_dict[word], c)) 
                                     for c in context_words)) / (bal+1)
    else:
        return candidates[:10]

## Score Calculation: baladd(cosine similarity)

In [None]:
bal = len(context_words)
if context_words:
    for word in candidates:
        candidates_dict[word] = ((1-spatial.distance.cosine(self.wvecs_dict[word], v)) * bal + 
                                 sum((1-spatial.distance.cosine(oldvec.wvecs_dict[word], c)) 
                                     for c in context_words)) / (bal*2)
    else:
        return candidates[:10]

## List of non context words

In [None]:
non_context = ["it's", "she's", 'were', 'because', 'this', 'couldn', 'then', 'how', 'd', 'doesn', 'down', 's', 
               'they', 'she', "needn't", 'wasn', 'haven', 'between', "wouldn't", 'the', 'ma', "wasn't", 'until', 
               'my', 'himself', "that'll", 'by', 'about', 'in', "aren't", "should've", 'why', 'nor', 'before', 
               'when', 'we', 'here', 'only', "couldn't", 'ain', 'no', 'your', 'will', 'own', 'his', "you'll", 
               'are', 'and', 'most', 'do', 'now', "isn't", 'having', 'on', 'her', 'theirs', 'under', 'with', 'to', 
               "mightn't", 'while', 'its', 'be', 'll', 'don', 'over', 'again', 'their', 'won', 'too', 'during', 
               'shan', 'herself', 'has', 'or', 'from', 'ours', 'into', 'our', 'above', 'wouldn', 'you', 'of', 'so', 
               't', 'he', 'doing', 'as', 'i', 'can', 'shouldn', 'have', 'at', 'other', 'hasn', 'more', 'yourselves', 
               'y', 'yours', 'very', 'themselves', 'which', 'these', 'being', 'both', 'aren', 'did', 'than', 'needn', 
               'for', 'itself', "haven't", 'through', 'weren', 'but', 'once', 'isn',  'ourselves', 'didn', 'not', 
               'yourself', 'mightn', 'after', 've', 'him', 'whom', "hasn't", 'a', 'hadn', "shouldn't", "mustn't", 
               'those', 'off',  'each', 'was', "didn't", "you'd", 'where', 'o', 'further', 'below', "shan't",  
               'myself', 'mustn', 'is', 'been', 'just', 'any', 'out', 'that', 'm', 'such',  'me', 'same', 'hers', 
               'some', 'had', 'does', 'against', 'should', "you've",  "doesn't", "you're", 'them', 'am', 'if', 
               'who', 'few', 'what', 'there',  "don't", "weren't", "won't", 'an', 'all', 're', 'it', 'up', "hadn't",
               "'ll", ',', '.']

## Approaches

1. changing sentence ranges <br>
2. testing various score calculations(distance, add, and baladd)<br>
3. adjusting number of candidates<br>
4. excluding non context words from the context words list<br>

## Analysis

Using context words to evaluate synonyms always had bad accuracies than without it. We tried to add whole words in the sentence as context words, but the score was extremely lowered. We could have the best score with range 1. When we excluded non context words(like: it, as, a, the, and so on) from the context words list, the score was higher. Having too many candidates increased the randomness; therefore, adjusted it to 100 which is 20 times of the topn. There were various score methods in the paper, and baladd had the best score.