# lexsub: default program

In [1]:
from default import *
import os

## Run the default solution on dev

In [4]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.retrofit.magnitude'), topn=10)
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split(), similarity_measure=0)))
print("\n".join(output[:10]))

point position slope heading way english line course while back
point position slope heading way english line course while back
point position slope heading way english line course while back
point position slope heading way english line course while back
point position slope heading way english line course while back
point position slope heading way english line course while back
point position slope heading way english line course while back
point position slope heading way english line course while back
point position slope heading way english line course while back
point position slope heading way english line course while back


## Evaluate the default output

In [5]:
from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.10f}".format(100*precision(ref_data, output)))

Score=47.0933646506


## Documentation



To create the retrofitted .magnitude file, run this command:

`sh run.sh`

This script reads the original word vector file `glove.6B.100d.magnitude` from the given CSIL path. 

It will take approximatelty 15 minutes to complete the process. The script runs `modifyWordVec.py` file that generates the retrofiited word vectors in a text file. It is by default reading the `wordnet-synonyms.txt` for reading the lexicon to create ontology graph. The final step generates '.magnitude' file from the generated '.txt' file. 


The `modifyWordVec.py` file first reads the pymagnitude word vectors `Q̂`.
Then create a new copy of it `Q`.
Then for 10 iterations it loops through each word vector from the ontology lexicon file and for each lexicon word vector it modifies the vectors in `Q` such that the vectors stay close to `Q̂` and at the same time the adjacent vectors from the ontology file come close to each other in `Q`. 

We minimize the $$L(Q)=\sum_{i=1}^n[α_i||q_i−q̂_i||^2+\sum_{(i,j)∈E}^{ }β_{ij}||q_i−q_j||^2]$$

Overall the algorithm looks like this:
    
<br>1. Initialize Q to be equal to the vectors in Q̂
<br>2. For iterations t= 1 … 10
&nbsp;&nbsp;&nbsp;Take the derivative of L(Q) wrt each qi word vector and assign it to zero to get an update:
    &nbsp;$$q_i=\sum_{j:(i,j)∈E}^{ }β_{ij}q_j+α_iq̂_i\sum_{j:(i,j)∈E}^{ }β_{ij}+α_i$$
    

We have created 4 functions `add_similarity`, `baladd_similarity`, `mul_similarity` and `balmul_similarity` to calculate the 4 substitutability measures from the reference paper

`add_similarity` uses the following similiarity calculation to get the similarity value between a candidate target word and the context words
$$\frac{\cos(s,t)+\sum_{c∈C}^{ }\cos(s,c)}{|C|+1}$$

`baladd_similarity` uses the following similiarity calculation to get the similarity value between a candidate target word and the context words
$$\frac{|C|·\cos(s,t)+\sum_{c∈C}^{ }\cos(s,c)}{2·|C|}$$

`mul_similarity` uses the following similiarity calculation to get the similarity value between a candidate target word and the context words
$$\sqrt[|C|+1]{p\cos(s,t)·\prod_{c∈C}^{ }p\cos(s,c)}$$

`balmul_similarity` uses the following similiarity calculation to get the similarity value between a candidate target word and the context words
$$\sqrt[2·|C|]{p\cos(s,t)^{|C|}·\prod_{c∈C}^{ }p\cos(s,c)}$$

We have copy pasted stop words from NLTK corpus and also added a few words we wanted to ignore like `"e.g.", "http://pgina.xpasystems.com","'s"`

## Analysis



### Baseline - with stop words and no context words using `wordnet-synonyms.txt`

In [6]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.retrofit.magnitude'), topn=10)
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split(), similarity_measure=0)))

from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.10f}".format(100*precision(ref_data, output)))

Score=47.0933646506


### With context words - Add and stop words and considering top 20 words using `wordnet-synonyms.txt`

In [13]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.retrofit.magnitude'), topn=20)
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split(), similarity_measure=1)))

from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.10f}".format(100*precision(ref_data, output)))

Score=38.1092190252


### With context words - BalAdd and stop words and considering top 20 words using `wordnet-synonyms.txt`

In [14]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.retrofit.magnitude'), topn=20)
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split(), similarity_measure=2)))

from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.10f}".format(100*precision(ref_data, output)))

Score=45.9189665297


### With context words - Mul and stop words and considering top 20 words using `wordnet-synonyms.txt`

In [18]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.retrofit.magnitude'), topn=20)
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split(), similarity_measure=3)))

from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.10f}".format(100*precision(ref_data, output)))

Score=47.0933646506


### With context words - BalMul and stop words and considering top 20 words using `wordnet-synonyms.txt`

In [17]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.retrofit.magnitude'), topn=20)
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split(), similarity_measure=4)))

from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.10f}".format(100*precision(ref_data, output)))

Score=47.0933646506


### Retrofitting: ###

We were getting a dev score of `45.04` with retrofitting using the `wordnet-synonyms.txt` lexicon. 

### Tuning alpha and beta: ###

We managed to get a higher dev score of `47.0933646506` by iteratively changing the alpha and with beta = (degree of node)^-1.

### Incorporating context words without stop words: ###

On top of tuning the parameters, we also incorporated the context words (not including the stop words) to improve the dev score. We tried different methods for this:

1. We tried to calculate the average word vector that takes the mean effect of the target word and the context words chosen in a particular context window. However, this approach resulted in a lower score of `38.5413` .


2. As given in the reference paper, we calculated the 4 substitutability measures based on Add, BalAdd, Mult and BalMult measuring methods. This gave us no improvement on the base model since this measure considers context vectors not the word vectors of the context words.





