## Intuition

### Sense Determination

We calculate the sense of a target word in a particular context by maximizing the cosine similarity between the aggregate context vector (average of the context word vectors after removing the stop words) and the different sense vectors of the target word.

### Evaluation

For evaluation, we check the key to **group** all the sentences in the test data which have the same sense for the same target word. Then we run the function on all these sentences (of a *group*) to check whether most of them (ideally all of them) have the same index or not.

On running the function on all these sentences (of a *group*) we get the sense indices. We are making an assumption here, that is, the most common sense index that we are obtaining is the correct sense index for this *group* of sentences. Then the measure of accuracy is calculated using the formula:

```example
accuracy = ∑(g) #(most_common_index(g)) / total_sentences
```

where `#(most_common_index(g))` gives the number of occurences of the most common index on running the function on a *group* `g` and `total_sentences` is the total number of sentences in the test dataset which give a valid output on running the function.

## Imports and Initializations

We need to import `numpy` for working with arrays, and other libs like `os`, `pickle` and `pprint` for other utility functions.

In [1]:
import os, pprint, pickle, re
import numpy as np
from stop_words import get_stop_words
import nltk

lem = nltk.stem.wordnet.WordNetLemmatizer()
pp = pprint.PrettyPrinter(indent=2)

#TEST_SENTENCES_PATH = '/Users/sounak/Documents/clg/nlp/nlp-projects/data/wsd/sentences.txt'
TEST_SENTENCES_PATH = '/home/nmn/Downloads/nlp-projects-master/data/wsd'

ModuleNotFoundError: No module named 'stop_words'

## Helper functions

The two helper functions `save_obj` and `load_obj` are used to pickle any object and load back the pickle file. These functions will be useful in saving the vector dicts and thus faster loading of the same.

In [2]:
5

## Loading the Sensegram

In [3]:
sense_vecs = load_obj('sense_vecs')
pos_tags = load_obj('pos_tags')

if not (sense_vecs and pos_tags):
    #SENSEGRAM_PATH = "/Users/sounak/Documents/clg/nlp/nlp-projects/data/sensegrams_of_wikipedia_cluster"
    SENSEGRAM_PATH = "/home/nmn/Desktop/JOBIMTEXT_NAMAN_PRIYANKA_INTERNSHIP"
    f = open(SENSEGRAM_PATH, 'r')
    sense_vecs = {}
    pos_tags = set()

    for line in f.readlines():
        t = line.split('\t')
        word, pos = t[0].split('#')
        pos_tags.add(pos)
        if t[1] == '0':
            sense_vecs[(word, pos)] = []
        sense_vecs[(word, pos)].append(np.array(eval(t[2])))
    f.close()
    save_obj(sense_vecs, 'sense_vecs')
    save_obj(pos_tags, 'pos_tags')

print('sense_vecs have been loaded')

IsADirectoryError: [Errno 21] Is a directory: '/home/nmn/Desktop/JOBIMTEXT_NAMAN_PRIYANKA_INTERNSHIP'

## Loading the Glove Model

In [4]:
word_vecs = load_obj('word_vecs')

if not word_vecs:
    #GLOVE_PATH = "/Users/sounak/Documents/clg/nlp/nlp-projects/data/glove.6B.300d.txt"
    GLOVE_PATH = "/home/nmn/Desktop/JOBIMTEXT_NAMAN_PRIYANKA_INTERNSHIP"
    f = open(GLOVE_PATH, 'r')
    word_vecs = {}
    for line in f.readlines():
        t = line.split(' ')
        word_vecs[t[0]] = np.array([float(_) for _ in t[1:]])
    f.close()
    save_obj(word_vecs, 'word_vecs')
    
print('word_vecs have been loaded')

IsADirectoryError: [Errno 21] Is a directory: '/home/nmn/Desktop/JOBIMTEXT_NAMAN_PRIYANKA_INTERNSHIP'

## Computing Sense

The function `compute_sense_idx` takes a sentence, the target and some other arguments and returns the index of the sense of the target that was used in the current context.

This function maximizes the cosine similarity of an aggregate context vector with the vectors of the different senses of the target word. It also doesn't include the stop words in the context. The aggregate context vector is calculated using the lemmatized words in the context after removing the stop words.

In [5]:
stop_words = get_stop_words('en')

def compute_sense_idx(sentence, target):
    if target not in sentence:
        return None
    sentence = nltk.pos_tag(sentence)
    context = list(filter(lambda x: x[0] != target, sentence))
    sum = np.zeros(300)
    preprocess = lambda w, pos : (lem.lemmatize(w, pos[0].lower()), pos) if pos[0].lower() in ['a', 'r', 'n', 'v'] else (w, pos)
    context_final = [preprocess(w, pos) for w, pos in context if w not in stop_words]
    for w, _ in context_final:
        try:
            sum += word_vecs[w]
        except KeyError:
            continue
        
    cw_mean = np.divide(sum, len(context))
    max_idx = -1
    max_value = float('-inf')
    for pos in pos_tags:
        try:
            for idx, sense in enumerate(sense_vecs[(target, pos)]):
                if np.linalg.norm(sense) > 0:
                    result = np.divide(np.dot(sense, cw_mean), (np.linalg.norm(sense) * np.linalg.norm(cw_mean)))
                    if result > max_value:
                        max_value = result
                        max_idx = idx
        except KeyError:
            continue
    return max_idx

NameError: name 'get_stop_words' is not defined

## Tokenizer

This is a light-weight tokenizer for tokenizing the input sentences.

In [6]:
def tokenize(text):
    words = [_.lower() for _ in re.split(r"[^a-zA-ZÀ-ÿ0-9']+", text)]
    words = [_[:-2] if "'s" in _ else _ for _ in words]
    return list(filter(('').__ne__, words))

## Testing

We are testing the function on the SemEval Test Dataset.

In [8]:
res = {}

for k, v in sents.items():
    res['.'.join(k.split('.')[:-1])] = compute_sense_idx(v, k.split('.')[-1])
    
print('results have been loaded')

NameError: name 'compute_sense_idx' is not defined

In [9]:
from xml.dom.minidom import parse
FILE = './data/wsd-test/contexts/senseval2-format/semeval-2013-task-13-test-data.senseval2.xml'

dom = parse(FILE)
inst = dom.getElementsByTagName('instance')

sents = {}

for i in inst:
    k = i.attributes['id'].value
    context = i.getElementsByTagName('context')[0]
    word = context.getElementsByTagName('head')[0].childNodes[0].nodeValue
    v = ' {} '.format(word).join(t.nodeValue.strip() for t in context.childNodes if t.nodeType == t.TEXT_NODE)
    sents[k + '.' + word] = tokenize(v)

print('test sentences have been loaded')

test sentences have been loaded


## Evaluation



In [10]:
KEY = './data/wsd-test/keys/gold/all.singlesense.key'

keys = {}
keys_rev = {}
f = open(KEY, 'r')
for line in f.readlines():
    l = line.strip().split(' ')
    keys[l[1]] = l[2].split(':')[0].split('%')[1]
    try:
        keys_rev[l[0] + '%' + l[2].split(':')[0].split('%')[1]].append(l[1])
    except KeyError:
        keys_rev[l[0] + '%' + l[2].split(':')[0].split('%')[1]] = [l[1]]
    
print('keys have been loaded')

keys have been loaded


In [11]:
from collections import Counter

total = correct = 0

for k in keys_rev.keys():
    c = Counter([res[_] for _ in keys_rev[k]])
    del c[-1]
    del c[None]
    correct += c.most_common(1)[0][1]
    total += sum(c.values())
    
print(correct / total * 100)

KeyError: 'add.v.1'