# Week 5 (Part 2): Dictionary Methods for WSD

We have seen that many words have many different senses.  In order to make the correct decision about the meaning of a sentence or a document, an application often needs to be able to **disambiguate** individual words, that is, choose the correct sense given the context.

In this lab we will be looking st methods for word sense disambiguation (WSD) that make use of dictionaries or other lexical resources (also referred to as **knowledge-based methods** for WSD).  In particular, we will look at
* simplified Lesk
* adapted Lesk
* minimising distance in a semantic hierarchy

As in the previous lab, we will be using WordNet as our lexical resource.  So, first, lets import it.

In [None]:
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic
from nltk.stem.wordnet import WordNetLemmatizer

import operator, sys

sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
sys.path.append(r'/Users/juliewe/Documents/teaching/NLE2018/resources')
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader


In [None]:
#make sure that the path to your utils.py file is correct for your computer
sys.path.append('/Users/juliewe/Documents/teaching/NLE/NLE2019/w4/Week4Labs')

from utils import *

## Simplified Lesk

The Lesk algorithm is based on the intuition that the correct combination of senses in a sentence will share more common words in their definitions.

It is computationally very expensive to compare all possible sense combinations of words in a sentence.  If each word has just 2 senses, then there are $2^n$ possible sense combinations.

In the simplifed Lesk algorithm, below, we consider each word in turn and choose the sense whose definition has more **overlap** with the contextual words in the sentence.


In [None]:
def simplifiedLesk(word,sentence):
    '''
    Use the simplified Lesk algorithm to disambiguate word in the context of sentence
    word: a String which is the word to be disambiguated
    sentence: a String which is the sentence containing the word
    :return: a pair (chosen sense definition, overlap score)
    '''
    
    #construct the set of context word tokens for the sentence: all words in sentence - word itself
    contexttokens=set(word_tokenize(sentence))-{word}
    
    #get all the possible synsets for the word
    synsets=wn.synsets(word)
    scores=[]
    
    #iterate over synsets
    for synset in synsets:
        #get the set of tokens in the definition of the synset
        sensetokens=set(word_tokenize(synset.definition()))
        #find the size of the intersection of the sensetokens set with the contexttokens set
        scores.append((synset.definition(),len(sensetokens.intersection(contexttokens))))
    
    #sort the score list in descending order by the score (which is item with index 1 in the pair)
    sortedscores=sorted(scores,key=operator.itemgetter(1),reverse=True) 
    #print(sortedscores)
    return sortedscores[0]
    


Now lets test it on a couple of sentences containing the word *bank*

In [None]:
banksentences=["he borrowed money from the bank","he sat on the bank of the river and watched the currents"]
for sentence in banksentences:
    print(sentence,":",simplifiedLesk("bank",sentence))

It actually appears not to do too bad.  However, this is more by luck than anything else.   If you inspect the sentences and the definitions, you will notice that most of the overlap is currently generated by stopwords.

### Exercise 1.1
Improve the SimplifiedLesk algorithm by carrying out:
* case and number normalisation 
* stopword filtering
* lemmatisation

You should find some useful functions for doing this in `utils.py` based on earlier labs.

Make sure you test it.  Unfortunately, you should now find 0 overlap between any of the senses and the two bank sentences given.

## Adapted Lesk
WordNet definitions are very short.  However, it is possible to create a bigger set of sense words by including information about the hypernyms and hyponyms of each sense.

### Exercise 2.1
Adapt the Lesk algorithm to include in `sensetokens`:
* all of the lemma_names for the sense itself
* all of the lemma_names for the hypernyms of the sense
* all of the lemma_names for the hypoynyms of the sense
* all of the words from the definitions of the hypernyms of the sense
* all of the words from the definitions of the hyponyms of the sense

Make sure you carry out normalisation and lemmatisation of these words as before

Test each adaptation you make on the bank sentences, recording the overlap observed with the chosen sense.

### Exercise 2.2
* From a sample of 1000 sentences from the dvd category of the Amazon review corpus (using the `sample_raw_sents()` method), find sentences which contain the lemma *film*. It will depend on the exact sample, but I would expect there to be somewhere between 50 and 100. 
* Use your AdaptedLesk algoritm to disambiguate them.  You may want to adapt it slightly so that it takes as input a list or a set of context lemmas rather than the sentence itself.  
* Record the number of instances of each sense of *film* predicted by this algorithm.

In [None]:
dvd_reader = AmazonReviewCorpusReader().category("dvd")
sentences=dvd_reader.sample_raw_sents(1000)

### Exercise 2.3
Inspect some of the individual predictions for your film sentences (at least one for each sense predicted).  Do you agree with the sense prediction?

## Minimising the Distance in the Semantic Hierarchy
This WSD method is based on the intuition that the concepts mentioned in a sentence will be close together in the hyponym hierarchy.

### Exercise 3.1
Write a function `max_sim(word, contextlemmas,pos)`which will choose the sense of a *word* given its context *sentence* using a WordNet based semantic similarity measure (see Lab_5_1).  You can assume that the part of speech of the word is known and is supplied to the function as another argument.

Within the function, 
1. For each **sense** of the word under consideration:
* compute its semantic similarity with each context **lemma** of the same part of speech.  For each context lemma you will need to consider each of its **senses** (and take the maximum similarity).  Therefore, you will need a triple nested loop! 
* sum the semantic similarities over the sentence
2. Choose the **sense** with the maximum sum.

Test your function on the bank sentences.  You should find, disappointingly for the method,  that the first sentence has a maximum score of 2.71 with "an arrangement of similar objects in a row or in tiers" and the second sentence has a maximum socre of 4.68 with "an arrangement of similar objects in a row or in tiers".

### Exercise 3.2
* Run your max_sim function on all of your film sentences and record the number of predictions for each sense.
* Inspect some of the individual predictions.
* Compare the results with those from the AdaptedLesk algorithm and draw some conclusions.