## Exercize 1.2
### Word Sense Disambiguation with WordNet
#### Francesco Sannicola

Import some useful libraries like *nltk* and *string*

In [1]:
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords

import nltk
import string

Method which parses SemCor corpus. The corpus is annotated by hand on WordNet.

After reading the XML file, it fix XML's bad formatting and obtains all the sentences. I use re and lxml packages for regular expression e exml files.

Now selects the words to disambiguate, with plus then one total number of senses, and extract Golden annotated sense from Wordnet.

In [2]:
import re
from lxml import etree as Exml

def xml_parse(file):
    """
    :param file: the path to the XML file
    :return: (sentence, [(word, gold)])
    """
    with open(file, 'r') as XML_file:
        data = XML_file.read()
        
        # Delete all \n
        data = data.replace('\n', '')
        
        # Delete all special char
        rep = re.compile("=([\w|:|\-|$|(|)|']*)")
        data = rep.sub(r'="\1"', data)

        res = []
        try:
            
            complete_xml_processed = Exml.XML(data)
            
            # Obtain all paragraphs
            par = complete_xml_processed.findall("./context/p")
            sentences = []
            
            # For each paragraph, take all sentences
            for p in par:
                sentences.extend(p.findall("./s"))
            
            # Process all sentences
            for sentence in sentences:
                
                # Take all words in a sentence
                words = sentence.findall('wf')
                complete_sentence = ""
                tuples = []
                
                for word in words:
                    
                    # With Exml it's easy to find word PoS 
                    w = word.text
                    pos = word.attrib['pos']
                    complete_sentence = complete_sentence + w + ' '
                    
                    # In a new strucure save only nouns whith sense id
                    if pos == 'NN' and len(wn.synsets(w)) > 1 and 'wnsn' in word.attrib and '_' not in w :
                        sense_id = word.attrib['wnsn']
                        tuples.append((w, sense_id))
                        
                res.append((complete_sentence, tuples))
        except Exception as e:
            raise NameError(str(e))
    return res

Implementation *bag of word* algoritm.

Given a sentence compute a result according to the bag of word approach: an unordered set of words, ignoring their exact position.

There is a preprocessing phase with the goal to eliminate stop words and punctuation.

In [3]:
def bag_of_word(sentence):
    """
    :param sentence: sentence
    :return: bag of words
    """
    stop_words = set(stopwords.words('english'))

    wordnet_lemmatizer = nltk.WordNetLemmatizer()
    
    # tokenization of a given sentence
    tokens = nltk.word_tokenize(sentence)
    
    # delete punctuation and stop words
    tokens = list(filter(lambda x: x not in stop_words and x not in string.punctuation, tokens))
    
    # unsorted list of lemmas
    res = set(wordnet_lemmatizer.lemmatize(token) for token in tokens)
    
    return res

### Lesk's algoritm implementation. 

Given a word and a sentence which contains that word, it return the best sense of that word.


In [4]:
def lesk(word, sentence):
    """
    :param word: word to disabiguate
    :param sentence: sentence to compare
    :return: best sense for the given word
    """

    max_overlap = 0
    senses = wn.synsets(word)
    
    # set word's first sense as the best
    best_sense = senses[0]
    
    # compute context of the sentence (BAW)
    context = bag_of_word(sentence)

    for sense in senses:
        
        # compute BAW of sense definition
        signature = bag_of_word(sense.definition())

        # for each example calculate BAW and calculate the union (no duplicates)
        examples = sense.examples()
        for ex in examples:
            signature = signature.union(bag_of_word(ex))

        # Calculate overlap between BOW of signature and BAW of context
        overlap = len(signature & context)
        
        # save only the best sense (with more overlap)
        if overlap > max_overlap:
            max_overlap = overlap
            best_sense = sense

    return best_sense

Method which invokes Lesk algorithm and compares the inferred senses and the right ones.

In [5]:
def word_sense_disambiguation(xml_file_parsed, limit):
    '''
    :param sentence: input sentence
    :param limit: maximum number of sentences to disambiguate
    :return: right senses, number of terms analyzed  
    '''
    right = 0
    count = 0
    terms_analyzed = 0 

    for instance in xml_file_parsed:
        if count == limit:
            break
        
        if len(instance[1]) > 0:
            # NB instance is a tuple ('sentence', [('word','number`]...])
            # calculate best sense of the first noun using lesk
            my_sense = lesk(instance[1][0][0], instance[0])
            
            # Index of the sense calculated with lesk algorithm considering entire synset
            index_val = wn.synsets(instance[1][0][0]).index(my_sense) + 1

            # obtain gold
            annotated_index = int(instance[1][0][1])
        
            if index_val == annotated_index:
                right += 1
                
            terms_analyzed += 1
        count += 1
        
    if terms_analyzed == 0:
        terms_analyzed = 1
    
    return right, terms_analyzed

In [6]:
import random
from prettytable import PrettyTable

iterations = 10
limit = 50
percentuals_sum = 0

xml_file_parsed = xml_parse("./Input/br-a01")
table = PrettyTable(["Iteration", "Accuracy", "Right", "Total"])

for i in range (0, iterations):
    
    right, total = word_sense_disambiguation(xml_file_parsed, limit)
    acc = right/total
    
    table.add_row([i, round(acc, 4), right, total])
    
    percentuals_sum += acc
    
    # Shuffle all file's instances (NB a element is (sentence, [(word, gold)]))
    random.shuffle(xml_file_parsed)
                   
print(table)

print("Accuracy mean:", percentuals_sum/iterations)


+-----------+----------+-------+-------+
| Iteration | Accuracy | Right | Total |
+-----------+----------+-------+-------+
|     0     |  0.5417  |   26  |   48  |
|     1     |   0.6    |   30  |   50  |
|     2     |  0.6122  |   30  |   49  |
|     3     |  0.6327  |   31  |   49  |
|     4     |  0.6122  |   30  |   49  |
|     5     |  0.5918  |   29  |   49  |
|     6     |   0.56   |   28  |   50  |
|     7     |  0.617   |   29  |   47  |
|     8     |  0.6383  |   30  |   47  |
|     9     |  0.5833  |   28  |   48  |
+-----------+----------+-------+-------+
Accuracy mean: 0.5989298740772905


Accuracy obtained is about 59 %. 

This result is not bad at all considering the difficulties of the task.