# Simple Lesk Algorithm (Method 2)

__Anish Sachdeva (DTU/2K16/MC/013)__

__Natural Language Processing (Dr. Seba Susan)__

In teh simple LESK agorithm we are given the gloss, which is the surrounding words/tokens of a given word and we need to disambiguate the given word using the gloss. We will calculate the IDF (Inverse document Frequency) of all the words in the gloss and then assign weights to all possible senses that the word can have and then disambiguate the word.

In [6]:
# importing the required packages
import pprint

import numpy as np
import nltk
from nltk.corpus import wordnet
# nltk.download('wordnet')
from nltk.corpus import stopwords
# nltk.download('stopwords')

We now define the `tokenize` method which takes in a string and returns the tokenized form with the word and stopwords removed.

In [3]:
stopwords_en = set(stopwords.words('english'))


def tokenize(document: str, word: str) -> set:
    # obtaining tokens from the gloss
    tokenizer = nltk.RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(document)

    # removing stop words from tokens
    tokens = [token for token in tokens if token not in stopwords_en and token.isalpha()]

    # removing the word from the tokens
    tokens = [token for token in tokens if token != word]
    return set(tokens)

We now define the simple LESK Algorithm which will take in the gloss and the word and will return the disambiguated sense from the wordnet corpus.

In [8]:
def simple_lesk(gloss: str, word: str):
    """":returns the sense most suited to the given word as per the Simple LESK Algorithm"""

    # converting everything to lowercase
    gloss = gloss.lower()
    word = word.lower()

    # obtaining tokens from the gloss
    gloss_tokens = tokenize(gloss, word)

    # calculating the word sense disambiguation using simple LESK
    synsets = wordnet.synsets(word)
    weights = [0] * len(synsets)
    N_t = len(synsets)
    N_w = {}

    # Creating the IDF Frequency column using Laplacian Scaling
    for gloss_token in gloss_tokens:
        N_w[gloss_token] = 1

        for sense in synsets:
            if gloss_token in sense.definition():
                N_w[gloss_token] += N_t
                continue

            for example in sense.examples():
                if gloss_token in example:
                    N_w[gloss_token] += N_t
                    break

    for index, sense in enumerate(synsets):
        # adding tokens from examples into the comparison set
        comparison = set()
        for example in sense.examples():
            for token in tokenize(example, word):
                comparison.add(token)

        # adding tokens from definition into the comparison set
        for token in tokenize(sense.definition(), word):
            comparison.add(token)

        # comparing the gloss tokens with comparison set
        for token in gloss_tokens:
            if token in comparison:
                weights[index] += np.log(N_w[token] / N_t)

    max_weight = max(weights)
    index = weights.index(max_weight)
    return synsets[index], weights

In [9]:
# We now test this code with our own gloss and words, (You can modify the code below with any word or gloss of your choice
gloss = 'I love me a hot cup of java in the morning'
word = 'java'
sense, weights = simple_lesk(gloss, word)
print('The disambiguated meaning is:', sense.definition())
print('The weight vector is:', weights)

The disambiguated meaning is: a beverage consisting of an infusion of ground coffee beans
The weight vector is: [0, 0.28768207245178085, 0]


In [10]:
# another test
gloss = 'java is my favourite programming language'
sense, weights = simple_lesk(gloss, word)
print('The disambiguated meaning is:', sense.definition())
print('The weight vector is:', weights)

The disambiguated meaning is: a platform-independent object-oriented programming language
The weight vector is: [0, 0, 0.5753641449035617]


So, using the simple LESK akgorithm we can disambiguate the meaning of a word given its gloss.