![NYPLogo.png](attachment:NYPLogo.png)

# Practical 11: Word Sense Disambiguation


## Objectives

- Construct models to resolve word sense disambiguation.


## WordNet

WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus. It is a machine-readable database of words which can be accessed from most popular programming languages (C, C#, Java, Ruby, Python etc.). WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings.

WordNet is not like your traditional dictionary. WordNet focuses on the relationship between words along with their definitions, and this makes a WordNet a network instead of a list. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets.

In the WordNet network, the words are connected by linguistic relations. These linguistic relations (hypernym, hyponym, meronym, holonym and other fancy sounding stuff), are WordNet’s secret sauce. They give you powerful capabilities that are missing in an ordinary dictionary/thesaurus.

Now, let's get into the implementation and usage of WordNet in Python!

### 1) Synonyms
WordNet stores synonyms in the form of synsets where each word in the synset shares the same meaning. Basically, each synset is a group of synonyms. Each synset has a definition associated with it. Relations are stored between different synsets.

In the following example. Take the word ‘sofa’. We have only one synset for ‘sofa’ which means that it has only one context or meaning. Another word like ‘jupiter’ will give two synsets because it has two meanings – one as ‘planet’ and the other as ‘Roman God’.

~~~Python
from nltk.corpus import wordnet as wn

# get synsets of sofa
print (wn.synsets('sofa'))

# get synsets of jupiter
syns = wn.synsets('jupiter')
print (syns)

# definition of first synset
print (syns[0].definition())

# definition of second synset
print (syns[1].definition())
~~~

In [None]:
# Enter code here


### 2) Hyponyms and Hypernyms
Hyponyms and Hypernyms are specific and generalized concepts respectively.

For example, ‘beach house’ and ‘guest house’ are hyponyms of ‘house’. They are more specific concepts of ‘house’. And ‘house’ is a hypernym of ‘guest house’ because it is a general concept.

‘Egg Noodle’ is a hyponym of ‘noodle’ and ‘pasta’ is a hypernym of ‘noodle’.

~~~Python
# get hyponyms of noodle
print (wn.synsets('noodle')[0].hyponyms())

# get hypernyms of noodle
print (wn.synsets('noodle')[0].hypernyms())

# definitation of egg noodle
print (wn.synset('egg_noodle.n.01').definition())

# definitation of pasta
print (wn.synset('pasta.n.01').definition())
~~~

In [None]:
# Enter code here


### 3) Meronyms and Holonyms
Meronyms and Holonyms represent the part-whole relationship. The meronym represents the part and the holonym represents the whole. For example, ‘kitchen’ is a meronym of ‘home'(the kitchen is a part of the home), ‘mattress’ is a meronym of ‘bed’, and ‘bedroom’ is a holonym of ‘bed’.

~~~Python
# get holonyms of noodle
print (wn.synsets('bed')[0].part_holonyms())

# get meronyms of noodle
print (wn.synsets('bed')[0].part_meronyms())
~~~

In [None]:
# Enter code here


### 4) Entailments
An entailment is an implication. For example, looking implies seeing. Buying implies choosing and paying. We can find entailments as seen below in the code.

~~~Python
# get entailments of buy
print (wn.synsets('buy'))
print (wn.synsets('buy')[1].entailments())
~~~

In [None]:
# Enter code here


### 5) Word Similarity
We can compute the similarity between two words based on the distance between words in the WordNet network. The smaller the distance, the more similar the words. In this way, it is possible to quantitatively figure out that a cat and a dog are similar, a phone and a computer are similar, but a cat and a phone are not similar!

~~~Python
# compute similarity between cat and dog
dog = wn.synsets('dog')[0]
cat = wn.synsets('cat')[0]
print (wn.path_similarity(dog, cat))

# compute similarity between cat and dog
phone = wn.synsets('phone')[0]
computer = wn.synsets('computer')[0]
print (wn.path_similarity(phone, computer))

# compute similarity between cat and phone
print (wn.path_similarity(phone, cat))
~~~

In [None]:
# Enter code here


### Applications
There are several subfields in natural language processing which can benefit from having a large lexical database, especially one as big and extensive as WordNet. Many semantic applications can draw benefits from using WordNet, including Word Sense Disambiguation (WSD), question answering and sentiment analysis. Many papers have been published regarding WordNet and WSD, exploring different approaches and algorithms, which is the main field for using this.

## Word Sense Disambiguation
One of the first things you realize when working with linguistic or textual data is just how ambiguous the words are. Language is very contextual and the meanings of the words depend upon the context in which you are using it.

To give a hint how all this works, consider three examples of the distinct senses that exist for the word "bass":
1. a type of fish
2. tones of low frequency
3. a type of instrument

and the sentences:
1. I went fishing for some sea bass.
2. The bass line of the song is too weak.

To a human, it is obvious that the first sentence is using the word "**<em>bass (fish)</em>**", as in the former sense above and in the second sentence, the word "**<em>bass (instrument)</em>**" is being used as in the latter sense below. Developing algorithms to replicate this human ability can often be a difficult task.

Nevertheless, we have solved a number of very difficult problems to a reasonable degree of accuracy with computational approaches. Let's talk about one of the more naive approaches to word sense disambiguation, which actually does fairly well when given a reasonably large input.

But first, what’s this Word Sense Disambiguation all about. Well, the sense of a word is a way of identifying how we use a given word by associating its definition. This is where the ambiguity problem comes in, how does a computer know, how to treat a given input when each one has a number of different senses, some of which have wildly different connotations and usages? This one problem is a key building block for all sorts of more complex, interesting NLP systems, from sentiment analysis to machine translation.

## A Simple WSD Application
Next, we will create a simple program that will analyse the query you entered and what it is relating to.
1. Give two sentences with some data.
2. Give a third sentence and the program will analyse which sentence you are relating to.

### Import libraries
~~~Python
import nltk

from nltk.tokenize import PunktSentenceTokenizer,sent_tokenize, word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer, PorterStemmer
~~~

In [None]:
# Enter code here


The **simpleFilter** function takes the given query/sentence as an input and returns list of tokens which are lemmatized. Lemmatization refers to deriving the root word which is morphologically correct. There rises a slight confusion between lemmatization and stemming. However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Stopwords, are the high frequency words in a language which do not contribute much to the topic of the sentence. In English, such words include, ‘a’ , ‘an’ , ‘the’, ‘of’ , ‘to’ , etc.. We remove these words and focus on our main subject/topic, to solve ambiguity. The function is applied to the training data sets as well as user input.

~~~Python
def simpleFilter(sentence):
    filtered_sent = []
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(sentence)
    
    for w in words:
        if w not in stop_words:
            filtered_sent.append(lemmatizer.lemmatize(w))
    
    return filtered_sent
~~~

In [None]:
# Enter code here


Next we perform similarity check, function: **similarityCheck**, for the filtered sentence tokens that are returned by the first function. Similarity is checked between the given query/sentence tokens and the training data set tokens. For this, the synonym set is loaded for each token word from wordnet corpus. The depth and closeness of a word is calculated and returned on scale of 0–1 . This is the main data that will resolve the ambiguity. The more data you provide, the more accurate it gets. The normalised similarity between sentences is stored.

~~~Python
def simlilarityCheck(word1, word2):
    word1 = word1 + ".n.01"
    word2 = word2 + ".n.01"
    
    try:
        w1 = wordnet.synset(word1)
        w2 = wordnet.synset(word2)
        
        return w1.wup_similarity(w2)

    except:
        return 0
~~~

In [None]:
# Enter code here


**synonymsCreator** is a simplistic function to store the synonyms of the given input word. This will be used is storing the synonyms of the given data set and query tokens. The synonyms will also be taken into consideration while performing similarity check for the sentences.

~~~Python
def synonymsCreator(word):
    synonyms = []

    for syn in wordnet.synsets(word):
        for i in syn.lemmas():
            synonyms.append(i.name())

    return synonyms
~~~

In [None]:
# Enter code here


Once the similarity is stored, we apply the next level filter, function: **filteredSentence**, to apply lemmatization over stemmed tokens and again removing stop words. In the filtered sentence list, we now store the token word along with its synonyms for more precised matching / similarity check. 

~~~Python
# Remove Stop Words. Word Stemming. Return new tokenised list.
def filteredSentence(sentence):
    filtered_sent = []
    lemmatizer = WordNetLemmatizer()   #lemmatizes the words
    ps = PorterStemmer()    #stemmer stems the root of the word.

    stop_words = set(stopwords.words("english"))
    words = word_tokenize(sentence)
    
    for w in words:
        if w not in stop_words:
            filtered_sent.append(lemmatizer.lemmatize(ps.stem(w)))
            
            for i in synonymsCreator(w):
                filtered_sent.append(i)

    return filtered_sent
~~~

In [None]:
# Enter code here


Next, we put all these together.

~~~Python
sent1 = input("Enter Sentence 1: ").lower()
sent2 = input("Enter Sentence 2: ").lower()
sent3 = input("Enter Query: ").lower()

filtered_sent1 = []
filtered_sent2 = []
filtered_sent3 = []

counter1 = 0
counter2 = 0
sent31_similarity = 0
sent32_similarity = 0

filtered_sent1 = simpleFilter(sent1)
filtered_sent2 = simpleFilter(sent2)
filtered_sent3 = simpleFilter(sent3)

for i in filtered_sent3:
    for j in filtered_sent1:
        sent31_similarity = sent31_similarity + simlilarityCheck(i, j)

    for j in filtered_sent2:
        sent32_similarity = sent32_similarity + simlilarityCheck(i, j)

filtered_sent1 = []
filtered_sent2 = []
filtered_sent3 = []

filtered_sent1 = filteredSentence(sent1)
filtered_sent2 = filteredSentence(sent2)
filtered_sent3 = filteredSentence(sent3)

sent1_count = 0
sent2_count = 0

for i in filtered_sent3:
    for j in filtered_sent1:
        if(i == j):
            sent1_count = sent1_count + 1
            
    for j in filtered_sent2:
        if(i == j):
            sent2_count = sent2_count + 1

if((sent1_count + sent31_similarity) > (sent2_count+sent32_similarity)):
    print("Same synset as Sentence 1")
else:
    print("Same synset as Sentence 2")
~~~

Use the following to test: 

**Sentence 1**: "the commercial banks are used for finance. all the financial matters are managed by financial banks and they have lots of money, user accounts like salary account and savings account, current account. money can also be withdrawn from this bank.

**Sentence 2**: "the river bank has water in it and it has fishes trees. lots of water is stored in the banks. boats float in it and animals come and drink water from it."

**Query**: "from which bank should i withdraw money"


**Sentence 1**: "any of various nocturnal flying mammals of the order Chiroptera, having membranous wings that extend from the forelimbs to the hind limbs or tail and anatomical adaptations for echolocation, by which they navigate and hunt prey."

**Sentence 2**: "a cricket wooden bat is used for playing criket. it is rectangular in shape and has handle and is made of wood or plastic and is used by cricket players."

**Query**: "which bat can fly?"

In [None]:
# Enter code here


## Lesk Algorithm
Let's use the Lesk algorithm in NLTK to estimate the sense of the word in the given context sentence.

The Lesk algorithm is based on the assumption that words in a given "neighborhood" (section of text) will tend to share a common topic. A simplified version of the Lesk algorithm is to compare the dictionary definition of an ambiguous word with the terms contained in its neighbourhood.

The below is the definition of the lesk function.

> lesk(context_sentence, ambiguous_word, pos=None, synsets=None)

In the example below, for the first and the second sentence, the lesk algorithm is some what accurate in understanding the context of the word bass in the sentence. But for the third sentence where the bass is in the context of musical instrument, it is estimating the word as Synset('sea_bass.n.01) which is clearly not correct! Unfortunately, Lesk’s approach is very sensitive to the exact wording of definitions, so the absence of a certain word can radically change the results.

~~~Python
from nltk.wsd import lesk

for ss in wn.synsets('bass'):
    print (ss, ss.definition())

sentences = [
    'I went fishing for some sea bass.',
    'The bass line of the song is too weak.',
    'Avishai Cohen is an Israeli jazz musician. He plays double bass and is also a composer'
]

# get sysnet for each sentence
print()
print (lesk(sentences[0].split(), 'bass', 'n'))
print (lesk(sentences[1].split(), 'bass', 's'))
print (lesk(sentences[2].split(), 'bass', 'n'))
~~~

In [None]:
# Enter code here
