## Exercise 4.2 – Word Sense Disambiguation with NLTK

### Use the Lesk algorithm provided by Python NLTK Word Sense Disambiguation (WSD) module to find the correct definition (i.e. WordNet’s synset definition) of the word “rock” in the following sentences:


#### ➢ “A rock is classified according to characteristics such as mineral and chemical composition”;

#### ➢ “Queen are a British rock band formed in London in 1970”.

In [1]:
#Import Libraries
from nltk.wsd import lesk
from nltk.corpus import wordnet as wn

In [2]:
#Different Definitions of "rock"
for ss in wn.synsets('rock'):
   print(ss, ss.definition())

Synset('rock.n.01') a lump or mass of hard consolidated mineral matter
Synset('rock.n.02') material consisting of the aggregate of minerals like those making up the Earth's crust
Synset('rock.n.03') United States gynecologist and devout Catholic who conducted the first clinical trials of the oral contraceptive pill (1890-1984)
Synset('rock.n.04') (figurative) someone who is strong and stable and dependable; ; --Gospel According to Matthew
Synset('rock_candy.n.01') hard bright-colored stick candy (typically flavored with peppermint)
Synset('rock_'n'_roll.n.01') a genre of popular music originating in the 1950s; a blend of black rhythm-and-blues with white country-and-western
Synset('rock.n.07') pitching dangerously to one side
Synset('rock.v.01') move back and forth or sideways
Synset('rock.v.02') cause to move back and forth


#### Performs the classic Lesk algorithm for Word Sense Disambiguation (WSD) using a the definitions of the ambiguous word. Given an ambiguous word and the context in which the word occurs, Lesk returns a Synset with the highest number of overlapping words between the context sentence and different definitions from each Synset.

In [3]:
#Tokenizing the sentence
sent1 = 'A rock is classified according to characteristics such as mineral and chemical composition'.split()

In [4]:
#Looks for the most appropriate Synset given the word in the specific context
lesk(sent1, 'rock')

Synset('rock.n.04')

In [5]:
#Prints the definition of the resulting Synset
print(wn.synset('rock.n.04').definition())

(figurative) someone who is strong and stable and dependable; ; --Gospel According to Matthew


In [6]:
sent2='Queen are a British rock band formed in London in 1970'.split()
lesk(sent2, 'rock')

Synset('rock_'n'_roll.n.01')

In [7]:
print(wn.synset("rock_'n'_roll.n.01").definition())

a genre of popular music originating in the 1950s; a blend of black rhythm-and-blues with white country-and-western


### Observation:

#### As we can see for the first example it does not return the synset correctly but for the second example it correctly identifies the synset.

### Let's try removing stopwords and then identify

In [9]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_word=set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /Users/azima/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
#Removing Stopwords from the First Sentence
new_sent1=[w for w in sent1 if not w in stop_word]
print(new_sent1)

['A', 'rock', 'classified', 'according', 'characteristics', 'mineral', 'chemical', 'composition']


In [14]:
#Removing Stopwords from the Second Sentence
new_sent2=[w for w in sent2 if not w in stop_word]
print(new_sent2)

['Queen', 'British', 'rock', 'band', 'formed', 'London', '1970']


In [21]:
print("For First Sentence meaning of 'rock': ",lesk(new_sent1, 'rock').definition())
print("For Second Sentence meaning of 'rock': ",lesk(new_sent2, 'rock').definition())

For First Sentence meaning of 'rock':  a lump or mass of hard consolidated mineral matter
For Second Sentence meaning of 'rock':  hard bright-colored stick candy (typically flavored with peppermint)


### Observation

#### Here we can observe a complete opposite scenario. After removing the stopwords the synset for the first sentence is correctly identified whereas for the second sentence it is wrong.

### We should not use Lesk Algorithm after removing the stopwords as it is designed to work considering the stopwords.