<div class="pull-right"><img src=KEY-logo.png></div/>

## Knowledge representation and similarity 
### Grounding (Word-Sense Disambiguation) to WordNet

CSI4106 Artificial Intelligence  
Fall 2018  
Caroline Barrière

***

In this notebook, first, you will explore Wordnet, a lexical semantic network, in which knowledge is organized by interrelated synsets (groups of synonyms).  Second, you will attempt Word-Sense Disambiguation (WSD), using simple Lesk-like algorithm which compares BOWs (bag-of-words).  

This notebook uses the same package NLTK as we used in the last notebook. We will also reuse some knowledge from the previous notebook (tokenization, lemmatization, POS tagging), so make sure to do the NLP Pipeline notebook before this one.

*As you now have more experience, this notebook requires that you write more code by yourself than the previous ones.*

***HOMEWORK***:  
Go through the notebook by running each cell, one at a time. Look for (**TO DO**) for the tasks that you need to perform.  
Make sure you *sign* (type your name) the notebook at the end. Once you're done, submit your notebook.

***

In [123]:
# let's import nltk, and wordnet

import nltk
from nltk.corpus import wordnet

**1. Exploring Wordnet**  

Let's first explore a bit the wordnet interface within nltk.  
You can also look a the [WordNet interface description](http://www.nltk.org/howto/wordnet.html)

In [124]:
# a synset is a concept associated with a set of synonyms

paperSenses = wordnet.synsets('paper')
print(paperSenses)

[Synset('paper.n.01'), Synset('composition.n.08'), Synset('newspaper.n.01'), Synset('paper.n.04'), Synset('paper.n.05'), Synset('newspaper.n.02'), Synset('newspaper.n.03'), Synset('paper.v.01'), Synset('wallpaper.v.01')]


This shows that there are 9 senses of paper, 7 nouns and 2 verbs.  The word displayed is the most representative word for each sense.  

You can try other words.  I recommend that you also perform the same search [online](http://wordnetweb.princeton.edu/perl/webwn) to better understand the results.

Let's look at the basic information in each synset.        

In [125]:
# We define a function to print the basic information

def printBasicSynsetInfo(d):
    print("SynLemmas")
    print(d.lemmas())
    print("Synonyms")
    synonyms = [l.name() for l in d.lemmas()]
    print(synonyms)
    print("Definition")
    print(d.definition())

In [126]:
# We can print the information for each sense of "paper"

for i in range(len(paperSenses)):
    print("[Sense " + str(i) + "]")
    printBasicSynsetInfo(paperSenses[i])
    print()

[Sense 0]
SynLemmas
[Lemma('paper.n.01.paper')]
Synonyms
['paper']
Definition
a material made of cellulose pulp derived mainly from wood or rags or certain grasses

[Sense 1]
SynLemmas
[Lemma('composition.n.08.composition'), Lemma('composition.n.08.paper'), Lemma('composition.n.08.report'), Lemma('composition.n.08.theme')]
Synonyms
['composition', 'paper', 'report', 'theme']
Definition
an essay (especially one written as an assignment)

[Sense 2]
SynLemmas
[Lemma('newspaper.n.01.newspaper'), Lemma('newspaper.n.01.paper')]
Synonyms
['newspaper', 'paper']
Definition
a daily or weekly publication on folded sheets; contains news and articles and advertisements

[Sense 3]
SynLemmas
[Lemma('paper.n.04.paper')]
Synonyms
['paper']
Definition
a medium for written communication

[Sense 4]
SynLemmas
[Lemma('paper.n.05.paper')]
Synonyms
['paper']
Definition
a scholarly article describing the results of observations or stating hypotheses

[Sense 5]
SynLemmas
[Lemma('newspaper.n.02.newspaper'), Lemm

A rich taxonomy has been manually developed in Wordnet, making it a rich resource.  

**(TO-DO : Q1)** Choose two words, and write code to print the taxonomic information for all senses of those words.

In [127]:
# We define a function to print the basic information, receives a synset

def printTaxonomyInfo(d):
    synonyms = [l.name() for l in d.lemmas()]
    print(synonyms)
    print("Hypernyms:")
    print(d.hypernyms())
    print("Hyponyms:")
    print(d.hyponyms())

In [128]:
# Q1 - ANSWER
# We can print the taxonomy information for each sense of a word X
ComputerSenses = wordnet.synsets('computer')
for i in range(len(ComputerSenses)):
    print("[Sense " + str(i) + "]")
    printTaxonomyInfo(ComputerSenses[i])
    print()



[Sense 0]
['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer', 'information_processing_system']
Hypernyms:
[Synset('machine.n.01')]
Hyponyms:
[Synset('analog_computer.n.01'), Synset('digital_computer.n.01'), Synset('home_computer.n.01'), Synset('node.n.08'), Synset('number_cruncher.n.02'), Synset('pari-mutuel_machine.n.01'), Synset('predictor.n.03'), Synset('server.n.03'), Synset('turing_machine.n.01'), Synset('web_site.n.01')]

[Sense 1]
['calculator', 'reckoner', 'figurer', 'estimator', 'computer']
Hypernyms:
[Synset('expert.n.01')]
Hyponyms:
[Synset('adder.n.01'), Synset('number_cruncher.n.01'), Synset('statistician.n.02'), Synset('subtracter.n.01')]



In [129]:
FlowerSenses = wordnet.synsets('flower')
for i in range(len(FlowerSenses)):
    print("[Sense " + str(i) + "]")
    printTaxonomyInfo(FlowerSenses[i])
    print()


[Sense 0]
['flower']
Hypernyms:
[Synset('angiosperm.n.01')]
Hyponyms:
[Synset('achimenes.n.01'), Synset('african_daisy.n.01'), Synset('african_daisy.n.02'), Synset('african_daisy.n.03'), Synset('african_violet.n.01'), Synset('ageratum.n.02'), Synset('ammobium.n.01'), Synset('anemone.n.01'), Synset('aster.n.01'), Synset('baby's_breath.n.01'), Synset('bartonia.n.01'), Synset('begonia.n.01'), Synset('bellwort.n.01'), Synset('billy_buttons.n.01'), Synset('blazing_star.n.01'), Synset('bloomer.n.01'), Synset('blue-eyed_african_daisy.n.01'), Synset('blue_daisy.n.01'), Synset('brass_buttons.n.01'), Synset('bush_violet.n.01'), Synset('butterfly_flower.n.01'), Synset('calceolaria.n.01'), Synset('calendula.n.01'), Synset('calla_lily.n.01'), Synset('candytuft.n.01'), Synset('cape_marigold.n.01'), Synset('carolina_spring_beauty.n.01'), Synset('catananche.n.01'), Synset('centaury.n.01'), Synset('china_aster.n.01'), Synset('christmas_bells.n.01'), Synset('chrysanthemum.n.02'), Synset('cineraria.n.01'

**2. Word-Sense Disambiguation.**  

Let's now implement a simple modified Lesk algorithm for WSD.  
The idea is to compare the sentence containing the ambiguous word W to all the definitions of W and choose the most similar.

(Step 1) Create a BOW (bag of words) for each definition.

In [130]:
# we will need the tokenizer

from nltk import word_tokenize

In [131]:
# define a small method to return the set of words found in a text
# we can exclude some words

def bow(text, excluded = None):
    text = text.replace("_", " ") # the compound nouns in wordnet text have _
    tokens = word_tokenize(text)
    setTokens = set(tokens)
    if excluded != None:
        if (excluded in setTokens):
            setTokens.remove(excluded)
    return setTokens

In [132]:
# testing 
print(bow("There is a lot of food on the table", excluded='table'))
print(bow("He wrote an excellent conference paper referred by many researchers", excluded='paper'))

{'There', 'food', 'on', 'of', 'is', 'a', 'the', 'lot'}
{'conference', 'by', 'researchers', 'excellent', 'an', 'He', 'many', 'wrote', 'referred'}


In [133]:
# make BOWs for all the senses in a received word
# exclude from the BOW, the word being defined

def makeDefBOWs(testWord):
    synsets = wordnet.synsets(testWord)
    defs = [s.definition() for s in synsets]
    bows = [bow(d, excluded=testWord) for d in defs]
    return bows

In [134]:
# try with different words, look at the resulting info

testWord = "cell" # bank, course, paper, ...
defBOWs = makeDefBOWs(testWord)
    
print(*defBOWs, sep="\n")  # to print a list on separate lines

{'any', 'compartment', 'small'}
{')', '(', 'all', 'or', 'tissues', 'functional', 'higher', 'of', 'as', 'unit', 'the', 'basic', 'may', 'exist', 'animals', ';', 'they', 'plants', 'form', 'units', 'independent', 'in', 'and', 'life', 'structural', 'monads', 'biology', 'colonies', 'organisms'}
{'delivers', 'chemical', 'as', 'of', 'a', 'the', 'electric', 'reaction', 'result', 'current', 'that', 'an', 'device'}
{'small', 'movement', 'as', 'of', 'a', 'unit', 'the', 'larger', 'part', 'nucleus', 'or', 'serving', 'political'}
{',', 'short-range', 'radiotelephone', 'hand-held', 'in', 'divided', 'into', 'use', 'small', 'sections', 'with', 'a', 'its', 'mobile', 'own', 'transmitter/receiver', 'an', 'for', 'area', 'each'}
{'small', 'which', 'in', 'lives', 'a', 'room', 'or', 'monk', 'nun'}
{'is', 'kept', 'a', 'room', 'prisoner', 'where'}


(Step 2) Create a method to compare BOWs

In [135]:
# We're interested in the size of the intersection between the BOWs
# If you wish to see the words in common to understand the results, uncomment the prints

def bowOverlap(bow1, bow2):
    #print(bow1)
    #print(bow2)
    print(bow1.intersection(bow2))
    return len(bow1.intersection(bow2))

**(TO-DO: Q2)** Implement the (Step 3) of the algorithm.  The (Step 3) consist in comparing the BOW of a test sentence (let's call it our context C) containing an ambiguous word (X) to the BOWs of all the senses of the X.  To do Step 3, you need to complete the method below which receives a word X, as well as the text C in which X occurs.  The method should return the synsets with largest common BOWs with X.  Notice that there could be more than one maximum, so your method should return all synsets with maximum intersection.

In [136]:
# Q2 - ANSWER

# method receives a word and its context
# returns all the synsets with maximum overlap

def findMostProbableSense(word, context):
    bows = makeDefBOWs(word)
    textBOW = bow(context)
    maxSynet = []
    maxSame = 0
    for bag in bows:
        numSame = bowOverlap(textBOW, bag)
        if(numSame >= maxSame):
            for word in textBOW.intersection(bag):
                maxSynet.append(wordnet.synsets(word))
            if(numSame > maxSame):
                maxSame = numSame
    return maxSynet


##### Your method should return the chosen senses for the example below.  We will test your method using the following code.

In [137]:
# Show the BOWs of the senses with the overlap, and the chosen sense(s)
# You can try with various words and sentences

testWord = "cell"
testSentence = "He lived in this prison cell for many years."


####  CALL TO YOUR METHOD RECEIVING THE WORD AND ITS CONTEXT
chosenSynsets = findMostProbableSense(testWord, testSentence)  

# print all the definitions of the most probable senses
for s in chosenSynsets:
    for x in s:
        print(x)
        printBasicSynsetInfo(x)

set()
{'in'}
set()
set()
{'for', 'in'}
{'in'}
set()
Synset('inch.n.01')
SynLemmas
[Lemma('inch.n.01.inch'), Lemma('inch.n.01.in')]
Synonyms
['inch', 'in']
Definition
a unit of length equal to one twelfth of a foot
Synset('indium.n.01')
SynLemmas
[Lemma('indium.n.01.indium'), Lemma('indium.n.01.In'), Lemma('indium.n.01.atomic_number_49')]
Synonyms
['indium', 'In', 'atomic_number_49']
Definition
a rare soft silvery metallic element; occurs in small quantities in sphalerite
Synset('indiana.n.01')
SynLemmas
[Lemma('indiana.n.01.Indiana'), Lemma('indiana.n.01.Hoosier_State'), Lemma('indiana.n.01.IN')]
Synonyms
['Indiana', 'Hoosier_State', 'IN']
Definition
a state in midwestern United States
Synset('in.s.01')
SynLemmas
[Lemma('in.s.01.in')]
Synonyms
['in']
Definition
holding office
Synset('in.s.02')
SynLemmas
[Lemma('in.s.02.in')]
Synonyms
['in']
Definition
directed or bound inward
Synset('in.s.03')
SynLemmas
[Lemma('in.s.03.in')]
Synonyms
['in']
Definition
currently fashionable
Synset('in.r

**(TO-DO: Q3)** What do you notice? With the example above for "cell", what are the words making the BOWs look similar?  Are these significant words?

*Q3-ANSWER*  
They are all looking like "in" such as "inch" "indiana" and "inch".  They are not significant words in this case. it seems like it has focused on the wrong word. 

**(TO-DO: Q4)  Refining our BOWs**

**Exploring variations:**
1. What if you lowercase everything?
2. What if you apply lemmatisation on all words in the BOWs?
3. What if you focus on only the NOUNS in the BOWs?

(hint) Go back to your notebook NLP pipeline for questions (2) use the lemmatizer and (3) perform POS tagging on the sentences. 

For your answer (code to write):  

a) First complete the BOW method below in which I've added parameters to possibly activate the lowercase, the lemmatization and the POS tagging.   
b) Add a few tests to see if your BOW works.  


In [138]:
# Q4 - ANSWER - part a)

# The parameters possibly ACTIVATE lowercase, lemmatization, and keeping only Nouns in BOWs.

# nltk contains a method to obtain the part-of-speech of each token
# Download the wordnet resource
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
wnl = nltk.WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.ADV  # just use as default, for ADV the lemmatizer doesn't change anything 

# refine the method with parameters
def bow(text, excluded = None, lowercase = False, lemmatize=False, nounsOnly=False):
    text = text.replace("_", " ")
    if lowercase:
        text = text.lower()     
    tokens = word_tokenize(text)
    setTokens = set(tokens)
    if excluded != None:
        if (excluded in setTokens):
            setTokens.remove(excluded)
    if lemmatize: 
        setTokens = [wnl.lemmatize(t) for t in setTokens]
        setTokens = set(setTokens)
    if nounsOnly:
        nounToken = []
        posTokens = nltk.pos_tag(setTokens)
        wordnet_tags = [get_wordnet_pos(p[1]) for p in posTokens]
        for idx, val in enumerate(setTokens):
            if wordnet_tags[idx] == 'n':
                nounToken.append(val)
        setTokens = set(nounToken)
    return setTokens


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/FLSingerman/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [139]:
# Q4 - ANSWER - part b)

# TEST YOUR METHOD 
print(bow("There is a lot of food on the table", excluded='table', lowercase=True, lemmatize=True, nounsOnly=True))
# Your example 1
print(bow("I love the Philadelphia Eagles", excluded=None, lowercase=False, lemmatize=False, nounsOnly=False))
# Your example 2
print(bow("I hope I do well on this assignment. I hope I did well on the exams. I hope I do well in the course", 
          excluded='well', lowercase=True, lemmatize=True, nounsOnly=True))

{'food', 'lot'}
{'love', 'I', 'the', 'Eagles', 'Philadelphia'}
{'hope', 'course', 'i', 'exam'}


**(TO-DO: Q5)** TESTING BOW VARIATIONS IN LESK-LIKE DISAMBIGUATION

a) Redo the method makeDefBOW and findMostProbableSense to use the new parameters.  

b) Generate three example cases and test your disambiguation strategy programmed above.  An example case contains an ambiguous word (e.g. bank) and a sentence in which that word must be disambiguated (e.g. He sat on the bank throwing rocks in the water.).  

c) For your examples, which filtering seems to work better (with/without lemmatization, with/without focus only on nouns)?


In [158]:
# Q5 - ANSWER - part a)

# add the parameters to makeBOW as well, same default
def makeDefBOWs(testWord, lowercase=False, lemmatize=False, nounsOnly=False):
    if lowercase:
        testWord = testWord.lower()     
    excluded = testWord
    synsets = wordnet.synsets(testWord)
    defs = [s.definition() for s in synsets]
    for d in defs:
        bows = bow(d, excluded, lowercase, lemmatize, nounsOnly)
    #bows = [bow(d, excluded=testWord, lowecase, lemmatize, nounsOnly) for d in defs]
    return bows

#     if lemmatize: 
#         setTokens = [wnl.lemmatize(t) for t in setTokens]
#         setTokens = set(setTokens)
#     if nounsOnly:
#         nounToken = []
#         posTokens = nltk.pos_tag(setTokens)
#         wordnet_tags = [get_wordnet_pos(p[1]) for p in posTokens]
#         for idx, val in enumerate(setTokens):
#             if wordnet_tags[idx] == 'n':
#                 nounToken.append(val)
#         setTokens = set(nounToken)

from nltk.stem import *
from nltk.stem.porter import *
        
from nltk.corpus import wordnet
def findMostProbableSense(word, text, stemming=False, lowercase=False, lemmatize=False, nounsOnly=False):
    if stemming:
        tokens = word_tokenize(text)
        tokens_word = word_tokenize(text)
        stemmer = PorterStemmer()
        singles = [stemmer.stem(t) for t in tokens]
        singles_word = [stemmer.stem(t) for t in tokens_word]
        textBOW = bow(singles,lowercase, lemmatize, nounsOnly)
        bows = makeDefBOWs(singles_word, lowercase, lemmatize, nounsOnly)
    else:    
        bows = makeDefBOWs(word, lowercase, lemmatize, nounsOnly)
        textBOW = bow(text, lowercase, lemmatize, nounsOnly)
    maxSynet = []
    maxSame = 0
    for bag in bows:
        numSame = bowOverlap(textBOW, bag)
        if(numSame >= maxSame):
            for word in textBOW.intersection(bag):
                maxSynet.append(wordnet.synsets(word))
            if(numSame > maxSame):
                maxSame = numSame
    return maxSynet
        
        



In [159]:

# # also add the parameter here, copy your method from above and add a parameter for stemming
# def findMostProbableSense(senses, text, stemming=False):

#     maxSynet = []
#     maxSame = 0
#     for bag in bows:
#         numSame = bowOverlap(textBOW, bag)
#         if(numSame >= maxSame):
#             for word in textBOW.intersection(bag):
#                 maxSynet.append(wordnet.synsets(word))
#             if(numSame > maxSame):
#                 maxSame = numSame
#     return maxSynet
    
    

In [164]:
# Q5 - ANSWER - part b)

testWord = "table"
testSentence = "There is a lot of food on the table."
chosenSynsets = findMostProbableSense(testWord, testSentence, lowercase=True, lemmatize=True, nounsOnly=True)  

# print all the definitions of the most probable senses
for s in chosenSynsets:
    for x in s:
        printBasicSynsetInfo(x)
    
# Your example 1
myWord = "Mark"
mySentence = "Mark my word, I will seek vengence."
chosenSynsets = findMostProbableSense(myWord, mySentence, lowercase=True, lemmatize=False, nounsOnly=True)  

# print all the definitions of the most probable senses
print("******* lowercase=True, lemmatize=False, nounsOnly=True")
for s in chosenSynsets:
    for x in s:
        printBasicSynsetInfo(x)


# Your example 2
chosenSynsets = findMostProbableSense(myWord, mySentence, lowercase=True, lemmatize=True, nounsOnly=False)  

# print all the definitions of the most probable senses
print("****** lowercase=True, lemmatize=True, nounsOnly=False")
for s in chosenSynsets:
    for x in s:
        printBasicSynsetInfo(x)

# Your example 3
chosenSynsets = findMostProbableSense(myWord, mySentence, lowercase=True, lemmatize=True, nounsOnly=True)  

# print all the definitions of the most probable senses
print("******** lowercase=True, lemmatize=True, nounsOnly=True")
for s in chosenSynsets:
    for x in s:
        
        printBasicSynsetInfo(x)

        
chosenSynsets = findMostProbableSense(myWord, mySentence, lowercase=True, lemmatize=False, nounsOnly=False)  

# print all the definitions of the most probable senses
print("******** lowercase=True, lemmatize=flase, nounsOnly=false")
for s in chosenSynsets:
    for x in s:
        
        printBasicSynsetInfo(x)

set()
{'a'}
set()
SynLemmas
[Lemma('angstrom.n.01.angstrom'), Lemma('angstrom.n.01.angstrom_unit'), Lemma('angstrom.n.01.A')]
Synonyms
['angstrom', 'angstrom_unit', 'A']
Definition
a metric unit of length equal to one ten billionth of a meter (or 0.0001 micron); used to specify wavelengths of electromagnetic radiation
SynLemmas
[Lemma('vitamin_a.n.01.vitamin_A'), Lemma('vitamin_a.n.01.antiophthalmic_factor'), Lemma('vitamin_a.n.01.axerophthol'), Lemma('vitamin_a.n.01.A')]
Synonyms
['vitamin_A', 'antiophthalmic_factor', 'axerophthol', 'A']
Definition
any of several fat-soluble vitamins essential for normal vision; prevents night blindness or inflammation or dryness of the eyes
SynLemmas
[Lemma('deoxyadenosine_monophosphate.n.01.deoxyadenosine_monophosphate'), Lemma('deoxyadenosine_monophosphate.n.01.A')]
Synonyms
['deoxyadenosine_monophosphate', 'A']
Definition
one of the four nucleotides used in building DNA; all four nucleotides have a common phosphate group and a sugar (ribose)
SynLe

*Q5 - ANSWER - part c)* The best filtering seems to be with/without lemmatization and with/without nounsonly

Nouns only alone does not work well at all. Using both lemmatize and nouns did not seem to give me great results. Using both lematize and nouns seems to have given me the same results as just lemmatize and they are still not good. 

#### Signature

I, Felix Singerman, declare that the answers provided in this notebook are my own.