<div class="pull-right"><img src=KEY-logo.png></div/>

## Représentation des connaissances et similarité
### Désambiguïsation vers WordNet

CSI4506 Intelligence Artificielle  
Automne 2018  
Caroline Barrière

***

Dans ce notebook, d'abord vous explorerez WordNet, une ressource dans laquelle la connaissance est organisée en synsets (groupes de synonymes) interreliés par des relations sémantiques.  Ensuite, vous tenterez de désambiguïser des mots en contexte en utilisant une approche de type Lesk, qui compare des sacs de mots (BOWs: bag-of-words).  

Ce notebook utilise encore NLTK, le package que nous avons utilisé pour le notebook sur la pipeline TAL.  Aussi, nous réutilisons ici des notions apprises dans ce dernier notebook (tokenization, lemmatization, POS tagging), donc assurez-vous de faire le notebook sur la pipeline TAL avant celui-ci.

*Comme vous avez maintenant plus d'expérience, ce notebook vous demande plus de code à écrire par vous-même que les notebooks précédents.*

***DEVOIR***:  
Parcourir le notebook, en exécutant chaque cellule, une à une.  
Pour chaque **(TO DO)**, effectuer les tâches demandées.  
Quand vous avez terminé, signez et soumettez votre notebook.

***

In [1]:
# let's import nltk, and wordnet

import nltk
from nltk.corpus import wordnet

**1. Exploration de Wordnet**  

Nous allons explorer Wordnet en utilisant quelques méthodes pour accéder à son information. 
Il y a une description plus exhaustive ici http://www.nltk.org/howto/wordnet.html 

In [2]:
# a synset is a concept associated with a set of synonyms

paperSenses = wordnet.synsets('paper')
print(paperSenses)

[Synset('paper.n.01'), Synset('composition.n.08'), Synset('newspaper.n.01'), Synset('paper.n.04'), Synset('paper.n.05'), Synset('newspaper.n.02'), Synset('newspaper.n.03'), Synset('paper.v.01'), Synset('wallpaper.v.01')]


Nous voyons qu'il y a 9 sens de "paper", dont 7 noms et 2 verbes.  Le mot que vous voyez associé au synset est le mot le plus représentatif de ce synset.  

Vous pouvez tester d'autres mots.  Je vous encourage à faire les mêmes recherches dans la version en ligne de Wordnet [online](http://wordnetweb.princeton.edu/perl/webwn) pour mieux comprendre les résultats.

Regardons l'information de base des synsets.  

In [3]:
# We define a function to print the basic information

def printBasicSynsetInfo(d):
    print("SynLemmas")
    print(d.lemmas())
    print("Synonyms")
    synonyms = [l.name() for l in d.lemmas()]
    print(synonyms)
    print("Definition")
    print(d.definition())

In [4]:
# We can print the information for each sense of "paper"

for i in range(len(paperSenses)):
    print("[Sense " + str(i) + "]")
    printBasicSynsetInfo(paperSenses[i])
    print()

[Sense 0]
SynLemmas
[Lemma('paper.n.01.paper')]
Synonyms
['paper']
Definition
a material made of cellulose pulp derived mainly from wood or rags or certain grasses

[Sense 1]
SynLemmas
[Lemma('composition.n.08.composition'), Lemma('composition.n.08.paper'), Lemma('composition.n.08.report'), Lemma('composition.n.08.theme')]
Synonyms
['composition', 'paper', 'report', 'theme']
Definition
an essay (especially one written as an assignment)

[Sense 2]
SynLemmas
[Lemma('newspaper.n.01.newspaper'), Lemma('newspaper.n.01.paper')]
Synonyms
['newspaper', 'paper']
Definition
a daily or weekly publication on folded sheets; contains news and articles and advertisements

[Sense 3]
SynLemmas
[Lemma('paper.n.04.paper')]
Synonyms
['paper']
Definition
a medium for written communication

[Sense 4]
SynLemmas
[Lemma('paper.n.05.paper')]
Synonyms
['paper']
Definition
a scholarly article describing the results of observations or stating hypotheses

[Sense 5]
SynLemmas
[Lemma('newspaper.n.02.newspaper'), Lemm

Wordnet contient aussi une taxonomie, ce qui en fait une ressource très riche. 

**(TO-DO : Q1)** Choisissez deux mots et écrivez le code qui permettra de montrer l'information taxonomique de ces mots.

In [5]:
# We define a function to print the basic information, receives a synset

def printTaxonomyInfo(d):
    synonyms = [l.name() for l in d.lemmas()]
    print(synonyms)
    print("Hypernyms:")
    print(d.hypernyms())
    print("Hyponyms:")
    print(d.hyponyms())

In [6]:
# Q1 - RÉPONSE
# We can print the taxonomy information for each sense of a word X
bottleSenses = wordnet.synsets('bottle')
for i in range(len(bottleSenses)):
    print("[Sense " + str(i) + "]")
    printTaxonomyInfo(bottleSenses[i])
    print()

print()
mouseSenses = wordnet.synsets('mouse')
for i in range(len(mouseSenses)):
    print("[Sense " + str(i) + "]")
    printTaxonomyInfo(mouseSenses[i])
    print()

[Sense 0]
['bottle']
Hypernyms:
[Synset('vessel.n.03')]
Hyponyms:
[Synset('beer_bottle.n.01'), Synset('carafe.n.01'), Synset('carboy.n.01'), Synset('catsup_bottle.n.01'), Synset('cruet.n.01'), Synset('demijohn.n.01'), Synset('flask.n.01'), Synset('gourd.n.01'), Synset('ink_bottle.n.01'), Synset('jug.n.01'), Synset('phial.n.01'), Synset('pill_bottle.n.01'), Synset('pop_bottle.n.01'), Synset('smelling_bottle.n.01'), Synset('specimen_bottle.n.01'), Synset('water_bottle.n.01'), Synset('whiskey_bottle.n.01'), Synset('wine_bottle.n.01')]

[Sense 1]
['bottle', 'bottleful']
Hypernyms:
[Synset('containerful.n.01')]
Hyponyms:
[Synset('split.n.02')]

[Sense 2]
['bottle', 'feeding_bottle', 'nursing_bottle']
Hypernyms:
[Synset('vessel.n.03')]
Hyponyms:
[]

[Sense 3]
['bottle']
Hypernyms:
[Synset('store.v.02')]
Hyponyms:
[]

[Sense 4]
['bottle']
Hypernyms:
[Synset('put.v.01')]
Hyponyms:
[]


[Sense 0]
['mouse']
Hypernyms:
[Synset('rodent.n.01')]
Hyponyms:
[Synset('field_mouse.n.02'), Synset('harvest

**2. Désambiguïsation**  

Nous allons implémenter un algorithme de type Lesk pour faire la désambiguïsation.
L'idée est de comparer le BOW de la phrase contenant le mot ambigu avec tous les BOWs correspondant aux définitions possibles de ce mot. 

(Étape 1) Créer le BOW (bag of words) pour chaque définition.

In [7]:
# we will need the tokenizer

from nltk import word_tokenize

In [8]:
# define a small method to return the set of words found in a text
# we can exclude some words

def bow(text, excluded = None):
    text = text.replace("_", " ") # the compound nouns in wordnet text have _
    tokens = word_tokenize(text)
    setTokens = set(tokens)
    if excluded != None:
        if (excluded in setTokens):
            setTokens.remove(excluded)
    return setTokens

In [9]:
# testing 
print(bow("There is a lot of food on the table", excluded='table'))
print(bow("He wrote an excellent conference paper referred by many researchers", excluded='paper'))

{'lot', 'the', 'food', 'There', 'of', 'a', 'is', 'on'}
{'an', 'many', 'wrote', 'excellent', 'conference', 'referred', 'researchers', 'by', 'He'}


In [10]:
# make BOWs for all the senses of a testWord
# exclude the testWord from the BOWs

def makeDefBOWs(testWord):
    synsets = wordnet.synsets(testWord)
    defs = [s.definition() for s in synsets]
    bows = [bow(d, excluded=testWord) for d in defs]
    return bows

In [11]:
# try with different words, look at the resulting info

testWord = "cell" # bank, course, paper, ...
defBOWs = makeDefBOWs(testWord)
    
print(*defBOWs, sep="\n")  # to print a list on separate lines

{'small', 'compartment', 'any'}
{'units', 'life', 'plants', 'unit', 'or', 'tissues', 'independent', 'and', 'structural', 'animals', 'basic', 'functional', ';', 'of', 'organisms', 'all', 'may', 'monads', 'they', 'colonies', 'higher', 'exist', 'form', 'in', 'the', 'biology', '(', ')', 'as'}
{'an', 'chemical', 'delivers', 'the', 'electric', 'current', 'of', 'reaction', 'that', 'result', 'device', 'a', 'as'}
{'unit', 'movement', 'small', 'larger', 'the', 'of', 'serving', 'or', 'nucleus', 'part', 'a', 'as', 'political'}
{'an', 'for', 'into', 'transmitter/receiver', 'each', 'use', 'hand-held', 'radiotelephone', 'its', 'sections', 'divided', 'small', 'with', 'mobile', 'in', 'short-range', 'area', 'own', ',', 'a'}
{'monk', 'which', 'small', 'in', 'room', 'or', 'nun', 'lives', 'a'}
{'prisoner', 'where', 'room', 'kept', 'a', 'is'}


(Étape 2) Créer une méthode pour comparer les BOWs

In [12]:
# We're interested in the size of the intersection between the BOWs
# If you wish to see the words in common to understand the results, uncomment the prints

def bowOverlap(bow1, bow2):
    # print(bow1)
    # print(bow2)
    print(bow1.intersection(bow2))
    return len(bow1.intersection(bow2))

**(TO-DO: Q2)** Écrivez le code pour faire l'étape 3 de l'algorithme.  L'étape 3 consiste à comparer le BOW du contexte entourant le mot ambigu aux BOWs des divers sens.  Pour faire cette étape, vous devez compléter la méthode ci-bas qui reçoit un mot et un contexte et qui retourne les synsets les plus probables (soit ceux avec la plus grande intersection avec le BOW du contexte).  Comme il peut y avoir plusieurs synsets avec la même taille d'intersection, la méthode doit tous les retourner.

In [13]:
# Q2 - RÉPONSE

# method receives a word, and its context
# returns all the synsets with maximum overlap

def findMostProbableSense(word, context):
    bows = makeDefBOWs(word)
    textBOW = bow(context)
    # find senses with max overlap
    maximum_intersections = max([bowOverlap(bow,textBOW) for bow in bows])
    print()
    return [wordnet.synsets(word)[i] for i,bow in enumerate(bows) if bowOverlap(bow,textBOW)==maximum_intersections]
    


##### Votre méthode devrait pouvoir retourner les sens choisis pour des exemples comme suit:

In [14]:
# Show the BOWs of the senses with the overlap, and the chosen sense(s)
# You can try with various words and sentences

testWord = "cell"
testSentence = "He lived in this prison cell for many years."

####  CALL TO YOUR METHOD RECEIVING THE WORD AND ITS CONTEXTS
chosenSynsets = findMostProbableSense(testWord, testSentence)
print()

# print all the definitions of the most probable senses
for s in chosenSynsets:
    printBasicSynsetInfo(s)

set()
{'in'}
set()
set()
{'for', 'in'}
{'in'}
set()

set()
{'in'}
set()
set()
{'for', 'in'}
{'in'}
set()

SynLemmas
[Lemma('cellular_telephone.n.01.cellular_telephone'), Lemma('cellular_telephone.n.01.cellular_phone'), Lemma('cellular_telephone.n.01.cellphone'), Lemma('cellular_telephone.n.01.cell'), Lemma('cellular_telephone.n.01.mobile_phone')]
Synonyms
['cellular_telephone', 'cellular_phone', 'cellphone', 'cell', 'mobile_phone']
Definition
a hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver


**(TO-DO: Q3)** Que remarquez-vous?  Avec l'exemple ci-haut pour "cell", quels sont les mots qui rendent les BOWs similaires?  Sont-ils des mots importants?

**Q3-RÉPONSE**

La definition du telephone cellulaire a ete prise comme sens plus probable. Les mots 'for' et 'in' etaient les mots qui intersectaient. Ce ne sont pas des mots importants.

**(TO-DO: Q4)  Raffinement des BOWs**

**Exploration de variations:**
1. Et si tout était en minuscule? 
2. Et si nous utilisions la lemmatisation? 
3. Et si le BOW ne contenait que des noms?

(pour vous aider) Retourner voir le notebook de la pipeline TAL et retrouver comment faire la lemmatisation (variation 2 ci-haut) et le POS tagging (variation 3 ci-haut). 

Écrivez le code pour:

a) Compléter la méthode BOW ci-bas qui contient maintenant plusieurs paramètres qui permettront de mettre en oeuvre (actif ou non) les divers filtres (lowercase, lemmatisation, POS tagging).  
b) Ajouter quelques tests pour valider que votre création de BOW fonctionne.


In [15]:
# Q4 - RÉPONSE - partie a)

# The parameters possibly ACTIVATE lowercase, lemmatization, and keeping only Nouns in BOWs.

# nltk contains a method to obtain the part-of-speech of each token
# Download the wordnet resource
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
wnl = nltk.WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.ADV  # just use as default, for ADV the lemmatizer doesn't change anything 

# refine the method with parameters
def bow(text, excluded = None, lowercase = False, lemmatize=False, nounsOnly=False):
    text = text.replace("_", " ")

    # Continue with the options to deal with the various cases (lemmatized T/F, nounsOnly T/F)
    if lowercase:
        text = text.lower()

    tokens = word_tokenize(text)

    if lemmatize:
        tokens = [wnl.lemmatize(t) for t in tokens]

    if nounsOnly:
        posTokens = nltk.pos_tag(tokens)
        tokens = [token for token,posToken in zip(tokens,posTokens) if get_wordnet_pos(posToken[1])==wordnet.NOUN]

    setTokens = set(tokens)
    if excluded != None:
        if (excluded in setTokens):
            setTokens.remove(excluded) 

    return setTokens

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Oliver\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [16]:
# Q4 - RÉPONSE - partie b)

# TEST YOUR METHOD 
print(bow("There is a lot of food on the table", excluded='table', lowercase=True, lemmatize=True, nounsOnly=True))
# Your example 1
print(bow("He wrote an excellent conference paper referred by many researchers", excluded='paper', lowercase=True, lemmatize=True))
# Your example 2
print(bow("In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat; it was a hobbit-hole, and that means comfort.", lemmatize=True, nounsOnly=True))

{'food', 'lot'}
{'an', 'many', 'wrote', 'researcher', 'excellent', 'he', 'conference', 'referred', 'by'}
{'hobbit', 'ground', 'dry', 'smell', 'worm', 'hole', 'bare', 'nothing', 'oozy', 'comfort', 'end'}


**(TO-DO: Q5)** Tester les variations de BOWs dans la tâche de désambiguïsation

a) Réécrivez les méthodes *makeDefBOW* et *findMostProbableSense* pour pouvoir activer les nouveaux paramètres.

b) Générer 3 cas-exemples pour tester la stratégie de désambiguïsation implémentée.  Un cas-exemple contient un mot ambigu
 (e.g. bank) et une phrase dans laquelle ce mot doit être désambiguïsé (e.g. He sat on the bank throwing rocks in the water.)
 
c) Pour vos exemples, quelles stratégies semblent mieux fonctionner ? (avec/sans lemmatization, avec/sans restriction sur les noms)?


In [17]:
# Q5 - RÉPONSE - partie a)

# add the parameters to makeBOW as well, same default
def makeDefBOWs(testWord, lowercase=False, lemmatize=False, nounsOnly=False):
    synsets = wordnet.synsets(testWord)
    defs = [s.definition() for s in synsets]
    bows = [bow(d, testWord, lowercase, lemmatize, nounsOnly) for d in defs]
    return bows

# also add the parameter here, copy your method from above and add a parameter for stemming
# def findMostProbableSense(senses, text, stemming=False):
def findMostProbableSense(word, text, lowercase=False, lemmatize=False, nounsOnly=False):
    bows = makeDefBOWs(word, lowercase, lemmatize, nounsOnly)
    textBOW = bow(text, lowercase, lemmatize, nounsOnly)
    maximum_intersections = max([bowOverlap(bow,textBOW) for bow in bows])
    print()
    return [wordnet.synsets(word)[i] for i,bow in enumerate(bows) if bowOverlap(bow,textBOW)==maximum_intersections]

In [18]:
# Q5 - RÉPONSE - partie b)

testWord = "table"
testSentence = "There is a lot of food on the table."
chosenSynsets = findMostProbableSense(testWord, testSentence, lowercase=True, lemmatize=True, nounsOnly=True)  

# print all the definitions of the most probable senses
print()
for s in chosenSynsets:
    printBasicSynsetInfo(s)
    
print()
# Your example 1
print("Example 1")
word1 = "bank"
sentence1 = "He sat on the bank throwing rocks in the water"
chosenSynsets1 = findMostProbableSense(word1, sentence1, lemmatize=True)  
print()
for s in chosenSynsets1:
    printBasicSynsetInfo(s)

print()
# Your example 2
print("Example 2")
word2 = "bank"
sentence2 = "He sat on the bank throwing rocks in the water"
chosenSynsets2 = findMostProbableSense(word2, sentence2, nounsOnly=True)  
print()
for s in chosenSynsets2:
    printBasicSynsetInfo(s)

print()
# Your example 3
print("Example 3")
word3 = "bank"
sentence3 = "He sat on the bank throwing rocks in the water"
chosenSynsets3 = findMostProbableSense(word3, sentence3, lemmatize=True, nounsOnly=True)  
print()
for s in chosenSynsets3:
    printBasicSynsetInfo(s)



set()
set()
set()
set()
set()
{'food'}
set()
set()

set()
set()
set()
set()
set()
{'food'}
set()
set()

SynLemmas
[Lemma('board.n.04.board'), Lemma('board.n.04.table')]
Synonyms
['board', 'table']
Definition
food or meals in general

Example 1
{'water', 'the'}
{'the'}
set()
{'in'}
{'in'}
{'in', 'the'}
{'in', 'the'}
{'in', 'the'}
{'in', 'the'}
{'in'}
set()
set()
set()
{'in', 'the'}
{'in', 'the'}
set()
{'the'}
{'in'}

{'water', 'the'}
{'the'}
set()
{'in'}
{'in'}
{'in', 'the'}
{'in', 'the'}
{'in', 'the'}
{'in', 'the'}
{'in'}
set()
set()
set()
{'in', 'the'}
{'in', 'the'}
set()
{'the'}
{'in'}

SynLemmas
[Lemma('bank.n.01.bank')]
Synonyms
['bank']
Definition
sloping land (especially the slope beside a body of water)
SynLemmas
[Lemma('bank.n.06.bank')]
Synonyms
['bank']
Definition
the funds held by a gambling house or the dealer in some gambling games
SynLemmas
[Lemma('bank.n.07.bank'), Lemma('bank.n.07.cant'), Lemma('bank.n.07.camber')]
Synonyms
['bank', 'cant', 'camber']
Definition
a slope 

**Q5 - RÉPONSE - partie c)**

La strategie optimale est d'utiliser les minuscules (permet de s'assurer que les majuscules n'affecteront pas le "string matching"), la lemmatisation (peut affecter le nombre de mots qui seront "POS tagged" comme etant des noms et augmente les chances d'avoir des mots en commun avec les definitions du dictionnaire) et les restrictions sur les noms (le parametre le plus utile, comme vu dans les resultats du deuxieme exemple, car ca augmente les chances qu'un nom pertinent soit commun a la phrase et a la definition).

#### Signature

Je, Oliver Charles Scott, declare que les réponses inscrites dans ce notebook sont les miennes.