# WordNet- based algorithms

## Lesk's algorithm

In [1]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.wsd import lesk
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [2]:
lesk(nltk.word_tokenize('Today I went to the bank to ask for a loan.'),'bank','n')

Synset('savings_bank.n.02')

In [3]:
lesk??

In [4]:
[s.definition() for s in wn.synsets('loan')]

['the temporary provision of money (usually at interest)',
 "a word borrowed from another language; e.g. `blitz' is a German word borrowed into modern English",
 'give temporarily; let have for a limited time']

In [5]:
syns=wn.synsets('bank', 'n')
for s in syns:
  print(s, s.definition())

Synset('bank.n.01') sloping land (especially the slope beside a body of water)
Synset('depository_financial_institution.n.01') a financial institution that accepts deposits and channels the money into lending activities
Synset('bank.n.03') a long ridge or pile
Synset('bank.n.04') an arrangement of similar objects in a row or in tiers
Synset('bank.n.05') a supply or stock held in reserve for future use (especially in emergencies)
Synset('bank.n.06') the funds held by a gambling house or the dealer in some gambling games
Synset('bank.n.07') a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Synset('savings_bank.n.02') a container (usually with a slot in the top) for keeping money at home
Synset('bank.n.09') a building in which the business of banking transacted
Synset('bank.n.10') a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)


In [6]:
wn.synsets('savings_bank')[1].hyponyms()

[Synset('piggy_bank.n.01')]

In [7]:
lesk(nltk.word_tokenize('Students enjoy going to school, studying and reading books'),'school','n')

Synset('school.n.06')

In [8]:
syns = wn.synsets('school', 'n')
for s in syns:
  print(s, s.definition())

Synset('school.n.01') an educational institution
Synset('school.n.02') a building where young people receive education
Synset('school.n.03') the process of being formally educated at a school
Synset('school.n.04') a body of creative artists or writers or thinkers linked by a similar style or by similar teachers
Synset('school.n.05') the period of instruction in a school; the time period when school is in session
Synset('school.n.06') an educational institution's faculty and students
Synset('school.n.07') a large group of fish


In [9]:
wn.synsets('school')[5].examples()

['the school keeps parents informed',
 'the whole school turned out for the game']

In [10]:
lesk(nltk.word_tokenize('I was snorkeling and saw a school of trout'),'school','n')

Synset('school.n.05')

In [11]:
wn.synsets('trout')[1].definition()

'any of various game and food fishes of cool fresh waters mostly smaller than typical salmons'

In [12]:
s = lesk(nltk.word_tokenize("I went fishing for some sea bass"), 'bass', 'n')
print(s, s.definition())

Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes


In [13]:
for s in wn.synsets('bass'):
  print(s, s.definition())

Synset('bass.n.01') the lowest part of the musical range
Synset('bass.n.02') the lowest part in polyphonic music
Synset('bass.n.03') an adult male singer with the lowest voice
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)
Synset('bass.n.06') the lowest adult male singing voice
Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes
Synset('bass.s.01') having or denoting a low vocal or instrumental range


In [14]:
syns=wn.synsets('trout', 'n')
for s in syns:
  print(s, s.member_meronyms())

Synset('trout.n.01') []
Synset('trout.n.02') []


## Extension: Banerjee and Pedersen's algorithm

This technique was presented by Satanjeev Banerjee and Ted Pedersen in 2003 in an article. https://www.researchgate.net/publication/221629283_An_Adapted_Lesk_Algorithm_for_Word_Sense_Disambiguation_Using_WordNet

The algorithm measures the relatedness of two words. Just like Lesk, it counts the overlaps of glosses, however it takes into account the related glosses of the two words as well.

Suppose we want to obtain the sense for a word in a certain context (for example a sentence or just a window of text). The steps of the algorithm are:

We first tag the words in the sentence with their part of speech
For each word we obtain the list of synsets corresponding to that part of speech.

For each synset s we obtain the glosses of the synsets for all:

- hypernyms
- hyponyms
- meronyms
- holonyms
- troponyms
- attributes
- similar–to
- also–see

it is good to use a structure that shows for each gloss where it comes from (in order to do the tests in the exercise). We add them all in a list with all the glosses (for each target word). We call these lists "**extended glosses**".

For each synnset of the target word (for which we want to obtain the sense) we compute a score by counting the overlaps in the synset with all the other synsets corresponding to the words in the context.
In computing the score, for each single word that appears in both extended glosses we add 1. However if it appears in a common phrase, supposing the length of common phrase is L, we add L**2 (for example, if "white bread" appears in both glosses, we add 4). We obviously don't add the score for the separate words in the phrase. 

We try to find the longest common sequences of consecutive words (it shouldn't start or end with a pronoun, preposition, article or conjunction in both glosses). In order to avoid counting the same overlap multiple times for the same two glosses, after counting the overlap you should replace the sequence of words with a special string (don't remove it completely as you may obtain false overlaps)
After computing the score for each synset of the target word, choose as result the synset with the highest score.

In [15]:
"some white snow and some type of rye bread" ## 2
"more white bread with butter"

"i eat white bread for breakfast" ## 2?
"white bread is better than rye"

"bread and butter"

'bread and butter'

In [16]:
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
import string
nltk.download('stopwords')
import re

gloss_rel = lambda x: x.definition()
example_rel = lambda x: " ".join(x.examples())
hyponym_rel = lambda x: " ".join(w.definition() for w in x.hyponyms())
meronym_rel = lambda x: " ".join(w.definition() for w in x.member_meronyms() + \
                                 x.part_meronyms() + x.substance_meronyms())
also_rel = lambda x: " ".join(w.definition() for w in x.also_sees())
attr_rel = lambda x: " ".join(w.definition() for w in x.attributes())
hypernym_rel = lambda x: " ".join(w.definition() for w in x.hypernyms())

relpairs = {wn.NOUN: [(hyponym_rel, meronym_rel), (meronym_rel, hyponym_rel),
                      (hyponym_rel, hyponym_rel),
                      (gloss_rel, meronym_rel), (meronym_rel, gloss_rel),
                      (example_rel, meronym_rel), (meronym_rel, example_rel),
                      (gloss_rel, gloss_rel)],
            wn.ADJ: [(also_rel, gloss_rel), (gloss_rel, also_rel),
                     (attr_rel, gloss_rel), (gloss_rel, attr_rel),
                     (gloss_rel, gloss_rel),
                     (example_rel, gloss_rel), (gloss_rel, example_rel),
                     (gloss_rel, hypernym_rel), (hypernym_rel, gloss_rel)],
            wn.VERB:[(example_rel, example_rel),
                     (example_rel, hypernym_rel), (hypernym_rel, example_rel),
                     (hyponym_rel, hyponym_rel),
                     (gloss_rel, hyponym_rel), (hyponym_rel, gloss_rel),
                     (example_rel, gloss_rel), (gloss_rel, example_rel)]}

def preprocess(text):
    """
    Helper function to preprocess text (lowercase, remove punctuation etc.)
    """
    words = nltk.word_tokenize(text)
    punctuation = string.punctuation
    words = [word.lower() for word in words if word not in punctuation]
    words = [word for word in words if not word in stopwords.words('english')] # ? not part of the original algorithm to remove all stopwords! (only ones at the edges of the subsequence)
    return words

def lcs(S1, S2):
    """
    Helper function to compute length and offsets of longest common substring of
    S1 and S2. Uses the classical dynamic programming algorithm.
    """
    M = [[0]*(1+len(S2)) for i in range(1+len(S1))]
    longest, x_longest, y_longest = 0, 0, 0
    for x in range(1,1+len(S1)):
        for y in range(1,1+len(S2)):
            if S1[x-1] == S2[y-1]:
                M[x][y] = M[x-1][y-1] + 1
                if M[x][y]>longest:
                    longest = M[x][y]
                    x_longest = x
                    y_longest = y
            else:
                M[x][y] = 0
    return longest, x_longest - longest, y_longest - longest

def score(gloss1, gloss2, normalized=False):
    """
    Compute score between two glosses based on length of common substrings.
    """
    gloss1 = preprocess(gloss1)
    gloss2 = preprocess(gloss2)
    curr_score = 0
    longest, start1, start2, = lcs(gloss1, gloss2)
    while longest > 0:
        gloss1[start1 : start1 + longest] = []
        gloss2[start2 : start2 + longest] = []
        curr_score += longest ** 2
        longest, start1, start2 = lcs(gloss1, gloss2)
    if normalized and curr_score:
      return curr_score / (len(gloss1) + len(gloss2))
    return curr_score

def relatedness(sense1, sense2, relpairs, normalized=False):
    """
    Compute the relatedness of two senses (synsets) using the list of pairs of
    relations in relpairs.
    """
    return sum(score(pair[0](sense1), pair[1](sense2), normalized=normalized) # Note: normalization not explicitly part of original algorithm!
    for pair in relpairs)

def wsd(context, target, winsize, pos_tag, verbose=False, normalized=False):
    """
    Find the best sense for a word in a given context.
    Arguments:
    context - sentence(s) we are analyzing; expected as list of strings
    target  - string representing the word whose senses we're trying to
              disambiguate. Target is assumed to occur once in sentence. In case
              of multiple occurences, the first one is considered. Will throw
              ValueError if target is not in sentence
    winsize - size of window used for disambiguating. The algorithm will only
              look at winsize words of the appropriate part-of-speech around the
              target word
    pos_tag - part of speech of target word
    """
    context = list(filter(None, [wn.synsets(word, pos=pos_tag) for word in context]))
    target_synsets = wn.synsets(target, pos=pos_tag)
    pos = context.index(target_synsets)
    window = context[max(pos - winsize, 0) : pos] + \
             context[pos + 1 : min(pos + winsize + 1, len(context))]
    sense_scores = [sum(sum(relatedness(sense, other_sense, relpairs[pos_tag], normalized=normalized)
                              for other_sense in senses)
                   for senses in window) for sense in target_synsets]
    if verbose:
      print("All scores:")
      for i, s in enumerate(target_synsets):
        print(sense_scores[i], s, s.definition())
    best_score = max(sense_scores)
    best_index = sense_scores.index(best_score)
    return target_synsets[best_index], best_score


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:

sentence = "I went fishing for some sea bass"
sentence = preprocess(sentence)
sense, outscore = wsd(sentence, "bass", 5, wn.NOUN)
print("best score was {} by synset {} \n(definition = '{}')".format(outscore, sense, sense.definition()))


best score was 1 by synset Synset('sea_bass.n.01') 
(definition = 'the lean flesh of a saltwater fish of the family Serranidae')


In [18]:
sentence = "Students enjoy going to school, studying and reading books"
sentence = preprocess(sentence)
sense, outscore = wsd(sentence, "school", 1, wn.NOUN)
print("best score was {} by synset {} \n(definition = '{}')".format(outscore, sense, sense.definition()))


best score was 1 by synset Synset('school.n.01') 
(definition = 'an educational institution')


In [19]:
sentence = "I went to the bank to get a loan"
sentence = preprocess(sentence)
sense, outscore = wsd(sentence, "bank", 3, wn.NOUN)
print("best score was {} by synset {} \n(definition = '{}')".format(outscore, sense, sense.definition()))


best score was 12 by synset Synset('depository_financial_institution.n.01') 
(definition = 'a financial institution that accepts deposits and channels the money into lending activities')


In [20]:
sentence = "I went to the bank to get a loan"
sentence = preprocess(sentence)
sense, outscore = wsd(sentence, "bank", 2, wn.NOUN)
print("best score was {} by synset {} \n(definition = '{}')".format(outscore, sense, sense.definition()))

best score was 12 by synset Synset('depository_financial_institution.n.01') 
(definition = 'a financial institution that accepts deposits and channels the money into lending activities')


In [21]:
sentence = 'I was snorkeling and saw a school of trout'
sentence = preprocess(sentence)
sense, outscore = wsd(sentence, "school", 3, wn.NOUN, verbose=True, normalized=True)
print("best score was {} by synset {} \n(definition = '{}')".format(outscore, sense, sense.definition()))

All scores:
0.030312006319115323 Synset('school.n.01') an educational institution
0.07142857142857142 Synset('school.n.02') a building where young people receive education
0 Synset('school.n.03') the process of being formally educated at a school
0.062405820142393036 Synset('school.n.04') a body of creative artists or writers or thinkers linked by a similar style or by similar teachers
0.018867924528301886 Synset('school.n.05') the period of instruction in a school; the time period when school is in session
0 Synset('school.n.06') an educational institution's faculty and students
0.125 Synset('school.n.07') a large group of fish
best score was 0.125 by synset Synset('school.n.07') 
(definition = 'a large group of fish')


In [22]:
[s.definition() for s in wn.synsets('whale')]

['a very large person; impressive in size or qualities',
 'any of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the head',
 'hunt for whales']

# Knowledge-rich WSD based on WordNet++


Technique developped by Simone Paolo Ponzetto and Roberto Navigli in their article "Knowledge-rich Word Sense Disambiguation Rivaling Supervised Systems". This approach uses supplimentary relations between words, in order to compute relatedness between concepts. https://aclanthology.org/P10-1154/

All the new relations are based on Wikipedia, this is the reason why in this laboratory we need to use the Wikipedia module (documentation: https://wikipedia.readthedocs.io/en/latest/code.html). However, some of the needed relations are not implemented in the Wikipedia module, therefore, we will also need to use the requests module in order to use MediaWiki action API to wich we'll need to transmit requests.

**Types of relations**

- "Redirect to" relations (https://www.mediawiki.org/wiki/API:Redirects)
- disambiguation pages
- internal links

In order to use these relations we need a mapping between WordNet word senses and Wikipedia articles. In the article, they give as an example the word "soda" (https://en.wikipedia.org/wiki/Soda). Notice that the disambiguation page redirects to this same page: https://en.wikipedia.org/wiki/Soda_(disambiguation). You can see that it has multiple senses illustrated in a list of pages. you can obtain the ids of those pages with a code similar to :

In [23]:
import requests

#create a connection(session)
r_session = requests.Session()

#url for the MediaWiki action API
URL = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "action": "query", #we are creating a query
    "titles": "car", #for the title car    
    "prop": "redirects", #asking for all the redirects (to the title car)
    "format": "json" #and we want the output in a json format
}

#we obtain the response to the get request with the given parmeters
query_response = r_session.get(url=URL, params=PARAMS)
json_data = query_response.json()

wikipedia_pages = json_data["query"]["pages"]

#we iterate through items and print all the redirects (their title and id)
try:
    for k, v in wikipedia_pages.items():
        for redir in v["redirects"]:
            print("{} redirect to {}({})".format(redir["title"], v["title"], redir["pageid"]))
except KeyError as err:
    if err.args[0]=='redirects':
        print("It has no redirects")
    else:
        print(repr(err))

Cars redirect to Car(73688)
Motor car redirect to Car(458458)
Motorcar redirect to Car(458459)
Automobiles redirect to Car(513608)
Motor Car redirect to Car(840650)
Passenger cars redirect to Car(1288645)
Ottomobile redirect to Car(1836567)
Automobles redirect to Car(1842410)
Motorization redirect to Car(3223435)
Motorisation redirect to Car(3223436)


JSON data looks like this:

In [24]:
{
   "continue":{
      "rdcontinue":"6492781",
      "continue":"||"
   },
   "query":{
      "normalized":[
         {
            "from":"car",
            "to":"Car"
         }
      ],
      "pages":{
         "13673345":{
            "pageid":13673345,
            "ns":0,
            "title":"Car",
            "redirects":[
               {
                  "pageid":73688,
                  "ns":0,
                  "title":"Cars"
               },
               {
                  "pageid":458458,
                  "ns":0,
                  "title":"Motor car"
               },
               {
                  "pageid":458459,
                  "ns":0,
                  "title":"Motorcar"
               },
               {
                  "pageid":513608,
                  "ns":0,
                  "title":"Automobiles"
               },
               {
                  "pageid":840650,
                  "ns":0,
                  "title":"Motor Car"
               },
               {
                  "pageid":1836567,
                  "ns":0,
                  "title":"Ottomobile"
               },
               {
                  "pageid":1842410,
                  "ns":0,
                  "title":"Automobles"
               },
               {
                  "pageid":3223435,
                  "ns":0,
                  "title":"Motorization"
               },
               {
                  "pageid":3223436,
                  "ns":0,
                  "title":"Motorisation"
               },
               {
                  "pageid":6260924,
                  "ns":0,
                  "title":"Passenger Vehicle"
               }
            ]
         }
      }
   }
}

{'continue': {'continue': '||', 'rdcontinue': '6492781'},
 'query': {'normalized': [{'from': 'car', 'to': 'Car'}],
  'pages': {'13673345': {'ns': 0,
    'pageid': 13673345,
    'redirects': [{'ns': 0, 'pageid': 73688, 'title': 'Cars'},
     {'ns': 0, 'pageid': 458458, 'title': 'Motor car'},
     {'ns': 0, 'pageid': 458459, 'title': 'Motorcar'},
     {'ns': 0, 'pageid': 513608, 'title': 'Automobiles'},
     {'ns': 0, 'pageid': 840650, 'title': 'Motor Car'},
     {'ns': 0, 'pageid': 1836567, 'title': 'Ottomobile'},
     {'ns': 0, 'pageid': 1842410, 'title': 'Automobles'},
     {'ns': 0, 'pageid': 3223435, 'title': 'Motorization'},
     {'ns': 0, 'pageid': 3223436, 'title': 'Motorisation'},
     {'ns': 0, 'pageid': 6260924, 'title': 'Passenger Vehicle'}],
    'title': 'Car'}}}}

Notice the normalization field, it is not what you might expect; it doesn't obtain the lemma, or apply any transformation on the letter case, it is about Unicode normalization.

for disambiaguations, notice the following two links:

https://en.wikipedia.org/w/api.php?action=query&titles=soda&prop=pageprops&format=json
https://en.wikipedia.org/w/api.php?action=query&titles=car&prop=pageprops&format=json

In order to create the mapping we shall use for a given Wikipedia page's:

- sense labels (actually they are the titles of the pages. At the time when the article was written, the titles had this syntax "word(sense label)" like "soda(soft drink)", however, notice that now you only find the sense label as a title.
- links (outgoing links from the current page)
- categories

The article uses the notation Ctx(w) for the set of words obtained from the text of some or all these pages.

Next, we need the WordNet context for a sense s, Ctx(s), for each sense of the word. For this we take the following relations:

- synonymy
- hypernymy/hyponymy
- sisterhood (senses that have the same direct hypernym)
- gloss

The next step is the mapping

1. For each word that we want to disambiguate, if we have **only one sense**, and only one Wikipedia page, we map that Wikipedia page to the word.
2. In the case of multiple senses, for each remaining wikipedia word w (after the mapping from the former step) that still has no associate Wordnet word, we take all the redirects to the word w. For each such redirect we look if we already have a mapping associated to it (a relation between its sense and the wikipedia page). If we have such a mapping and the mapped word is in w's sysnset, we map w to the sense associated to the redirect page
3. For all wikipedia pages that aren't mapped yet, we try to assign the most probable sense. The most probable sense has the highest value p computed as score(w,sense)/sum, where sum is the sum between all the combinations of scores between each sense of the word from wordnet and each sense of the word from wikipedia. The score is the number of common words between the context of the sense in wikipedia and the context of the sense in WordNet to which we add 1: $score(s,w)=|Ctx(s) ∩ Ctx(w)|+1$

In the end we have created new relations (WordNet++) that we can use in a **simplified Lesk manner** to disambiguate a text. We will compute the overlaps on all the glosses given by the mentioned relations.



In [25]:
sentence = "I was drinking a soda with a straw."