In [None]:
from google.colab import drive
drive.mount('/content/drive')

# path where the data is contained
path = '/content/drive/MyDrive/IHLT/lab2/'

import pandas as pd
import re
import nltk
from nltk.metrics import jaccard_distance
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from scipy.stats import pearsonr
from IPython.display import display_html 

nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

Mounted at /content/drive


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

# Lab 6: Word Sense Disambiguation

For the sixth practical of the subject, the goal is to get in touch with Lesk's algorithm. The statement is as follows:

1. Read all pairs of sentences of the SMTeuroparl files of test set within the evaluation framework of the project.
2. Apply Lesk’s algorithm to the words in the sentences.
3. Compute their similarities by considering senses and Jaccard coefficient.
4. Compare the results with those in session 2 (document) and 3 (morphology) in which words and lemmas were considered.
5. Compare the results with gold standard by giving the Pearson correlation between them.


We will do three comparisons, first we check only those words that have representation in the WordNet with each other. However, since this comparison would not be totally fair, a second comparison removing the stopwords will also be done. Finally, to compare the method with the ones presented in the second and third sessions, we also add the tokens on the sets for those words that do not appear in the WordNet.

In [None]:
# read the dataset and apply the lesk function for each pair of sentences
def data_reader(function_preprocess):
  dt = pd.read_csv(path + 'STS.input.SMTeuroparl.txt', sep='\t', header = None) # i guess we could read this just once but eh
  dt[2] = dt.apply(lambda row: function_preprocess(row[0]), axis = 1)
  dt[3] = dt.apply(lambda row: function_preprocess(row[1]), axis = 1)
  dt['gs'] = pd.read_csv(path + 'STS.gs.SMTeuroparl.txt', sep='\t', header = None)
  dt['jac'] = dt.apply(lambda row: 5*(1 - jaccard_distance(row[2], row[3])), axis = 1)
  #dt.drop_duplicates(subset = [0, 1], keep=False, inplace = True) # drop duplicate rows, but no one seems to be doing it so why should we 
  return dt

## Lesk

In this basic case, we just get the POS tag of the sentence and use Lesk directly on the word-tag pair on the sentence. 

In [None]:
stopw = set(nltk.corpus.stopwords.words('english')) # english stopwords
wnl = WordNetLemmatizer() # initialize the lemmatizer

# define the tags of the wordnet
tags = {'NN': wordnet.NOUN,
        'VB': wordnet.VERB,
        'JJ': wordnet.ADJ, 
        'RB': wordnet.ADV}

def lesk_basic(sentence):
  tokens = nltk.word_tokenize(sentence) # tokenize
  pairs = nltk.pos_tag(tokens) # get the pos of the tokens
  synsets = [nltk.wsd.lesk(sentence, pair[0], pos = tags.get(pair[1][:2].upper())) for pair in pairs] # use lesk on the sentence using the word and the tag
  synsets = [syn.name() for syn in synsets if syn] # filter empty results
  return set(synsets)

In [None]:
dt = data_reader(lesk_basic)
dt.head()

Unnamed: 0,0,1,2,3,gs,jac
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,"{probability.n.01, give.v.43, seize.v.06, let....","{profit.v.01, therefore.r.01, seize.v.06, let....",4.5,1.875
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,"{amendment.n.02, propose.v.01, variety.n.06, s...","{amendment.n.02, propose.v.01, variety.n.06, s...",5.0,4.5
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,"{tax.n.01, maine.n.01, let.v.01, ally.n.01, in...","{potent.a.03, wish.v.02, tax.n.01, embody.v.02...",4.25,1.25
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,"{vote.n.04, today.n.02, astatine.n.01, take.v....","{vote.n.04, astatine.n.01, take.v.39, topograp...",4.5,4.166667
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...","{embody.v.02, tired.a.01, passive.a.01, fisher...","{embody.v.02, tired.a.01, passive.a.01, fisher...",5.0,5.0


In [None]:
pearsonr(dt['gs'], dt['jac'])[0]

0.4823328683435585

Seems to work pretty well, however, we are just using the words that have representation in the WordNet, hence this value is a bit inflated.

Nevertheless, before we go on to unbias this, let us check the worst sentences to see where we could improve. We know that low Jaccard will mean that there are few words in the intersection of both sets, but let us see why. 

Looking at the output, we can see that the Lesk is returning quite a few very scientific synsets of the words. Specifically, one and two-letter words seems to think we are referring to chemical elements (`in` $\rightarrow$ `indium.n.01`), while this is not the case. Most of these words are actually stopwords, and WordNet does not have a representation of these elements. Let us remove them before we proceed.



In [None]:
dt['diff'] = abs(dt['jac'] - dt['gs'])
dt_worst = dt.sort_values(by=['diff'], ascending=False).head(4)
df1_styler = dt_worst.style.set_table_attributes("style='display:inline'").set_caption('Highest difference between Jaccard and Gold Standard')
display_html(df1_styler._repr_html_(), raw=True)

Unnamed: 0,0,1,2,3,gs,jac,diff
414,Maij-Weggen report (A5-0323/2000),Report/ratio Maij-Weggen (A5-0323/2000),{'reputation.n.03'},set(),4.75,0.0,4.75
169,The vote will take place today at 5.30 p.m.,The vote will be to 17h30.,"{'vote.n.04', 'today.n.02', 'astatine.n.01', 'take.v.39', 'topographic_point.n.01', 'will.n.03'}","{'will.v.02', 'exist.v.01', 'vote.n.05'}",4.5,0.0,4.5
36,There must be a balance as a whole.,Group must be in equilibrium.,"{'symmetry.n.01', 'whole.n.02', 'embody.v.02', 'there.n.01', 'deoxyadenosine_monophosphate.n.01', 'must.n.01'}","{'mustiness.n.01', 'group.n.03', 'equilibrium.n.04', 'exist.v.01', 'indium.n.01'}",4.5,0.0,4.5
335,"Consumers will lose out, employees will lose out, Europe will lose competitive strength and growth.","Users are the losers, with workers and European competitiveness and innovation régresseront.","{'europe.n.01', 'suffer.v.11', 'increase.n.03', 'out.s.03', 'competitive.a.01', 'persuasiveness.n.01', 'consumer.n.01', 'will.n.03', 'employee.n.01'}","{'invention.n.02', 'competitiveness.n.01', 'loser.n.03', 'embody.v.02', 'european.a.01', 'user.n.01', 'worker.n.03'}",4.25,0.0,4.25


Let us do the same, however if the word in the POS tag is a stopword we skip it.

In [None]:
def lesk_stopless(sentence):
  tokens = nltk.word_tokenize(sentence) # tokenize
  pairs = nltk.pos_tag(tokens) # get the pos of the tokens
  synsets = [nltk.wsd.lesk(sentence, pair[0], pos = tags.get(pair[1][:2].upper())) for pair in pairs if pair[0].lower() not in stopw] # skip if the word is stopword else use lesk 
  synsets = [syn.name() for syn in synsets if syn] # we ignore the words without meaning
  return set(synsets)

In [None]:
dt = data_reader(lesk_stopless)
dt.head()

Unnamed: 0,0,1,2,3,gs,jac
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,"{probability.n.01, give.v.43, seize.v.06, let....","{profit.v.01, therefore.r.01, seize.v.06, let....",4.5,1.818182
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,"{amendment.n.02, propose.v.01, variety.n.06, s...","{amendment.n.02, propose.v.01, variety.n.06, s...",5.0,5.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,"{tax.n.01, let.v.01, ally.n.01, include.v.03, ...","{potent.a.03, wish.v.02, tax.n.01, ally.n.01, ...",4.25,1.875
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,"{take.v.39, vote.n.04, topographic_point.n.01,...","{take.v.39, vote.n.04, topographic_point.n.01}",4.5,3.75
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...","{passive.a.01, tired.a.01, fisherman.n.01}","{passive.a.01, tired.a.01, fisherman.n.01}",5.0,5.0


In [None]:
pearsonr(dt['gs'], dt['jac'])[0]

0.5083583421535602

Analyzing the worst results, we can see that the first one is hopeless to solve without a smarter initial preprocessing. We could split every `/` we find into two different words, however that would also split the `A5-0323/2000`, which we may or may not want. 

The next worst result, is observation 169, where we can see that, even though we have the same word appearing, since it appears with different meanings in the sentence it is not considering it a match ('vote.n.04', and 'vote.n.05').

For instance 36, we can see that since it must have read equilibrium, the definitions must have gone more mathematical. Making it choose the mathematical entry of the WordNet for group, making it then fail when choosing what entry to get for `must`. Still, even if we got must correct, this would not be such a great match since the other synsets are not matching either.

In [None]:
dt['diff'] = abs(dt['jac'] - dt['gs'])
dt_worst = dt.sort_values(by=['diff'], ascending=False).head(4)
df1_styler = dt_worst.style.set_table_attributes("style='display:inline'").set_caption('Highest difference between Jaccard and Gold Standard')
display_html(df1_styler._repr_html_(), raw=True)

Unnamed: 0,0,1,2,3,gs,jac,diff
414,Maij-Weggen report (A5-0323/2000),Report/ratio Maij-Weggen (A5-0323/2000),{'reputation.n.03'},set(),4.75,0.0,4.75
169,The vote will take place today at 5.30 p.m.,The vote will be to 17h30.,"{'take.v.39', 'vote.n.04', 'topographic_point.n.01', 'today.n.02'}",{'vote.n.05'},4.5,0.0,4.5
36,There must be a balance as a whole.,Group must be in equilibrium.,"{'symmetry.n.01', 'must.n.01', 'whole.n.02'}","{'equilibrium.n.04', 'mustiness.n.01', 'group.n.03'}",4.5,0.0,4.5
31,Maij-Weggen report (A5-0323/2000),Relation Maij-Weggen (A5-0323/2000),{'reputation.n.03'},{'sexual_intercourse.n.01'},4.25,0.0,4.25


Now that we know this, let us add the tokens to make more comparable matches, since not all words appear in the WordNet.

In [None]:
def lesk_sentence_fair(sentence, basic = True):
  tokens = nltk.word_tokenize(sentence) # tokenize
  pairs = nltk.pos_tag(tokens) # get the pos of the tokens
  pairs = [(re.sub(r'[^\w\s]', '', pair[0]), pair[1]) for pair in pairs]
  text = []
  for pair in pairs:
    if pair[0].lower() in stopw or re.match(r'^[_\W]+$', pair[0].lower()) or not pair[0]: # if the token is a stopword or symbol we skip
      continue
    synset = nltk.wsd.lesk(sentence, pair[0], pos = tags.get(pair[1][:2].upper())) # use lesk on the sentence using the word and the tag
    if synset: # if the synset is not empty we just add it to the list
      text.append(synset.name())
    else: # else we add the token, there is no need to lemmatize because we need the WordNet representation of this. And since it failed in the above, it means there is none
      text.append(pair[0].lower())
  return set(text)

In [None]:
dt = data_reader(lesk_sentence_fair)
df1_styler = dt.head().style.set_table_attributes("style='display:inline'").set_caption('Highest difference between Jaccard and Gold Standard')
display_html(df1_styler._repr_html_(), raw=True)

Unnamed: 0,0,1,2,3,gs,jac
0,The leaders have now been given a new chance and let us hope they seize it.,The leaders benefit aujourd' hui of a new luck and let's let them therefore seize it.,"{'probability.n.01', 'give.v.43', 'seize.v.06', 'let.v.01', 'new.a.06', 'hope.v.03', 'uranium.n.01', 'leadership.n.02'}","{'profit.v.01', 'therefore.r.01', 'seize.v.06', 'let.v.01', 'new.a.06', 'leadership.n.02', 'luck.n.03', 'hui', 'aujourd'}",4.5,1.538462
1,Amendment No 7 proposes certain changes in the references to paragraphs.,Amendment No 7 is proposing certain changes in the references to paragraphs.,"{'amendment.n.02', 'propose.v.01', 'variety.n.06', 'sealed.a.01', 'seven.s.01', 'paragraph.v.03', 'reference_book.n.01'}","{'amendment.n.02', 'propose.v.01', 'variety.n.06', 'sealed.a.01', 'seven.s.01', 'paragraph.v.03', 'reference_book.n.01'}",5.0,5.0
2,Let me remind you that our allies include fervent supporters of this tax.,"I would like to remind you that among our allies, there are strong of this tax.","{'fervent', 'tax.n.01', 'let.v.01', 'ally.n.01', 'include.v.03', 'supporter.n.01', 'remind.v.01'}","{'potent.a.03', 'wish.v.02', 'tax.n.01', 'ally.n.01', 'would', 'among', 'remind.v.01'}",4.25,1.363636
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,"{'530', 'vote.n.04', 'today.n.02', 'take.v.39', 'topographic_point.n.01', 'promethium.n.01'}","{'take.v.39', 'vote.n.04', 'topographic_point.n.01', '530pm'}",4.5,2.142857
4,"The fishermen are inactive, tired and disappointed.","The fishermen are inactive, tired and disappointed.","{'passive.a.01', 'disappointed', 'tired.a.01', 'fisherman.n.01'}","{'passive.a.01', 'disappointed', 'tired.a.01', 'fisherman.n.01'}",5.0,5.0


In [None]:
pearsonr(dt['gs'], dt['jac'])[0] 

0.4780132616520053

We can see that adding the tokens helped some of the previous phrases, for example the ones we deemed hopeless, seem to have become better since the `A5-0323/2000` and the `Maij-Weggen` will be matched by only using tokens.

However, we still have the same problem as before, with similar words getting different meaning depending on the exact position and the context of the sentence. We can now see `catastrophe.n.03` and `catastrophe.n.02` which in the first case it means a sudden violent change in the earths surface and in the second a state of extreme ruin. It seems that in the first sentence the use of version 03 seems wrong and should have gotten 02.

For the tokens, there are still the problems we described in earlier sessions. For example, well as 530pm and 530 pm would still be considered different since in one case it is joined and in the other separated. More preprocessing to deal with these will be needed.

In [None]:
dt['diff'] = abs(dt['jac'] - dt['gs'])
dt_worst = dt.sort_values(by=['diff'], ascending=False).head(4)
df1_styler = dt_worst.style.set_table_attributes("style='display:inline'").set_caption('Highest difference between Jaccard and Gold Standard')
display_html(df1_styler._repr_html_(), raw=True)

Unnamed: 0,0,1,2,3,gs,jac,diff
169,The vote will take place today at 5.30 p.m.,The vote will be to 17h30.,"{'530', 'vote.n.04', 'today.n.02', 'take.v.39', 'topographic_point.n.01', 'promethium.n.01'}","{'17h30', 'vote.n.05'}",4.5,0.0,4.5
36,There must be a balance as a whole.,Group must be in equilibrium.,"{'symmetry.n.01', 'must.n.01', 'whole.n.02'}","{'equilibrium.n.04', 'mustiness.n.01', 'group.n.03'}",4.5,0.0,4.5
335,"Consumers will lose out, employees will lose out, Europe will lose competitive strength and growth.","Users are the losers, with workers and European competitiveness and innovation régresseront.","{'europe.n.01', 'suffer.v.11', 'increase.n.03', 'competitive.a.01', 'persuasiveness.n.01', 'consumer.n.01', 'employee.n.01'}","{'invention.n.02', 'competitiveness.n.01', 'loser.n.03', 'régresseront', 'european.a.01', 'user.n.01', 'worker.n.03'}",4.25,0.0,4.25
452,Then perhaps we could have avoided a catastrophe.,We might have been able to prevent a disaster.,"{'catastrophe.n.03', 'could', 'possibly.r.01', 'keep_off.v.01'}","{'able.a.01', 'prevent.v.02', 'catastrophe.n.02', 'might.n.01'}",4.25,0.0,4.25



# Conclusion

We have compared three different ways of doing the STS task using Word Sense Disambiguation using the classic Lesk algorithm. In the first case, just using the given sentence and extracting the POS to identify the word, we see that it returns adequate results, however in closer inspection we see that some words matched should not really be matched. For example, we saw that not removing the stopwords, since these are not represented in the WordNet, will return us incorrect synsets about the word. Moreover, since these words are usually 2 letters long, it confuses them with words on the periodic table.

The second Lesk implementation, fixed this by first checking if the word is a stopword and omitting the word if it was when using Lesk. We feel this is a more realistic approach to dealing with the words, since it is what we also did on the Lab 2 and Lab 3. Hence, it will be better for comparison as well. We can see that removing the stopwords gives us higher correlation to the gold standard, as the unexpected synsets we get from the stopwords are now removed.

| Processing | Pearson $r$ |  
|------------|:---:|
| Lesk     | 0.482 |
| Lesk + Stopwords   | 0.508  |


Finally, since in these previous two implementations we only have compared the words that appeared in the WordNet with themselves, we decided to add those words that did not appear as their token. This will make the comparison between the previous lab sessions better, since otherwise we are inflating the results by not using all the text that appears (numbers, given names, etc.). We do not need to lemmatize these words, since lemmatizing also uses the WordNet, and if Lesk returns us that there is nothing, then there will be nothing to lemmatize as well. We also remove the stopwords and tokens that only contain non-alphanumeric symbols, to match the preprocessing of the previous sessions. Furthermore, we do not remove the symbols inside the synset representations (e.g. `vote.n.03` stays the same). With these restrictions, the resulting Pearson Correlations are given in the following table.

| Processing | Pearson $r$ |  
|------------|:---:|
| Simple (Lab 2)    | 0.468 |
| Stemming (Lab 2)  | 0.462  |
| Lemmatizing (Lab 3) | 0.483  |
| **Lesk + Stopwords + Tokens** | **0.478**  |

As we can see in the table, the use of Lesk returns pretty similar results to the previous Labs, better than just using the tokens directly as well as using Stemming. However, when compared to Lemmatizing, we seem to be obataining a bit worse results.

Lesk basically gets the maximum intersection of the `context` sentence with the definitions of the synsets found from the ambiguous word, making it very sensitive to the exact wording of definitions. Hence, the absence of certain words can radically change the results. Moreover, it only checks overlaps among the glosses of the senses being considered, and these are usually not long enough to provide a good enough distinction.

This can be seen clearly in some of the previous examples. Where, for instance, the word `vote` appeared and the result of the Lesk algorithm was two different results in significance:


- `The vote will take place today at 5.30 p.m.` $\rightarrow$ `vote.n.04`
- `The vote will be to 17h30.`  $\rightarrow$ `vote.n.05` (is this even a correct sentence?)

Which makes our naive comparison fail, since we are checking if the whole word is a match. This could be fixed by just omitting the word function and entry used (everything after the point in the synset name). However, we were told by a higher power, our savior, Salvador, to not do it.

Overall, the WDS seems a better approach to solving the STS problem, since it makes more sense to use the words and find their possible synset. Nevertheless, there are still some quirks to fix since some of these words are obtaining the wrong entry in the WordNet, making it fail the set intersection.