# Lab 7: Finishing ngrams and Word sense disambiguation with Lesk algorithm

```
Plan : 
  1. Ngrams as features, Skip-grams
  1. Simplified Lesk algorithm, using WordNet
  1. (optional) WSD with Naive Bayes Classifier
```

## Ngrams and skip-grams as features

What limitations of Bag-of-Words model do you remember?

One of them is a loss of word order. For the following ewo sentences the BoW are identical, however, the labels are different.
* No, this movie is good. [ Positive sentiment]
* This movie is no good. [ Negative sentiment]

So, introducing Ngrams into the Bag-of-Words model (Bag-of-Ngrams) can mitigate this limitation (How?). What else can be done to improve?

* This movie is not quite good.
* This movie is not that good.
* This movie is not very good.

Can you see a pattern here? It seems we could add "not _ good" instead of all the possible ngrams. Such ngrams are called Skip-grams.

**N-grams** are sequences of adjacent units (letters, words, or whatever counting unit you happen to care about) of length "n"; it's a cover term for bigrams (sequences of 2 adjacent things, n = 2), trigrams (sequences of three adjacent things), 4-grams, etc.


**Skip-grams** (or "k-skip-n-grams") are sequences of ordered but not-necessarily-adjacent (thus "skipped") units, where the gaps can be at most "k" units long.


For example, in the sentence "The quick brown fox jumped over the lazy dog"
* bigrams (2-grams) include "The quick", "quick brown", "brown fox", fox jumped", "jumped over", "over the", "the lazy", and "lazy dog",
* 1-skip-2-grams include all of the bigrams in addition to "the _ brown", "quick _ fox", "brown _ jumped", "fox _ over", "jumped _ the", "over _ lazy", and "the _ dog".

### Implement a TfIdf Classifier, which will take unigrams, bigram and 1-skip 2-grams as input. Compare it with a similar classifier with only unigrams input.

Possible dataset: [IMDB movie reviews](https://huggingface.co/datasets/imdb)

In [None]:
## Write your code here 

## Simplified Lesk algorithm, using WordNet.

NLTK provides an interface to access and explore WordNet.

In [None]:
import nltk 
from nltk.corpus import wordnet 
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

We can find all synsets that contain a given word's base form, or lemma, using `lemmas(word)`

* To match a word, it must be given in its base form 

Words may be converted automatically to their base forms using **`morphy(word)`**, which takes the word and an optional part-of-speech as arguments, and returns the base form of the given word with the matching PoS (or any, if none given).

### Lists of synsets and their definitions

In [None]:
w = wordnet.morphy('interest')
for l in wordnet.lemmas(w):
  print(l)
  print(l.synset().definition())

Lemma('interest.n.01.interest')
a sense of concern with and curiosity about someone or something
Lemma('sake.n.01.interest')
a reason for wanting something done
Lemma('interest.n.03.interest')
the power of attracting or holding one's attention (because it is unusual or exciting etc.)
Lemma('interest.n.04.interest')
a fixed charge for borrowing money; usually a percentage of the amount borrowed
Lemma('interest.n.05.interest')
(law) a right or legal share of something; a financial involvement with something
Lemma('interest.n.06.interest')
(usually plural) a social group whose members control some field of activity and who have common aims
Lemma('pastime.n.01.interest')
a diversion that occupies one's time and thoughts (usually pleasantly)
Lemma('interest.v.01.interest')
excite the curiosity of; engage the interest of
Lemma('concern.v.02.interest')
be on the mind of
Lemma('matter_to.v.01.interest')
be of importance or consequence


## Simplified Lesk

The Simplified Lesk chooses the word sense which has the most in common between: its dictionary definition and examples, and the context of the target word.

For example, if we are interested in identifying the correct sense of the word **`interest`** in the following context:

```
While some in the United States cheered the election victory of Democrat Barack Obama, on the other side of the world, 
Chinese showed concern and interest over the state of the economy.
```

Choose sense with best matches <br>
**NB**: ignore stopwords

In [None]:
context = set(['concern', 'over', 'state','economy'])
w_of_int = 'interest'

sysnsets = wordnet.synsets(w_of_int)

In [None]:
from nltk.wsd import lesk
sent = 'While some in the United States cheered the election victory of Democrat Barack Obama, on the other side of the world, Chinese showed concern and interest over the state of the economy.'
ambiguous = 'interest'
print(lesk(sent.split(), ambiguous,'n'))
lesk(sent.split(), ambiguous).definition()

Synset('interest.n.06')


'excite the curiosity of; engage the interest of'

In [None]:
## TODO : Write your code here and test it with the example sentence above

## Task

```
1. Run your algorithm over the 300 Senseval example sentences.

2. Test accuracy using Senseval-2 corpus
 - Can be installed in NLTK The tags are a little different:
      HARD1 corresponds to difficult.a.01 
      HARD2 corresponds to hard.a.02 
      HARD3 corresponds to hard.a.03.hard
  - but the order is the same

* Remember to exclude stop words

3. Does this algorithm perform better than simply selecting the most common sense for all examples?
```

### Download Senseval Data

In [None]:
from nltk.corpus import senseval as se
nltk.download('senseval')
se.fileids()

[nltk_data] Downloading package senseval to /root/nltk_data...
[nltk_data]   Package senseval is already up-to-date!


['hard.pos', 'interest.pos', 'line.pos', 'serve.pos']

In [None]:
# sample from Senseval
se.instances()[:5]

[SensevalInstance(word='hard-a', position=20, context=[('``', '``'), ('he', 'PRP'), ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'), ('popular', 'JJ'), ('support', 'NN'), (',', ','), ('but', 'CC'), ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'), ('kill', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'), ('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('do', 'VB'), ('.', '.'), ("''", "''")], senses=('HARD1',)),
 SensevalInstance(word='hard-a', position=10, context=[('clever', 'NNP'), ('white', 'NNP'), ('house', 'NNP'), ('``', '``'), ('spin', 'VB'), ('doctors', 'NNS'), ("''", "''"), ('are', 'VBP'), ('having', 'VBG'), ('a', 'DT'), ('hard', 'JJ'), ('time', 'NN'), ('helping', 'VBG'), ('president', 'NNP'), ('bush', 'NNP'), ('explain', 'VB'), ('away', 'RB'), ('the', 'DT'), ('economic', 'JJ'), ('bashing', 'NN'), ('that', 'IN'), ('low-and', 'JJ'), ('middle-income', 'JJ'), ('workers', 'NNS'), ('are', 'VBP'), ('taking', 'VBG'), ('these', 'DT'), ('day

In [None]:
w = wordnet.morphy('hard')
for l in wordnet.lemmas(w):
  print(l)
  print(l.synset().definition())

Lemma('difficult.a.01.hard')
not easy; requiring great physical or mental effort to accomplish or comprehend or endure
Lemma('hard.a.02.hard')
dispassionate; 
Lemma('hard.a.03.hard')
resisting weight or pressure
Lemma('hard.s.04.hard')
very strong or vigorous
Lemma('arduous.s.01.hard')
characterized by effort to the point of exhaustion; especially physical effort
Lemma('unvoiced.a.01.hard')
produced without vibration of the vocal cords
Lemma('hard.a.07.hard')
(of light) transmitted directly from a pointed light source
Lemma('hard.a.08.hard')
(of speech sounds); produced with the back of the tongue raised toward or touching the velum
Lemma('intemperate.s.03.hard')
given to excessive indulgence of bodily appetites especially for intoxicating liquors
Lemma('hard.s.10.hard')
being distilled rather than fermented; having a high alcoholic content
Lemma('hard.s.11.hard')
unfortunate or hard to bear
Lemma('hard.s.12.hard')
dried out
Lemma('hard.r.01.hard')
with effort or force or vigor
Lemma('

In [None]:
w = wordnet.morphy('serve')
for l in wordnet.lemmas(w):
  print(l)
  print(l.synset().definition())

Lemma('serve.n.01.serve')
(sports) a stroke that puts the ball in play
Lemma('serve.v.01.serve')
serve a purpose, role, or function
Lemma('serve.v.02.serve')
do duty or hold offices; serve in a specific function
Lemma('serve.v.03.serve')
contribute or conduce to
Lemma('service.v.01.serve')
be used by; as of a utility
Lemma('serve.v.05.serve')
help to some food; help with food or drink
Lemma('serve.v.06.serve')
provide (usually but not necessarily food)
Lemma('serve.v.07.serve')
devote (part of) one's life or efforts to, as of countries, institutions, or ideas
Lemma('serve.v.08.serve')
promote, benefit, or be useful or beneficial to
Lemma('serve.v.09.serve')
spend time in prison or in a labor camp
Lemma('serve.v.10.serve')
work for or be a servant to
Lemma('serve.v.11.serve')
deliver a warrant or summons to someone
Lemma('suffice.v.01.serve')
be sufficient; be adequate, either in quality or quantity
Lemma('serve.v.13.serve')
do military service
Lemma('serve.v.14.serve')
mate with
Lemma(

In [None]:
## Write your code here 

## References

1. [Creating text features with bag-of-words, n-grams, parts-of-speach and more](https://uc-r.github.io/creating-text-features)
1. [Speech and Language Processing. Daniel Jurafsky & James H. Martin.](https://web.stanford.edu/~jurafsky/slp3/18.pdf)