# Word Sense Disambiguation (WSD) with NLTK:

Example Problem

Consider the word "bank":
  - Sense 1: A financial institution (e.g., "I deposited money at the bank.")
  - Sense 2: The side of a river (e.g., "We walked along the river bank.")

  

In [None]:
pip install nltk


In [10]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
# Understand the Structure of WordNet

from nltk.corpus import wordnet as wn

synsets = wn.synsets('bank')
for synset in synsets:
    print(synset, synset.definition())


Synset('bank.n.01') sloping land (especially the slope beside a body of water)
Synset('depository_financial_institution.n.01') a financial institution that accepts deposits and channels the money into lending activities
Synset('bank.n.03') a long ridge or pile
Synset('bank.n.04') an arrangement of similar objects in a row or in tiers
Synset('bank.n.05') a supply or stock held in reserve for future use (especially in emergencies)
Synset('bank.n.06') the funds held by a gambling house or the dealer in some gambling games
Synset('bank.n.07') a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Synset('savings_bank.n.02') a container (usually with a slot in the top) for keeping money at home
Synset('bank.n.09') a building in which the business of banking transacted
Synset('bank.n.10') a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
Synset('bank.v.01') tip laterally
Sy

In [12]:
# Preprocess the Text

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

sentence = "He withdrew money from the bank."
tokens = word_tokenize(sentence)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)  # Output: ['He', 'withdrew', 'money', 'bank', '.']


['withdrew', 'money', 'bank', '.']


In [13]:
# Lesk Algorithm for WSD

# Lesk Algorithm is a simple, yet effective algorithm for WSD. It disambiguates words based on the overlap between the dictionary definitions (glosses) of different senses of a word and the context in which the word occurs.


from nltk.wsd import lesk

context_sentence = "He deposited money in the bank."
ambiguous_word = "bank"
best_sense = lesk(word_tokenize(context_sentence), ambiguous_word)

print(best_sense, best_sense.definition())  # Output: Synset('depository_financial_institution.n.01') - "a financial institution..."


Synset('savings_bank.n.02') a container (usually with a slot in the top) for keeping money at home


Note:

-  The Lesk algorithm matches the definition of each sense with the surrounding words in the context. The sense with the highest overlap is chosen.

Advanced WSD Techniques
  - Supervised Learning: Train a machine learning model on labeled data where the senses of the words are already known.

  - Unsupervised Learning: Use clustering algorithms to group similar contexts, assuming each cluster corresponds to a different sense.

  - Contextualized Word Embeddings: Modern methods like BERT can capture the context and disambiguate word meanings more accurately.

In [14]:
nltk.download('senseval')

[nltk_data] Downloading package senseval to /root/nltk_data...
[nltk_data]   Package senseval is already up-to-date!


True