# Word-Sense Disambiguation - A Dandelion by Any Other Name Would Reek Just as Candy

## Is there an art to replacing words?

Many topics in Natural Language Processing seem like they would be trivial to solve. Take the case of finding synonyms or antonyms for a word in a sentence. All you have to do is look up the word in a thesaurus and find a suitable replacement. Our dictionaries like Miriam Webster or Oxford are essentially databases. Surely we can just do a lookup, right? That's O(1) complexity! In my example above - "A Dandelion by any other name would reek just as candy" is a play on Shakespeare's words "A rose by any other name would be just as sweet". I've replaced three words manually. 

**Rose -> Dandelion**: These are not true synonyms, but they are both flowers. A human might find them to be a suitable replacement, but this is NOT what we want to solve. In fact, this is a replacement that a language model might make since the **sense** of the sentence is preserved.

**Sweet -> Candy**: The word sweet could refer to a cookie or any other sugary confectionery. This is a synonym, but it doesn't preserve the sense of the sentence. The original meaning is describing the taste/smell of sugar. The word **pleasant** would maintain the meaning of the original sentence, but instead we choose the wrong meaning

**Smell -> Reek**: This is a correct synonym change for the sentence, although it does change the connotation of the sentence. The sense of the word is preserved, but reek implies that the smell is bad! In looking at a thesaurus, the word **smack** would be more appropriate, but it's also more colloquial than the simple word **smell**.

## First Attempt: Using a Dictionary

The first thought would be to use a dictionary database that supports synonyms. The most famous one is Wordnet:
    
*George A. Miller (1995). WordNet: A Lexical Database for English.
Communications of the ACM Vol. 38, No. 11: 39-41.*
*Princeton University "About WordNet." https://wordnet.princeton.edu. Princeton University. 2010. *

This database is provided in the NLTK package. It provides apis to map between words AND senses. It has what is called synsets. These are the list of senses, or meanings that a word could represent. Words that share the same sense are synonyms.

This makes our algorithm simple:
1. Look up the word in the dictionary
2. Get all senses of the word
3. For the **correct** sense, select all words that share the same sense

We'll soon see that the 3rd step is going to be much more complicated than assumed and has a whole field of research behind it. Anyways, without further ado:



### 1. Look up the word in the dictionary

In [1]:
from nltk.corpus import wordnet
print(wordnet.synsets("smell"))

[Synset('smell.n.01'), Synset('olfactory_property.n.01'), Synset('spirit.n.02'), Synset('smell.n.04'), Synset('smell.n.05'), Synset('smell.v.01'), Synset('smell.v.02'), Synset('smell.v.03'), Synset('smack.v.02'), Synset('smell.v.05')]


What is this? Did we already finish the second step? This is a list of senses. You might expect that we would get a list of definitions, but the distinction here is that the the definition of a word is tied to it's meaning, and not the word itself. This seems obvious in hindsight. There are multiple meanings for the word. Some of them might be very specific to the word, such that there might not be another word that sames the same sense. 

The senses are names in the format of "sense_name.part_of_speech_.number". 'smell.n.01' is the first meaning for the noun "smell". Let's get the definition of each of theses:

2. Get all senses of the word

In [2]:
smell_senses = wordnet.synsets("smells")
for sense in smell_senses:
    print("{sense}: {definition}".format(sense=sense, definition=sense.definition()))

Synset('smell.n.01'): the sensation that results when olfactory receptors in the nose are stimulated by particular chemicals in gaseous form
Synset('olfactory_property.n.01'): any property detected by the olfactory system
Synset('spirit.n.02'): the general atmosphere of a place or situation and the effect that it has on people
Synset('smell.n.04'): the faculty that enables us to distinguish scents
Synset('smell.n.05'): the act of perceiving the odor of something
Synset('smell.v.01'): inhale the odor of; perceive by the olfactory sense
Synset('smell.v.02'): emit an odor
Synset('smell.v.03'): smell bad
Synset('smack.v.02'): have an element suggestive (of something)
Synset('smell.v.05'): become aware of not through the senses but instinctively


3. For the **correct** sense, select all words that share the same sense

We now have all of the senses and their definitions, but do we know the correct meaning? 'smell.v.01' looks promising. Let's look at that first.

In [3]:
words = wordnet.synset('smell.v.01').lemmas()
for word in words:
    print(word.name())

smell


We got back the word we started with! We'll have to try another. Let's just print out all of the synonyms for each sense to be safe

In [4]:
for sense in smell_senses:
    print("{sense}: {definition}".format(sense=sense, definition=sense.definition()))
    for word in sense.lemmas():
        print("\t{}".format(word.name()))

Synset('smell.n.01'): the sensation that results when olfactory receptors in the nose are stimulated by particular chemicals in gaseous form
	smell
	odor
	odour
	olfactory_sensation
	olfactory_perception
Synset('olfactory_property.n.01'): any property detected by the olfactory system
	olfactory_property
	smell
	aroma
	odor
	odour
	scent
Synset('spirit.n.02'): the general atmosphere of a place or situation and the effect that it has on people
	spirit
	tone
	feel
	feeling
	flavor
	flavour
	look
	smell
Synset('smell.n.04'): the faculty that enables us to distinguish scents
	smell
	sense_of_smell
	olfaction
	olfactory_modality
Synset('smell.n.05'): the act of perceiving the odor of something
	smell
	smelling
Synset('smell.v.01'): inhale the odor of; perceive by the olfactory sense
	smell
Synset('smell.v.02'): emit an odor
	smell
Synset('smell.v.03'): smell bad
	smell
Synset('smack.v.02'): have an element suggestive (of something)
	smack
	reek
	smell
Synset('smell.v.05'): become aware of no

It seems like 'smack.v.02' is our most promising result and is equivalent to what I found in the thesaurus (smack and reek). We probably could have avoided most of these by removing the nouns, but how would we have selected the remaining word? **'smack.v.02': have an element suggestive (of something)** is our correct answer, but why not use **'smell.v.02': emit an odor** instead based on the definition? That definition seems more promising. 

My choice of sense was initially based on the definition. Who decides on the exact wording? If I used a different dictionary, I might have come to a different conclusion. Therefore a dictionary-agnostic method is preferred. This is the crux of the matter. The meaning or sense of the word is ambiguous. How should I know which word to use? Is there a way to validate that I selected the correct word?

This problem is eponymously known as **Word-Sense Disambiguation**. It is hard. But much progress has been made in the past few years in Natural Language Processing. I plan to look at existing research in the field starting with pre machine learning algorithms and build up to state of the art research.