# WWC - Semantic Similarity

In this session we will use NLTK synset to find semantic similarity between two words.

**WordNet** is a lexical database for the English language. In other words, it is a dictionary designed specifically for natural language processing

NLTK comes with a simple interface for looking up for words in **WordNet**. What you get is a simple list of instances, which are groupings of synonymous words that express the same concept. Many words have only one **synset**, but some have several.

Let's explore some of the properties and methods of a synset.

In [1]:
from nltk.corpus import wordnet

Each synset in the list has a number of attributes you can use to learn more about it.
The **name** attribute will give you a unique name for the synset which you can use to get the synset directly.

In [2]:
syn = wordnet.synsets('cookbook')[0]

In [5]:
syn.name()

'cookbook.n.01'

In [6]:
syn.definition()

'a book of recipes and cooking directions'

### Getting examples

lemma: A heading indicating the subject or argument of a literary composition, an annotation, or a dictionary entry. Synsets can provide examples on how to use a word in a sentence.

In [9]:
wordnet.synsets('cooking')[0].examples()

['cooking can be a great art',
 'people are needed who have experience in cookery',
 'he left the preparation of meals to his wife']

In [11]:
synonyms = []
antonyms = []

In [12]:
for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        print("l: ",l)
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

l:  Lemma('good.n.01.good')
l:  Lemma('good.n.02.good')
l:  Lemma('good.n.02.goodness')
l:  Lemma('good.n.03.good')
l:  Lemma('good.n.03.goodness')
l:  Lemma('commodity.n.01.commodity')
l:  Lemma('commodity.n.01.trade_good')
l:  Lemma('commodity.n.01.good')
l:  Lemma('good.a.01.good')
l:  Lemma('full.s.06.full')
l:  Lemma('full.s.06.good')
l:  Lemma('good.a.03.good')
l:  Lemma('estimable.s.02.estimable')
l:  Lemma('estimable.s.02.good')
l:  Lemma('estimable.s.02.honorable')
l:  Lemma('estimable.s.02.respectable')
l:  Lemma('beneficial.s.01.beneficial')
l:  Lemma('beneficial.s.01.good')
l:  Lemma('good.s.06.good')
l:  Lemma('good.s.07.good')
l:  Lemma('good.s.07.just')
l:  Lemma('good.s.07.upright')
l:  Lemma('adept.s.01.adept')
l:  Lemma('adept.s.01.expert')
l:  Lemma('adept.s.01.good')
l:  Lemma('adept.s.01.practiced')
l:  Lemma('adept.s.01.proficient')
l:  Lemma('adept.s.01.skillful')
l:  Lemma('adept.s.01.skilful')
l:  Lemma('good.s.09.good')
l:  Lemma('dear.s.02.dear')
l:  Lemma('d

In [13]:
print(set(synonyms))

{'thoroughly', 'upright', 'dear', 'beneficial', 'sound', 'expert', 'honest', 'just', 'goodness', 'unspoiled', 'in_effect', 'serious', 'secure', 'salutary', 'unspoilt', 'soundly', 'dependable', 'good', 'trade_good', 'commodity', 'full', 'skillful', 'well', 'near', 'right', 'honorable', 'proficient', 'skilful', 'safe', 'in_force', 'estimable', 'practiced', 'effective', 'undecomposed', 'adept', 'ripe', 'respectable'}


In [14]:
print(set(antonyms))

{'evilness', 'evil', 'bad', 'badness', 'ill'}


## Wu and Palmer Measure of Similarity

To find semantic similarity we will use the Wu & Palmer measure (wup) that calculates similarity by considering the depths of the two concepts in the UMLS, along with the depth of the LCS The formula is score = 2*depth(lcs) / (depth(s1) + depth(s2)). This means that 0 < score <= 1. The score can never be zero because the depth of the LCS is never zero (the depth of the root of a taxonomy is one). The score is one if the two input concepts are the same.

### How it works

**wup_similarity** is short for Wu-Palmer Similarity, a scoring method based on how similar the words senses are where the synsets occur relative to each other in the hypernym tree. One of the core metrics used to calculate similarityis the shortest path distance between the two synsets and their common hypernym.


In [15]:
w1 = wordnet.synset("ship.n.01")

In [16]:
w2 = wordnet.synset("boat.n.01")

In [18]:
print(w1.wup_similarity(w2))

0.9090909090909091


In [19]:
w3 = wordnet.synset("ship.n.01")

In [20]:
w4 = wordnet.synset("car.n.01")

In [22]:
print(w3.wup_similarity(w4))

0.6956521739130435


In [23]:
w5 = wordnet.synset("ship.n.01")

In [24]:
w6 = wordnet.synset("cactus.n.01")

In [25]:
print(w5.wup_similarity(w6))

0.38095238095238093


In [26]:
w7 = wordnet.synset("cat.n.01")

In [27]:
print(w5.wup_similarity(w7))

0.32


In [51]:
ref = w1.hypernyms()[0]  # Root
ref

Synset('vessel.n.02')

In [49]:
w1_root = w1.shortest_path_distance(ref) # Distance between "ship" and "vessel"
w1_root

1

In [50]:
w2_root = w2.shortest_path_distance(ref) # Distance between "boat" and "vessel"
w2_root

1

In [52]:
w1.shortest_path_distance(w2)

2

### CONCLUSION

So *ship* and *boat* are similar since they are only one step away from the root hypernym, *vessel*, and therefore only two steps away from each other.

### Exercise

Calculate shortest path distance between ship and cactus.



### USES

You can use WUP similariry to rewrite/change words, correct words or find if two words are alike. 

In [53]:
w1.shortest_path_distance(w6) # distance between "ship" and "cactus"

13

Verbs can also be compared

In [54]:
cook = wordnet.synset("cook.v.01")

In [55]:
bake = wordnet.synset("bake.v.02")

In [56]:
cook.wup_similarity(bake)

0.6666666666666666

### Hypernyms

Synsets are organized in a kind of inheritance tree. More abstract terms are knowns as hypernyms and more specific terms are hyponyms. This tree can be bre traced all the way up to the root hypernym.

**Hypernyms** provide a way to categorize and group words based on their similarity to each other. The synset similarity. The synset similarity recipe details the functions used to calculate similarity based on the distance between two words inthe hypernym tree.

In [34]:
syn = wordnet.synsets("cookbook")[0]

In [35]:
syn

Synset('cookbook.n.01')

In [36]:
syn.hypernyms()    # Get cookbook hypernym

[Synset('reference_book.n.01')]

In [37]:
syn.hypernyms()[0].hyponyms() # Get all the hyponyms comprised by this hypernym

[Synset('annual.n.02'),
 Synset('atlas.n.02'),
 Synset('cookbook.n.01'),
 Synset('directory.n.01'),
 Synset('encyclopedia.n.01'),
 Synset('handbook.n.01'),
 Synset('instruction_book.n.01'),
 Synset('source_book.n.01'),
 Synset('wordbook.n.01')]

In [38]:
syn.root_hypernyms()

[Synset('entity.n.01')]

1. "Reference book" is the hypernym of "cookbook"
2. "Cookbook" is one hyponym of "Refereence book"
3. All kinds of "books" have a common root "Entity" which an abstract term
4. You can trace the path from the root to cookbook using the *hypernym_paths()* method

In [40]:
syn.hypernym_paths() # This is tree branch for "cookbook"

[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('creation.n.02'),
  Synset('product.n.02'),
  Synset('work.n.02'),
  Synset('publication.n.01'),
  Synset('book.n.01'),
  Synset('reference_book.n.01'),
  Synset('cookbook.n.01')]]

This method returns a list of lists, where each list starts at the root hypernym and ends with the original Synset. 

### Part-of-speech (POS)

Synsets also has a simplified part-of-speech tag:

1. Noun = n
2. Adjective = a
3. Adverb = r
4. Verb = v

These POS tags can be used for looking up specific *synsets* for a word. For example, the word **great** can be used as a noun or an adjective. In Wordnet,  **great** has one noun synset and six adjective synsets.

In [42]:
syn.pos()

'n'

In [43]:
len(wordnet.synsets('great'))

7

In [44]:
len(wordnet.synsets('great', pos ='n')) # Finding noun synsets

1

In [45]:
len(wordnet.synsets('great', pos = 'a')) # Finding adjective synsets

6

### CAVEAT

While most nouns can be traced up to *object*, thereby providing a basis for similarity, many verbs do not share common hypernyms, making WordNet unable to calculate similarity.

For example, if you were to use *synset* for *bake.v.01* here, instead of *bake.v.02*, the return value would be *None*. This is because the root hypernyms of the two synsets are different, with no overlapping paths. For this reason, you also cannot calculate similarity between words with different parts of speech.

### Path and LCH similarity

Other two similarity comparison are the path similarity and **Leacock Chodrow (LCH)** similarity.


In [58]:
w1.path_similarity(w2) # Path similarity between "ship" and "boat"

0.3333333333333333

In [59]:
w1.lch_similarity(w2) # LCH similarity between "ship" and "boat"

2.538973871058276