## WordNet

First, let's have a close look at the `WordNet` corpus reader: <font color='blue'>**WordNet**</font> is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus.  

In [1]:
from nltk.corpus import wordnet as wn # as convention

You can look up a word using `wn.synsets()`: This function has an optional **pos argument** which lets you constrain the part of speech of the word:

In [2]:
wn.synsets('sense', pos=wn.VERB) # an ordered list

[Synset('feel.v.03'),
 Synset('sense.v.02'),
 Synset('smell.v.05'),
 Synset('sense.v.04')]

Or, since a synset is identified with a 3-part name of the form: `word.pos.nn`, you can use `wn.synset()`:  
**Note**: without 's'

In [3]:
wn.synset('sense.v.01') # the number is the order number

Synset('feel.v.03')

### Basic methods in WordNet
`Synset` object have various methods:
1. `Synset.name()`: return the synset name with dot notation
2. `Synset.lemmas()`: return a **list** of lemmas object
3. `Synset.lemmas_names()`: get a **list** of str names of lemmas objects
4. `Synset.examples()`: return a **list** of examples

In [4]:
from nltk.corpus import wordnet as wn
syn = wn.synsets("program") # a list of synset object
print('Synset object:', syn[0].name()) # synset
print('str name:', syn[0].lemmas()[0].name()) # just single words
print('Def:', syn[0].definition()) # definition
print('Example:',syn[0].examples())

Synset object: plan.n.01
str name: plan
Def: a series of steps to be carried out or goals to be accomplished
Example: ['they drew up a six-step plan', 'they discussed plans for a new bond issue']


Another way to extract only names of a `Synset` object:

In [5]:
syn[0].lemma_names()

['plan', 'program', 'programme']

In [6]:
syn[0].lemmas() # 词目

[Lemma('plan.n.01.plan'),
 Lemma('plan.n.01.program'),
 Lemma('plan.n.01.programme')]

### Synonyms and Antonym in WordNet
With <font color='blue'>**WordNet**</font>, you can look up **synonyms** and **antonym** and **definitions** and even the context of that word.

In [7]:
word = 'good'
syns = []
antonyms = []
for syn in wn.synsets(word):
    for l in syn.lemmas():
        syns.append(l.name()) #just pick str name
        if l.antonyms(): # if exist
            antonyms.append(l.antonyms()[0].name())
print(set(syns),'\n------')
print(set(antonyms))

{'salutary', 'goodness', 'honest', 'commodity', 'beneficial', 'trade_good', 'dependable', 'thoroughly', 'unspoiled', 'proficient', 'full', 'good', 'soundly', 'unspoilt', 'sound', 'right', 'practiced', 'near', 'serious', 'dear', 'ripe', 'expert', 'safe', 'honorable', 'adept', 'effective', 'in_force', 'undecomposed', 'estimable', 'skillful', 'secure', 'skilful', 'in_effect', 'just', 'respectable', 'well', 'upright'} 
------
{'evil', 'bad', 'evilness', 'ill', 'badness'}


### Multilingual in WordNet
It also gives access to the **Open Multilingual WordNet**, using ISO-639 language codes:

In [8]:
print(sorted(wn.langs())) #language support

['als', 'arb', 'bul', 'cat', 'cmn', 'dan', 'ell', 'eng', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb', 'hrv', 'ind', 'ita', 'jpn', 'nld', 'nno', 'nob', 'pol', 'por', 'qcn', 'slv', 'spa', 'swe', 'tha', 'zsm']


In [9]:
# sense.v.01 = feel.v.03
wn.synset('sense.v.01').lemma_names('ita')# to Italian

['avvertire', 'intuire', 'percepire', 'sentire']

Find lemmas in other language with argument:

In [10]:
wn.lemmas('intuire', lang='ita')

[Lemma('perceive.v.02.intuire'),
 Lemma('feel.v.03.intuire'),
 Lemma('divine.v.01.intuire')]

To check the language of a Lemma object with `Lemma.lang()`

In [11]:
wn.lemma('feel.v.03.intuire', lang='ita').lang()

'ita'

* hyponyms (上义词),
* hypernyms (上义词), 
* root_hypernym (上义词根), 
* holonym (同形同音异义词), 
* pertainyms (相关词),  --- only on `lemmas`
* derivationally_related_forms (词性变换)   --- only on `lemmas`
* common hypernym (相同上义词)

In [34]:
wn.synset('dog.n.01').hyponyms()  # find hypernyms 下义词

[Synset('basenji.n.01'),
 Synset('corgi.n.01'),
 Synset('cur.n.01'),
 Synset('dalmatian.n.02'),
 Synset('great_pyrenees.n.01'),
 Synset('griffon.n.02'),
 Synset('hunting_dog.n.01'),
 Synset('lapdog.n.01'),
 Synset('leonberg.n.01'),
 Synset('mexican_hairless.n.01'),
 Synset('newfoundland.n.01'),
 Synset('pooch.n.01'),
 Synset('poodle.n.01'),
 Synset('pug.n.01'),
 Synset('puppy.n.01'),
 Synset('spitz.n.01'),
 Synset('toy_dog.n.01'),
 Synset('working_dog.n.01')]

In [12]:
wn.synset('dog.n.01').hypernyms() # find hypernyms 上义词

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

In [13]:
wn.synset('dog.n.01').root_hypernyms() # find the root hypernyms

[Synset('entity.n.01')]

Find the depth of one's hypernyms until reach 'entity'

In [36]:
entity = wn.synset('entity.n.01')
start_words = wn.synset('dog.n.01')
depth = 0
temp = start_words
while temp != entity:
    temp = temp.hypernyms()[0] # just simply pick the 1st one
    print(temp.name())
    depth+=1
print(depth)

canine.n.02
carnivore.n.01
placental.n.01
mammal.n.01
vertebrate.n.01
chordate.n.01
animal.n.01
organism.n.01
living_thing.n.01
whole.n.02
object.n.01
physical_entity.n.01
entity.n.01
13


In [14]:
wn.synset('dog.n.01').member_holonyms() # holonym 同形同音异义

[Synset('canis.n.01'), Synset('pack.n.06')]

 `wn.lemma()` find lemma 词目

In [15]:
vocal = wn.lemma('vocal.a.01.vocal')
vocal

Lemma('vocal.a.01.vocal')

In [16]:
vocal.pertainyms() # related word 相关词

[Lemma('voice.n.02.voice')]

In [17]:
# vocal adj. -> vocalize v. 词性变换
vocal.derivationally_related_forms()

[Lemma('vocalize.v.02.vocalize')]

Find **common hypernym** 相同上义词:

In [18]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
dog.lowest_common_hypernyms(cat) # find the common hypernyms

[Synset('carnivore.n.01')]

## Lemmas

In [19]:
eat = wn.lemma('eat.v.03.eat')
eat

Lemma('feed.v.06.eat')

In [20]:
eat.key()

'eat%2:34:02::'

In [21]:
eat.count()

4

retrieve `Lemma` object from `Lemma` key:

In [22]:
wn.lemma_from_key('eat%2:34:02::')

Lemma('feed.v.06.eat')

convert `Lemma` object to `Synset` object

In [23]:
wn.lemma_from_key(eat.key()).synset()

Synset('feed.v.06')

## Verb Frames

Get verb frames id: `Synset.frame_ids()`

In [24]:
wn.synset('think.v.01').frame_ids()

[5, 9]

Get usage of a lemma: `Lemma.frame_strings()`

In [25]:
a_think_lemma = wn.synset('think.v.01').lemmas()[0]
a_think_lemma.frame_strings() 

['Something think something Adjective/Noun', 'Somebody think somebody']

In [26]:
for lemma in wn.synset('think.v.01').lemmas():
    print(lemma, lemma.frame_ids())
    print(" | ".join(lemma.frame_strings()))

Lemma('think.v.01.think') [5, 9]
Something think something Adjective/Noun | Somebody think somebody
Lemma('think.v.01.believe') [5, 9]
Something believe something Adjective/Noun | Somebody believe somebody
Lemma('think.v.01.consider') [5, 9]
Something consider something Adjective/Noun | Somebody consider somebody
Lemma('think.v.01.conceive') [5, 9]
Something conceive something Adjective/Noun | Somebody conceive somebody


# Similarity
两词抵达**相同上义词**的**最短路径**来评估相似度  
`synset1.path_similarity(synset2)`:  
Return a **score** denoting how similar two word senses are, based on **the shortest path** that connects the senses in the is-a (hypernym/hypnoym) taxonomy.

This score is in the range 0~1

In [27]:
dog.path_similarity(cat)

0.2

### 1. Leacock-Chodorow Similarity 
两词**词义**相连的<font color='red'>**最短路径**</font>，以及在词义树中的<font color='red'>**最大深度**</font>来评估相似度 
`synset1.lch_similarity(synset2)`:   
Return a **score** denoting how similar two word **senses** are, based on **the shortest path (p)** that connects the senses (as above) and **the maximum depth (d)** of the taxonomy in which the senses occur. 

The relationship is given as **-log(p/2d)**:  

In [28]:
dog.lch_similarity(cat)

2.0281482472922856

In [29]:
wn.lch_similarity(dog,cat)

2.0281482472922856

### 2. Wu-Palmer Similarity
利用<font color='red'>**最少相同词源**</font>来判断相似度，这是根据两词在词义树中抵达相同词源的<font color='red'>**最大深度**</font>（而不是两词间最短距离），也就是说，两词的中抵达相同词源的最长路径（longest path to the root node）将会用来计算相似度。  
`synset1.wup_similarity(synset2)`:  
Return a **score** denoting how similar two word senses are, based on **the depth** of the two senses in the taxonomy and that of their **Least Common Subsumer (LCS)(most specific ancestor node)**.  
\> Note that at this time the scores given **DO NOT ALWAYS** agree with those given by Pedersen's Perl implementation of **Wordnet Similarity**.

In [30]:
dog.wup_similarity(cat)

0.8571428571428571

In [31]:
wn.wup_similarity(dog,cat)

0.8571428571428571

## `wordnet_ic` Information Content?
```python
from nltk.corpus import wordnet_ic
wordnet_ic.ic()
```

In [37]:
from nltk.corpus import wordnet_ic
from nltk.corpus import genesis

brown_ic = wordnet_ic.ic('ic-brown.dat')
semcor_ic = wordnet_ic.ic('ic-semcor.dat')
genesis_ic = wn.ic(genesis, False, 0.0)

### 3.  Jiang-Conrath  Similarity
`synset1.jcn_similarity(synset2, ic)`: 
Return a **score** denoting how similar two word senses are, based on **the Information Content (IC)** of the Least Common Subsumer (LCS) and that of the two input Synsets. The relationship is given by the equation:
$$\cfrac{1}{(IC(s1) + IC(s2) - 2 \times IC(lcs))}$$

In [38]:
dog.res_similarity(cat, brown_ic)

7.911666509036577

In [39]:
dog.res_similarity(cat, genesis_ic)

7.204023991374837

### 4. Lin Similarity
`synset1.lin_similarity(synset2, ic)`:  
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation:
$$\cfrac {2 \times IC(lcs)}{(IC(s1) + IC(s2))}$$

In [40]:
dog.lin_similarity(cat, semcor_ic)

0.8863288628086228