# Lab2.1: Words, concepts, semantic relations in Wordnet-NLTK

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, you are going to work with the famous wordnet databases as they have been incorporated in the NLTK package.
Detailed information how to access and use wordnet can be found here: http://www.nltk.org/howto/wordnet.html

Study the documentation and make yourself familiar with the different commands. Some of them are repeated below.


In [1]:
from nltk.corpus import wordnet as wn

Look up a word using the "wn.synsets()" function. This will give you a list of synsets in which the lookup string is matched with a lemma (synonym).

In [2]:
all_dog_synsets = wn.synsets('dog')
print('Number of synsets with "dog" as a synonym:', len(all_dog_synsets))
print(all_dog_synsets)

Number of synsets with "dog" as a synonym: 8
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]


Note that the synsets in this list are printed by listing the first synonym only. Also note that we got nouns and verbs. We can iterate over the list to get each synset as an 'object' and next call specific functions for each synset:

In [3]:
for synset in all_dog_synsets:
    print()
    print('The synonyms = ', synset.lemmas())
    print('The definition =', synset.definition())
    print('The full path of hypernyms =', synset.hypernym_paths())
    print('The maximum depth of its hyponymy chain is = ', synset.max_depth())



The synonyms =  [Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')]
The definition = a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
The full path of hypernyms = [[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('chordate.n.01'), Synset('vertebrate.n.01'), Synset('mammal.n.01'), Synset('placental.n.01'), Synset('carnivore.n.01'), Synset('canine.n.02'), Synset('dog.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('domestic_animal.n.01'), Synset('dog.n.01')]]
The maximum depth of its hyponymy chain is =  13

The synonyms =  [Lemma('frump.n.01.frump'), Lemma('frump.n.01.d

You can also obtain only synsets with a certain part-of-speech.

In [17]:
all_dog_verb_synsets = wn.synsets('dog', 'v')
print(all_dog_verb_synsets)

[Synset('chase.v.01')]


Various functions can be called on synset object and data structure can be obtained. Try some to get a feeling for it.

In [4]:
doggy_synset = all_dog_synsets[0]
print('Part holonyms:',doggy_synset.part_holonyms())
print('Member holonyms:',doggy_synset.member_holonyms())
print('Substance holonyms:',doggy_synset.substance_holonyms())

print('Part meronyms:',doggy_synset.part_meronyms())
print('Member meronyms:',doggy_synset.member_meronyms())
print('Substance meronyms:',doggy_synset.substance_meronyms())


Part holonyms: []
Member holonyms: [Synset('canis.n.01'), Synset('pack.n.06')]
Substance holonyms: []
Part meronyms: [Synset('flag.n.07')]
Member meronyms: []
Substance meronyms: []


In [28]:
chase_synset = all_dog_verb_synsets[0]
print('Caused:', chase_synset.causes())
print('Entailments:',chase_synset.entailments())
print('Hyponyms:', chase_synset.hyponyms())
print('Examples:', chase_synset.examples())


Caused: []
Entailments: []
Hyponyms: [Synset('hound.v.01'), Synset('quest.v.02'), Synset('run_down.v.07'), Synset('tree.v.03')]
Examples: ['The policeman chased the mugger down the alley', 'the dog chased the rabbit']


To get the full Python object definition of a synset to show all options, you can use the 'dir' command:

In [51]:
dir(chase_synset)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_hypernyms',
 '_definition',
 '_examples',
 '_frame_ids',
 '_hypernyms',
 '_instance_hypernyms',
 '_iter_hypernym_lists',
 '_lemma_names',
 '_lemma_pointers',
 '_lemmas',
 '_lexname',
 '_max_depth',
 '_min_depth',
 '_name',
 '_needs_root',
 '_offset',
 '_pointers',
 '_pos',
 '_related',
 '_shortest_hypernym_paths',
 '_wordnet_corpus_reader',
 'also_sees',
 'attributes',
 'causes',
 'closure',
 'common_hypernyms',
 'definition',
 'entailments',
 'examples',
 'frame_ids',
 'hypernym_distances',
 'hypernym_paths',
 'hypernyms',
 'hyponyms',
 'instance_hypernyms',
 'instance_hyponyms',
 'jcn

## Wordnets in other languages

There are wordnets in many different languages and many are linked to the English. The ones that are freely available in the Open Multilingual Wordnet platform: http://compling.hss.ntu.edu.sg/omw/ are also available in NLTK. You can use "wn.langs" to get the full list.

In [47]:
sorted(wn.langs())

['als',
 'arb',
 'bul',
 'cat',
 'cmn',
 'dan',
 'ell',
 'eng',
 'eus',
 'fas',
 'fin',
 'fra',
 'glg',
 'heb',
 'hrv',
 'ind',
 'ita',
 'jpn',
 'nld',
 'nno',
 'nob',
 'pol',
 'por',
 'qcn',
 'slv',
 'spa',
 'swe',
 'tha',
 'zsm']

The listed language wordnets are created by translating the English synsets (the so-called Expand Method (Vossen (ed.) 1998). This means that the concepts of the English wordnet are re-used and the synonyms in the synsets are translated.

Since the concept structure is the same for all these wordnets (they share the English concepts)you can  ask for the language lemmas linked to any synset in English.

Are there any Japanese lemmas linked to English dog sense 1

In [5]:
# Are there any Japanese lemmas linked to English dog sense 1
wn.synset('dog.n.01').lemma_names('jpn')

['イヌ', 'ドッグ', '洋犬', '犬', '飼犬', '飼い犬']

In [6]:
# The same for Dutch
wn.synset('dog.n.01').lemma_names('nld')

['hond', 'joekel']

In [11]:
wn.synsets('dog.n.01.hond')

[]

Unfortunately, you cannot directly get the synsets in Wordnet through the same interface we have used before for 'dog'. The next call does not work:

In [35]:
all_dog_synsets = wn.synsets('hond')
print('Number of synsets with "hond" as a synonym:', len(all_dog_synsets))
print(all_dog_synsets)

Number of synsets with "hond" as a synonym: 0
[]


We have to go through the wn.lemmas() function to get the list of lemma objects: 

In [9]:
dutch_dog_lemmas = wn.lemmas('hond', lang='nld')
print(dutch_dog_lemmas)

[Lemma('dog.n.01.hond'), Lemma('asshole.n.01.hond')]


Lemma is yet another object with attributes and functions, some of which overlap with those of a synset. Let's check them out through 'dir'

In [13]:
dutch_lemma = dutch_dog_lemmas[0]
dir(dutch_lemma)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_frame_ids',
 '_frame_strings',
 '_hypernyms',
 '_instance_hypernyms',
 '_key',
 '_lang',
 '_lex_id',
 '_lexname_index',
 '_name',
 '_related',
 '_synset',
 '_syntactic_marker',
 '_wordnet_corpus_reader',
 'also_sees',
 'antonyms',
 'attributes',
 'causes',
 'count',
 'derivationally_related_forms',
 'entailments',
 'frame_ids',
 'frame_strings',
 'hypernyms',
 'hyponyms',
 'instance_hypernyms',
 'instance_hyponyms',
 'key',
 'lang',
 'member_holonyms',
 'member_meronyms',
 'name',
 'part_holonyms',
 'part_meronyms',
 'pertainyms',
 'region_domains',
 'similar_tos',
 'substance_holonyms',
 '

Some are different from synset such as lang() and through .synset() we can go to the synset through the lemma. Obviously, the synset information is the same as for the English wordnet because the Open Dutch Wordnet: http://wordpress.let.vupr.nl/odwn/ was created by expanding the English wordnet.

In [15]:
print(dutch_lemma.lang())

dutch_dog_synset = dutch_lemma.synset()
print(dutch_dog_synset.hypernyms())
print(dutch_dog_synset.definition())
print(dutch_dog_synset.hypernym_paths())



nld
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('chordate.n.01'), Synset('vertebrate.n.01'), Synset('mammal.n.01'), Synset('placental.n.01'), Synset('carnivore.n.01'), Synset('canine.n.02'), Synset('dog.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('domestic_animal.n.01'), Synset('dog.n.01')]]


So we have many wordnets in different languages. Can we get statistics in their coverage?

In [39]:
print('Dutch:', len(wn.all_lemma_names(pos='n', lang='nld')))
print('Italian:', len(wn.all_lemma_names(pos='n', lang='ita')))
print('Japanese:', len(wn.all_lemma_names(pos='n', lang='jpn')))
print('Slovene:', len(wn.all_lemma_names(pos='n', lang='slv')))

Dutch: 36896
Italian: 31477
Japanese: 64797
Slovene: 30898


### Get the Dutch and Japanese dogs

The next simple "for" loop iterates over all dog-hyponyms in the English wordnet and prints the synset and any Dutch and Japanese labels. We can easily see, which dogs are 'lexicalized' in which language.

In [19]:
dog = wn.synset ('dog.n.01')
dogs = dog.hyponyms()
print('Number of dogs:', len(dogs))

for s in dogs:
    print(s)
    print('Dutch:', s.lemma_names('nld'))
    print('Japanese:', s.lemma_names('jpn'))
    print()

Number of dogs: 18
Synset('basenji.n.01')
Dutch: []
Japanese: []

Synset('corgi.n.01')
Dutch: []
Japanese: ['ウェルシュ・コーギー']

Synset('cur.n.01')
Dutch: ['mormel', 'idioot', 'halve_gare', 'bastaard', 'bastaardhond', 'straathond']
Japanese: ['雑犬', '雑種犬', '駄犬']

Synset('dalmatian.n.02')
Dutch: ['dalmatiër', 'Dalmatische']
Japanese: []

Synset('great_pyrenees.n.01')
Dutch: []
Japanese: []

Synset('griffon.n.02')
Dutch: []
Japanese: ['グリフォン', 'ブリュッセルグリフォン', 'グリフォンブリュッセロワ']

Synset('hunting_dog.n.01')
Dutch: []
Japanese: ['猟犬']

Synset('lapdog.n.01')
Dutch: []
Japanese: []

Synset('leonberg.n.01')
Dutch: []
Japanese: []

Synset('mexican_hairless.n.01')
Dutch: []
Japanese: []

Synset('newfoundland.n.01')
Dutch: []
Japanese: []

Synset('pooch.n.01')
Dutch: ['bastaard', 'vuilnisbakkie']
Japanese: ['わんこ', 'わんわん', 'わんちゃん']

Synset('poodle.n.01')
Dutch: ['poedel']
Japanese: ['プードル']

Synset('pug.n.01')
Dutch: ['mops', 'mopshond']
Japanese: ['パグ']

Synset('puppy.n.01')
Dutch: ['hondejong', 'hondenjong

This gives dogs as direct hyponyms but maybe there are more dogs as hyponyms of hyponyms of hyponyms, etc.  The WordNet interface documentation uses a so-called anonymous function (lambda) is applied recursively to synsets that are the hyponyms of synsets. This is higher Python magic. For now accept that it traverses the hyponym tree from a starting synset and puts all results in a list.

In [17]:
hypo = lambda s: s.hyponyms()

In [20]:
dogs_at_all_levels = list(dog.closure(hypo))
print('Number of dogs:', len(dogs_at_all_levels))


Number of dogs: 189


Ahhh, we now have 189 dogs instead of 18! Let's check their cverage in Dutch

In [21]:

for s in dogs_at_all_levels: 
    print(s)
    print(s.lemma_names('nld'))
    print ()

Synset('basenji.n.01')
[]

Synset('corgi.n.01')
[]

Synset('cur.n.01')
['mormel', 'idioot', 'halve_gare', 'bastaard', 'bastaardhond', 'straathond']

Synset('dalmatian.n.02')
['dalmatiër', 'Dalmatische']

Synset('great_pyrenees.n.01')
[]

Synset('griffon.n.02')
[]

Synset('hunting_dog.n.01')
[]

Synset('lapdog.n.01')
[]

Synset('leonberg.n.01')
[]

Synset('mexican_hairless.n.01')
[]

Synset('newfoundland.n.01')
[]

Synset('pooch.n.01')
['bastaard', 'vuilnisbakkie']

Synset('poodle.n.01')
['poedel']

Synset('pug.n.01')
['mops', 'mopshond']

Synset('puppy.n.01')
['hondejong', 'hondenjong', 'pup', 'puppy']

Synset('spitz.n.01')
[]

Synset('toy_dog.n.01')
[]

Synset('working_dog.n.01')
['werkhond']

Synset('cardigan.n.02')
[]

Synset('pembroke.n.01')
[]

Synset('feist.n.01')
['Feist']

Synset('pariah_dog.n.01')
[]

Synset('liver-spotted_dalmatian.n.01')
[]

Synset('brabancon_griffon.n.01')
[]

Synset('courser.n.03')
[]

Synset('dachshund.n.01')
['dashond', 'taks', 'teckel']

Synset('hound.n

It is clear that the Open Dutch Wordnet lacks coverage compared to the English WordNet. There is work to do to complete it. Perhaps a nice project for you to work on to increase the coverage of the Dutch WordNet.

Question: are there any Dutch words for Dutch that are not in the English WordNet?

## Wordnet Similarity

A whole series of similarity functions have been built in and can be used for scoring synset pairs. See the documentation for the other methods. We show here how it works for "path"

In [74]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')

In [75]:
print(dog.path_similarity(cat))
print(hit.path_similarity(slap))
print(wn.path_similarity(hit, slap))


0.2
0.14285714285714285
0.14285714285714285


If you did some readings on WordNet, you may know that the noun hiearchy has a single top-node synset 'entity-n-01'. All nominal synsets decent from this synset. This is not the case for verbs nor for adjectives. The verb synsets form '559' islands of disconnected synsets with 559 rootnodes. The English WordNet editors decided not to connect these islands in an artificial way as was done for nouns.

Let's check this with the NLTK root_hypernyms() function for the above verb synsets 'hit' and 'slap'.

In [80]:
print('Root for hit:', slap.root_hypernyms())
print('Root for hit:', hit.root_hypernyms())

Root for hit: [Synset('touch.v.01')]
Root for hit: [Synset('move.v.02')]


How is it possible to get a value for similarity if the subgraphs are not connected? Well, the package imposes a simulated rootnode by grouping all the subgraph top-nodes under a single node. This is the default setting. If you do not want to use this, you can add a variable to turn it off:

In [76]:
print(hit.path_similarity(slap, simulate_root=False))
print(wn.path_similarity(hit, slap, simulate_root=False))

None
None


Without the simulated root there is no path from 'hit' to 'slap'.

## Using WordNet similarity for words instead of synsets

Now if we want to use this for words, we first need to obtain all the synsets for a word and then compare each synset with the synsets of another word. We thus need a for-loop inside a for-loop. The first loop gets the synsets for the first word and the second loop for each synset the synsets for the second word to compare.

In [81]:
w1='dog'
w2='cat'
for s1 in wn.synsets(w1, 'n'):
    print(s1,':')
    for s2 in wn.synsets(w2, 'n'):
        print('\t', s2,':', s1.path_similarity(s2))

Synset('dog.n.01') :
	 Synset('cat.n.01') : 0.2
	 Synset('guy.n.01') : 0.125
	 Synset('cat.n.03') : 0.125
	 Synset('kat.n.01') : 0.07692307692307693
	 Synset('cat-o'-nine-tails.n.01') : 0.08333333333333333
	 Synset('caterpillar.n.02') : 0.07692307692307693
	 Synset('big_cat.n.01') : 0.2
	 Synset('computerized_tomography.n.01') : 0.05263157894736842
Synset('frump.n.01') :
	 Synset('cat.n.01') : 0.07142857142857142
	 Synset('guy.n.01') : 0.125
	 Synset('cat.n.03') : 0.125
	 Synset('kat.n.01') : 0.1
	 Synset('cat-o'-nine-tails.n.01') : 0.07142857142857142
	 Synset('caterpillar.n.02') : 0.06666666666666667
	 Synset('big_cat.n.01') : 0.07142857142857142
	 Synset('computerized_tomography.n.01') : 0.05555555555555555
Synset('dog.n.03') :
	 Synset('cat.n.01') : 0.07692307692307693
	 Synset('guy.n.01') : 0.2
	 Synset('cat.n.03') : 0.14285714285714285
	 Synset('kat.n.01') : 0.1111111111111111
	 Synset('cat-o'-nine-tails.n.01') : 0.07692307692307693
	 Synset('caterpillar.n.02') : 0.07142857142857

We can use the highest similarity amond all pairs to find the strongest association 

# End of this notebook