# Lab2.1: Words, concepts, semantic relations in Wordnet-NLTK

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, you are going to work with wordnet databases that have been incorporated in the NLTK package.
Detailed information how to access and use wordnet can be found here: http://www.nltk.org/howto/wordnet.html

Study the documentation and make yourself familiar with the different commands. Some of documentation is also explained below.

As an abstract model, WordNet provides a mapping of the vocabulary of a language to a set of concepts with semantic relations between these concepts. The concepts and relations form a huge semantic graph or space, in which one concept is related to another. In the next picture, you only see a fragment of the WordNet graph for all concepts related to communication, which is the concept in the center. This graph only shows the `hyponymy` relations, which are most dominant in WordNet. You can travel from the edge of the graph to the center and take another branch back the the edge. In this way, you could reach any communication related concept. 

![wordnet](./images/wn-communication.png)

The graph is so dense that you cannot read the words that map to each concept. You can imagine how the graph would look like if you project the complete WordNet space as a universe.

The mapping of the words of a language to the concepts is complex. Synonymous words form a so-called `synset`, e.g. `{board, surf board}`, which represents a single concept. However, a word from a synset such as `board` can also have other meanings (`polysemy`), and therefore occur in other synsets as well. There is therefore a many-to-many relation between words and concepts through the synset mappings. Finally, it is important to realize that nouns, verbs and adjectives form different subgraphs in WordNet and exhibit different relations.

In this notebook, we will explain how to access the graph structure and explore it as well as how to measure how close or distant concepts and words are according to the graph. We first need to import the *wordnet* module from NLTK:

In [1]:
from nltk.corpus import wordnet as wn

You can look up a word using the "wn.synsets()" function. This will give you a list of synsets in which the lookup string is matched with a lemma (synonym).

In [2]:
all_dog_synsets = wn.synsets('dog')
print('Number of synsets with "dog" as a synonym:', len(all_dog_synsets))
print(all_dog_synsets)

Number of synsets with "dog" as a synonym: 8
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]


The word "dog" thus occurs as a synonym in 8 synsets (concepts), which means it has 8 different meanings in WordNet and can be located in 8 different places in the graph.

If we print the synsets, we get a shorthand representation in which the first synonym is shown from the complete synset. In some cases this is the word "dog" but we see also other words such as "cad", "chase". So "dog" is always one of the synonyms but not always the first one listed. You can also observe that the word is followed by a digit (the sense number of the word) and a part-of-speech tag. We see that "dog" can also be a verb in English according to WordNet because "chase.v.01" is a verb and synsets have only one part-of-speech.

Let us inspect the details of a synset in more detail. We take the first synset from the list and print out some information to understand the structure.

In [3]:
first_synset = all_dog_synsets[0]
print('The synset:', first_synset)
print('Python data type:', type(first_synset))
print()
print('The synonyms:', first_synset.lemmas())
print()
print('The definition:', first_synset.definition())
print()
print('The full path of hypernyms:', first_synset.hypernym_paths())
print()
print('The maximum depth of its hyponymy chain is:', first_synset.max_depth())

The synset: Synset('dog.n.01')
Python data type: <class 'nltk.corpus.reader.wordnet.Synset'>

The synonyms: [Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')]

The definition: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds

The full path of hypernyms: [[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('chordate.n.01'), Synset('vertebrate.n.01'), Synset('mammal.n.01'), Synset('placental.n.01'), Synset('carnivore.n.01'), Synset('canine.n.02'), Synset('dog.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('domestic_animal.n.01'), Synset('dog.n.01')]]

The maximum depth of 

We see that the synset is of a specific type **Synset** defined in the wordnet module. The function **lemmas** gives us all the synonyms (prefixed by the synset identifier (e.g. **dog.n.01**). We also can get the `definition`, the `hypernym_paths` and the `max_depth` at which we find this synset in the complete wordnet graph.

The function **hypernym_paths** is interesting. Let us inspect it a bit more. It gets all the `hypernym` relations upward starting from this synset untill there are no more hypernyms. Note that there can be multiple hypernym chains for a synset because occasionally WordNet gives multiple hypernyms for a synset. If you examine the output of the **hypernym_paths** function more closely, you will see it is actually a list of lists.

In [6]:
for path in first_synset.hypernym_paths():
    print(path)
    print()

[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('chordate.n.01'), Synset('vertebrate.n.01'), Synset('mammal.n.01'), Synset('placental.n.01'), Synset('carnivore.n.01'), Synset('canine.n.02'), Synset('dog.n.01')]

[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('domestic_animal.n.01'), Synset('dog.n.01')]



The next piece of code makes it easier to see the structure:

In [7]:
for path in first_synset.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "

 Synset('entity.n.01')
    Synset('physical_entity.n.01')
       Synset('object.n.01')
          Synset('whole.n.02')
             Synset('living_thing.n.01')
                Synset('organism.n.01')
                   Synset('animal.n.01')
                      Synset('chordate.n.01')
                         Synset('vertebrate.n.01')
                            Synset('mammal.n.01')
                               Synset('placental.n.01')
                                  Synset('carnivore.n.01')
                                     Synset('canine.n.02')
                                        Synset('dog.n.01')
 Synset('entity.n.01')
    Synset('physical_entity.n.01')
       Synset('object.n.01')
          Synset('whole.n.02')
             Synset('living_thing.n.01')
                Synset('organism.n.01')
                   Synset('animal.n.01')
                      Synset('domestic_animal.n.01')
                         Synset('dog.n.01')


We see that dog in its first sense has two pathes due to the hypernyms **canine.n.02** and **domestic_animal.n.01**. Two different ways of classifying dogs that eventually end up in the same top nodes from **animal.n.01** onwards but provide different routes to it. 

This makes you think about what all posssible ways there are to classify something and how to know these? 

What about wordnets for other lanuages. Should these be classified in the same way? Also for different cultures? Clearly WordNet is just a proxy for a semantic space for the English language and different choices could have been made when building it. Nevertheless, still it is the most precise proxy we have at the moment.

We can iterate over the list of all synsets with the synonym dog, get each synset as an 'object' and next call specific functions to know more what they represent. This gives us a better insight into the different meanings of "dog" in WordNet.

In [8]:
for synset in all_dog_synsets:
    print()
    print('The synset =', synset)
    print('Type', type(synset))
    print('The synonyms = ', synset.lemmas())
    print('The definition =', synset.definition())
    print('The maximum depth of its hyponymy chain is = ', synset.max_depth())


The synset = Synset('dog.n.01')
Type <class 'nltk.corpus.reader.wordnet.Synset'>
The synonyms =  [Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')]
The definition = a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
The maximum depth of its hyponymy chain is =  13

The synset = Synset('frump.n.01')
Type <class 'nltk.corpus.reader.wordnet.Synset'>
The synonyms =  [Lemma('frump.n.01.frump'), Lemma('frump.n.01.dog')]
The definition = a dull unattractive unpleasant girl or woman
The maximum depth of its hyponymy chain is =  10

The synset = Synset('dog.n.03')
Type <class 'nltk.corpus.reader.wordnet.Synset'>
The synonyms =  [Lemma('dog.n.03.dog')]
The definition = informal term for a man
The maximum depth of its hyponymy chain is =  9

The synset = Synset('cad.n.01')
Type <class 'nltk.corpus.reader.wordnet.Synset'>
The synonyms =  [Lemma('cad.n.01.cad

Take you time to inspect the data and test your knowledge of the English language. Do these meanings make sense and how different are they?

You can also obtain synsets with a certain part-of-speech only by passing a part-of-speech value as a parameter.

In [9]:
all_dog_verb_synsets = wn.synsets('dog', 'v')
print(all_dog_verb_synsets)

[Synset('chase.v.01')]


Remember from the lecture that the verbs do not have an augmented top layer in WordNet but the verb subnetwork consists of many isolated subgraphs as islands. So let's see what the hypernym path is for this verbal synset:

In [11]:
for path in all_dog_verb_synsets[0].hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper, hyper.definition())
        indent += "   "

 Synset('travel.v.01') change location; move, travel, or proceed, also metaphorically
    Synset('pursue.v.02') follow in or as if in pursuit
       Synset('chase.v.01') go after with the intent to catch


Interesing. The path is very short and not so informative. Is there a way, to proceed further up than `travel.v.01`?
What would you consider as a possible hypernym above `travel.v.01` and what would be the sibling concepts?

These are the questions that wordnet builders need to answer.

Various functions can be called on the synset object yielding different data structures. Try some to get a feeling for it. Below we show the specific relations that synsets can have beside hyponymy:

In [10]:
doggy_synset = all_dog_synsets[0]
#### Part - to -  whole relations:
print('Part holonyms:',doggy_synset.part_holonyms())
print('Member holonyms:',doggy_synset.member_holonyms())
print('Substance holonyms:',doggy_synset.substance_holonyms())

### Whole - to - part relations
print('Part meronyms:',doggy_synset.part_meronyms())
print('Member meronyms:',doggy_synset.member_meronyms())
print('Substance meronyms:',doggy_synset.substance_meronyms())


Part holonyms: []
Member holonyms: [Synset('canis.n.01'), Synset('pack.n.06')]
Substance holonyms: []
Part meronyms: [Synset('flag.n.07')]
Member meronyms: []
Substance meronyms: []


In [11]:
chase_synset = all_dog_verb_synsets[0]
#### Relations for verbal synsets
print('Caused:', chase_synset.causes())
print('Entailments:',chase_synset.entailments())
print('Hyponyms:', chase_synset.hyponyms())
print('Examples:', chase_synset.examples())


Caused: []
Entailments: []
Hyponyms: [Synset('hound.v.01'), Synset('quest.v.02'), Synset('run_down.v.07'), Synset('tree.v.03')]
Examples: ['The policeman chased the mugger down the alley', 'the dog chased the rabbit']


To get the full Python object definition of a synset to show all options, you can use the 'dir' command:

In [10]:
dir(chase_synset)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_all_hypernyms',
 '_definition',
 '_doc',
 '_examples',
 '_frame_ids',
 '_hypernyms',
 '_instance_hypernyms',
 '_iter_hypernym_lists',
 '_lemma_names',
 '_lemma_pointers',
 '_lemmas',
 '_lexname',
 '_max_depth',
 '_min_depth',
 '_name',
 '_needs_root',
 '_offset',
 '_pointers',
 '_pos',
 '_related',
 '_shortest_hypernym_paths',
 '_wordnet_corpus_reader',
 'acyclic_tree',
 'also_sees',
 'attributes',
 'causes',
 'closure',
 'common_hypernyms',
 'definition',
 'entailments',
 'examples',
 'frame_ids',
 'hypernym_distances',
 'hypernym_paths',
 'hypernyms',
 'hyponyms',
 'in_region_domains',
 'in_topic_domains

Some of these options may be familair to you if you have read the literature. We will look into the similarity functions more closely later on.

#### Get all dogs

We can get more insight in the coverage of specific areas or semantic fields by collecting all the hyponyms "below" a synset, e.g. all types of dogs, cats, horses, ships, cars, etc.

The next simple "for" loop iterates over all dog-hyponyms in the English wordnet and prints the definition and next the synonyms. We can easily see, which dogs have a synonym in which language and which have not.

In [4]:
dog = wn.synset ('dog.n.01')
dogs = dog.hyponyms()
print('Number of dogs:', len(dogs))
print()
for s in dogs:
    print(s)
    print(s.definition())
    print('English:', s.lemma_names('eng'))
    print()

Number of dogs: 18

Synset('basenji.n.01')
small smooth-haired breed of African origin having a tightly curled tail and the inability to bark
English: ['basenji']

Synset('corgi.n.01')
either of two Welsh breeds of long-bodied short-legged dogs with erect ears and a fox-like head
English: ['corgi', 'Welsh_corgi']

Synset('cur.n.01')
an inferior dog or one of mixed breed
English: ['cur', 'mongrel', 'mutt']

Synset('dalmatian.n.02')
a large breed having a smooth white coat with black or brown spots; originated in Dalmatia
English: ['dalmatian', 'coach_dog', 'carriage_dog']

Synset('great_pyrenees.n.01')
bred of large heavy-coated white dogs resembling the Newfoundland
English: ['Great_Pyrenees']

Synset('griffon.n.02')
breed of various very small compact wiry-coated dogs of Belgian origin having a short bearded muzzle
English: ['griffon', 'Brussels_griffon', 'Belgian_griffon']

Synset('hunting_dog.n.01')
a dog used in hunting game
English: ['hunting_dog']

Synset('lapdog.n.01')
a dog sma

The above code only gives the direct hyponyms as dogs but maybe there are more dogs as hyponyms of these hyponyms or even deeper down. To get these, we need a recursive function. A recursive function is a function that calls itself inside. These functions are extremely powerful and also elegant. But this comes with a risk: if you do not built in a way to stop the calling, it will run for ever and may consume all your memory.

We define a function that takes a synset and its family and first adds the hyponyms as children and next for each child calls the function again to get the grand children. If there are any grand children these are also added to the family, if not the call is done and we return to the next child.

The function terminates if all children have been processed. Because it is recursive, the same applies for the function applied to the children, and the children of the children, etc. Since the graph ends, the function also ends.

What would happen if the WordNet builders made a mistake and made a hyponym also the hypernym of a concept?

dog -is-a-> working_dog -is-a-> dog

This would create a cycle and our function will never terminate. This is the danger of recursive functions that do not have a clear stopping condition. Your memory will get loaded and at some point everythiong slows down. Eventually, the application crashes and you may have to restart your computer. Not a real disaster, since you will not destroy anything but you may not have saved your work in other application or in this notebook. Let's hope that the WordNet builders did not make a mistake.

In [5]:
def get_hyponym_family (parent):
    family=[]
    children = parent.hyponyms()
    if children:
        family = family + children
        for child in children:
            grand_children = get_hyponym_family(child)
            if grand_children:
                family = family + grand_children
    return family

In [6]:
dog_family = get_hyponym_family(dog)

In [7]:
print('Number of the dog family:', len(dog_family))

Number of the dog family: 189


Ahhh, we now have 189 dogs instead of 18!

In [8]:
print(dog_family[170:])

[Synset('collie.n.01'), Synset('german_shepherd.n.01'), Synset('kelpie.n.02'), Synset('komondor.n.01'), Synset('old_english_sheepdog.n.01'), Synset('rottweiler.n.01'), Synset('shetland_sheepdog.n.01'), Synset('groenendael.n.01'), Synset('malinois.n.01'), Synset('malamute.n.01'), Synset('siberian_husky.n.01'), Synset('attack_dog.n.01'), Synset('housedog.n.01'), Synset('kuvasz.n.01'), Synset('pinscher.n.01'), Synset('schipperke.n.01'), Synset('affenpinscher.n.01'), Synset('doberman.n.01'), Synset('miniature_pinscher.n.01')]


A function such as the one above can be very handy to use in many programs. Note that the above function gives you the total set of synsets. A synset is a NLTK WordNet object and has many properties among which the actual synonyms.

Let's make another function that just gets the synonyms from all these synsets. This function iterates over all the synsets in the "family", gets the Wordnet Lemma objects and finally gets the lemmas for these. We add a parameter to specify the language for the synonyms.

In [9]:
def get_lemmas_from_wordnet_family(wnfamily, language):
    lemmas = []
    for synset in wnfamily:
        slemmas = synset.lemma_names(language)
        for slemma in slemmas:
            lemmas.append(slemma)
    return lemmas

In [10]:
### Get the words for dogs in English at any level of specificity
dog_lemmas = get_lemmas_from_wordnet_family(dog_family, 'eng')
print('There are so many dogs in WordNet:', len(dog_lemmas))
print(dog_lemmas)

There are so many dogs in WordNet: 279
['basenji', 'corgi', 'Welsh_corgi', 'cur', 'mongrel', 'mutt', 'dalmatian', 'coach_dog', 'carriage_dog', 'Great_Pyrenees', 'griffon', 'Brussels_griffon', 'Belgian_griffon', 'hunting_dog', 'lapdog', 'Leonberg', 'Mexican_hairless', 'Newfoundland', 'Newfoundland_dog', 'pooch', 'doggie', 'doggy', 'barker', 'bow-wow', 'poodle', 'poodle_dog', 'pug', 'pug-dog', 'puppy', 'spitz', 'toy_dog', 'toy', 'working_dog', 'Cardigan', 'Cardigan_Welsh_corgi', 'Pembroke', 'Pembroke_Welsh_corgi', 'feist', 'fice', 'pariah_dog', 'pye-dog', 'pie-dog', 'liver-spotted_dalmatian', 'Brabancon_griffon', 'courser', 'dachshund', 'dachsie', 'badger_dog', 'hound', 'hound_dog', 'Rhodesian_ridgeback', 'sporting_dog', 'gun_dog', 'terrier', 'sausage_dog', 'sausage_hound', 'Afghan_hound', 'Afghan', 'basset', 'basset_hound', 'beagle', 'bloodhound', 'sleuthhound', 'bluetick', 'boarhound', 'coonhound', 'foxhound', 'greyhound', 'harrier', 'Ibizan_hound', 'Ibizan_Podenco', 'Norwegian_elkhoun

## Wordnet Similarity

The structure of wordnet as a graph can be used to measure the similarity across concepts. The basic idea is that concepts can be connected by going up and down through the relations. By counting the steps, we can measure the distance between for example **car**, **train**, **man**, **woman**. In the image below, we can infer that "train#1" and "bus#1" are similar because they share the same hypernym "public_transport", but to get to e.g. "knowledge" requires many steps through the graph which makes "knowledge" very dissimilar.

![distance_in_wordnet](./images/wordnet_sim.png)

A whole series of similarity functions have been added to NLTK that measure the distances in different ways but use the same basic strategy that exploits the relations between synsets. See the documentation for the other methods. 

We show here how it works for the most basic method **path** that counts the steps. We first obtain a few synsets:

In [12]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
car = wn.synset('car.n.01')

We can get the full hypernym path for these synsets:

In [13]:
print('dogs are a type of:')
for path in dog.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "

print('cats are a type of:')
for path in cat.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "

dogs are a type of:
 Synset('entity.n.01')
    Synset('physical_entity.n.01')
       Synset('object.n.01')
          Synset('whole.n.02')
             Synset('living_thing.n.01')
                Synset('organism.n.01')
                   Synset('animal.n.01')
                      Synset('chordate.n.01')
                         Synset('vertebrate.n.01')
                            Synset('mammal.n.01')
                               Synset('placental.n.01')
                                  Synset('carnivore.n.01')
                                     Synset('canine.n.02')
                                        Synset('dog.n.01')
 Synset('entity.n.01')
    Synset('physical_entity.n.01')
       Synset('object.n.01')
          Synset('whole.n.02')
             Synset('living_thing.n.01')
                Synset('organism.n.01')
                   Synset('animal.n.01')
                      Synset('domestic_animal.n.01')
                         Synset('dog.n.01')
cats are a type of:
 Sy

First of all, we see that **cat** has only one path and we have seen that **dog** has two. The **cat** path is most similar to the **canine** path of **dog**.

We use the `path_similarity` function of a synset, which requires as input another synset. So let's get the score for dog and cat, where the function will use the shortest path from alternatives.

In [14]:
print(dog.path_similarity(cat))

0.2


Is this very similar? The only way to find out is to compare this with something else such as a **car**:

In [16]:
car = wn.synset('car.n.01')
print(dog.path_similarity(car))

0.07692307692307693


Right, this score is a lot lower.

Let's see if this also works for verbs.

In [17]:
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')
print(hit.path_similarity(slap))
print(wn.path_similarity(hit, slap))

0.14285714285714285
0.14285714285714285


It seems to work well but......

If you did some readings on WordNet, you know that the noun hierarchy has a single top-node synset 'entity-n-01'. All nominal synsets decent from this synset. This is not the case for verbs nor for adjectives. The verb part of WordNet therefore consists of '559' islands of disconnected synsets with 559 rootnodes. The English WordNet editors decided not to connect these islands in an artificial way as was done for nouns. We can see this when we get the hypernym path for each of the above synsets:

In [18]:
print('hit is a type of:')
for path in hit.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "
print('slap is a type of:')
for path in slap.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "

hit is a type of:
 Synset('move.v.02')
    Synset('propel.v.01')
       Synset('hit.v.01')
slap is a type of:
 Synset('touch.v.01')
    Synset('strike.v.01')
       Synset('slap.v.01')


So the top synsets **move** and **touch** are not connected in any way and there is no overlap between the pathes for the two verb synsets.

We can also see this using the NLTK root_hypernyms() function for the above noun and verb synsets.

In [19]:
print('Root for dog:', dog.root_hypernyms())
print('Root for cat:', cat.root_hypernyms())
print('Root for slap:', slap.root_hypernyms())
print('Root for hit:', hit.root_hypernyms())

Root for dog: [Synset('entity.n.01')]
Root for cat: [Synset('entity.n.01')]
Root for slap: [Synset('touch.v.01')]
Root for hit: [Synset('move.v.02')]


How is it still possible to get a value for similarity if the subgraphs are not connected? Well, the package imposes a simulated root node by grouping all the subgraph top-nodes under a single node. This is the default setting. If you do not want to use this, you can turn it off by setting the parameter *simulate_root* to False.

In [20]:
print(hit.path_similarity(slap, simulate_root=False))
print(wn.path_similarity(hit, slap, simulate_root=False))

None
None


Without the simulated root there is no path from 'hit' to 'slap'.

## Using WordNet similarity for words instead of synsets

Can we also determine the similarity of words?

Yes we can but we first need to obtain all the synsets for a word and then compare each synset with the synsets of another word to get the most similar meanings of these words. 
We therefore need a **for-loop** inside a **for-loop**. The first loop gets the synsets for the first word and the second loop for each synset gets the synsets for the second word to compare.

In [21]:
w1='dog'
w2='cat'
for s1 in wn.synsets(w1, 'n'):
    print(s1,':')
    for s2 in wn.synsets(w2, 'n'):
        print('\t', s2,':', s1.path_similarity(s2))

Synset('dog.n.01') :
	 Synset('cat.n.01') : 0.2
	 Synset('guy.n.01') : 0.125
	 Synset('cat.n.03') : 0.125
	 Synset('kat.n.01') : 0.07692307692307693
	 Synset('cat-o'-nine-tails.n.01') : 0.08333333333333333
	 Synset('caterpillar.n.02') : 0.07692307692307693
	 Synset('big_cat.n.01') : 0.2
	 Synset('computerized_tomography.n.01') : 0.05263157894736842
Synset('frump.n.01') :
	 Synset('cat.n.01') : 0.07142857142857142
	 Synset('guy.n.01') : 0.125
	 Synset('cat.n.03') : 0.125
	 Synset('kat.n.01') : 0.1
	 Synset('cat-o'-nine-tails.n.01') : 0.07142857142857142
	 Synset('caterpillar.n.02') : 0.06666666666666667
	 Synset('big_cat.n.01') : 0.07142857142857142
	 Synset('computerized_tomography.n.01') : 0.05555555555555555
Synset('dog.n.03') :
	 Synset('cat.n.01') : 0.07692307692307693
	 Synset('guy.n.01') : 0.2
	 Synset('cat.n.03') : 0.14285714285714285
	 Synset('kat.n.01') : 0.1111111111111111
	 Synset('cat-o'-nine-tails.n.01') : 0.07692307692307693
	 Synset('caterpillar.n.02') : 0.07142857142857

We can use the highest similarity from all pairs to find the strongest association.

In [22]:
w1='dog'
w2='cat'
for s1 in wn.synsets(w1, 'n'):
    top_sim_score = 0    
    top_sim_synset_w1 = ""
    top_sim_synset_w2 = ""
    for s2 in wn.synsets(w2, 'n'):
        sim = s1.path_similarity(s2)
        if sim>top_sim_score:
            top_sim_score = sim
            top_sim_synset_w2 = s2
    print('Most similar are', s1, top_sim_synset_w2,':', top_sim_score)

Most similar are Synset('dog.n.01') Synset('cat.n.01') : 0.2
Most similar are Synset('frump.n.01') Synset('guy.n.01') : 0.125
Most similar are Synset('dog.n.03') Synset('guy.n.01') : 0.2
Most similar are Synset('cad.n.01') Synset('guy.n.01') : 0.14285714285714285
Most similar are Synset('frank.n.02') Synset('kat.n.01') : 0.09090909090909091
Most similar are Synset('pawl.n.01') Synset('cat-o'-nine-tails.n.01') : 0.14285714285714285
Most similar are Synset('andiron.n.01') Synset('cat-o'-nine-tails.n.01') : 0.16666666666666666


If you need to use this code more often, it is convenient to define a function for it and always call this function instead of re-typing this code. Here is the function to measure shortest distance through the path function for any pair of words. This function assumes you imported NLTK wordnet as wn.

In [24]:
import nltk
from nltk.corpus import wordnet as wn

def word_similarity_wordnet_path(w1, w2):
    top_sim_score = 0    
    top_sim_synset_w1 = ""
    top_sim_synset_w2 = ""
    for s1 in wn.synsets(w1, 'n'):
        for s2 in wn.synsets(w2, 'n'):
            sim = s1.path_similarity(s2)
            if sim>top_sim_score:
                top_sim_score = sim
                top_sim_synset_w1 = s1
                top_sim_synset_w2 = s2
    return top_sim_synset_w1, top_sim_synset_w2, top_sim_score

In [25]:
s1, s2, sim = word_similarity_wordnet_path("mouse", "keyboard")
print(s1, s1.definition())
print(s2, s2.definition())
print("Similarity", sim)

Synset('mouse.n.04') a hand-operated electronic device that controls the coordinates of a cursor on your computer screen as you move it around on a pad; on the bottom of the device is a ball that rolls on the surface of the pad
Synset('keyboard.n.01') device consisting of a set of keys on a piano or organ or typewriter or typesetting machine or computer or the like
Similarity 0.25


WordNet uses explicit representations of the meanings of words. It is debatable what these meanings are even when they were created by trained experts. In the next notebook, we look at word embeddings which do not distinguish between meanings at all. They plot words directly into a semantic space instead of words mapped to concepts first through their meanings.

## Wordnets in other languages

There are wordnets in many different languages and many are linked to English. The ones that are freely available in the Open Multilingual Wordnet platform are also available in NLTK. You can use "wn.langs" to get the full list.

You might get an error regarding NLTK not finding a 'omw' dataset. You can download it just like you did in lab1.

In [12]:
import nltk
nltk.download('omw')

[nltk_data] Downloading package omw to /Users/piek/nltk_data...
[nltk_data]   Package omw is already up-to-date!


True

Let's check out which languages have wordnets in the OMW package:

In [13]:
print(sorted(wn.langs()))

['als', 'arb', 'bul', 'cat', 'cmn', 'dan', 'ell', 'eng', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb', 'hrv', 'ind', 'ita', 'jpn', 'nld', 'nno', 'nob', 'pol', 'por', 'qcn', 'slv', 'spa', 'swe', 'tha', 'zsm']


The listed language wordnets are created by translating the English synsets, following the so-called `Expand Method` (Vossen (ed.) 1998). This means that the concepts of the English wordnet are re-used and only the synonyms in the synsets are translated. Another approach is the merge method in which a wordnet is built independently from English and mapped to English afterwards. Only few wordnets are built following a merge approach and, often, they are not freely available as they started from existing dictionaries. The main reason for this is that building a wordnet from scratch is very expensive and labour-intensive.

    Vossen, Piek. "Introduction to eurowordnet." In EuroWordNet: A multilingual database with lexical semantic networks, pp. 1-17. Springer, Dordrecht, 1998.

In NLTK, you only find wordnets built following the `Expand Method`. 

Starting from the English wordnet, you can  ask for any language lemmas linked to any English-based synset.

Are there any Japanese lemmas linked to English dog sense 1? 

For this we need to use the function **lemma_names** on the synset and pass in the  3-letter language tag as a parameter: 

In [14]:
# Are there any Japanese lemmas linked to English dog sense 1
wn.synset('dog.n.01').lemma_names('jpn')

['イヌ', 'ドッグ', '洋犬', '犬', '飼犬', '飼い犬']

In [15]:
# The same for Dutch
wn.synset('dog.n.01').lemma_names('nld')

['hond', 'joekel']

So that is great but can we also get the synset directly through a Dutch or Japanese synonym?

In [16]:
wn.synsets('dog.n.01.hond')

[]

Unfortunately not. You cannot directly get the synsets in Wordnet through the same interface we have used before for 'dog'. The next call therefore also does not work:

In [17]:
all_dog_synsets = wn.synsets('hond')
print('Number of synsets with "hond" as a synonym:', len(all_dog_synsets))
print(all_dog_synsets)

Number of synsets with "hond" as a synonym: 0
[]


To obtain the synsets for a non-English word, we first have to use the wn.lemmas() function to get the list of lemma objects for a specific language. The next cell shows this for the Dutch lemma *hond*.

In [18]:
dutch_dog_lemmas = wn.lemmas('hond', lang='nld')
print(dutch_dog_lemmas)

[Lemma('dog.n.01.hond'), Lemma('asshole.n.01.hond')]


In [19]:
type(dutch_dog_lemmas[0])

nltk.corpus.reader.wordnet.Lemma

*Lemma* is yet another class defined in the wordnet module with attributes and functions, some of which overlap with those of a synset. Let's check them out through the 'dir' function.

In [20]:
dutch_lemma = dutch_dog_lemmas[0]
dir(dutch_lemma)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_frame_ids',
 '_frame_strings',
 '_hypernyms',
 '_instance_hypernyms',
 '_key',
 '_lang',
 '_lex_id',
 '_lexname_index',
 '_name',
 '_related',
 '_synset',
 '_syntactic_marker',
 '_wordnet_corpus_reader',
 'also_sees',
 'antonyms',
 'attributes',
 'causes',
 'count',
 'derivationally_related_forms',
 'entailments',
 'frame_ids',
 'frame_strings',
 'hypernyms',
 'hyponyms',
 'in_region_domains',
 'in_topic_domains',
 'in_usage_domains',
 'instance_hypernyms',
 'instance_hyponyms',
 'key',
 'lang',
 'member_holonyms',
 'member_meronyms',
 'name',
 'part_holonyms',
 'part_meronyms',
 'pertainym

Some are different from the `Synset` functions such as lang(). The function *.synset()* can be used to get the synsets associated with a lemma. Obviously, the synset information is the same as for the English wordnet because the Open Dutch Wordnet: http://wordpress.let.vupr.nl/odwn/ was created by expanding the English wordnet:

    Postma, Marten, Emiel Van Miltenburg, Roxane Segers, Anneleen Schoen, and Piek Vossen. "Open dutch wordnet." In Proceedings of the 8th Global WordNet Conference (GWC), pp. 302-310. 2016
    

In [21]:
print(dutch_lemma, dutch_lemma.lang())

dutch_dog_synset = dutch_lemma.synset()
print('Synsets:', dutch_dog_synset)
print('Synonym:', dutch_lemma._name)
print('Hypernyms:', dutch_dog_synset.hypernyms())
print('Definition:', dutch_dog_synset.definition())

Lemma('dog.n.01.hond') nld
Synsets: Synset('dog.n.01')
Synonym: hond
Hypernyms: [Synset('canine.n.02'), Synset('domestic_animal.n.01')]
Definition: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds


So we have many wordnets in different languages. Can we get statistics on their coverage?

In [22]:
print('English:', len(list(wn.all_lemma_names(pos='n', lang='eng'))))

print('Dutch:', len(list(wn.all_lemma_names(pos='n', lang='nld'))))
print('Italian:', len(list(wn.all_lemma_names(pos='n', lang='ita'))))
print('Japanese:', len(list(wn.all_lemma_names(pos='n', lang='jpn'))))
print('Slovene:', len(list(wn.all_lemma_names(pos='n', lang='slv'))))
print('Spanish:', len(list(wn.all_lemma_names(pos='n', lang='spa'))))

English: 117798
Dutch: 36896
Italian: 31477
Japanese: 64797
Slovene: 31631
Spanish: 28647


We can see that English has a lot more synonyms than the other wordnets. So there is still work to be done.

In [23]:
dog = wn.synset ('dog.n.01')
dogs = dog.hyponyms()
print('Number of dogs:', len(dogs))
print()
for s in dogs:
    print(s)
    print(s.definition())
    print('English:', s.lemma_names('eng'))
    print('Dutch:', s.lemma_names('nld'))
    print('Japanese:', s.lemma_names('jpn'))
    print('Italian:', s.lemma_names('ita'))
    print('Spanish:', s.lemma_names('spa'))
    print()

Number of dogs: 18

Synset('basenji.n.01')
small smooth-haired breed of African origin having a tightly curled tail and the inability to bark
English: ['basenji']
Dutch: []
Japanese: []
Italian: ['basenji', 'cane_del_Congo']
Spanish: ['basenji']

Synset('corgi.n.01')
either of two Welsh breeds of long-bodied short-legged dogs with erect ears and a fox-like head
English: ['corgi', 'Welsh_corgi']
Dutch: []
Japanese: ['ウェルシュ・コーギー']
Italian: []
Spanish: []

Synset('cur.n.01')
an inferior dog or one of mixed breed
English: ['cur', 'mongrel', 'mutt']
Dutch: ['mormel', 'idioot', 'halve_gare', 'bastaard', 'bastaardhond', 'straathond']
Japanese: ['雑犬', '雑種犬', '駄犬']
Italian: ['bastardo']
Spanish: ['chucho', 'gozque', 'mestizo']

Synset('dalmatian.n.02')
a large breed having a smooth white coat with black or brown spots; originated in Dalmatia
English: ['dalmatian', 'coach_dog', 'carriage_dog']
Dutch: ['dalmatiër', 'Dalmatische']
Japanese: []
Italian: ['dalmata']
Spanish: []

Synset('great_pyrene

We can now call the ```get_lemmas_from_wordnet_family``` function for another language:

In [11]:

dutch_dog_lemmas = get_lemmas_from_wordnet_family(dog_family, 'nld')
print('There are so many Dutch dogs in WordNet:', len(dutch_dog_lemmas))
print(dutch_dog_lemmas)

There are so many Dutch dogs in WordNet: 69
['mormel', 'idioot', 'halve_gare', 'bastaard', 'bastaardhond', 'straathond', 'dalmatiër', 'Dalmatische', 'bastaard', 'vuilnisbakkie', 'poedel', 'mops', 'mopshond', 'hondejong', 'hondenjong', 'pup', 'puppy', 'werkhond', 'Feist', 'dashond', 'taks', 'teckel', 'jachthond', 'vogelhond', 'terriër', 'Afghaanse_windhond', 'brak', 'snuffelaar', 'beagle', 'brak', 'vossenjacht', 'hazewind', 'hazewindhond', 'windhond', 'hond_voor_jacht_op_otters', 'trillen_op_zijn_benen', 'spaniël', 'waterhond', 'golden_retriever', 'Ierse_setter', 'bulterriër', 'Schotse_terriër', 'pitbullterriër', 'eten', 'keeshond', 'Pommers', 'Samojeed', 'chihuahua', 'Maltese', 'boxer', 'pugilist', 'vuistvechter', 'bokser', 'buldog', 'sledehond', 'bulhond', 'mastiff', 'politiehond', 'herdershond', 'sledehond', 'waakhond', 'Tibetaanse_mastiff', 'belgische_herder', 'Duitse_herdershond', 'komondor', 'Siberische_husky', 'schippersklokje', 'schippertje', 'dwergpinscher']


In [33]:
for s in dog_family: 
    print(s)
    print(s.definition())
    print('English:', s.lemma_names('eng'))
    print('Dutch:', s.lemma_names('nld'))
    print('Japanese:', s.lemma_names('jpn'))
    print('Italian:', s.lemma_names('ita'))
    print('Spanish:', s.lemma_names('spa'))
    print()

Synset('basenji.n.01')
small smooth-haired breed of African origin having a tightly curled tail and the inability to bark
English: ['basenji']
Dutch: []
Japanese: []
Italian: ['basenji', 'cane_del_Congo']
Spanish: ['basenji']

Synset('corgi.n.01')
either of two Welsh breeds of long-bodied short-legged dogs with erect ears and a fox-like head
English: ['corgi', 'Welsh_corgi']
Dutch: []
Japanese: ['ウェルシュ・コーギー']
Italian: []
Spanish: []

Synset('cur.n.01')
an inferior dog or one of mixed breed
English: ['cur', 'mongrel', 'mutt']
Dutch: ['mormel', 'idioot', 'halve_gare', 'bastaard', 'bastaardhond', 'straathond']
Japanese: ['雑犬', '雑種犬', '駄犬']
Italian: ['bastardo']
Spanish: ['chucho', 'gozque', 'mestizo']

Synset('dalmatian.n.02')
a large breed having a smooth white coat with black or brown spots; originated in Dalmatia
English: ['dalmatian', 'coach_dog', 'carriage_dog']
Dutch: ['dalmatiër', 'Dalmatische']
Japanese: []
Italian: ['dalmata']
Spanish: []

Synset('great_pyrenees.n.01')
bred of la

Because of the way that we defined the function, the dogs deeper in the hierarchy or closer to the edge of the graph are added at the end of the list. If you look closely at the results, you can also see that the coverage decreases towards the end of the `dog_family`. We expect that more specific and less frequent dog names closer to the edge are less likely translated using the expand method.

The next simple code counts the missing dogs per language.

In [34]:
dog_gaps_nl = 0
dog_gaps_jp = 0
dog_gaps_es = 0
dog_gaps_it = 0


for s in dog_family: 
    if not s.lemma_names('nld'):
        dog_gaps_nl +=1
    
    if not s.lemma_names('jpn'):
        dog_gaps_jp +=1
    
    if not s.lemma_names('ita'):
        dog_gaps_it +=1
    
    if not s.lemma_names('spa'):
        dog_gaps_es +=1
        
print('Missing dogs in Dutch:', dog_gaps_nl)
print('Missing dogs in Japanse:', dog_gaps_jp)
print('Missing dogs in Italian:', dog_gaps_it)
print('Missing dogs in Spanish:', dog_gaps_es)

Missing dogs in Dutch: 142
Missing dogs in Japanse: 100
Missing dogs in Italian: 127
Missing dogs in Spanish: 154


Something to think about: could there be any dogs that are not in the English WordNet but are common in other languages? Where can you find the answer to such a question?

# End of this notebook