# Exercise 1

In [1]:
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
import itertools

#### Given the following (lemma, category) pairs:

In [2]:
pairs = [('the','DT'), ('man','NN'), ('swim','VB'), ('with', 'PR'), ('a', 'DT'), \
         ('girl','NN'), ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), ('whilst', 'PR'), \
         ('the', 'DT'), ('woman', 'NN'), ('walk', 'VB')]

#### For each pair, when possible, print their most frequent WordNet synset, their corresponding least common subsumer (LCS) and their similarity value, using the following functions:
- Path Similarity
- Leacock-Chodorow Similarity
- Wu-Palmer Similarity
- Lin Similarity

#### Normalize similarity values when necessary. What similarity seems better?

In [3]:
def get_valid_pairs(original_pairs):
    """
    Produces a list containing the pairs word-tag that are valid
    or the WordNet Synset analysis (Only nouns, verbs, adjectives and adverbs),
    and converts the POS tag to the WordNet POS tag
    """
    valid_pairs = []
    for word, tag in original_pairs:
        #if word is a noun
        if tag.startswith('N'):
            valid_pairs.append((word, wn.NOUN))
        #if word is a verb
        elif tag.startswith('V'):
            valid_pairs.append((word, wn.VERB))
        #if word is an adjective
        elif tag.startswith('J'):
            valid_pairs.append((word, wn.ADJ))
        #if word is a verb
        elif tag.startswith('R'):
            valid_pairs.append((word, wn.ADV))
    return valid_pairs

**We show the most frequent Wordnet synset for each of the valid words (for valid we meen that the word is either a noun, a verb, an adjective or an adverb)**

In [4]:
valid_pairs = get_valid_pairs(pairs)
synsets = []
for word, tag in valid_pairs:
    synset = wn.synsets(word, tag)[0]
    synsets.append(synset)
    print(word, ': ', synset)

man :  Synset('man.n.01')
swim :  Synset('swim.v.01')
girl :  Synset('girl.n.01')
boy :  Synset('male_child.n.01')
woman :  Synset('woman.n.01')
walk :  Synset('walk.v.01')


**For each couple of words, when possible, we print their corresponding least common subsumer (LCS) and their similarity value, using the following functions:**
- Path Similarity
- Leacock-Chodorow Similarity
- Wu-Palmer Similarity
- Lin Similarity

**We normalize the Leacock-Chodorow Similarity by dividing the similarity value between two words by the similarity value between a word and itself, which returns the highest similiarity value obtainable with a similarity function.**

In [5]:
combinations = itertools.combinations(synsets, 2)
for s1, s2 in combinations:
    print('Pair: ', s1, s2)
    print('Least common subsumer: ', s1.lowest_common_hypernyms(s2))
    print('Path Similarity: ', wn.path_similarity(s1, s2))
    # Leacock-Chodorow Similarity can only be computed between synsets of the same POS
    if(s1.pos() == s2.pos()):
        print('Leacock-Chodorow Similarity (normalized): ', wn.lch_similarity(s1, s2)/wn.lch_similarity(s1, s1))
    else:
        print('Leacock-Chodorow Similarity: -') 
    print('Wu-Palmer Similarity: ', s1.wup_similarity(s2))
    brown_ic = wordnet_ic.ic('ic-brown.dat')
    # Lin Similarity can only be computed between synsets of the same POS
    if(s1.pos() == s2.pos()):
        print('Lin Similarity: ', s1.lin_similarity(s2, brown_ic))
    else:
        print('Lin Similarity: -')
    print()

Pair:  Synset('man.n.01') Synset('swim.v.01')
Least common subsumer:  []
Path Similarity:  None
Leacock-Chodorow Similarity: -
Wu-Palmer Similarity:  None
Lin Similarity: -

Pair:  Synset('man.n.01') Synset('girl.n.01')
Least common subsumer:  [Synset('adult.n.01')]
Path Similarity:  0.25
Leacock-Chodorow Similarity (normalized):  0.6188971751464533
Wu-Palmer Similarity:  0.631578947368421
Lin Similarity:  0.7135111237276783

Pair:  Synset('man.n.01') Synset('male_child.n.01')
Least common subsumer:  [Synset('male.n.02')]
Path Similarity:  0.3333333333333333
Leacock-Chodorow Similarity (normalized):  0.6979831568441128
Wu-Palmer Similarity:  0.6666666666666666
Lin Similarity:  0.7294717876200584

Pair:  Synset('man.n.01') Synset('woman.n.01')
Least common subsumer:  [Synset('adult.n.01')]
Path Similarity:  0.3333333333333333
Leacock-Chodorow Similarity (normalized):  0.6979831568441128
Wu-Palmer Similarity:  0.6666666666666666
Lin Similarity:  0.7870841372982784

Pair:  Synset('man.n.0

### What similarity seems better?
Based on what we saw with this ewercise, the better similarity function is Wu-Palmer similarity, because it is not restricted by the condition of having the same POS tag in order for the words to be analyzed.
If we analyze the 4 similarity functions that we tried, two of them (Leacock-Chodorow and Lin) work only for words of the same POS, which is not ideal if we want to analyze the similarities between words of different POS (for example between verbs and nouns, or nouns and adjectives).
Of the functions that we analyzed, two of them can be used even between words of different POS (Path similarity and Wu-Palmer similarity), and if we look at the results, we can say that the second one seems better. For example, the similarity between "man" and "male_child" is given a similarity of only 0.33 by path similarity, compared to a 0.66 given by Wu-Palmer similarity, which seems more fair. We observe similar results in the similarity between "man" and "woman", "girl" and "male_child", and "man" and "male_child".