# IHLT Lab 5: Lexical Semantics

**Authors:** *Zachary Parent ([zachary.parent](mailto:zachary.parent@estudiantat.upc.edu)), Carlos Jiménez ([carlos.humberto.jimenez](mailto:carlos.humberto.jimenez@estudiantat.upc.edu))*

### 2024-10-17

**Instructions:**

- Given the following (lemma, category) pairs:

`(’the’,’DT’), (’man’,’NN’), (’swim’,’VB’), (’with’, ’PR’), (’a’, ’DT’),
(’girl’,’NN’), (’and’, ’CC’), (’a’, ’DT’), (’boy’, ’NN’), (’whilst’, ’PR’),
(’the’, ’DT’), (’woman’, ’NN’), (’walk’, ’VB’)`

- For each pair, when possible, print their most frequent WordNet synset

- For each pair of words, when possible, print their corresponding least common subsumer (LCS) and their similarity value, using the following functions:

    - Path Similarity

    - Peacock-Chodorow Similarity

    - Wu-Palmer Similarity

    - Lin Similarity

Normalize similarity values when necessary. What similarity seems better?

## Notes

- we should normalize the lch_similarity (not all trees have the same depth)
- most frequent ~= most likely
- lemmas have an attribute 'count'
- given synsets we can get lemmas and counts
- we should established rules/criteria to identify which similarity is 'best'
- not all the words have an entry in wordnet, in that case computing the similarity is not possible
  - we should compute our own sorting method by lemmas count


## Setup

In [1]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
from nltk.corpus import sentiwordnet as swn
import math
import operator
import pandas as pd

In [2]:
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('sentiwordnet')
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


True

In [3]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
word = wn.synsets('word', 'n')

In [4]:
for s in word:
    print(s.name())
    for l in s.lemmas():
        print(l.name(), l.count())
        
    print()

word.n.01
word 117

word.n.02
word 18

news.n.01
news 21
intelligence 0
tidings 3
word 5

word.n.04
word 3

discussion.n.02
discussion 25
give-and-take 0
word 3

parole.n.01
parole 0
word 1
word_of_honor 0

word.n.07
word 0

son.n.02
Son 2
Word 0
Logos 0

password.n.01
password 0
watchword 0
word 0
parole 0
countersign 0

bible.n.01
Bible 13
Christian_Bible 0
Book 0
Good_Book 0
Holy_Scripture 1
Holy_Writ 0
Scripture 5
Word_of_God 1
Word 0



## helper methods

In [5]:
def getRelations(ss):
    lexRels = ['hypernyms', 'instance_hypernyms', 'hyponyms', \
       'instance_hyponyms', 'member_holonyms', 'substance_holonyms', \
       'part_holonyms', 'member_meronyms', 'substance_meronyms', \
       'part_meronyms', 'attributes', 'entailments', 'causes', 'also_sees', \
       'verb_groups', 'similar_tos']
    def getRelValue(ss, rel):
        method = getattr(ss, rel)
        return method()
    
    results = {}
    for rel in lexRels:
        val = getRelValue(ss, rel)
        if val != []:
            results[rel] = val
    return results

In [6]:
def normalized_lch_similarity(synset1, synset2):
    lch_sim = wn.lch_similarity(synset1, synset2)
    return 1 / (1 + math.exp(-lch_sim))  # Sigmoid function

## WordNet Similarities

In [7]:
dog.lowest_common_hypernyms(cat) 

[Synset('carnivore.n.01')]

In [8]:
dog.path_similarity(cat)

0.2

In [9]:
print(dog.lch_similarity(cat))
normalized_lch_similarity(dog, cat)

2.0281482472922856


0.8837209302325582

In [10]:
dog.wup_similarity(cat) 

0.8571428571428571

In [11]:
brown_ic = wordnet_ic.ic('ic-brown.dat')
dog.lin_similarity(cat,brown_ic)

0.8768009843733973

## SentiWordnet 

In [12]:
# getting the wordnet synset
synset = wn.synset('good.a.1')
# getting the sentiwordnet synset
sentiSynset = swn.senti_synset(synset.name())

sentiSynset.pos_score(), sentiSynset.neg_score(), sentiSynset.obj_score()

(0.75, 0.0, 0.25)

In [13]:
words = [('the','DT'), ('man','NN'), ('swim','VB'), ('with', 'PR'), ('a', 'DT'),
('girl','NN'), ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), ('whilst', 'PR'),
('the', 'DT'), ('woman', 'NN'), ('walk', 'VB')]

In [14]:
synset_pos_converter = {
    'NN': 'n',
    'VB': 'v',
}

In [15]:
def get_most_frequent_synset(word, pos):
    max_count = 0
    most_frequent_synset = None 

    for synset in wn.synsets(word, pos):
        for lemma in synset.lemmas():
            if lemma.name() == word and lemma.count() > max_count:
                max_count = lemma.count()
                most_frequent_synset = synset.name()
                    
    return most_frequent_synset

In [16]:

words_df = pd.DataFrame(words,columns=['word', 'pos'])
words_df['synset_pos'] = words_df['pos'].apply(lambda pos: synset_pos_converter[pos] if pos in synset_pos_converter else None)

print(f"Null values:\n{words_df.isnull().sum()}")
words_df.dropna(inplace=True)
print(f"Null values:\n{words_df.isnull().sum()}")

words_df['synset_accessor'] = words_df.apply(lambda row: get_most_frequent_synset(row['word'], row['synset_pos']), axis=1)
words_df


Null values:
word          0
pos           0
synset_pos    7
dtype: int64
Null values:
word          0
pos           0
synset_pos    0
dtype: int64


Unnamed: 0,word,pos,synset_pos,synset_accessor
1,man,NN,n,man.n.01
2,swim,VB,v,swim.v.01
5,girl,NN,n,girl.n.01
8,boy,NN,n,male_child.n.01
11,woman,NN,n,woman.n.01
12,walk,VB,v,walk.v.01


In [17]:
def get_similarity_matrix(words_df: pd.DataFrame, similarity_fn: callable) -> pd.DataFrame:
    similarity_matrix = pd.DataFrame(columns=words_df['word'], index=words_df['word'])
    for _, (word1, _, _, word1_synset_accessor) in words_df.iterrows():
        word1_synset = wn.synset(word1_synset_accessor)
        for _, (word2, _, _, word2_synset_accessor) in words_df.iterrows():
            word2_synset = wn.synset(word2_synset_accessor)
            similarity_matrix.loc[word1, word2] = similarity_fn(word1_synset, word2_synset)
    return similarity_matrix

In [18]:
words_df

Unnamed: 0,word,pos,synset_pos,synset_accessor
1,man,NN,n,man.n.01
2,swim,VB,v,swim.v.01
5,girl,NN,n,girl.n.01
8,boy,NN,n,male_child.n.01
11,woman,NN,n,woman.n.01
12,walk,VB,v,walk.v.01


## Path similarities

In [19]:
# path_similarities = get_similarity_matrix(words_df, wn.path_similarity)
path_similarities = pd.DataFrame(index=words_df['word'].to_list(), columns=words_df['word'])
verb_words = words_df[words_df['synset_pos'] == 'v']['word'].tolist()
path_similarities.loc[verb_words, verb_words] = get_similarity_matrix(words_df[words_df['synset_pos'] == 'v'], wn.path_similarity)

noun_words = words_df[words_df['synset_pos'] == 'n']['word'].tolist()
path_similarities.loc[noun_words, noun_words] = get_similarity_matrix(words_df[words_df['synset_pos'] == 'n'], wn.path_similarity)

path_similarities

word,man,swim,girl,boy,woman,walk
man,1.0,,0.25,0.333333,0.333333,
swim,,1.0,,,,0.333333
girl,0.25,,1.0,0.166667,0.5,
boy,0.333333,,0.166667,1.0,0.2,
woman,0.333333,,0.5,0.2,1.0,
walk,,0.333333,,,,1.0


## lch similarities

In [20]:
lch_similarities = pd.DataFrame(index=words_df['word'].to_list(), columns=words_df['word'])
lch_similarities.loc[verb_words, verb_words] = get_similarity_matrix(words_df[words_df['synset_pos'] == 'v'], normalized_lch_similarity)
lch_similarities.loc[noun_words, noun_words] = get_similarity_matrix(words_df[words_df['synset_pos'] == 'n'], normalized_lch_similarity)

lch_similarities

word,man,swim,girl,boy,woman,walk
man,0.974359,,0.904762,0.926829,0.926829,
swim,,0.962963,,,,0.896552
girl,0.904762,,0.974359,0.863636,0.95,
boy,0.926829,,0.863636,0.974359,0.883721,
woman,0.926829,,0.95,0.883721,0.974359,
walk,,0.896552,,,,0.962963


## wup similarities

In [21]:
wup_similarities = pd.DataFrame(index=words_df['word'].to_list(), columns=words_df['word'])
wup_similarities.loc[verb_words, verb_words] = get_similarity_matrix(words_df[words_df['synset_pos'] == 'v'], wn.wup_similarity)
wup_similarities.loc[noun_words, noun_words] = get_similarity_matrix(words_df[words_df['synset_pos'] == 'n'], wn.wup_similarity)

wup_similarities

word,man,swim,girl,boy,woman,walk
man,1.0,,0.631579,0.666667,0.666667,
swim,,1.0,,,,0.333333
girl,0.631579,,1.0,0.631579,0.631579,
boy,0.666667,,0.631579,1.0,0.666667,
woman,0.666667,,0.947368,0.666667,1.0,
walk,,0.333333,,,,1.0


## lin similarity

In [22]:
## TODO:

# Conclusions

- ### What similarity seems better?