# IHLT Lab 5: Lexical Semantics

**Authors:** *Zachary Parent ([zachary.parent](mailto:zachary.parent@estudiantat.upc.edu)), Carlos Jiménez ([carlos.humberto.jimenez](mailto:carlos.humberto.jimenez@estudiantat.upc.edu))*

### 2024-10-17

**Instructions:**

- Given the following (lemma, category) pairs:

`(’the’,’DT’), (’man’,’NN’), (’swim’,’VB’), (’with’, ’PR’), (’a’, ’DT’),
(’girl’,’NN’), (’and’, ’CC’), (’a’, ’DT’), (’boy’, ’NN’), (’whilst’, ’PR’),
(’the’, ’DT’), (’woman’, ’NN’), (’walk’, ’VB’)`

- For each pair, when possible, print their most frequent WordNet synset

- For each pair of words, when possible, print their corresponding least common subsumer (LCS) and their similarity value, using the following functions:

    - Path Similarity

    - Peacock-Chodorow Similarity

    - Wu-Palmer Similarity

    - Lin Similarity

Normalize similarity values when necessary. What similarity seems better?

## Notes

- we should normalize the lch_similarity (not all trees have the same depth)
- most frequent ~= most likely
- lemmas have an attribute 'count'
- given synsets we can get lemmas and counts
- we should established rules/criteria to identify which similarity is 'best'
- not all the words have an entry in wordnet, in that case computing the similarity is not possible
  - we should compute our own sorting method by lemmas count


## Setup

In [1]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
from nltk.corpus import sentiwordnet as swn

In [2]:
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('sentiwordnet')
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


True

In [3]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')

In [4]:
def getRelations(ss):
    lexRels = ['hypernyms', 'instance_hypernyms', 'hyponyms', \
       'instance_hyponyms', 'member_holonyms', 'substance_holonyms', \
       'part_holonyms', 'member_meronyms', 'substance_meronyms', \
       'part_meronyms', 'attributes', 'entailments', 'causes', 'also_sees', \
       'verb_groups', 'similar_tos']
    def getRelValue(ss, rel):
        method = getattr(ss, rel)
        return method()
    
    results = {}
    for rel in lexRels:
        val = getRelValue(ss, rel)
        if val != []:
            results[rel] = val
    return results

## WordNet Similarities

In [5]:
dog.lowest_common_hypernyms(cat) 

[Synset('carnivore.n.01')]

In [6]:
dog.path_similarity(cat)

0.2

In [7]:
dog.lch_similarity(cat) 


2.0281482472922856

In [8]:
dog.wup_similarity(cat) 

0.8571428571428571

In [9]:
brown_ic = wordnet_ic.ic('ic-brown.dat')
dog.lin_similarity(cat,brown_ic)

0.8768009843733973

## SentiWordnet 

In [10]:
# getting the wordnet synset
synset = wn.synset('good.a.1')
# getting the sentiwordnet synset
sentiSynset = swn.senti_synset(synset.name())

sentiSynset.pos_score(), sentiSynset.neg_score(), sentiSynset.obj_score()

(0.75, 0.0, 0.25)

In [11]:
words = [('the','DT'), ('man','NN'), ('swim','VB'), ('with', 'PR'), ('a', 'DT'),
('girl','NN'), ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), ('whilst', 'PR'),
('the', 'DT'), ('woman', 'NN'), ('walk', 'VB')]

In [12]:
import pandas as pd
synset_pos_converter = {
    'NN': 'n',
    'VB': 'v',
}

words_df = pd.DataFrame(words,columns=['word', 'pos'])
words_df['synset_pos'] = words_df['pos'].apply(lambda pos: synset_pos_converter[pos] if pos in synset_pos_converter else None)

print(f"Null values: {words_df.isnull().sum()}")
words_df.dropna(inplace=True)
print(f"Null values: {words_df.isnull().sum()}")

words_df['synset_accessor'] = words_df.apply(lambda row: f"{row['word']}.{row['synset_pos']}.01", axis=1)
words_df


Null values: word          0
pos           0
synset_pos    7
dtype: int64
Null values: word          0
pos           0
synset_pos    0
dtype: int64


Unnamed: 0,word,pos,synset_pos,synset_accessor
1,man,NN,n,man.n.01
2,swim,VB,v,swim.v.01
5,girl,NN,n,girl.n.01
8,boy,NN,n,boy.n.01
11,woman,NN,n,woman.n.01
12,walk,VB,v,walk.v.01


In [13]:
def get_similarity_matrix(words_df: pd.DataFrame, similarity_fn: callable) -> pd.DataFrame:
    similarity_matrix = pd.DataFrame(columns=words_df['word'], index=words_df['word'])
    for _, (word1, _, _, word1_synset_accessor) in words_df.iterrows():
        word1_synset = wn.synset(word1_synset_accessor)
        for _, (word2, _, _, word2_synset_accessor) in words_df.iterrows():
            word2_synset = wn.synset(word2_synset_accessor)
            similarity_matrix.loc[word1, word2] = similarity_fn(word1_synset, word2_synset)
    return similarity_matrix

In [14]:
import pandas as pd

path_similarities = get_similarity_matrix(words_df, wn.path_similarity)

lch_similarities = pd.DataFrame(index=words_df['word'].to_list(), columns=words_df['word'])
verb_words = words_df[words_df['synset_pos'] == 'v']['word'].tolist()
lch_similarities.loc[verb_words, verb_words] = get_similarity_matrix(words_df[words_df['synset_pos'] == 'v'], wn.lch_similarity)

noun_words = words_df[words_df['synset_pos'] == 'n']['word'].tolist()
lch_similarities.loc[noun_words, noun_words] = get_similarity_matrix(words_df[words_df['synset_pos'] == 'n'], wn.lch_similarity)

lch_similarities


word,man,swim,girl,boy,woman,walk
man,3.637586,,2.251292,2.538974,2.538974,
swim,,3.258097,,,,2.159484
girl,2.251292,,3.637586,1.845827,2.944439,
boy,2.538974,,1.845827,3.637586,2.028148,
woman,2.538974,,2.944439,2.028148,3.637586,
walk,,2.159484,,,,3.258097


In [15]:
wn.synset

<bound method WordNetCorpusReader.synset of <WordNetCorpusReader in '/Users/zachparent/nltk_data/corpora/wordnet.zip/wordnet/'>>