# IHLT Lab 6: Word Sense Disambiguation

**Authors:** *Zachary Parent ([zachary.parent](mailto:zachary.parent@estudiantat.upc.edu)), Carlos Jiménez ([carlos.humberto.jimenez](mailto:carlos.humberto.jimenez@estudiantat.upc.edu))*

### 2024-10-24

**Instructions:**

1. Read all pairs of sentences of the SMTeuroparl files of test set within the evaluation framework of the project.

2. Apply WSD algorithms to the words in the sentences.

3. Compute their similarities by considering senses and Jaccard coefficient.

4. Compare the results with those in session 2 (document) and 3 (morphology) in which words and lemmas were considered.

5. Compare the results with gold standard by giving the pearson correlation between them.

## Notes

* this time we are forced to use TextServer (only session)
* spoiler alert: using word sense will not work as well
    * best approach is morphology
* if we combine approaches from symset, lemmas, tokens into one model, we will probably get better results
* could use lemma tokens as a fallback
* could combine similarity metrics, such as all the previous methods, and use a model, like KNN or decision tree, to get better results on the project
* could use metadata for features, like document length, parts of speech
* lesk is quite bad, UKB is slightly better
* the wn response is a different format than we are used to, e.g. `10285313-n`
    * to get the symset, use `wn.synset_from_pos_and_offset('n',10285313)`
* textserver python API is not working, but we could use the browser console to hit the web API
* the browser GUI works too
* it's a good idea to cache the results from textserver
* we will decide what to do with tokens that don't have a sense, we could ignore the ones without a sense. if we keep the ones, we will apply the same pre-processing as in session 2 and 3
    * he's being coy, but we should probably not discard the words
    * if the API doesn't work, it will be OK to not use UKB


## Setup

In [1]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
from nltk.corpus import sentiwordnet as swn
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from nltk.metrics.distance import jaccard_distance
from functools import partial
import itertools
import sys
import os

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
from textserver import TextServer
import config

In [2]:
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('sentiwordnet')
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


True

In [3]:

ts = ts = TextServer(config.TEXT_SERVER_USERNAME, config.TEXT_SERVER_PASSWORD, 'morpho')

In [4]:
context = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
synset = nltk.wsd.lesk(context, 'bank', 'n')

In [5]:
synset.name(), synset.definition()

('savings_bank.n.02',
 'a container (usually with a slot in the top) for keeping money at home')

### 1. Read all pairs of sentences of the SMTeuroparl files of test set within the evaluation framework of the project.

In [6]:
BASE_PATH = './'

In [7]:
assert BASE_PATH is not None, "BASE_PATH is not set"

## Load the data

In [8]:
dt = pd.read_csv(
    f"{BASE_PATH}/test-gold/STS.input.SMTeuroparl.txt", sep="\t", header=None
)
dt.columns = ["s1", "s2"]
gs = pd.read_csv(
    f"{BASE_PATH}/test-gold/STS.gs.SMTeuroparl.txt", sep="\t", header=None
)
dt["gs"] = gs[0]
dt.head()

Unnamed: 0,s1,s2,gs
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0


## Previous results

In [9]:
# Lemmatization methods
wnl = nltk.stem.WordNetLemmatizer()
def lemmatize_one(word):
    x, pos = nltk.pos_tag([word])[0]
    d = {'NN': 'n', 'NNS': 'n', 
       'JJ': 'a', 'JJR': 'a', 'JJS': 'a', 
       'VB': 'v', 'VBD': 'v', 'VBG': 'v', 'VBN': 'v', 'VBP': 'v', 'VBZ': 'v', 
       'RB': 'r', 'RBR': 'r', 'RBS': 'r'}
    if pos in d:
        return wnl.lemmatize(word, pos=d[pos])
    return x

def lemmatize_many(words):
    return [lemmatize_one(word) for word in words]

In [10]:
# Token pre-processing methods
def remove_non_alnum(tokens):
    return [token for token in tokens if token.isalnum()]

def lower(tokens):
    return [token.lower() for token in tokens]

def remove_stopwords(tokens):
    return [token for token in tokens if token not in nltk.corpus.stopwords.words("english")]

In [11]:
# Scoring methods
def jaccard_vector(s1, s2):
    return pd.concat([s1, s2], axis=1).apply(lambda x: jaccard_distance(set(x.iloc[0]), set(x.iloc[1])), axis=1)

def score_jaccard_vector(jaccard_vector):
    return pearsonr(gs[0], jaccard_vector)[0]

In [12]:
results=pd.DataFrame(index=['score'])

s1_tokens = dt['s1'].apply(nltk.word_tokenize).apply(remove_non_alnum).apply(lower)
s2_tokens = dt['s2'].apply(nltk.word_tokenize).apply(remove_non_alnum).apply(lower)
results['tokenize'] = score_jaccard_vector(jaccard_vector(s1_tokens, s2_tokens))

s1_lemmas = dt['s1'].apply(nltk.word_tokenize).apply(remove_non_alnum).apply(lower).apply(remove_stopwords).apply(lemmatize_many)
s2_lemmas = dt['s2'].apply(nltk.word_tokenize).apply(remove_non_alnum).apply(lower).apply(remove_stopwords).apply(lemmatize_many)
results['lemmatize'] = score_jaccard_vector(jaccard_vector(s1_lemmas, s2_lemmas))

results.head()

Unnamed: 0,tokenize,lemmatize
score,-0.490289,-0.503693


### 2. Apply WSD algorithms to the words in the sentences.

In [40]:
import pickle

CACHE_FILE = 'textserver_cache.pkl'

def load_cache():
    try:
        with open(CACHE_FILE, 'rb') as f:
            return pickle.load(f)
    except FileNotFoundError:
        return {}

cache = load_cache()

def dump_cache():
    with open(CACHE_FILE, 'wb') as f:
        pickle.dump(cache, f)

def get_textserver_response(s):
    if s in cache:
        return cache[s]
    try:
        response = ts.senses(s)
    except Exception as e:
        print(f"Error getting response for {s}: {e}")
        raise e
    cache[s] = response
    return response

In [36]:
def get_lesk_synset_if_exists(tokens):
    synsets = [nltk.wsd.lesk(tokens, word) for word in tokens]
    synsets_or_tokens = [synset.name() if synset is not None else token for synset, token in zip(synsets, tokens)]
    return synsets_or_tokens

def get_ukb_synset_if_exists(tokens):
    try:
        response = get_textserver_response(' '.join(tokens))
    except Exception as e:
        print(f"Error getting response for {tokens}: {e}")
        return tokens
    synset_accessors = [sense[4] for sense in response[0]]
    synsets = [wn.synset_from_pos_and_offset( x.split('-')[1], int( x.split('-')[0])).name() if x != 'N/A' else None for x in synset_accessors]
    synsets_or_tokens = [synset if synset is not None else token for synset, token in zip(synsets, tokens)]
    return synsets_or_tokens

In [41]:
print(f"get_lesk_synset_if_exists(s1_lemmas[0]): {get_lesk_synset_if_exists(s1_lemmas[0])}")
print(f"get_ukb_synset_if_exists(s1_lemmas[0]): {get_ukb_synset_if_exists(s1_lemmas[0])}")

get_lesk_synset_if_exists(s1_lemmas[0]): ['leader.n.01', 'render.v.04', 'newfangled.s.01', 'gamble.v.01', 'permit.v.01', 'uranium.n.01', 'promise.n.02', 'assume.v.06']
Error getting response for leader give new chance let us hope seize: 401 Client Error:  for url: http://frodo.lsi.upc.edu:8080/TextWS/textservlet/ws/processQuery/senses
Error getting response for ['leader', 'give', 'new', 'chance', 'let', 'us', 'hope', 'seize']: 401 Client Error:  for url: http://frodo.lsi.upc.edu:8080/TextWS/textservlet/ws/processQuery/senses
get_ukb_synset_if_exists(s1_lemmas[0]): ['leader', 'give', 'new', 'chance', 'let', 'us', 'hope', 'seize']


### 3. Compute their similarities by considering senses and Jaccard coefficient.

In [42]:
s1_with_lesk_synsets = s1_lemmas.apply(get_lesk_synset_if_exists)
s2_with_lesk_synsets = s2_lemmas.apply(get_lesk_synset_if_exists)
results['lesk_synsets'] = score_jaccard_vector(jaccard_vector(s1_with_lesk_synsets, s2_with_lesk_synsets))

s1_with_ukb_synsets = s1_lemmas.apply(get_ukb_synset_if_exists)
s2_with_ukb_synsets = s2_lemmas.apply(get_ukb_synset_if_exists)
results['ukb_synsets'] = score_jaccard_vector(jaccard_vector(s1_with_ukb_synsets, s2_with_ukb_synsets))

results.head()

Unnamed: 0,tokenize,lemmatize,lesk_synsets
score,-0.490289,-0.503693,-0.500498


In [21]:
dump_cache()

### 4. Results comparison

#### Lab2 (document)

#### Lab3 (morphology)

### gold standard

# Analysis & Conclusions


