# IHLT Lab 6: Word Sense Disambiguation

**Authors:** *Zachary Parent ([zachary.parent](mailto:zachary.parent@estudiantat.upc.edu)), Carlos Jiménez ([carlos.humberto.jimenez](mailto:carlos.humberto.jimenez@estudiantat.upc.edu))*

### 2024-10-24

**Instructions:**

1. Read all pairs of sentences of the SMTeuroparl files of test set within the evaluation framework of the project.

2. Apply WSD algorithms to the words in the sentences.

3. Compute their similarities by considering senses and Jaccard coefficient.

4. Compare the results with those in session 2 (document) and 3 (morphology) in which words and lemmas were considered.

5. Compare the results with gold standard by giving the pearson correlation between them.

## Notes

* this time we are forced to use TextServer (only session)
* spoiler alert: using word sense will not work as well
    * best approach is morphology
* if we combine approaches from symset, lemmas, tokens into one model, we will probably get better results
* could use lemma tokens as a fallback
* could combine similarity metrics, such as all the previous methods, and use a model, like KNN or decision tree, to get better results on the project
* could use metadata for features, like document length, parts of speech
* lesk is quite bad, UKB is slightly better
* the wn response is a different format than we are used to, e.g. `10285313-n`
    * to get the symset, use `wn.synset_from_pos_and_offset('n',10285313)`
* textserver python API is not working, but we could use the browser console to hit the web API
* the browser GUI works too
* it's a good idea to cache the results from textserver
* we will decide what to do with tokens that don't have a sense, we could ignore the ones without a sense. if we keep the ones, we will apply the same pre-processing as in session 2 and 3
    * he's being coy, but we should probably not discard the words
    * if the API doesn't work, it will be OK to not use UKB


## Setup

In [1]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
from nltk.corpus import sentiwordnet as swn
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from functools import partial
import itertools
from textserver import TextServer

In [2]:
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('sentiwordnet')
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/carlos.jimenez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/carlos.jimenez/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /home/carlos.jimenez/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /home/carlos.jimenez/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


True

In [3]:
context = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
synset = nltk.wsd.lesk(context, 'bank', 'n')

In [4]:
synset.name(), synset.definition()

('savings_bank.n.02',
 'a container (usually with a slot in the top) for keeping money at home')

### 1. Read all pairs of sentences of the SMTeuroparl files of test set within the evaluation framework of the project.

In [8]:
BASE_PATH = './'

In [9]:
assert BASE_PATH is not None, "BASE_PATH is not set"

In [10]:
dt = pd.read_csv(
    f"{BASE_PATH}/test-gold/STS.input.SMTeuroparl.txt", sep="\t", header=None
)
dt.head()

Unnamed: 0,0,1
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi..."


### 2. Apply WSD algorithms to the words in the sentences.

### 3. Compute their similarities by considering senses and Jaccard coefficient.

### 4. Results comparison

#### Lab2 (document)

#### Lab3 (morphology)

### gold standard

# Analysis & Conclusions


