# WSL task

As stated in the paper:

Formally, let $t$ be the input text, with $t_1,...,t_|t|$ being its words, and $I$ the reference inventory, containing a set of senses. Then, a WSL systems can be represented as a function $f$ that takes as input the tuple $(t, I)$ and outputs a list of triples $[(s_1, e_1, g_1),...,(s_n, e_n, g_n)]$ where each triple $(s_i, e_i, g_i), i\epsilon[1,n])$, represents a disambiguated span, with $s_i$ and $e_i$ being the start and end token index of the span, and $g_i \epsilon I$ representing the corresponding sense chosen from the inventory.

# WSL Dataset

To note that the WSL dataset consists of two splits, containing the following datasets:
* Validation - Datasets:
  - SemEval 2007
* Test - Datasets:
  - SemEval 2007
  - SemEval 2013
  - SemEval 2015
  - SensEval 2
  - SensEval 3

As you can see the Validation data is in the Test split, which is expected as stated in the [WSL paper.](https://aclanthology.org/2024.findings-acl.851/)

Loading the dataset:

In [1]:
import os
from dotenv import load_dotenv
from datasets import load_dataset
from bokeh.plotting import show, output_notebook
from wn.compat import sensekey
import pandas as pd
from experimental_wsd.wsl import *
from experimental_wsd.wordnet_utils import check_lexicon_exists

output_notebook()

load_dotenv()

EN_LEXICON = 'omw-en:1.4'
check_lexicon_exists(EN_LEXICON)
ENGLISH_WN = wn.Wordnet(lexicon=EN_LEXICON, expand='')
GET_SENSE = sensekey.sense_getter(EN_LEXICON, ENGLISH_WN)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
wsl_ds = load_dataset("Babelscape/wsl", token=os.getenv("HF_TOKEN"))


We can iterate over all sentences in both the validation and test dataset splits:

In [3]:
wsl_splits = ['validation', 'test']
for wsl_split in wsl_splits:
    for wsl_sentence in wsl_sentence_generator(wsl_ds, wsl_split, word_net_sense_getter=GET_SENSE):
        continue

We can show the dataset in each split:

In [4]:
wsl_splits = ['validation', 'test']
for wsl_split in wsl_splits:
    print(f"Split: {wsl_split}")
    for dataset_name in get_all_dataset_ids(wsl_sentence_generator(wsl_ds, wsl_split, word_net_sense_getter=GET_SENSE)):
        print(f"Dataset: {dataset_name}")
    print('--------------')
    print()

Split: validation
Dataset: semeval2007
--------------

Split: test
Dataset: senseval3
Dataset: senseval2
Dataset: semeval2015
Dataset: semeval2013
Dataset: semeval2007
--------------



As stated earlier the validation data is in the test split therefore when we are generating the statistics split by Dataset name we will only use the Test split data for that.

* number of documents
* number of sentences
* number of tokens
  - number of content words
* number of annotations
 - number of Multi Word Expressions (MWE)
 - number of sub-words

In [5]:
dataset_id_statistics = defaultdict(dict)

for dataset_id in get_all_dataset_ids(wsl_sentence_generator(wsl_ds, 'test', word_net_sense_getter=GET_SENSE)):
    dataset_id_statistics[dataset_id] = wsl_data_statistics(wsl_sentence_generator(wsl_ds, 'test', word_net_sense_getter=GET_SENSE, filter_by_dataset_id=dataset_id))
pd.DataFrame(dataset_id_statistics)[["semeval2007", "semeval2013", "semeval2015", "senseval2", "senseval3"]]

Unnamed: 0,semeval2007,semeval2013,semeval2015,senseval2,senseval3
No. Docs,3,13,4,3,3
No. Sent,135,306,138,242,352
No. Tokens,3219,8535,2645,5846,5642
No. Content Tokens (%),"1,449 (45.01%)","3,861 (45.24%)","1,175 (44.42%)","2,723 (46.58%)","2,534 (44.91%)"
No. Annotations,1419,3880,1179,2733,2503
No. MWEs (%),63 (4.44%),203 (5.23%),42 (3.56%),125 (4.57%),143 (5.71%)
No. Sub tokens,0,1,1,1,2


We can now show the dataset statistics by split:

In [6]:
split_statistics = defaultdict(dict)

for dataset_split in ['validation', 'test']:
    split_statistics[dataset_split] = wsl_data_statistics(wsl_sentence_generator(wsl_ds, dataset_split, word_net_sense_getter=GET_SENSE))
pd.DataFrame(split_statistics)[["validation", "test"]]

Unnamed: 0,validation,test
No. Docs,3,26
No. Sent,135,1173
No. Tokens,3219,25887
No. Content Tokens (%),"1,449 (45.01%)","11,742 (45.36%)"
No. Annotations,1419,11714
No. MWEs (%),63 (4.44%),576 (4.92%)
No. Sub tokens,0,5


As stated in the original [WSL paper (section 4.2)](https://aclanthology.org/2024.findings-acl.851.pdf) tokens that are part of a MWE can also be single entities themselves, we show below the number of these overlapping occurrences:

* Number of overlapping groups, whereby a group can contain two or more entities (annotations).
* Number of entities/annotations that are in these overlapping groups.
* One common label groups - Number of groups whereby all entities in that group have one common WSD label.
* A breakdown of the overlapping groups based on number of entities in the groups.

In [7]:
dataset_id_overlapping_groups_statistics = defaultdict(dict)

for dataset_id in get_all_dataset_ids(wsl_sentence_generator(wsl_ds, 'test', word_net_sense_getter=GET_SENSE)):
    dataset_id_overlapping_groups_statistics[dataset_id] = get_overlapping_occurrences_statistics(wsl_sentence_generator(wsl_ds, 'test', word_net_sense_getter=GET_SENSE, filter_by_dataset_id=dataset_id))
pd.DataFrame(dataset_id_overlapping_groups_statistics)[["semeval2007", "semeval2013", "semeval2015", "senseval2", "senseval3"]]

Unnamed: 0,semeval2007,semeval2013,semeval2015,senseval2,senseval3
Number of entities in overlapping groups,69.0,393.0,81.0,241.0,221
Number of overlapping groups,32.0,150.0,33.0,95.0,87
One common label groups,0.0,0.0,1.0,2.0,4
Overlapping groups of 2,27.0,58.0,19.0,44.0,45
Overlapping groups of 3,5.0,91.0,13.0,51.0,39
Overlapping groups of 4,,1.0,1.0,,1
Overlapping groups of 5,,,,,2


We can also generate these statistics aggregated across all of these datasets, which is the test set split:

In [8]:
get_overlapping_occurrences_statistics(wsl_sentence_generator(wsl_ds, 'test', word_net_sense_getter=GET_SENSE, filter_by_dataset_id=dataset_id))

{'Number of entities in overlapping groups': 69,
 'Number of overlapping groups': 32,
 'One common label groups': 0,
 'Overlapping groups of 2': 27,
 'Overlapping groups of 3': 5}

We now show the breakdown of POS tags for all content words in the test set:

In [9]:
show(create_wsd_pos_content_words_plot(wsl_sentence_generator(wsl_ds, 'test', word_net_sense_getter=GET_SENSE)))