This repository contains the code used for experiments reported in the paper:
"PISA: A measure of Preference In Selection of Arguments to model verb argument recoverability"
Authors: Giulia Cappelli and Alessandro Lenci
Presented at *SEM2020
To cite the paper:
@inproceedings{cappelli-lenci-2020-pisa,
title = "{PISA}: A measure of Preference In Selection of Arguments to model verb argument recoverability",
author = "Cappelli, Giulia and
Lenci, Alessandro",
booktitle = "Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.starsem-1.14",
pages = "131--136",
abstract = "Our paper offers a computational model of the semantic recoverability of verb arguments, tested in particular on direct objects and Instruments. Our fully distributional model is intended to improve on older taxonomy-based models, which require a lexicon in addition to the training corpus. We computed the selectional preferences of 99 transitive verbs and 173 Instrument verbs as the mean value of the pairwise cosines between their arguments (a weighted mean between all the arguments, or an unweighted mean with the topmost k arguments). Results show that our model can predict the recoverability of objects and Instruments, providing a similar result to that of taxonomy-based models but at a much cheaper computational cost.",
}
Under the main directory:
python3 setup.py [install|develop]
The first thing you want to do is to extract from a corpus the list of target nouns associated with each verb. In order to do so, launch the command:
resnikmeasure extract-dobjects \
--verbs-input abs_path_to_verb_list \
--corpus abs_path_to_corpus_files \
--rels list_of_relations \
--num-workers number_of_worker_for_multiprocessing \
--output-dir abs_path_to_output_directory
The verb list for Resnik original measure is in data/verb_list_resnik.txt
The script will create, in the output directory:
- a file
nouns.freq
containing the absolute frequencies of the selected verbs - a file
verbs.freq
containing the absolute frequencies of the found nouns - a set of files
output_nouns.[verb]
, one for each verb, containing the list of nouns for the specific verbs with its relative frequency
After having extracted the complete list of co-occurrences, the next thing to do would be to filter the lists of nouns based on their absolute frequency (to reduce noise) and their presence in the distributional models we plan to use.
To do so, run the following commands:
resnikmeasure filter-threshold \
--input-dir abs_path_to_directory_containing_input_files \
--threshold minimum_admitted_frequency \
--output-dir abs_path_to_output_directory
resnikmeasure filter-coverage \
--input-filepaths abs_paths_to_files_with_lists \
--models-fpath abs_path_to_file_containing_models_list \
--nouns-fpath abs_path_to_noun_frequency_file \
--output-dir abs_path_to_output_directory
The file containing the list of models should be formatted as follows:
model.one.id /abs/path/to/file/containing/one/vector/per/line
model.two.id /abs/path/to/file/containing/one/vector/per/line
Next, you can compute the standard measure proposed by Resnik, using the following command:
resnikmeasure resnik
--input-filepaths abs_paths_to_files_with_lists \
--wordnet true_if_specified \
--language-code wordnet_language_code \
--output-dir abs_path_to_output_directory
The --language-code
parameter is required only if the --wordnet
flag is used.
Next, you want to compute the set of measures based on distributional information. In order to do so, a preliminary step has to be done for efficiency reasons.
Before getting to the actual computation, we need to store the pairwise cosine similarities between the nouns that we have extracted during the previous steps. This is a costly operation in terms of both space and time, so be prepared to see this computation last for a while.
resnikmeasure cosines
--input-filepaths abs_paths_to_files_with_lists \
--nouns-fpath abs_path_to_file_with_noun_frequencies \
--num-workers number_of_worker_for_multiprocessing \
--models-fpath abs_path_to_file_containing_models_list \
--output-dir abs_path_to_output_directory
Another thing that we might want to pre-compute are the weights that will be used during the computation of the distributional measures. This will make it easier to have a qualitative understanding of what the measures do.
resnikmeasure weights
--input-filepaths abs_paths_to_files_with_lists \
--weight-name [id|frequency|idf|entropy|in_entropy|lmi] \
--noun-freqs abs_path_to_file_with_noun_frequencies \
--verb-freqs abs_path_to_file_with_verb_frequencies \
--output-dir abs_path_to_output_directory
The --noun-freqs
parameter is only needed for entropy
and lmi
computation.
The --verb-freqs
parameter is only needed for lmi
computation.
[TODO: add a description of weights]
We can now turn to the computation for the actual measures. Note that they are quite time-consuming.
[TODO: add description]
resnikmeasure weighted-dist-measure
--input-filepaths abs_paths_to_files_with_lists \
--models-filepaths abs_paths_to_files_containing_pairwise_cosines \
--weight-filepaths abs_paths_to_files_containing_weights
--output-dir abs_path_to_output_directory
[TODO: add description]
resnikmeasure topk-dist-measure
--input-filepaths abs_paths_to_files_with_lists \
--weight-filepaths abs_paths_to_files_containing_weights \
--models-filepaths abs_paths_to_files_containing_pairwise_cosines \
--top-k number_of_items_to_consider \
--output-dir abs_path_to_output_directory
If --top-k
is given a negative value, the least significant k
values will be considered.