TemporalReferencing

Data and code for the experiments in

Haim Dubossarsky, Simon Hengchen, Nina Tahmasebi and Dominik Schlechtweg. 2019. Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy. Association for Computational Linguistics.

If you use this software for academic research, please cite the paper above and make sure you give appropriate credit to the below-mentioned software this repository strongly depends on.

The code heavily relies on DISSECT (modules/composes). For aligning embeddings (SGNS) we used VecMap (alignment/map_embeddings). For alignment of the PPMI matrices and measuring cosine distance we relied on code from LSCDetection. We used hyperwords for training SGNS and PPMI on the extracted word-context pairs.

Usage

The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in sys.path.append('./modules/') in the scripts. All scripts can be run directly from the command line, e.g.:

python2 corpus_processing/extract_pairs.py <corpDir> <outPath> <windowSize> <lowerBound> <upperBound> <vocabset>

We recommend to run the scripts with Python 2.7.15, only for VecMap Python 3 is needed. You will have to install some additional packages such as: docopt, gensim, i.a. Those that aren't available from the Anaconda installer can be installed via EasyInstall, or by running pip install -r requirements.txt.

Pipelines

Under scripts/ we provide full pipelines running the models on a small test corpus. Assuming you are working on a UNIX-based system, first make the scripts executable with

chmod 755 scripts/*.sh

Then download DISSECT, VecMap and hyperwords with

bash -e scripts/get_packs.sh

Then run the pipelines with

bash -e scripts/run_tr_sgns.sh
bash -e scripts/run_tr_ppmi.sh

bash -e scripts/run_bin_sgns.sh
bash -e scripts/run_bin_ppmi.sh

TR

The Temporal Referencing pipelines for SGNS/PPMI run through the following steps:

get vocabulary from corpus (corpus_processing/make_vocab.py)
extract temporally referenced word-context pairs (word_year) for specified target words (corpus_processing/extract_pairs.py)
learn one TR matrix for all bins (modules/hyperwords/)
extract matrix for each bin from TR matrix (space_creation/tr2bin.py)
extract cosine distances for each pair of adjacent time bins (measures/cd.py)
extract nearest neighbors for each time bin (measures/knn.py)

Bins

The bin pipelines run through the following steps:

get vocabulary from corpus (corpus_processing/make_vocab.py)
extract regular word-context pairs for each time bin (corpus_processing/extract_pairs.py)
learn matrix for each bin (modules/hyperwords/)
align matrices for each pair of adjacent time bins (alignment/, modules/vecmap/)
extract cosine distances from aligned matrix pairs (measures/cd.py)
extract nearest neighbors for each time bin (measures/knn.py)

Corpus

Under corpus/test/files/ we provide a small test corpus contains many duplicate sentences for the time bins 1920, 1930 and 1940 with each line in the following format:

year [tab] word1 word2 word3...

Data

Under data/ we give a spreadsheet with experimental results on the Word Sense Change testset and lists of the nearest neighbors for the test words found by the different models when trained on COHA (1920-1970).

BibTex

@inproceedings{Dubossarskyetal19,
	title = {Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change},
	author = {Haim Dubossarsky and Simon Hengchen and Nina Tahmasebi and Dominik Schlechtweg},
	booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
	year = {2019},
	address = {Florence, Italy},
	publisher = {Association for Computational Linguistics},
	pages = {457--470}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TemporalReferencing

Usage

Pipelines

TR

Bins

Corpus

Data

BibTex

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
alignment		alignment
corpus/test		corpus/test
corpus_processing		corpus_processing
data		data
measures		measures
modules		modules
scripts		scripts
space_creation		space_creation
testsets/test		testsets/test
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

License

drvenabili/TemporalReferencing

Folders and files

Latest commit

History

Repository files navigation

TemporalReferencing

Usage

Pipelines

TR

Bins

Corpus

Data

BibTex

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages