This repository contains code to compute continuously evolving word embeddings as weighted running averages of contextualized embeddings (e.g. computed by a transformer model such as BERT). Specifically, these embeddings can be used as continuous diachronic embeddings to study the semantic shift and usage change of words over time as described in our paper.
The code is written for Python 3. Please make sure you have all the dependencies from the requirements.txt
file installed (especially flair>=0.8.0
).
The main code to compute the evolving embeddings can be found in the file evolvemb/embeddings.py
. The evolving embeddings are built on top of Flair Embeddings. They can be used as follows:
from flair.embeddings import TransformerWordEmbeddings
from evolvemb import EvolvingEmbeddings
# the contextualized embeddings from which the weighted running averages should be computed
local_emb = TransformerWordEmbeddings("bert-base-uncased", layers="all", layer_mean=True, subtoken_pooling="mean", fine_tune=False)
# initialize the continuously evolving embeddings on top of the BERT embeddings with some alpha for the weighted average
evolving_emb = EvolvingEmbeddings(local_emb, alpha=0.1)
# ## basic, direct usage (--> see also diachronic embedding example)
# the corpus from which the embeddings should be learned: list of lists of words
sentences = [["This", "is", "one", "sentence", "."], ["This", "is", "another", "sentence", "."]]
# learn embeddings
for s in sentences:
evolving_emb.update_evolving_embeddings(s)
# get embedding vector for a word of interest as a numpy array
emb_vector = evolving_emb["sentence"]
# ## working with Flair data structures
from flair.data import Sentence
# make a flair sentence
sentence = Sentence('I love Berlin .')
# embed sentence with evolving embeddings
evolving_emb.embed(sentence)
If you need to save the final embeddings without the dependency on the flair library, you can have a look at the SimplePretrainedEmbeddings
class in evolvemb/base.py
and the diachronic embedding example.
The following steps need to be executed to explore diachronic embeddings created from your own dataset in a local dash web app with interactive graphics (see screencast for a demonstration). See data/nytimes_dataset_excerpt.txt
for a preview of how the full nytimes_dataset.txt
file used in the examples looks like.
-
[Optional (but advised)] Fine-tune Transformer Model: especially if your data is very different from any of the available pre-trained HuggingFace models, consider fine-tuning a related model checkpoint on your dataset. The
finetune_huggingface.ipynb
notebook contains code for how we did this on our NYTimes dataset. -
Dataset Preparation & Computation of Continuously Evolving Embedding Snapshots: see
nytimes_diachronic.ipynb
for an example of how to do this. The embedding snapshots can be computed with the functioncompute_emb_snapshots
fromevolvemb
. The function requires as input:sentences
: a list of sentences, where a sentence is represented as a list of words (possibly preprocessed, i.e., strings will be passed as is to the transformer model). The sentences have to be in chronological order (i.e., oldest first), corresponding to the dates (see below).dates
: a list of dates (as strings) of the same length assentences
, i.e., each date corresponds to the respective sentence in the previous list. The date strings should be written in a format that can easily be compared, e.g.,'%Y-%m-%d'
(->'2021-02-16'
). If your dataset is in the same format as our original dataset (seedata/nytimes_dataset_excerpt.txt
), then you can also use the functionload_diachronic_dataset
to read in your data and generate thesentences
anddates
lists.snapshots
: a list of strings with dates after which the snapshot should be saved; in the same format as the strings indates
. For example,snapshots=['2020-01-31', '2020-02-31', '2020-03-31']
will result in 3 embedding snapshots that are saved at the end of Jan/Feb/Mar 2020 respectively. It doesn't matter that the snapshots contain impossible dates, e.g., the snapshot for'2020-02-31'
is simply taken before the first sentence with the date'2020-03-01'
is processed.local_emb_name
: this will be passed as the checkpoint name to the transformer model. By default, random embeddings are used, which can be helpful to quickly test your code without applying the transformer to all your sentences.
The
compute_emb_snapshots
function then returns a dictionary with{'snapshot-date': SimplePretrainedEmbeddings}
where the snapshot dates consist of the actual dates where the snapshots were taken (e.g.'2020-02-28'
instead of'2020-02-31'
) and the embeddings themselves areSimplePretrainedEmbeddings
fromevolvemb/base.py
, which is a convenience format to easily access embeddings in a numpy array with keys, e.g.,snapshot_emb['2020-02-28']['banana']
could give you the snapshot of the embedding for the word 'banana' at the end of February 2020.
To use the computed embedding snapshots in the next step (web app with interactive plots), use pickle to save them assnapshot_emb.pkl
in the same folder asapp.py
, i.e.,pickle.dump(snapshot_emb, open("snapshot_emb.pkl", "wb"), -1)
. -
Interactive Dash App: Once you've computed and saved the embedding snapshots as
snapshot_emb.pkl
, the interactive dash app can be started by running$ python app.py
and opening a browser at http://127.0.0.1:8050/, where you can now explore the plots for different words in your dataset. Alternatively, the same plots can also be viewed in a Jupyter Notebook (seenytimes_diachronic.ipynb
for an example).
The web app and plots can also easily be created using other diachronic embeddings by simply wrapping the existing embedding matrices in a SimplePretrainedEmbeddings
or PretrainedEmbeddings
object (e.g. gensim
KeyedVectors
) when constructing the snapshot_emb
dictionary; see nytimes_diachronic_gensim.ipynb
for an example based on Kim et. al 2014.
In addition to the evolvemb
library, this repository also contains all necessary code to reproduce the results from our paper: first the New York Times article snippets need to be downloaded from the NYTimes API. For this the script nytimes_make_dataset.py
can be used, which requires an API key in a file nytimes_apikey.txt
and saves the final dataset at data/nytimes_dataset.txt
(see data/nytimes_dataset_excerpt.txt
for a preview of how the full dataset is structured). Once the dataset was created, the experiments can be executed in the notebook nytimes_diachronic.ipynb
(and nytimes_diachronic_gensim.ipynb
). Please note that the date given for some NYTimes article snippets is noisy, because an article can be updated again at a much later date without this being reflected in the data obtained from the API (this becomes noticeable, for example, when looking at the plots for 'coronavirus').
If you found any of this code helpful, please consider citing the paper:
@inproceedings{horn2021exploring,
title = "Exploring Word Usage Change with Continuously Evolving Embeddings",
author = "Horn, Franziska",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-demo.35",
doi = "10.18653/v1/2021.acl-demo.35",
pages = "290--297",
}