Diachronic Word Embedding from Google Ngrams Data

Author: Bogdan Asztalos (abogdan@caesar.elte.hu)

Based on William Hamilton's code: Original Repository

Overview

This repository contains the code with which one can reproduce the results of this paper. The diachronic word embedding can be produced by the execution of full_process.sh. This pipeline carries out steps shown in panels a-h in the figure below (see the paper for details). To extract information about the subdiffusive behavior of words from the embedding data, scripts in data_analysis directory can be used. The figures of the paper was made by IPython notebooks in notebooks directory.

The code is a developed and (in some cases) modifyed version of William Hamilton's code for historical word embeddings.

Code organization

The structure of the code (in terms of folder organization) is as follows:

cooc_randomization contains code for randomizing co-occurrence matrices before embedding. (This is not relevant to the paper, but useful for understanding the logic of Word2vec)
data_analysis contains code for extracting and studying information from the embedding data. Subdiffusive behavior can be observed through these information.
googlengram contains code for pulling and processing historical Google N-Gram Data (Version 2).
`notebooks˙ contains IPython notebooks to reproduce figures from the paper.
representations contains code that provides a high-level interface to (historical) word vectors and is originally based upon Omer Levy's hyperwords package (https://bitbucket.org/omerlevy/hyperwords).
sgns contains a modified version of Google's word2vec code (https://code.google.com/archive/p/word2vec/).
vecanalysis contains code for evaluating and analyzing historical word vectors.

Dependencies

For the diachronic embedding:

python 2.7
numpy: http://numpy.org/install
sklearn: http://scikit-learn.org/stable/
cython: http://docs.cython.org/src/quickstart/install.html
Natural Language Toolkit: http://nltk.org/install.html

For the IPython notebooks:

python 3.8
jupyter: http://docs.jupyter.org/en/latest/install.html
numpy: http://numpy.org/install
scipy: http://scipy.org/install
matplotlib: http://matplotlib.org/stable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Diachronic Word Embedding from Google Ngrams Data

Author: Bogdan Asztalos (abogdan@caesar.elte.hu)

Based on William Hamilton's code: Original Repository

Overview

Code organization

Dependencies

Files

README.md

Latest commit

History

README.md

File metadata and controls

Diachronic Word Embedding from Google Ngrams Data

Author: Bogdan Asztalos (abogdan@caesar.elte.hu)

Based on William Hamilton's code: Original Repository

Overview

Code organization

Dependencies