Skip to content

Latest commit

 

History

History
39 lines (29 loc) · 2.56 KB

README.md

File metadata and controls

39 lines (29 loc) · 2.56 KB

Diachronic Word Embedding from Google Ngrams Data

Author: Bogdan Asztalos (abogdan@caesar.elte.hu)

Based on William Hamilton's code: Original Repository

Overview

This repository contains the code with which one can reproduce the results of this paper. The diachronic word embedding can be produced by the execution of full_process.sh. This pipeline carries out steps shown in panels a-h in the figure below (see the paper for details). To extract information about the subdiffusive behavior of words from the embedding data, scripts in data_analysis directory can be used. The figures of the paper was made by IPython notebooks in notebooks directory.

The code is a developed and (in some cases) modifyed version of William Hamilton's code for historical word embeddings.

pipeline of the embedding process

Code organization

The structure of the code (in terms of folder organization) is as follows:

  • cooc_randomization contains code for randomizing co-occurrence matrices before embedding. (This is not relevant to the paper, but useful for understanding the logic of Word2vec)
  • data_analysis contains code for extracting and studying information from the embedding data. Subdiffusive behavior can be observed through these information.
  • googlengram contains code for pulling and processing historical Google N-Gram Data (Version 2).
  • `notebooks˙ contains IPython notebooks to reproduce figures from the paper.
  • representations contains code that provides a high-level interface to (historical) word vectors and is originally based upon Omer Levy's hyperwords package (https://bitbucket.org/omerlevy/hyperwords).
  • sgns contains a modified version of Google's word2vec code (https://code.google.com/archive/p/word2vec/).
  • vecanalysis contains code for evaluating and analyzing historical word vectors.

Dependencies

For the diachronic embedding:

For the IPython notebooks: