GeoSGLM

Code for learning geographically-informed word embeddings, as used in Bamman et al. 2014, "Distributed Representations of Geographically Situated Language" (ACL). This draws on code from Mikolov et al. 2013, "Efficient estimation of word representations in vector space" (ICLR), https://code.google.com/p/word2vec/ (Apache 2.0).

To run, adjust the input/output parameters in run.sh and execute it. The required arguments are as follows:

DATA=data/data.test.txt

The data file contains the text (and associated metadata) to learn word representations from. The main columns should be tab-separated, and the text should be space-separated (and tokenized). Sample records include:

id	location	message
480326347508969000	PA	There is a great research question in how long a sequence of blog comments can go before it descends into madness http://t.co/NFqKgaZRuO
472023364908118000	PA	So much easier than hunting through individual websites : using Google Scholar to get BibTeX citations http://t.co/H2inkMGMom
105039889808109000	PA	Just discovered Conflict Kitchen in Pittsburgh - brilliant idea that needs to catch on in other cities . http://t.co/FkSLGD9

In the work described in Bamman et al. (2014), the metadata values = the 51 US States (including DC), but can be any categorical feature.

VOCABFILE=data/vocab.txt

The vocab file contains the maximal set of words to learn representations for; if a word is not in this list, then don't learn a representation for it. This list is further filtered in the code to only include words that are seen at least 5 times in the data, and a maximum of the $MAXVOCAB most frequent terms (specified below).

FEATUREFILE=data/states.txt

The feature file lists the valid metadata values to learn embeddings for (e.g., a list of all US states).

OUTFILE=data/out.embeddings

The outfile contains the learned word embeddings. The output format is space-separated (facet, term, K-dimensional word representation). "Facet" denotes either the base representation (MAIN) or the state-specific deviation from that base representation (e.g., "CA" for california). To get the word representation for the word "city" in California, add together the vectors for city/MAIN and city/CA.

MAXVOCAB=100000

Maxvocab specifies the largest size the vocabulary can be.

DIMENSIONALITY=100

Dimensionality specifies the size of the learned word representations.

L2=0.0001

L2 regularization parameter.

Viewing embeddings

For a given query q, you can view the terms most similar to q in all 51 states using scripts/findNearest.py

python scripts/findNearest.py $OUTFILE

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
lib		lib
scripts		scripts
src/geosglm/ark/cs/cmu/edu		src/geosglm/ark/cs/cmu/edu
README.md		README.md
build.xml		build.xml
common.xml		common.xml
geoSGLM.jar		geoSGLM.jar
run.sh		run.sh
runjava		runjava

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeoSGLM

Viewing embeddings

About

Releases

Packages

Languages

dbamman/geoSGLM

Folders and files

Latest commit

History

Repository files navigation

GeoSGLM

Viewing embeddings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages