Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks (EMNLP2017)
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data removed kdtree.py Aug 11, 2017
README.md
data.py
hella.jpg
lang2loc.py
lang2loc_mdnshared.py readme and dataset_name fixed Aug 11, 2017
lasagne_layers.py
loc2lang.py readme and dataset_name fixed Aug 11, 2017
loc2lang_withpi.py
utils.py remove loc2lang_withpi Aug 11, 2017

README.md

GEOMDN readme

Introduction

GEOMDN is an implementation of Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks (EMNLP2017).

The neural-network is implemented using Theano/Lasagne but it shouldn't be difficult to adopt it to other NN frameworks.

The work has 3 main modules:

  1. lang2loc.py implements mixture density networks to predict location from text input

  2. lang2loc_mdnshared.py implements mixture density networks to predict location from text input with the difference that the mus, sigmas and corxys of the mixure of Gaussians are shared between all the input samples and only pis of samples are conditioned on input. This improved the model as the global mixture of Gaussian sturcture exists and can be learned from all the samples rather than predicted for each individual sample.

  3. loc2lang.py implements a lexical dialectology model where given 2d coordinate inputs predicts a unigram probability distribution over vocabulary. The input is a normal 2d input layer but the hidden layer consisits of several Gaussian distributions whose mus, sigmas and corxys are learned and its output is the probability of input in each of the Gaussian components.

word maps Look at some of the maps, a lot of local words including named entities for several DARE dialect regions and city terms including named entities for about 100 U.S. cities.

local words retrieved for dialect region Delmarva:

    "delmarva": [
        "llsssss", 
        "llssss", 
        "llsss", 
        "downingtown", 
        "ardd", 
        "dickeating", 
        "llss", 
        "brovah", 
        "millersville", 
        "erked", 
        "rehoboth", 
        "suitland", 
        "arddd", 
        "oldhead", 
        "deptford", 
        "exton", 
        "youngbull", 
        "harford", 
        "fraudin", 
        "drawlin", 
        "dfl", 
        "cheltenham", 
        "reisterstown", 
        "ared", 
        "parkville", 
        "nizz", 
        "#ttm", 
        "marlton", 
        "xib", 
        "llls", 
        "norristown", 
        "horsham", 
        "owings", 
        "schuylkill", 
        "ard", 
        "kutztown", 
        "manayunk", 
        "bensalem", 
        "elkridge", 
        "btfu", 
        "fyd", 
        "llab", 

Geolocation Datasets

Datasets are GEOTEXT a.k.a CMU (a small Twitter geolocation dataset) and TwitterUS a.k.a NA (a bigger Twitter geolocation dataset) both covering continental U.S. which can be downloaded from here

Quick Start

  1. Download the datasets and place them in ''./datasets/cmu'' and ''./datasets/na'' for GEOTEXT and TwitterUS (contact me for the datasets).

  2. For lang2loc geolocation run:

For GEOTEXT a.k.a CMU run:

THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc.py -d ./datasets/cmu/ -enc latin1 -reg 0 -drop 0.5 -mindf 10 -hid 100 -ncomp 100

For TwitterUS a.k.a NA run:

THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc.py -d ./datasets/na/ -enc utf-8 -reg 1e-5 -drop 0.0 -mindf 10 -hid 300 -ncomp 100
  1. For lang2loc_mdnshared geolocation run:

For GEOTEXT a.k.a CMU run:

THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc_mdnshared.py -d ~/datasets/cmu/ -enc latin1 -reg 0.0 -drop 0.0 -mindf 10 -hid 100 -ncomp 300 -batch 200

For TwitterUS a.k.a NA run:

THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc_mdnshared.py -d ~/datasets/na/ -enc utf-8 -reg 0.0 -drop 0.0 -mindf 10 -hid 900 -ncomp 900 -batch 2000
  1. For loc2lang lexical dialectology model run:
THEANO_FLAGS='device=cpu,floatX=float32'   nice -n 10 python loc2lang.py -d ~/datasets/na/ -enc utf-8 -reg 0.0 -drop 0.0 -mindf 100 -hid 1000 -ncomp 500 -batch 5000

Note that cmu is very small to be used for lexical dialectology.

Citation

@InProceedings{rahimicontinuous2017,
  author    = {Rahimi, Afshin  and  Baldwin, Timothy and Cohn, Trevor},
  title     = {Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks },
  booktitle = {Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP2017)},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  url       = {http://people.eng.unimelb.edu.au/tcohn/papers/emnlp17geomdn.pdf}
}

Contact

Afshin Rahimi afshinrahimi@gmail.com