Code for A Kernel Independence Test for Geographical Language Variation, Nguyen and Eisenstein, 2017.
Jupyter Notebook Python R
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
sample_data
HSIC.ipynb
HSIC.py
README.md
delaunay_ppa.py
hsic_wrapper.py
mantel.py
morans_i.py
plots.r
synthetic_data.ipynb
testing.py

README.md

A Kernel Independence Test for Geographical Language Variation.

Code for the following paper:

D. Nguyen and J. Eisenstein. A Kernel Independence Test for Geographical Language Variation. 
To appear in Computational Linguistics.

Getting Started

The code is in Python 2.7 and makes use of several Python packages:

  • descartes
  • fiona
  • numpy
  • pyproj
  • scipy
  • shapely

Tests

The following file runs some unit tests

python testing.py

Data

The data can be downloaded from http://www.dongnguyen.nl/data/dataset-nguyen-eisenstein-cl2017.zip (274 MB)

  • synthetic_experiments: The synthetic datasets. Each directory corresponds to one experiment. Each directory contains a results.txt file with the raw results.
  • shapefiles: Shapefiles of the Netherlands for plotting the synthetic datasets, aggregating data into bins (for Moran's I), etc. You'll still need them if you would like to experiment with the synthetic data.

Experiments

  • synthetic_data.ipynb: This notebook plots several selected synthetic datasets and shows how to apply the methods to the different types of data (binary, categorical and frequency data).
  • HSIC.ipynb: This notebook illustrates HSIC with several synthetic (non-geographical) datasets.
  • plots.r: Shows how to generate the plots in the paper based on the result files in the synthetic_experiments directory.

Command line tool

python hsic_wrapper.py 

Frequency data (should return 0.00653):

python hsic_wrapper.py -d freq -l sample_data/locs1.txt -f sample_data/data1.txt

Binary data (should return 0.00241):

python hsic_wrapper.py -d bin -l sample_data/locs2.txt -f sample_data/data2.txt

Categorical data (should return 0.00111):

python hsic_wrapper.py -d cat -l sample_data/locs3.txt -f sample_data/data3.txt

Authors

Contact: dong.p.ng@gmail.com