first release

alexandres · Jul 26, 2016 · 93e7617 · 93e7617
1 parent 2bce43f
commit 93e7617
Show file tree

Hide file tree

Showing 10 changed files with 1,581 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1,3 +1,58 @@
 # LexVec
 
-Source code for the model will be published shortly before ACL 2016 which takes place between August 7-12.
+This is an implementation of the **LexVec word embedding model** (similar to word2vec and GloVe) that achieves state of the art results in multiple NLP tasks, as described in [this paper](https://arxiv.org/pdf/1606.00819v2) and [this one](https://arxiv.org/pdf/1606.01283v1).
+
+## Installation
+
+### Binary
+
+The easiest way to get started with LexVec is to download the binary release. We only distribute amd64 binaries for Linux.
+
+**[Download binary](https://github.com/alexandres/lexvec/releases)**
+
+If you are using Windows, OS X, 32-bit Linux, or any other OS, follow the instructions below to build from source.
+
+### Building from source
+
+1. [Install the Go compiler](https://golang.org/doc/install)
+2. Make sure your `$GOPATH` is set
+3. Execute the following commands in your terminal:
+
+   ```bash
+   go get github.com/alexandres/lexvec
+   cd $GOPATH/src/github.com/alexandres/lexvec
+   go build
+   ```
+
+## Usage
+
+### In-memory (default, faster)
+
+To get started, run `$ ./demo.sh` which trains a model using the small [text8](http://mattmahoney.net/dc/text8.zip) corpus (100MB from Wikipedia).
+
+Basic usage of LexVec is:
+
+`$ ./lexvec -corpus somecorpus -output someoutputdirectory/vectors`
+
+Run `$ ./lexvec -h` for a full list of options.
+
+Additionally, we provide a `word2vec` script which implements the exact same interface as the [word2vec](https://code.google.com/archive/p/word2vec/) package should you want to test LexVec using existing scripts. 
+
+### External Memory
+
+By default, LexVec stores the sparse matrix being factorized in-memory. This can be a problem if your training corpus is large and your system memory limited. We suggest you first try using the in-memory implementation. If you run into Out-Of-Memory issues, try this External Memory approximation.
+xi
+
+`env OUTPUTDIR=output ./external_memory_lexvec.sh -corpus somecorpus -dim 300 ...exactsameoptionsasinmemory`
+
+Pre-processing can be accelerated by installing [nsort](http://www.ordinal.com/try.cgi/nsort-i386-3.4.54.rpm) and [pypy](http://pypy.org/) and editing `pairs_to_counts.sh`.
+
+## References
+
+Salle, A., Idiart, M., & Villavicencio, A. (2016). [Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations](https://arxiv.org/pdf/1606.00819v2). arXiv preprint arXiv:1606.00819.
+
+Salle, A., Idiart, M., & Villavicencio, A. (2016). [Enhancing the LexVec Distributed Word Representation Model Using Positional Contexts and External Memory](https://arxiv.org/pdf/1606.01283v1). arXiv preprint arXiv:1606.01283.
+
+## License
+
+Copyright (c) 2016 Salle, Alexandre <atsalle@inf.ufrgs.br>. All work in this package is distributed under the MIT License.
diff --git a/demo.sh b/demo.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+
+set -e
+
+# build lexvec binary
+if [ ! -e lexvec ]; then
+	go build
+fi
+
+if [ ! -e text8 ]; then
+	echo Downloading text8 corpus
+	if hash wget 2>/dev/null; then
+		wget http://mattmahoney.net/dc/text8.zip
+	else
+		curl -O http://mattmahoney.net/dc/text8.zip
+	fi
+	unzip text8.zip
+	rm text8.zip
+fi
+
+OUTPUTDIR=output
+
+mkdir -p $OUTPUTDIR
+# These settings are for small corpora such as text8. For larger corpora, stick to the default settings.
+./lexvec -corpus text8 -output $OUTPUTDIR/vectors -dim 200 -iterations 15 -subsample 1e-4 -window 2 -model 2 -negative 25 -minfreq 5 -threads 12 -pos=false
+
+echo Trained vectors saved to file $OUTPUTDIR/vectors
diff --git a/demo_external_memory.sh b/demo_external_memory.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+
+set -e
+
+# build lexvec binary
+if [ ! -e lexvec ]; then
+	go build
+fi
+
+if [ ! -e text8 ]; then
+	echo Downloading text8 corpus
+	if hash wget 2>/dev/null; then
+		wget http://mattmahoney.net/dc/text8.zip
+	else
+		curl -O http://mattmahoney.net/dc/text8.zip
+	fi
+	unzip text8.zip
+	rm text8.zip
+fi
+
+export OUTPUTDIR=output
+export MI=false
+export MEMORY=1
+
+# These settings are for small corpora such as text8. For larger corpora, stick to the default settings.
+./external_memory_lexvec.sh -corpus text8 -dim 200 -iterations 15 -subsample 1e-4 -window 2 -model 2 -negative 25 -minfreq 5 -threads 12 -pos=false
+
+echo Trained vectors saved to file $OUTPUTDIR/vectors
diff --git a/external_memory_lexvec.sh b/external_memory_lexvec.sh
@@ -0,0 +1,46 @@
+#!/bin/bash
+
+# Copyright (c) 2016 Salle, Alexandre <atsalle@inf.ufrgs.br>
+# Author: Salle, Alexandre <atsalle@inf.ufrgs.br>
+# 
+# Permission is hereby granted, free of charge, to any person obtaining a copy of
+# this software and associated documentation files (the "Software"), to deal in
+# the Software without restriction, including without limitation the rights to
+# use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
+# the Software, and to permit persons to whom the Software is furnished to do so,
+# subject to the following conditions:
+# 
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+# 
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
+# FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
+# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+# IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+set -e
+
+if [ -z "$OUTPUTDIR" ]; then echo "Need to set OUTPUTDIR"; fi
+
+mkdir -p $OUTPUTDIR
+
+export TMPDIR=$OUTPUTDIR
+
+export MI=${MI:-false}
+COOC=$OUTPUTDIR/coocs
+COOC_TOTALS=$COOC.totals
+VOCAB=$OUTPUTDIR/vocab
+
+CMD="./lexvec $@ -mi=$MI -cooctotalspath $COOC_TOTALS -externalmemory"
+
+echo identifying w,c pairs
+eval $CMD -printcooc -coocpath $COOC -savevocab $VOCAB 
+
+echo aggregating pairs
+./pairs_to_counts.sh < $COOC > $COOC.ready
+rm $COOC
+
+echo traning model
+eval $CMD -coocpath $COOC.ready -output $OUTPUTDIR/vectors -readvocab $VOCAB