Skip to content

Commit

Permalink
first release
Browse files Browse the repository at this point in the history
  • Loading branch information
alexandres committed Jul 26, 2016
1 parent 2bce43f commit 93e7617
Show file tree
Hide file tree
Showing 10 changed files with 1,581 additions and 1 deletion.
57 changes: 56 additions & 1 deletion README.md
@@ -1,3 +1,58 @@
# LexVec

Source code for the model will be published shortly before ACL 2016 which takes place between August 7-12.
This is an implementation of the **LexVec word embedding model** (similar to word2vec and GloVe) that achieves state of the art results in multiple NLP tasks, as described in [this paper](https://arxiv.org/pdf/1606.00819v2) and [this one](https://arxiv.org/pdf/1606.01283v1).

## Installation

### Binary

The easiest way to get started with LexVec is to download the binary release. We only distribute amd64 binaries for Linux.

**[Download binary](https://github.com/alexandres/lexvec/releases)**

If you are using Windows, OS X, 32-bit Linux, or any other OS, follow the instructions below to build from source.

### Building from source

1. [Install the Go compiler](https://golang.org/doc/install)
2. Make sure your `$GOPATH` is set
3. Execute the following commands in your terminal:

```bash
go get github.com/alexandres/lexvec
cd $GOPATH/src/github.com/alexandres/lexvec
go build
```

## Usage

### In-memory (default, faster)

To get started, run `$ ./demo.sh` which trains a model using the small [text8](http://mattmahoney.net/dc/text8.zip) corpus (100MB from Wikipedia).

Basic usage of LexVec is:

`$ ./lexvec -corpus somecorpus -output someoutputdirectory/vectors`

Run `$ ./lexvec -h` for a full list of options.

Additionally, we provide a `word2vec` script which implements the exact same interface as the [word2vec](https://code.google.com/archive/p/word2vec/) package should you want to test LexVec using existing scripts.

### External Memory

By default, LexVec stores the sparse matrix being factorized in-memory. This can be a problem if your training corpus is large and your system memory limited. We suggest you first try using the in-memory implementation. If you run into Out-Of-Memory issues, try this External Memory approximation.
xi

`env OUTPUTDIR=output ./external_memory_lexvec.sh -corpus somecorpus -dim 300 ...exactsameoptionsasinmemory`

Pre-processing can be accelerated by installing [nsort](http://www.ordinal.com/try.cgi/nsort-i386-3.4.54.rpm) and [pypy](http://pypy.org/) and editing `pairs_to_counts.sh`.

## References

Salle, A., Idiart, M., & Villavicencio, A. (2016). [Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations](https://arxiv.org/pdf/1606.00819v2). arXiv preprint arXiv:1606.00819.

Salle, A., Idiart, M., & Villavicencio, A. (2016). [Enhancing the LexVec Distributed Word Representation Model Using Positional Contexts and External Memory](https://arxiv.org/pdf/1606.01283v1). arXiv preprint arXiv:1606.01283.

## License

Copyright (c) 2016 Salle, Alexandre <atsalle@inf.ufrgs.br>. All work in this package is distributed under the MIT License.
27 changes: 27 additions & 0 deletions demo.sh
@@ -0,0 +1,27 @@
#!/bin/bash

set -e

# build lexvec binary
if [ ! -e lexvec ]; then
go build
fi

if [ ! -e text8 ]; then
echo Downloading text8 corpus
if hash wget 2>/dev/null; then
wget http://mattmahoney.net/dc/text8.zip
else
curl -O http://mattmahoney.net/dc/text8.zip
fi
unzip text8.zip
rm text8.zip
fi

OUTPUTDIR=output

mkdir -p $OUTPUTDIR
# These settings are for small corpora such as text8. For larger corpora, stick to the default settings.
./lexvec -corpus text8 -output $OUTPUTDIR/vectors -dim 200 -iterations 15 -subsample 1e-4 -window 2 -model 2 -negative 25 -minfreq 5 -threads 12 -pos=false

echo Trained vectors saved to file $OUTPUTDIR/vectors
28 changes: 28 additions & 0 deletions demo_external_memory.sh
@@ -0,0 +1,28 @@
#!/bin/bash

set -e

# build lexvec binary
if [ ! -e lexvec ]; then
go build
fi

if [ ! -e text8 ]; then
echo Downloading text8 corpus
if hash wget 2>/dev/null; then
wget http://mattmahoney.net/dc/text8.zip
else
curl -O http://mattmahoney.net/dc/text8.zip
fi
unzip text8.zip
rm text8.zip
fi

export OUTPUTDIR=output
export MI=false
export MEMORY=1

# These settings are for small corpora such as text8. For larger corpora, stick to the default settings.
./external_memory_lexvec.sh -corpus text8 -dim 200 -iterations 15 -subsample 1e-4 -window 2 -model 2 -negative 25 -minfreq 5 -threads 12 -pos=false

echo Trained vectors saved to file $OUTPUTDIR/vectors
46 changes: 46 additions & 0 deletions external_memory_lexvec.sh
@@ -0,0 +1,46 @@
#!/bin/bash

# Copyright (c) 2016 Salle, Alexandre <atsalle@inf.ufrgs.br>
# Author: Salle, Alexandre <atsalle@inf.ufrgs.br>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of
# this software and associated documentation files (the "Software"), to deal in
# the Software without restriction, including without limitation the rights to
# use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
# the Software, and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
# FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
# IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

set -e

if [ -z "$OUTPUTDIR" ]; then echo "Need to set OUTPUTDIR"; fi

mkdir -p $OUTPUTDIR

export TMPDIR=$OUTPUTDIR

export MI=${MI:-false}
COOC=$OUTPUTDIR/coocs
COOC_TOTALS=$COOC.totals
VOCAB=$OUTPUTDIR/vocab

CMD="./lexvec $@ -mi=$MI -cooctotalspath $COOC_TOTALS -externalmemory"

echo identifying w,c pairs
eval $CMD -printcooc -coocpath $COOC -savevocab $VOCAB

echo aggregating pairs
./pairs_to_counts.sh < $COOC > $COOC.ready
rm $COOC

echo traning model
eval $CMD -coocpath $COOC.ready -output $OUTPUTDIR/vectors -readvocab $VOCAB

0 comments on commit 93e7617

Please sign in to comment.