Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
2bce43f
commit 93e7617
Showing
10 changed files
with
1,581 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,58 @@ | ||
# LexVec | ||
|
||
Source code for the model will be published shortly before ACL 2016 which takes place between August 7-12. | ||
This is an implementation of the **LexVec word embedding model** (similar to word2vec and GloVe) that achieves state of the art results in multiple NLP tasks, as described in [this paper](https://arxiv.org/pdf/1606.00819v2) and [this one](https://arxiv.org/pdf/1606.01283v1). | ||
|
||
## Installation | ||
|
||
### Binary | ||
|
||
The easiest way to get started with LexVec is to download the binary release. We only distribute amd64 binaries for Linux. | ||
|
||
**[Download binary](https://github.com/alexandres/lexvec/releases)** | ||
|
||
If you are using Windows, OS X, 32-bit Linux, or any other OS, follow the instructions below to build from source. | ||
|
||
### Building from source | ||
|
||
1. [Install the Go compiler](https://golang.org/doc/install) | ||
2. Make sure your `$GOPATH` is set | ||
3. Execute the following commands in your terminal: | ||
|
||
```bash | ||
go get github.com/alexandres/lexvec | ||
cd $GOPATH/src/github.com/alexandres/lexvec | ||
go build | ||
``` | ||
|
||
## Usage | ||
|
||
### In-memory (default, faster) | ||
|
||
To get started, run `$ ./demo.sh` which trains a model using the small [text8](http://mattmahoney.net/dc/text8.zip) corpus (100MB from Wikipedia). | ||
|
||
Basic usage of LexVec is: | ||
|
||
`$ ./lexvec -corpus somecorpus -output someoutputdirectory/vectors` | ||
|
||
Run `$ ./lexvec -h` for a full list of options. | ||
|
||
Additionally, we provide a `word2vec` script which implements the exact same interface as the [word2vec](https://code.google.com/archive/p/word2vec/) package should you want to test LexVec using existing scripts. | ||
|
||
### External Memory | ||
|
||
By default, LexVec stores the sparse matrix being factorized in-memory. This can be a problem if your training corpus is large and your system memory limited. We suggest you first try using the in-memory implementation. If you run into Out-Of-Memory issues, try this External Memory approximation. | ||
xi | ||
|
||
`env OUTPUTDIR=output ./external_memory_lexvec.sh -corpus somecorpus -dim 300 ...exactsameoptionsasinmemory` | ||
|
||
Pre-processing can be accelerated by installing [nsort](http://www.ordinal.com/try.cgi/nsort-i386-3.4.54.rpm) and [pypy](http://pypy.org/) and editing `pairs_to_counts.sh`. | ||
|
||
## References | ||
|
||
Salle, A., Idiart, M., & Villavicencio, A. (2016). [Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations](https://arxiv.org/pdf/1606.00819v2). arXiv preprint arXiv:1606.00819. | ||
|
||
Salle, A., Idiart, M., & Villavicencio, A. (2016). [Enhancing the LexVec Distributed Word Representation Model Using Positional Contexts and External Memory](https://arxiv.org/pdf/1606.01283v1). arXiv preprint arXiv:1606.01283. | ||
|
||
## License | ||
|
||
Copyright (c) 2016 Salle, Alexandre <atsalle@inf.ufrgs.br>. All work in this package is distributed under the MIT License. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
#!/bin/bash | ||
|
||
set -e | ||
|
||
# build lexvec binary | ||
if [ ! -e lexvec ]; then | ||
go build | ||
fi | ||
|
||
if [ ! -e text8 ]; then | ||
echo Downloading text8 corpus | ||
if hash wget 2>/dev/null; then | ||
wget http://mattmahoney.net/dc/text8.zip | ||
else | ||
curl -O http://mattmahoney.net/dc/text8.zip | ||
fi | ||
unzip text8.zip | ||
rm text8.zip | ||
fi | ||
|
||
OUTPUTDIR=output | ||
|
||
mkdir -p $OUTPUTDIR | ||
# These settings are for small corpora such as text8. For larger corpora, stick to the default settings. | ||
./lexvec -corpus text8 -output $OUTPUTDIR/vectors -dim 200 -iterations 15 -subsample 1e-4 -window 2 -model 2 -negative 25 -minfreq 5 -threads 12 -pos=false | ||
|
||
echo Trained vectors saved to file $OUTPUTDIR/vectors |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
#!/bin/bash | ||
|
||
set -e | ||
|
||
# build lexvec binary | ||
if [ ! -e lexvec ]; then | ||
go build | ||
fi | ||
|
||
if [ ! -e text8 ]; then | ||
echo Downloading text8 corpus | ||
if hash wget 2>/dev/null; then | ||
wget http://mattmahoney.net/dc/text8.zip | ||
else | ||
curl -O http://mattmahoney.net/dc/text8.zip | ||
fi | ||
unzip text8.zip | ||
rm text8.zip | ||
fi | ||
|
||
export OUTPUTDIR=output | ||
export MI=false | ||
export MEMORY=1 | ||
|
||
# These settings are for small corpora such as text8. For larger corpora, stick to the default settings. | ||
./external_memory_lexvec.sh -corpus text8 -dim 200 -iterations 15 -subsample 1e-4 -window 2 -model 2 -negative 25 -minfreq 5 -threads 12 -pos=false | ||
|
||
echo Trained vectors saved to file $OUTPUTDIR/vectors |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
#!/bin/bash | ||
|
||
# Copyright (c) 2016 Salle, Alexandre <atsalle@inf.ufrgs.br> | ||
# Author: Salle, Alexandre <atsalle@inf.ufrgs.br> | ||
# | ||
# Permission is hereby granted, free of charge, to any person obtaining a copy of | ||
# this software and associated documentation files (the "Software"), to deal in | ||
# the Software without restriction, including without limitation the rights to | ||
# use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of | ||
# the Software, and to permit persons to whom the Software is furnished to do so, | ||
# subject to the following conditions: | ||
# | ||
# The above copyright notice and this permission notice shall be included in all | ||
# copies or substantial portions of the Software. | ||
# | ||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS | ||
# FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR | ||
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER | ||
# IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN | ||
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | ||
|
||
set -e | ||
|
||
if [ -z "$OUTPUTDIR" ]; then echo "Need to set OUTPUTDIR"; fi | ||
|
||
mkdir -p $OUTPUTDIR | ||
|
||
export TMPDIR=$OUTPUTDIR | ||
|
||
export MI=${MI:-false} | ||
COOC=$OUTPUTDIR/coocs | ||
COOC_TOTALS=$COOC.totals | ||
VOCAB=$OUTPUTDIR/vocab | ||
|
||
CMD="./lexvec $@ -mi=$MI -cooctotalspath $COOC_TOTALS -externalmemory" | ||
|
||
echo identifying w,c pairs | ||
eval $CMD -printcooc -coocpath $COOC -savevocab $VOCAB | ||
|
||
echo aggregating pairs | ||
./pairs_to_counts.sh < $COOC > $COOC.ready | ||
rm $COOC | ||
|
||
echo traning model | ||
eval $CMD -coocpath $COOC.ready -output $OUTPUTDIR/vectors -readvocab $VOCAB |
Oops, something went wrong.