Skip to content

fogfish/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word2vec

Golang "native" implementation of word2vec algorithm (word2vec++ port)


The library enables word2vec algorithm for Golang using native runtime (no servers, no Python, etc). This Golang module implements CGO bridge towards Max Fomichev's word2vec C++ library.

Getting started

Building the C++ library

Use C++11 compatible compiler and cmake 3.1 to build the library. It is essential step before going further.

mkdir _build && cd _build
brew install cmake
cmake -DCMAKE_BUILD_TYPE=Release ../libw2v
make
cp ../libw2v/lib/libw2v.dylib /usr/local/lib/libw2v.dylib

Note: the project does not distribute library binaries, it is upcoming feature. You have to build binaries by yourself for your target runtime or raise an issue if any help is needed.

Training the model

The trained model is required before moving on. Either use original Max Fomichev's word2vec C++ utility or Golang's frond-end supplied by this project:

go install github.com/fogfish/word2vec/w2v@latest

In following examples, "War and Peace" by Leo Tolstoy is used for training. We have also used stop words to increase accuracy.

Let's start training with defining the config file:

w2v train config > wap-en.yaml

w2v train -C wap-en.yaml \
  -o wap-v300w5e5s1h005-en.bin \
  -f ../doc/leo-tolstoy-war-and-peace-en.txt

Name the output model after parameters used for training: v vector size, w nearby words window, e training epoch, architecture skip-gram s1 or CBoW s0, algorithm H. softmax h1, N. Sampling h0.

The default arguments gives sufficient results, see the article Word2Vec: Optimal hyperparameters and their impact on natural language processing downstream tasks for consideration about training options.

Using word2vec

The latest version of the library is available at its main branch. All development, including new features and bug fixes, take place on the main branch using forking and pull requests as described in contribution guidelines. The stable version is available via Golang modules.

Use go get to retrieve the library and add it as dependency to your application.

go get -u github.com/fogfish/word2vec

The example below shows the usage patterns for the library

import "github.com/fogfish/word2vec"

// 1. Load model
w2v, err := word2vec.Load("wap-v300w5e10s1h010-en.bin", 300)

seq := make([]word2vec.Nearest, 30)
w2v.Lookup("alexander", seq)

See the example or try it our via command line

w2v lookup \
  -m wap-v300w5e5s1h005-en.bin \
  -k 30 \
  alexander

Embeddings

Calculate embedding for document

import "github.com/fogfish/word2vec"

// 1. Load model
w2v, err := word2vec.Load("wap-v300w5e10s1h010-en.bin", 300)

// 2. Allocated the memory for vector
vec := make([]float32, 300)

// 3. Calculate embeddings for the document
doc := "braunau was the headquarters of the commander-in-chief"
err = w2v.Embedding(doc, vec)

See the example or try it our via command line

w2v embedding \
  -m wap-v300w5e5s1h005-en.bin \
  ../doc/leo-tolstoy-war-and-peace-en.txt

How To Contribute

The library is MIT licensed and accepts contributions via GitHub pull requests:

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Added some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

The build and testing process requires Go version 1.21 or later.

commit message

The commit message helps us to write a good release note, speed-up review process. The message should address two question what changed and why. The project follows the template defined by chapter Contributing to a Project of Git book.

bugs

If you experience any issues with the library, please let us know via GitHub issues. We appreciate detailed and accurate reports that help us to identity and replicate the issue.

License

See LICENSE

References

  1. Go Wiki: cgo
  2. Calling C code with cgo
  3. Pass struct and array of structs to C function from Go
  4. cgo - cast C struct to Go struct
  5. Word2Vec (google code)
  6. word2vec patch for Mac OS X

About

Golang "native" implementation of word2vec algorithm (word2vec++ port)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published