Skip to content
Collection of Repackaged Word Embeddings
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE
README.md

README.md

Vegetables

This is a collection of word embeddings repackaged for easy machine loading and human reading.

Each set of embeddings should come with the following files:

  • .tsv is a tab separated file where the
    • (i) first column is the word/token,
    • (ii) second column is the count (if the original pre-trained embedding didn't save any count, it will be set to -1),
    • (iii) the third to the last columns form the actual embedding for the word/token in the first column.
  • .txt is the key words
    • same as the first column in the .tsv file.
  • .npy is the word embedding that can be directly loaded with numpy
    • same as the third to last columns in the .tsv file.
  • .pkl is a pickled file with its keys as the word/token and the count of the word/token.
    • if the original pre-trained embedding didn't save any count, it will be set to -1

Usage

>>> import pickle 
>>> import numpy as np

>>> embeddings = np.load('hlbl.rcv1.original.50d.npy')
>>> tokens = [line.strip() for line in open('hlbl.rcv1.original.50d.txt')]
>>> embeddings[tokens.index('hello')]
array([-0.21167406, -0.04189226,  0.22745571, -0.09330438,  0.13239339,
        0.25136262, -0.01908735, -0.02557277,  0.0029353 , -0.06194451,
       -0.22384156,  0.04584747,  0.03227248, -0.13708033,  0.17901117,
       -0.01664691,  0.09400477,  0.06688628, -0.09019949, -0.06918809,
        0.08437972, -0.01485273, -0.12062263,  0.05024147, -0.00416972,
        0.04466985, -0.05316647,  0.00998635, -0.03696947,  0.10502578,
       -0.00190554,  0.03435732, -0.05715087, -0.06777468, -0.11803425,
        0.17845355,  0.18688948, -0.07509124, -0.16089943,  0.0396672 ,
       -0.05162677, -0.12486628, -0.03870481,  0.0928738 ,  0.06197058,
       -0.14603543,  0.04026282,  0.14052328,  0.1085517 , -0.15121481])

Monolingual

Pre-trained Embeddings Type Lang Cite Year Bib License Kaggle Dataset
Senna (aka. C&W) LM2 eng Collobert et al. (aka. C&W) 2008/2011 bib/
bib
License senna-embeddings
HLBL Embeddings (from Turian et al. 2011) HLBL eng Mnih and Hinton 2009 bib License hlbl-embeddings
Turian Embeddings
(aka scaled HLBL and C&W)
C&W, HLBL eng Turian et al. 2011 bib License turian-embeddings
Huang Embeddings (aka. Huang) Huang eng Huang et al. 2012 bib License huang-embeddings
Word2Vec (News) word2vec eng Mikolov et al. 2013 bib License google-word2vec
Word2Vec (Freebase) word2vec eng Mikolov et al. 2013 bib License google-word2vec-freebase
morphoRNN Huang, C&W eng Luong et al. 2013 bib License csrnn-embeddings
GloVe (6B) GloVe eng Pennington et al. 2014 bib License stanford-glove-6b
GloVe (42B) GloVe eng Pennington et al. 2014 bib License stanford-glove-42b
GloVe (840B) GloVe eng Pennington et al. 2014 bib License stanford-glove-840b
GloVe (Twitter) GloVe eng Pennington et al. 2014 bib License stanford-glove-twitter
COMPOSES word2vec eng Baroni et al. 2014 bib License: CC BY 4.0 composes-embeddings
Dependency word2vec eng Levy and Golberg 2014 bib License dependency-embeddings
Word2Vec (Shiroyagi) word2vec jap Shiroyagi Corp. 2017 ~ License: MIT shiroyagi-word2vec
You can’t perform that action at this time.