Skip to content

alvations/vegetables

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 

Repository files navigation

Vegetables

This is a collection of word embeddings repackaged for easy machine loading and human reading.

Each set of embeddings should come with the following files:

  • .tsv is a tab separated file where the
    • (i) first column is the word/token,
    • (ii) second column is the count (if the original pre-trained embedding didn't save any count, it will be set to -1),
    • (iii) the third to the last columns form the actual embedding for the word/token in the first column.
  • .txt is the key words
    • same as the first column in the .tsv file.
  • .npy is the word embedding that can be directly loaded with numpy
    • same as the third to last columns in the .tsv file.
  • .pkl is a pickled file with its keys as the word/token and the count of the word/token.
    • if the original pre-trained embedding didn't save any count, it will be set to -1

Usage

>>> import pickle 
>>> import numpy as np

>>> embeddings = np.load('hlbl.rcv1.original.50d.npy')
>>> tokens = [line.strip() for line in open('hlbl.rcv1.original.50d.txt')]
>>> embeddings[tokens.index('hello')]
array([-0.21167406, -0.04189226,  0.22745571, -0.09330438,  0.13239339,
        0.25136262, -0.01908735, -0.02557277,  0.0029353 , -0.06194451,
       -0.22384156,  0.04584747,  0.03227248, -0.13708033,  0.17901117,
       -0.01664691,  0.09400477,  0.06688628, -0.09019949, -0.06918809,
        0.08437972, -0.01485273, -0.12062263,  0.05024147, -0.00416972,
        0.04466985, -0.05316647,  0.00998635, -0.03696947,  0.10502578,
       -0.00190554,  0.03435732, -0.05715087, -0.06777468, -0.11803425,
        0.17845355,  0.18688948, -0.07509124, -0.16089943,  0.0396672 ,
       -0.05162677, -0.12486628, -0.03870481,  0.0928738 ,  0.06197058,
       -0.14603543,  0.04026282,  0.14052328,  0.1085517 , -0.15121481])

Monolingual

Pre-trained Embeddings Type Lang Cite Year Bib License Kaggle Dataset
Senna (aka. C&W) LM2 eng Collobert et al. (aka. C&W) 2008/2011 bib/
bib
License senna-embeddings
HLBL Embeddings (from Turian et al. 2011) HLBL eng Mnih and Hinton 2009 bib License hlbl-embeddings
Turian Embeddings
(aka scaled HLBL and C&W)
C&W, HLBL eng Turian et al. 2011 bib License turian-embeddings
Huang Embeddings (aka. Huang) Huang eng Huang et al. 2012 bib License huang-embeddings
Word2Vec (News) word2vec eng Mikolov et al. 2013 bib License google-word2vec
Word2Vec (Freebase) word2vec eng Mikolov et al. 2013 bib License google-word2vec-freebase
morphoRNN Huang, C&W eng Luong et al. 2013 bib License csrnn-embeddings
GloVe (6B) GloVe eng Pennington et al. 2014 bib License stanford-glove-6b
GloVe (42B) GloVe eng Pennington et al. 2014 bib License stanford-glove-42b
GloVe (840B) GloVe eng Pennington et al. 2014 bib License stanford-glove-840b
GloVe (Twitter) GloVe eng Pennington et al. 2014 bib License stanford-glove-twitter
COMPOSES word2vec eng Baroni et al. 2014 bib License: CC BY 4.0 composes-embeddings
Dependency word2vec eng Levy and Golberg 2014 bib License dependency-embeddings
Word2Vec (Shiroyagi) word2vec jap Shiroyagi Corp. 2017 ~ License: MIT shiroyagi-word2vec

About

Collection of Repackaged Word Embeddings

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published