Skip to content

Latest commit

 

History

History
36 lines (32 loc) · 2.23 KB

emb-from-suc.md

File metadata and controls

36 lines (32 loc) · 2.23 KB

FastText embeddings from SUC

Below you find embeddings for different sizes computed from the Spanish Unannotated Corpora.

Embeddings

Links to the embeddings:

XS (#dimensions=10, #vectors=1313423):
S (#dimensions=30, #vectors=1313423):
M (#dimensions=100, #vectors=1313423):
L (#dimensions=300, #vectors=1313423):
new L (#dimensions=300, #vectors=1451827):

Algorithm

  • Implementation: FastText with Skipgram
  • Parameters:
    • min subword-ngram = 3
    • max subword-ngram = 6
    • minCount = 5
    • epochs = 20
    • dim = 10, 30, 100, 300, 300
    • all other parameters set as default

Corpus

  • Spanish Unannotated Corpora
  • Corpus Size: 2.6 billion words and 3 billion words (for the new 300 dim)
  • Post processing: Explained in Embeddings and Corpora repos, that include tokenization, lowercase, removed listings and urls.