id | title |
---|---|
crawl-vectors |
Word vectors for 157 languages |
We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish.
In order to download with command line or from python code, you must have installed the python package as described here.
$ ./download_model.py en # English
Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
(19.78%) [=========> ]
Once the download is finished, use the model as usual:
$ ./fasttext nn cc.en.300.bin 10
Query word?
>>> import fasttext.util
>>> fasttext.util.download_model('en', if_exists='ignore') # English
>>> ft = fasttext.load_model('cc.en.300.bin')
Word vectors for 157 languages available on the Hugging Face Hub under the fasttext
tag and more documentation is available here.
>>> import fasttext
>>> from huggingface_hub import hf_hub_download
>>> model_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")
>>> model = fasttext.load_model(model_path)
The pre-trained word vectors we distribute have dimension 300. If you need a smaller size, you can use our dimension reducer. In order to use that feature, you must have installed the python package as described here.
For example, in order to get vectors of dimension 100:
$ ./reduce_model.py cc.en.300.bin 100
Loading model
Reducing matrix dimensions
Saving model
cc.en.100.bin saved
Then you can use the cc.en.100.bin
model file as usual.
>>> import fasttext
>>> import fasttext.util
>>> ft = fasttext.load_model('cc.en.300.bin')
>>> ft.get_dimension()
300
>>> fasttext.util.reduce_model(ft, 100)
>>> ft.get_dimension()
100
Then you can use ft
model object as usual:
>>> ft.get_word_vector('hello').shape
(100,)
>>> ft.get_nearest_neighbors('hello')
[(0.775576114654541, u'heyyyy'), (0.7686290144920349, u'hellow'), (0.7663413286209106, u'hello-'), (0.7579624056816101, u'heyyyyy'), (0.7495524287223816, u'hullo'), (0.7473770380020142, u'.hello'), (0.7407292127609253, u'Hiiiii'), (0.7402616739273071, u'hellooo'), (0.7399682402610779, u'hello.'), (0.7396857738494873, u'Heyyyyy')]
or save it for later use:
>>> ft.save_model('cc.en.100.bin')
The word vectors are available in both binary and text formats.
Using the binary models, vectors for out-of-vocabulary words can be obtained with
$ ./fasttext print-word-vectors wiki.it.300.bin < oov_words.txt
where the file oov_words.txt contains out-of-vocabulary words.
In the text format, each line contain a word followed by its vector. Each value is space separated, and words are sorted by frequency in descending order. These text models can easily be loaded in Python using the following code:
import io
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
data = {}
for line in fin:
tokens = line.rstrip().split(' ')
data[tokens[0]] = map(float, tokens[1:])
return data
We used the Stanford word segmenter for Chinese, Mecab for Japanese and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.
More information about the training of these models can be found in the article Learning Word Vectors for 157 Languages.
The word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.
If you use these word vectors, please cite the following paper:
E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages
@inproceedings{grave2018learning,
title={Learning Word Vectors for 157 Languages},
author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
year={2018}
}
The analogy evaluation datasets described in the paper are available here: French, Hindi, Polish.
The models can be downloaded from:
Afrikaans: bin, text | Albanian: bin, text | Alemannic: bin, text |
Amharic: bin, text | Arabic: bin, text | Aragonese: bin, text |
Armenian: bin, text | Assamese: bin, text | Asturian: bin, text |
Azerbaijani: bin, text | Bashkir: bin, text | Basque: bin, text |
Bavarian: bin, text | Belarusian: bin, text | Bengali: bin, text |
Bihari: bin, text | Bishnupriya Manipuri: bin, text | Bosnian: bin, text |
Breton: bin, text | Bulgarian: bin, text | Burmese: bin, text |
Catalan: bin, text | Cebuano: bin, text | Central Bicolano: bin, text |
Chechen: bin, text | Chinese: bin, text | Chuvash: bin, text |
Corsican: bin, text | Croatian: bin, text | Czech: bin, text |
Danish: bin, text | Divehi: bin, text | Dutch: bin, text |
Eastern Punjabi: bin, text | Egyptian Arabic: bin, text | Emilian-Romagnol: bin, text |
English: bin, text | Erzya: bin, text | Esperanto: bin, text |
Estonian: bin, text | Fiji Hindi: bin, text | Finnish: bin, text |
French: bin, text | Galician: bin, text | Georgian: bin, text |
German: bin, text | Goan Konkani: bin, text | Greek: bin, text |
Gujarati: bin, text | Haitian: bin, text | Hebrew: bin, text |
Hill Mari: bin, text | Hindi: bin, text | Hungarian: bin, text |
Icelandic: bin, text | Ido: bin, text | Ilokano: bin, text |
Indonesian: bin, text | Interlingua: bin, text | Irish: bin, text |
Italian: bin, text | Japanese: bin, text | Javanese: bin, text |
Kannada: bin, text | Kapampangan: bin, text | Kazakh: bin, text |
Khmer: bin, text | Kirghiz: bin, text | Korean: bin, text |
Kurdish (Kurmanji): bin, text | Kurdish (Sorani): bin, text | Latin: bin, text |
Latvian: bin, text | Limburgish: bin, text | Lithuanian: bin, text |
Lombard: bin, text | Low Saxon: bin, text | Luxembourgish: bin, text |
Macedonian: bin, text | Maithili: bin, text | Malagasy: bin, text |
Malay: bin, text | Malayalam: bin, text | Maltese: bin, text |
Manx: bin, text | Marathi: bin, text | Mazandarani: bin, text |
Meadow Mari: bin, text | Minangkabau: bin, text | Mingrelian: bin, text |
Mirandese: bin, text | Mongolian: bin, text | Nahuatl: bin, text |
Neapolitan: bin, text | Nepali: bin, text | Newar: bin, text |
North Frisian: bin, text | Northern Sotho: bin, text | Norwegian (Bokmål): bin, text |
Norwegian (Nynorsk): bin, text | Occitan: bin, text | Oriya: bin, text |
Ossetian: bin, text | Palatinate German: bin, text | Pashto: bin, text |
Persian: bin, text | Piedmontese: bin, text | Polish: bin, text |
Portuguese: bin, text | Quechua: bin, text | Romanian: bin, text |
Romansh: bin, text | Russian: bin, text | Sakha: bin, text |
Sanskrit: bin, text | Sardinian: bin, text | Scots: bin, text |
Scottish Gaelic: bin, text | Serbian: bin, text | Serbo-Croatian: bin, text |
Sicilian: bin, text | Sindhi: bin, text | Sinhalese: bin, text |
Slovak: bin, text | Slovenian: bin, text | Somali: bin, text |
Southern Azerbaijani: bin, text | Spanish: bin, text | Sundanese: bin, text |
Swahili: bin, text | Swedish: bin, text | Tagalog: bin, text |
Tajik: bin, text | Tamil: bin, text | Tatar: bin, text |
Telugu: bin, text | Thai: bin, text | Tibetan: bin, text |
Turkish: bin, text | Turkmen: bin, text | Ukrainian: bin, text |
Upper Sorbian: bin, text | Urdu: bin, text | Uyghur: bin, text |
Uzbek: bin, text | Venetian: bin, text | Vietnamese: bin, text |
Volapük: bin, text | Walloon: bin, text | Waray: bin, text |
Welsh: bin, text | West Flemish: bin, text | West Frisian: bin, text |
Western Punjabi: bin, text | Yiddish: bin, text | Yoruba: bin, text |
Zazaki: bin, text | Zeelandic: bin, text |