id	title
crawl-vectors	Word vectors for 157 languages

We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish.

Download directly with command line or from python

In order to download with command line or from python code, you must have installed the python package as described here.

$ ./download_model.py en     # English
Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
 (19.78%) [=========>                                         ]

Once the download is finished, use the model as usual:

$ ./fasttext nn cc.en.300.bin 10
Query word?

>>> import fasttext.util
>>> fasttext.util.download_model('en', if_exists='ignore')  # English
>>> ft = fasttext.load_model('cc.en.300.bin')

🤗 HuggingFace Integration

Word vectors for 157 languages available on the Hugging Face Hub under the fasttext tag and more documentation is available here.

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")
>>> model = fasttext.load_model(model_path)

Adapt the dimension

The pre-trained word vectors we distribute have dimension 300. If you need a smaller size, you can use our dimension reducer. In order to use that feature, you must have installed the python package as described here.

For example, in order to get vectors of dimension 100:

$ ./reduce_model.py cc.en.300.bin 100
Loading model
Reducing matrix dimensions
Saving model
cc.en.100.bin saved

Then you can use the cc.en.100.bin model file as usual.

>>> import fasttext
>>> import fasttext.util
>>> ft = fasttext.load_model('cc.en.300.bin')
>>> ft.get_dimension()
300
>>> fasttext.util.reduce_model(ft, 100)
>>> ft.get_dimension()
100

Then you can use ft model object as usual:

>>> ft.get_word_vector('hello').shape
(100,)
>>> ft.get_nearest_neighbors('hello')
[(0.775576114654541, u'heyyyy'), (0.7686290144920349, u'hellow'), (0.7663413286209106, u'hello-'), (0.7579624056816101, u'heyyyyy'), (0.7495524287223816, u'hullo'), (0.7473770380020142, u'.hello'), (0.7407292127609253, u'Hiiiii'), (0.7402616739273071, u'hellooo'), (0.7399682402610779, u'hello.'), (0.7396857738494873, u'Heyyyyy')]

or save it for later use:

>>> ft.save_model('cc.en.100.bin')

Format

The word vectors are available in both binary and text formats.

Using the binary models, vectors for out-of-vocabulary words can be obtained with

$ ./fasttext print-word-vectors wiki.it.300.bin < oov_words.txt

where the file oov_words.txt contains out-of-vocabulary words.

In the text format, each line contain a word followed by its vector. Each value is space separated, and words are sorted by frequency in descending order. These text models can easily be loaded in Python using the following code:

import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

Tokenization

We used the Stanford word segmenter for Chinese, Mecab for Japanese and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.

More information about the training of these models can be found in the article Learning Word Vectors for 157 Languages.

License

The word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.

References

If you use these word vectors, please cite the following paper:

E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

Evaluation datasets

The analogy evaluation datasets described in the paper are available here: French, Hindi, Polish.

Models

The models can be downloaded from:


Afrikaans: bin, text	Albanian: bin, text	Alemannic: bin, text
Amharic: bin, text	Arabic: bin, text	Aragonese: bin, text
Armenian: bin, text	Assamese: bin, text	Asturian: bin, text
Azerbaijani: bin, text	Bashkir: bin, text	Basque: bin, text
Bavarian: bin, text	Belarusian: bin, text	Bengali: bin, text
Bihari: bin, text	Bishnupriya Manipuri: bin, text	Bosnian: bin, text
Breton: bin, text	Bulgarian: bin, text	Burmese: bin, text
Catalan: bin, text	Cebuano: bin, text	Central Bicolano: bin, text
Chechen: bin, text	Chinese: bin, text	Chuvash: bin, text
Corsican: bin, text	Croatian: bin, text	Czech: bin, text
Danish: bin, text	Divehi: bin, text	Dutch: bin, text
Eastern Punjabi: bin, text	Egyptian Arabic: bin, text	Emilian-Romagnol: bin, text
English: bin, text	Erzya: bin, text	Esperanto: bin, text
Estonian: bin, text	Fiji Hindi: bin, text	Finnish: bin, text
French: bin, text	Galician: bin, text	Georgian: bin, text
German: bin, text	Goan Konkani: bin, text	Greek: bin, text
Gujarati: bin, text	Haitian: bin, text	Hebrew: bin, text
Hill Mari: bin, text	Hindi: bin, text	Hungarian: bin, text
Icelandic: bin, text	Ido: bin, text	Ilokano: bin, text
Indonesian: bin, text	Interlingua: bin, text	Irish: bin, text
Italian: bin, text	Japanese: bin, text	Javanese: bin, text
Kannada: bin, text	Kapampangan: bin, text	Kazakh: bin, text
Khmer: bin, text	Kirghiz: bin, text	Korean: bin, text
Kurdish (Kurmanji): bin, text	Kurdish (Sorani): bin, text	Latin: bin, text
Latvian: bin, text	Limburgish: bin, text	Lithuanian: bin, text
Lombard: bin, text	Low Saxon: bin, text	Luxembourgish: bin, text
Macedonian: bin, text	Maithili: bin, text	Malagasy: bin, text
Malay: bin, text	Malayalam: bin, text	Maltese: bin, text
Manx: bin, text	Marathi: bin, text	Mazandarani: bin, text
Meadow Mari: bin, text	Minangkabau: bin, text	Mingrelian: bin, text
Mirandese: bin, text	Mongolian: bin, text	Nahuatl: bin, text
Neapolitan: bin, text	Nepali: bin, text	Newar: bin, text
North Frisian: bin, text	Northern Sotho: bin, text	Norwegian (Bokmål): bin, text
Norwegian (Nynorsk): bin, text	Occitan: bin, text	Oriya: bin, text
Ossetian: bin, text	Palatinate German: bin, text	Pashto: bin, text
Persian: bin, text	Piedmontese: bin, text	Polish: bin, text
Portuguese: bin, text	Quechua: bin, text	Romanian: bin, text
Romansh: bin, text	Russian: bin, text	Sakha: bin, text
Sanskrit: bin, text	Sardinian: bin, text	Scots: bin, text
Scottish Gaelic: bin, text	Serbian: bin, text	Serbo-Croatian: bin, text
Sicilian: bin, text	Sindhi: bin, text	Sinhalese: bin, text
Slovak: bin, text	Slovenian: bin, text	Somali: bin, text
Southern Azerbaijani: bin, text	Spanish: bin, text	Sundanese: bin, text
Swahili: bin, text	Swedish: bin, text	Tagalog: bin, text
Tajik: bin, text	Tamil: bin, text	Tatar: bin, text
Telugu: bin, text	Thai: bin, text	Tibetan: bin, text
Turkish: bin, text	Turkmen: bin, text	Ukrainian: bin, text
Upper Sorbian: bin, text	Urdu: bin, text	Uyghur: bin, text
Uzbek: bin, text	Venetian: bin, text	Vietnamese: bin, text
Volapük: bin, text	Walloon: bin, text	Waray: bin, text
Welsh: bin, text	West Flemish: bin, text	West Frisian: bin, text
Western Punjabi: bin, text	Yiddish: bin, text	Yoruba: bin, text
Zazaki: bin, text	Zeelandic: bin, text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawl-vectors.md

crawl-vectors.md

Download directly with command line or from python

🤗 HuggingFace Integration

Adapt the dimension

Format

Tokenization

License

References

Evaluation datasets

Models

Files

crawl-vectors.md

Latest commit

History

crawl-vectors.md

File metadata and controls

Download directly with command line or from python

🤗 HuggingFace Integration

Adapt the dimension

Format

Tokenization

License

References

Evaluation datasets

Models