# Word Embeddings

Documentation:

* https://fasttext.cc/docs/en/english-vectors.html
* https://fasttext.cc/docs/en/unsupervised-tutorial.html
* https://fasttext.cc/docs/en/html/index.html

1. Load the pre-trained Word Embeddings (`/opt/data/nlp/cc.en.300.bin`) into a FastText model!

In [1]:
import fasttext

In [2]:
model = fasttext.load_model('/opt/data/nlp/cc.en.300.bin')



2. How large is the vocabulary?

In [3]:
print(len(model.words))

2000000


3. How many dimensions does the model have?

In [4]:
model.get_dimension()

300

4. Print the Word Vectors for the Munich!

In [5]:
model.get_word_vector('Munich')

array([-0.01772507,  0.04481639, -0.0458482 , -0.01564159,  0.00569844,
       -0.01493978, -0.00475513,  0.05498182,  0.03653041,  0.05107911,
       -0.00792894, -0.04519449,  0.02314676, -0.04049589,  0.0533328 ,
       -0.01538926, -0.07663719, -0.1266364 , -0.0860248 ,  0.01669377,
       -0.06406741, -0.01205093,  0.10280135, -0.05018425, -0.00066486,
       -0.08689803, -0.03842681, -0.08041406,  0.16761142,  0.11794221,
        0.01121775,  0.00119475, -0.01672026,  0.01051223, -0.08705565,
        0.01279867, -0.04793347,  0.07529095,  0.06565578, -0.03357543,
        0.06183482, -0.01693818,  0.07336422, -0.12507132, -0.06156905,
        0.01393662, -0.02854991,  0.04012572,  0.08620507,  0.07610758,
        0.00872343,  0.12036866, -0.09645183, -0.0086469 , -0.04031846,
       -0.03134852, -0.08990994,  0.01966693,  0.11320422, -0.08001418,
        0.028819  , -0.01371382,  0.06330752,  0.09222437,  0.02637078,
       -0.05378804, -0.09224798, -0.05626002,  0.02702164,  0.03

5. What are the most similar terms to LRZ?

In [6]:
model.get_nearest_neighbors('LRZ')

[(0.6676446795463562, 'SuperMUC'),
 (0.6408230066299438, 'WLCG'),
 (0.6323488354682922, 'FZI'),
 (0.6296346783638, 'PDSF'),
 (0.624134361743927, 'FZK'),
 (0.6112837195396423, 'ESSL'),
 (0.6090673208236694, 'iVEC'),
 (0.6009016633033752, 'ETHZ'),
 (0.6007493734359741, 'XK7'),
 (0.5999080538749695, 'NERSC')]

6. Find the analogie to the relation of Berlin to Germany for Denmark!

In [8]:
dir(model)

['__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_labels',
 '_words',
 'f',
 'get_analogies',
 'get_dimension',
 'get_input_matrix',
 'get_input_vector',
 'get_labels',
 'get_line',
 'get_nearest_neighbors',
 'get_output_matrix',
 'get_sentence_vector',
 'get_subword_id',
 'get_subwords',
 'get_word_id',
 'get_word_vector',
 'get_words',
 'is_quantized',
 'labels',
 'predict',
 'quantize',
 'save_model',
 'set_args',
 'set_matrices',
 'test',
 'test_label',
 'words']

In [13]:
model.get_analogies('Berlin', 'Germany', 'Denmark')

[(0.7887206673622131, 'Copenhagen'),
 (0.7260220050811768, 'Aarhus'),
 (0.6744620203971863, 'Stockholm'),
 (0.6726727485656738, 'Århus'),
 (0.6658076047897339, 'Frederiksberg'),
 (0.642078697681427, 'Odense'),
 (0.6269904375076294, 'Christiania'),
 (0.6266441345214844, 'Roskilde'),
 (0.6243226528167725, 'Oslo'),
 (0.6190618276596069, 'Copenhagan')]