# Exploring word vectors

This notebook provides very basic tools for exploring the contents of word embeddings.

NOTE:
If you do not have Gensim installed, this notebook will not work. If you are using Anaconda, open your conda prompt and type ```conda install -c conda-forge gensim```. Sometimes it may be necessary also to update smart_open by typing ```conda update smart_open```.

If everything works, the following line should run without errors.

In [1]:
from explore_embeddings import VectorExplorer

At first we initialize the VectorExplorer object and load our vector file. If you produced your word embeddings from the test corpus included in ```corpora/akkadian.zip```, you can also load the dictionary file. If you don't have a dictionary, you can just comment the line out.

The dictionary contains all words, frequencies and translations of the test corpus in TSV format.

| Field1 | Field2 | Field3 |
| -- | -- | -- |
| lemma | frequency | translations separated by ```;``` |


In [2]:
embeddings = VectorExplorer()
embeddings.read_vectors('akkadian.vec')
embeddings.read_dict('akk_corpus.dict')

## Finding nearest neighbors by lemma

By using ```nearest_neighbors()``` function, you can query the embeddings. This function takes three arguments:


| Parameter | Type | What it does |
| :- | -: | :- |
| **word** | str | Word you want to find the nearest neighbors for. |
| **num** | int | How many nearest neighbors you want to find. |
| **min_freq** | int | Do not print nearest neighbors if they are rarer than this (requires dictionary) |


In [3]:
embeddings.nearest_neighbors('nêru_I', num=10, min_freq=2)

----------------------------------------------------------------
nêru_I strike[110];killer[11]
----------------------------------------------------------------
0.853664 gērû_I                    72     opponent[72]                            
0.837647 zāʾeru_1                  61     enemy[61]                               
0.833618 zāwiānu_I                 48     enemy[48]                               
0.829792 ayyābu_I                  121    enemy[121]                              
0.791450 ṭarādu_I                  24     send[12];one who drives away[9];send off[3]
0.771796 qemû_1                    4      grind[4]                                
0.767695 nakāpu_I                  16     push[11];one who gores[5]               
0.754151 šūbu_1                    2      rush[2]                                 
0.748208 šuknušu_I                 37     one who makes someone bow down[35];humble[2]
0.748008 kāšidu_I                  100    conqueror[98];conquering[2]             


In the test corpus lemmata have been disambiguated by their meanings. Thus for ```rabû``` there are ```rabû_I```, ```rabû_II``` etc. You do not have to use the Roman numerals in your queries. VectorExplorer will find you all the lemmata that match your search anyway.

In [4]:
embeddings.nearest_neighbors('sisû', num=10, min_freq=2)

----------------------------------------------------------------
sisû_I horse[967]
----------------------------------------------------------------
0.847284 ṣimittu_I                 52     binding[52]                             
0.823162 parû_I                    114    mule[114]                               
0.810611 udru_I                    23     Bactrian camel[23]                      
0.743731 kūdanu_I                  122    mule[122]                               
0.720178 nīru_I                    316    yoke[315];(ornament for a yoke)[1]      
0.712449 agālu_I                   8      donkey[8]                               
0.706719 attartu_I                 5      (military) cart[5]                      
0.687537 rukūbu_I                  32     vehicle[32]                             
0.679626 urû_I                     62     team[57];team of equids[4];attendant of teams[1]
0.672090 halluptu_1                8      armour[8]                               


## Finding nearest neighbor by English translation

If you have a dictionary file, you can also find words by their English translations by using ```nn_by_translation()``` function. This takes the same arguments as ```nearest_neighbors```, as well as the following:

| Parameter | Type | What it does |
| :- | -: | :- |
| **start** | bool | If set True, only matches if a translation starts with the given word. Without this ```eat``` will also match ```beat, great``` etc. |


In [5]:
embeddings.nn_by_translation('silver', num=10, min_freq=2, start=True)

----------------------------------------------------------------
ešmarû_I (a silver alloy)[25]
----------------------------------------------------------------
0.817299 ihzētu_I                  13     decorative inlay[13]                    
0.812660 lahmu_I                   14     hairy[14]                               
0.781151 patāqu_I                  117    brickwork[75];make brick structures[33];shaper[7];builder[1];shape[1]
0.758944 zahalû_I                  46     silver alloy[46]                        
0.731947 ebbu_I                    80     bright[80]                              
0.725137 pašallu_I                 14     (an alloy of gold)[14]                  
0.718554 urudû_I                   14     copper[14]                              
0.684151 ṣāriru_I                  13     flashing red[13]                        
0.663184 tiṣbutu_I                 4      linked[4]                               
0.656203 gattu_I                   34     form[34]              

In [7]:
embeddings.nn_by_translation('elephant', num=10, min_freq=2, start=False)

----------------------------------------------------------------
pīru_I elephant[103]
----------------------------------------------------------------
0.870922 pašhu_1                   8      (a hand-held weapon)[8]                 
0.863017 pagû_I                    16     monkey[16]                              
0.770402 turāhu_I                  4      ibex[4]                                 
0.753727 lurmu_I                   11     ostrich[11]                             
0.752174 illūru_I                  4      (a flower)[4]                           
0.749477 būṣu_I                    4      hyaena[4]                               
0.737633 ayyalu_I                  13     stag[13]                                
0.729275 serrēmu_I                 11     onager[11]                              
0.704693 nālu_1                    2      roe deer?[2]                            
0.704264 ṣabītu_I                  15     gazelle[15]                             
-------------------

## TODO:

- Unescape HTML entities
- Analogies
- POS filters
- Gephi export