# Produce word embeddings

In this notebook, we will produce word embeddings to be used with the soft cosine measure.

In [1]:
! hostname

apollo.fi.muni.cz


## Produce non-positional `word2vec` embeddings

In this section, we will produce word embeddings for global `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [2]:
%%capture
! make word2vec-text+latex.vec

In [3]:
! head -2 word2vec-text+latex.vec

69177 300
Ġ -0.4911327 0.10599916 0.3505817 0.6833688 -1.3446841 -0.27966222 -0.006334139 -0.24758625 0.36254022 -0.11361687 -0.16100892 -0.11946388 -0.14131856 -0.13496888 0.24247374 0.07401053 -0.3440946 0.08469719 -0.20505634 0.3357777 0.45511454 0.15884897 0.0059555485 -0.88288 0.27665764 0.5844739 0.10001706 0.062370844 -0.035176527 -0.4532718 -0.69284165 0.083393484 0.46663472 0.19422898 0.29948747 0.46402577 0.34319147 -0.5321049 0.30657673 -1.0308042 -0.098499976 0.87004673 0.45476702 -0.027368661 -0.5547552 -0.57541937 0.05903322 0.28396904 0.10996748 0.24960591 -0.098507866 0.12050137 0.41581747 -0.16060467 -0.3960163 -0.91345847 -0.3254867 -0.44595346 -0.6622143 -0.2838936 -0.33838993 -0.3503998 -0.042346112 0.5254559 0.27442947 0.051749878 -0.29121464 0.15925126 0.080277115 -0.27463135 0.7243122 -0.24872757 -0.22137575 0.18234788 0.3587295 0.38679105 -0.010842248 -0.30203652 0.0001471663 -0.47879148 -0.54526114 -0.4690693 0.24877721 0.39388976 0.5369823 -0.01143753 0.838683

### The LaTeX format

To produce a separate soft vector space model just for math, we also have a separate dataset with just LaTeX.

In [4]:
%%capture
! make word2vec-latex.vec

In [5]:
! head -2 word2vec-latex.vec

29626 300
= 0.45211563 0.39751762 0.42374533 0.7656501 -0.687599 -1.1972449 -1.1884291 0.4996499 0.1104988 -1.0093739 -0.5944532 1.5990381 -0.87857515 0.4789259 -0.72392815 -0.8281125 -0.34271508 -0.1456393 -1.1095649 0.39455065 0.27540746 1.3610456 -1.2389966 -0.1572963 2.151942 1.2567799 0.43847576 -0.70683247 -0.6971553 0.54050136 0.24628767 -1.2517096 0.14973809 0.36064965 0.6866465 -0.48631003 0.10373527 -0.05338171 -1.2745261 -0.88932055 0.9549716 0.66074157 1.5348703 -0.049554724 -0.4072205 1.047728 1.6011713 1.3103225 -0.64296234 -0.41527897 -0.8602007 1.1396009 0.9975877 -0.32012704 -1.8375577 -0.57838714 -0.6749642 -0.2723207 0.18643135 0.8779331 0.16570298 0.83777815 0.30962312 0.18147355 0.03476536 -1.0669308 -0.6659401 -0.03910051 -0.19898187 -0.98984283 -0.8313048 -1.2218608 -0.28428096 0.58778864 -0.13910562 1.8572913 -0.52205503 0.054531824 -0.47444367 1.3603321 0.81436384 -0.8444932 0.25017482 0.6253206 0.43668544 1.1883595 0.060562678 0.835826 1.4584521 -0.35038212 -1

## The Tangent-L format

To produce a separate soft vector space model just for math, we also have a separate dataset with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [6]:
%%capture
! make word2vec-tangentl.vec

In [7]:
! head -2 word2vec-tangentl.vec

1986439 300
(start) -0.7883517 -0.4147513 0.109570004 0.30688292 3.6664822 -1.8741546 0.5164354 -0.3924657 0.7925373 -1.4477867 -0.23767693 -2.0926077 0.4927457 -0.75468713 0.54832906 1.0683583 -0.6374132 0.061881304 0.09244942 -2.2327147 1.1101563 0.5815448 0.2750769 1.1154897 -0.96233326 2.1798615 0.053582694 -0.71024233 0.01958334 0.11164918 -0.2490331 -1.1246321 1.8936965 2.0551243 -0.67028046 1.387179 0.37774748 2.308518 0.79442024 -0.7665613 -1.3263549 1.0679648 -1.3366199 0.31892794 1.0904627 0.6458117 -1.3810245 -0.059227582 0.41283834 -0.6207842 -1.3623626 1.7307568 0.70670766 1.8203899 -0.8782172 -0.6903082 2.0658896 -0.8475402 0.7414832 0.0026021323 -1.5228511 -1.5677015 0.38836935 -0.15504538 -1.664133 1.1359817 0.32770953 -0.7394564 -1.185888 -0.38944528 -1.1922969 1.1946704 2.3373098 -2.6557286 -0.37687904 -1.317471 -0.15844513 -0.19751823 0.62681144 -0.6336746 0.55956715 0.19937158 0.012083598 0.8902954 -0.10233464 0.8064445 0.29683325 0.23845148 -1.2934929 0.43259752 -1

## Produce positional  `word2vec` embeddings

In this section, we will produce word embeddings for global `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [8]:
%%capture
! make word2vec-text+latex-positional.vec

In [9]:
! head -2 word2vec-text+latex-positional.vec

69177 300
Ġ 0.16035363 -0.30829433 -0.17519654 0.28695825 -0.031610817 -0.53657657 -0.08200527 -0.71196944 0.1975642 0.79927015 -1.4589858 -0.25060907 0.1409496 0.51592255 0.14705835 1.0694468 0.1486603 -0.7698888 0.30867752 0.15096696 -0.76286983 0.3385864 -0.2529503 -1.0759047 -1.0047637 1.0050743 -1.2993101 0.15850481 -0.06875775 0.7808354 -0.3946995 -0.9622622 -0.16689946 -0.11755313 -0.08954904 0.77235913 -0.24658005 0.16299461 0.23526783 -0.13695192 -0.49874997 -0.25241584 0.06373526 0.20177509 -0.5996808 0.60709125 -0.13289207 0.22112171 -0.6708566 -0.34966278 -0.2291605 0.25201464 -0.19016406 0.14451861 -0.26050755 -0.7426428 -0.7466208 -0.4368737 0.2527407 -0.6455509 -0.30178457 -0.20722939 -0.15675136 -0.43205494 -0.57021964 0.14128122 0.069268554 -0.13737638 -0.13881494 -0.15382406 -0.4104811 -0.44630072 -0.22543418 0.5596907 -0.22434202 -0.2929695 0.2014402 0.027469423 0.16393778 -0.09448801 0.5805148 -0.09726634 0.39869404 0.3131959 0.19083415 -0.22665048 0.10135016 -0.217

### The LaTeX format

To produce a separate soft vector space model just for math, we also have a separate dataset with just LaTeX.

In [10]:
%%capture
! make word2vec-latex-positional.vec

In [11]:
! head -2 word2vec-latex-positional.vec

29626 300
= 0.13982615 -0.56022334 -0.28534672 0.45654088 -0.50039023 -0.40535104 0.8029495 -0.045476936 -0.16988455 -0.27256945 -0.1577064 -0.73274046 0.6016116 0.54523355 0.15174997 -0.21776019 -0.7108013 -0.0073919306 -0.32877383 -1.1984626 0.18486197 0.10209546 2.0590396 -0.058102265 -0.052770197 0.026080692 0.48529533 -0.6797519 -0.37992993 -0.27292103 -0.14054266 -0.23104043 0.0037257895 -0.38512194 -1.6816893 -0.69398135 -0.50314385 0.025338978 -0.5085116 -0.052280564 0.07204428 -0.047413018 1.0062249 -0.7108461 0.067347944 -0.51193166 -1.129081 0.02056672 -0.1928178 -0.33797947 1.5729941 0.91550654 0.7663111 0.3601008 1.0404599 -0.6997386 -1.6628864 1.3028282 0.6607716 0.5822879 -0.55878335 -1.0914717 1.0250407 0.7648821 -0.48592326 1.2578444 0.53003174 -1.3726004 0.32341287 1.3854774 0.19301677 -0.55102706 0.66091806 -0.41993397 -1.0413152 1.3418113 0.033148527 -1.6679255 1.3095134 -0.7279497 -0.6734412 -1.9760101 -0.08267678 0.2894707 -0.6673951 2.2595806 0.9393743 -0.9024336

## The Tangent-L format

To produce a separate soft vector space model just for math, we also have a separate dataset with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [12]:
%%capture
! make word2vec-tangentl-positional.vec

In [13]:
! head -2 word2vec-tangentl-positional.vec

1986439 300
(start) -1.7576293 0.19926013 2.1739433 -0.10596601 1.0934087 -1.5911621 0.19819638 0.09222508 1.3350728 0.042416114 -0.10117604 -0.021549191 0.31576115 1.4128927 -2.547857 0.10248722 -1.6801174 -0.20031011 0.34349585 -0.008054039 -0.23402317 0.17021184 0.6303321 0.03904953 0.23646358 0.5115825 -0.5102902 -0.8792905 1.2061231 0.34561715 0.0322433 -0.23454382 -0.11002169 -1.0626395 -0.27713925 -1.2055808 0.026971424 -0.16196723 0.018942563 1.8719597 -0.2117105 0.021235488 -0.9593002 -0.06983088 -0.041619234 -0.018843519 0.052969422 -0.035224788 -0.16969824 0.14895898 -0.4481881 0.049159262 0.05333016 -0.630466 0.041137047 -0.20374206 -2.2369826 -0.1592554 -0.014324772 0.43846983 0.5197639 -0.71592665 1.2351409 1.439014 0.66466594 0.30253163 -2.385935 -0.7764015 -0.048062928 -0.6276131 0.6590605 0.16693367 2.2304666 -1.8663439 -0.27034754 -0.05648147 0.5388975 0.7610479 1.3103434 -0.19490878 -0.088016 -1.047175 -0.59148896 1.0930003 1.2452099 -0.7901551 1.1972663 -0.69072413 