# Train `word2vec` models

In this notebook, we will train `word2vec` models that can represent math-specific tokens.

In [1]:
! hostname

mir


In [2]:
%%capture
! pip install .[word2vec]

## Train non-positional `word2vec` models

In this section, we will train `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [3]:
%%capture
! make word2vec-text+latex

In [4]:
! du -hc word2vec-text+latex

491M	word2vec-text+latex/model/custom-en-word2vec_cbow-epochs=1
491M	word2vec-text+latex/model
0	word2vec-text+latex/cache/custom-en-word2vec_cbow-epochs=1
0	word2vec-text+latex/cache
491M	word2vec-text+latex
491M	total


In [5]:
! head -n 2 word2vec-text+latex/model/*/model.vec

71540 300
Ġ -0.08767373 0.45843008 -0.008227732 -0.12825662 0.40604976 -0.08795726 0.25151646 0.6583432 -0.2843155 -0.36248145 -0.90115786 -0.040905494 0.5283086 -0.059527937 0.0679248 -0.9048535 -1.3493316 -0.029202132 0.537324 -1.0813957 -0.0958676 0.20301759 -0.42168352 0.2544472 0.40478817 -0.16245314 -0.87397313 0.1526849 -0.84534985 0.3666558 0.43923587 0.1550599 0.116662145 -0.545336 0.21840543 -0.5227038 -0.822009 0.64769906 -0.080697194 0.14055185 -0.39671564 -0.19012949 0.18417068 -0.1396354 -0.14135444 -0.009711535 0.5304928 0.11867654 0.29499966 -0.085327715 -0.34632125 -0.8764067 0.9665256 -0.3684227 0.4612774 -0.4608598 1.1773355 -0.25845176 0.7018656 -0.33604014 -0.6696823 -0.58044964 -0.6593275 0.4494126 -0.2941379 -0.0895813 0.17626098 0.326288 0.7238655 0.52115345 -0.36118248 -0.04045876 -0.740491 -0.8278218 -0.030319996 -0.21983442 -0.39075902 -0.23682156 -0.2234199 -0.8090305 -0.5750199 -0.38446173 -0.5580111 -0.35380092 -0.12416926 -0.40380785 0.08775886 -0.6330915

### The text format

To train a `word2vec` model just for text, we also have a separate dataset with just text.

In [6]:
%%capture
! make word2vec-text

In [7]:
! du -hc word2vec-text

339M	word2vec-text/model/custom-en-word2vec_cbow-epochs=2
339M	word2vec-text/model
0	word2vec-text/cache/custom-en-word2vec_cbow-epochs=2
0	word2vec-text/cache
339M	word2vec-text
339M	total


In [8]:
! head -n 2 word2vec-text/model/*/model.vec

49310 300
Ġthe 0.7032212 0.03634197 -0.5135605 0.3357469 0.7312258 0.351296 -0.896176 -0.3988427 -0.10211854 -0.11711154 0.32692328 0.29528573 0.12167 0.34439772 0.08338026 0.20829177 -0.36266616 -0.5905029 0.47458592 -0.35525286 0.013833059 0.6882711 -1.1641 -0.038543947 -1.1867766 -0.6374976 0.0148884 0.6491241 0.05241671 0.35373393 -0.103352964 -0.17781082 -0.05479559 -1.2284712 -0.590193 0.19130298 -0.31961197 -0.61971647 0.70202094 0.169576 0.69972306 0.2638736 0.32178205 0.283192 -0.9492635 -0.34824353 -0.017195312 -0.07383108 0.5235321 -0.98763424 -0.7479147 0.36549878 -0.16234569 -0.1586904 0.23776159 -1.0839185 0.24051921 0.11132975 -0.023195984 0.39197168 1.5016292 0.1194545 -0.15831332 0.6309355 0.6230399 -0.45612815 -0.33054242 -0.22435433 -0.5830698 -0.8330652 0.26828966 -1.0591255 0.62596726 0.117289886 0.35900852 -0.106096074 -0.4572636 0.53354627 0.15884255 0.055366952 0.101207554 -0.20706625 0.29750574 0.22175674 -0.12139011 0.38055658 0.5777266 -0.69356936 0.08338858 

### The LaTeX format

To train a `word2vec` model just for math, we also have a separate dataset with just LaTeX.

In [9]:
%%capture
! make word2vec-latex

In [10]:
! du -hc word2vec-latex

203M	word2vec-latex/model/custom-en-word2vec_cbow-epochs=5
203M	word2vec-latex/model
0	word2vec-latex/cache/custom-en-word2vec_cbow-epochs=5
0	word2vec-latex/cache
203M	word2vec-latex
203M	total


In [11]:
! head -n 2 word2vec-latex/model/*/model.vec

29675 300
% -0.8123627 -0.4865128 -1.6376101 -1.0977508 -0.0026990217 -0.81645113 1.092308 0.4254648 -1.7375989 1.0688992 -0.9607899 -0.84274924 0.030448405 -0.16411988 -0.11348923 1.3030134 1.5414021 0.4524594 1.6606439 -0.1709696 0.468996 -0.006377744 0.03779586 0.19229876 1.8753405 -0.7870954 1.3225005 1.5892274 0.7606449 0.46662828 -1.3111341 -0.16214229 -0.43260443 -1.8870902 -0.39288288 -2.175066 0.6409486 -1.2540636 -0.69904655 -1.4974338 3.1012635 0.9241071 0.57255584 -0.4895081 0.404196 1.4884887 -0.98764694 0.28389123 -0.3655657 0.72464097 2.0262249 -0.0961223 -0.14714527 -0.13083251 -2.4775496 -1.8082969 -2.3877156 0.37305248 0.5121084 -3.3746302 0.06849546 -0.42158577 1.6311511 0.006520479 -0.5237138 2.7410126 0.8662932 1.7007933 1.372537 -0.01475599 0.26689458 1.2083758 0.8985544 0.35812837 -0.84248924 0.27347162 0.51197463 0.94877106 -0.8685739 1.3737218 0.26224765 -1.1566223 0.16475773 1.4718179 0.29538846 1.2587968 -0.5608657 0.85403377 0.61386347 1.0316347 -2.148078 0.

### The Tangent-L format

To train a word2vec model just for math, we also have a separate dataset with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [12]:
%%capture
! make word2vec-tangentl

In [13]:
! du -hc word2vec-tangentl

14G	word2vec-tangentl/model/custom-en-word2vec_cbow-epochs=2
14G	word2vec-tangentl/model
0	word2vec-tangentl/cache/custom-en-word2vec_cbow-epochs=2
0	word2vec-tangentl/cache
14G	word2vec-tangentl
14G	total


In [14]:
! head -n 2 word2vec-tangentl/model/*/model.vec

1986439 300
(start) -0.7883517 -0.4147513 0.109570004 0.30688292 3.6664822 -1.8741546 0.5164354 -0.3924657 0.7925373 -1.4477867 -0.23767693 -2.0926077 0.4927457 -0.75468713 0.54832906 1.0683583 -0.6374132 0.061881304 0.09244942 -2.2327147 1.1101563 0.5815448 0.2750769 1.1154897 -0.96233326 2.1798615 0.053582694 -0.71024233 0.01958334 0.11164918 -0.2490331 -1.1246321 1.8936965 2.0551243 -0.67028046 1.387179 0.37774748 2.308518 0.79442024 -0.7665613 -1.3263549 1.0679648 -1.3366199 0.31892794 1.0904627 0.6458117 -1.3810245 -0.059227582 0.41283834 -0.6207842 -1.3623626 1.7307568 0.70670766 1.8203899 -0.8782172 -0.6903082 2.0658896 -0.8475402 0.7414832 0.0026021323 -1.5228511 -1.5677015 0.38836935 -0.15504538 -1.664133 1.1359817 0.32770953 -0.7394564 -1.185888 -0.38944528 -1.1922969 1.1946704 2.3373098 -2.6557286 -0.37687904 -1.317471 -0.15844513 -0.19751823 0.62681144 -0.6336746 0.55956715 0.19937158 0.012083598 0.8902954 -0.10233464 0.8064445 0.29683325 0.23845148 -1.2934929 0.43259752 -1

## Train positional `word2vec` models

In this section, we will train `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [15]:
%%capture
! make word2vec-text+latex-positional

In [16]:
! du -hc word2vec-text+latex-positional

490M	word2vec-text+latex-positional/model/custom-en-constrained_positional_word2vec_cbow-epochs=1
490M	word2vec-text+latex-positional/model
0	word2vec-text+latex-positional/cache/custom-en-constrained_positional_word2vec_cbow-epochs=1
0	word2vec-text+latex-positional/cache
490M	word2vec-text+latex-positional
490M	total


In [17]:
! head -n 2 word2vec-text+latex-positional/model/*/model.vec

71540 300
Ġ -0.1875479 0.8918865 0.30134764 0.61537254 0.16202745 0.16520086 -0.031330816 -1.4616253 -0.66537344 0.12608404 1.8684675 -0.35645586 0.68418777 -0.75702 -0.10578573 -0.42978078 -0.48130286 0.7738346 0.5421547 0.2888756 0.65891993 0.45541206 0.014337258 -0.94012845 0.84462523 0.23259977 -0.37134096 0.40669978 -0.86422944 0.17734031 -0.62940836 -1.7463795 0.8235458 0.30762246 0.7477674 -0.24838255 1.5179554 -0.32424372 0.6378801 0.57103693 0.7150298 1.2847652 0.21121918 0.037991 0.5547662 -0.11271347 1.0556422 -0.47664535 0.23925139 -0.94804865 0.26941147 0.5826693 0.39958215 -0.0810523 -0.86759174 -0.46414074 -1.1166532 -0.8150435 1.0491073 -0.059790947 0.1920825 0.31189322 -0.25114658 0.8361882 -0.32525218 -0.48084393 -0.13130282 0.05037433 -0.19800007 0.583167 -0.19305679 0.25336298 -0.5069497 0.25286612 0.06388154 0.067826465 0.16921827 0.2411752 -0.06670237 -0.1617248 -0.96788156 0.5192936 0.13087822 -0.9724806 -0.050773334 0.60405374 -0.30630124 -0.6006769 -0.6523337 -

### The text format

To train a `word2vec` model just for text, we also have a separate dataset with just text.

In [18]:
%%capture
! make word2vec-text-positional

In [19]:
! du -hc word2vec-text-positional

338M	word2vec-text-positional/model/custom-en-constrained_positional_word2vec_cbow-epochs=2
338M	word2vec-text-positional/model
0	word2vec-text-positional/cache/custom-en-constrained_positional_word2vec_cbow-epochs=2
0	word2vec-text-positional/cache
338M	word2vec-text-positional
338M	total


In [20]:
! head -n 2 word2vec-text-positional/model/*/model.vec

49310 300
Ġthe -0.0948922 0.46487233 0.53211135 -0.041445784 1.6362921 -0.10327671 1.3562237 -0.6604223 -0.5201821 -0.074263625 1.2274079 -0.5351085 0.05654604 -1.4765608 0.22930744 0.76847905 -0.22626773 -1.4716115 -0.24517034 1.646543 -0.31159335 0.026490277 0.38063404 0.6644411 -0.25424173 0.8107544 -1.7661531 2.2621965 0.02205728 -1.1229014 -0.2920755 1.4923047 -0.32814467 -2.2108057 0.32496676 -0.7461394 0.7928426 -0.1117733 0.54062027 -1.5501577 -0.2083572 -0.91348296 0.7764302 -0.64569443 0.32488403 -0.42126602 -1.657367 0.9719783 -0.17734364 0.046469208 0.8273238 -1.5106404 -0.37492773 1.3581058 0.08664954 -0.10760002 -0.6162666 -0.53794456 0.36314002 1.2475406 0.26962999 -0.17876142 0.015458746 0.9794636 0.08373761 -0.38949597 -0.22871017 0.1528613 -0.41417062 -0.3410043 -0.48524252 0.31217077 -0.1679793 -0.42378938 -0.20197037 -0.0029911604 0.59018666 -0.0024281521 -0.046227697 -0.74926484 0.24579398 0.5046935 -0.030378481 -0.21657053 0.22813272 0.13031432 -0.1576565 -0.20766

### The LaTeX format

To train a `word2vec` model just for math, we also have a separate dataset with just LaTeX.

In [21]:
%%capture
! make word2vec-latex-positional

In [22]:
! du -hc word2vec-latex-positional

202M	word2vec-latex-positional/model/custom-en-constrained_positional_word2vec_cbow-epochs=5
202M	word2vec-latex-positional/model
0	word2vec-latex-positional/cache/custom-en-constrained_positional_word2vec_cbow-epochs=5
0	word2vec-latex-positional/cache
202M	word2vec-latex-positional
202M	total


In [23]:
! head -n 2 word2vec-latex-positional/model/*/model.vec

29675 300
% 0.8901723 0.47753775 -0.25580484 -0.49702826 -1.2510386 0.28111792 0.23857468 -1.0780758 -1.4763756 0.2956801 0.055865422 0.026966417 0.326713 0.6106123 -0.87188375 -1.1999911 0.34720817 -0.4468305 -0.3698747 -0.4408113 0.071058765 -1.0206571 0.4129372 -0.30090615 -0.94386333 -0.8120412 1.0945362 -0.10427095 0.99721193 0.29392707 1.0004305 -0.06176557 0.40482345 -0.30843467 0.42092946 0.3593929 -0.59519 1.3344753 0.85418326 0.67675304 0.31679824 -0.5394206 0.72658104 -0.07458251 -0.36788827 -0.80487925 -0.7956097 0.19825435 -0.043689813 -1.1933388 -0.5309262 0.18160331 -0.60350937 -0.5598332 0.34673214 0.10679278 -1.5410138 0.50901806 0.76687497 0.45944133 -0.52140075 -0.056415234 -1.2947371 0.6880816 0.41167077 -0.36029693 2.0548363 -0.26385704 0.22266854 -0.0026835317 -0.91878295 2.0201352 -0.44925925 -0.46337846 1.0135711 0.17934509 0.86974764 -1.5536085 0.051703487 -0.45952705 0.556976 1.0920143 0.5733894 -0.26399985 1.5638045 -1.2356725 0.28950712 -0.69601995 -0.873385

### The Tangent-L format

To train a word2vec model just for math, we also have a separate dataset with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [24]:
%%capture
! make word2vec-tangentl-positional

In [25]:
! du -hc word2vec-tangentl-positional

14G	word2vec-tangentl-positional/model/custom-en-constrained_positional_word2vec_cbow-epochs=2
14G	word2vec-tangentl-positional/model
0	word2vec-tangentl-positional/cache/custom-en-constrained_positional_word2vec_cbow-epochs=2
0	word2vec-tangentl-positional/cache
14G	word2vec-tangentl-positional
14G	total


In [26]:
! head -n 2 word2vec-tangentl-positional/model/*/model.vec

1986439 300
(start) 0.3166078 0.32142597 0.34454095 -0.07255285 0.38778037 -0.14426847 -0.01732232 -0.26281923 1.252759 0.049788285 -0.37009966 -0.15158775 0.07610741 2.0655713 -2.5267181 0.22337817 -0.004000889 -0.5431778 -0.076354615 -0.03853108 1.0655208 -0.116475835 -0.10065177 -0.078265786 0.07453254 0.988562 0.2175998 1.3232359 0.87032974 -0.00044681778 0.176348 0.023796262 -0.053987995 -0.6664432 -0.6631214 0.090399295 0.13955821 1.2250364 -0.017805219 -0.989486 -0.27473718 0.14729865 -0.14013909 -0.056522164 -0.1962746 -0.10431214 -0.17020053 -0.0711098 0.17637534 0.23139422 1.329581 -0.07272253 -0.037196796 0.41047052 0.14671275 -1.2506131 -1.5762898 -0.17462981 -0.031680465 -1.0242909 0.47987965 -0.8490513 -0.23418312 0.6616664 0.80433255 0.5773198 -1.518877 -0.49992532 -0.7281841 -1.3296658 1.5090414 -0.8300803 -1.3984554 1.6109108 0.36564013 0.10850824 0.7570445 1.623565 1.4367251 1.0795465 -1.3869472 0.48184153 1.693365 1.3953366 0.32961982 0.09521597 0.69020426 2.0102963 