# Train `word2vec` models

In this notebook, we will fine-tune the `word2vec` model, so that it can represent math-specific tokens.

In [1]:
! hostname

apollo.fi.muni.cz


## Train non-positional `word2vec` models

In this section, we will produce word embeddings for global `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [2]:
%%capture
! make word2vec-text+latex

In [3]:
! du -hc word2vec-text+latex

473M	word2vec-text+latex/model/custom-en-word2vec_cbow-epochs=10
473M	word2vec-text+latex/model
0	word2vec-text+latex/cache/custom-en-word2vec_cbow-epochs=10
4,0K	word2vec-text+latex/cache
473M	word2vec-text+latex
473M	total


### The LaTeX format

To train a `word2vec` model just for math, we also have a separate dataset with just LaTeX.

In [4]:
%%capture
! make word2vec-latex

In [5]:
! du -hc word2vec-latex

202M	word2vec-latex/model/custom-en-word2vec_cbow-epochs=50
202M	word2vec-latex/model
0	word2vec-latex/cache/custom-en-word2vec_cbow-epochs=50
4,0K	word2vec-latex/cache
202M	word2vec-latex
202M	total


### The Tangent-L format

To train a word2vec model just for math, we also have a separate dataset with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [6]:
%%capture
! make word2vec-tangentl

In [7]:
! du -hc word2vec-tangentl

14G	word2vec-tangentl/model/custom-en-word2vec_cbow-epochs=2
14G	word2vec-tangentl/model
0	word2vec-tangentl/cache/custom-en-word2vec_cbow-epochs=2
4,0K	word2vec-tangentl/cache
14G	word2vec-tangentl
14G	total


## Train positional `word2vec` models

In this section, we will produce word embeddings for global `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [8]:
%%capture
! make word2vec-text+latex-positional

In [9]:
! du -hc word2vec-text+latex-positional

472M	word2vec-text+latex-positional/model/custom-en-constrained_positional_word2vec_cbow-epochs=10
472M	word2vec-text+latex-positional/model
0	word2vec-text+latex-positional/cache/custom-en-constrained_positional_word2vec_cbow-epochs=10
4,0K	word2vec-text+latex-positional/cache
472M	word2vec-text+latex-positional
472M	total


### The LaTeX format

To train a `word2vec` model just for math, we also have a separate dataset with just LaTeX.

In [10]:
%%capture
! make word2vec-latex-positional

In [11]:
! du -hc word2vec-latex-positional

202M	word2vec-latex-positional/model/custom-en-constrained_positional_word2vec_cbow-epochs=50
202M	word2vec-latex-positional/model
0	word2vec-latex-positional/cache/custom-en-constrained_positional_word2vec_cbow-epochs=50
4,0K	word2vec-latex-positional/cache
202M	word2vec-latex-positional
202M	total


### The Tangent-L format

To train a word2vec model just for math, we also have a separate dataset with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [12]:
%%capture
! make word2vec-tangentl-positional

In [13]:
! du -hc word2vec-tangentl-positional

14G	word2vec-tangentl-positional/model/custom-en-constrained_positional_word2vec_cbow-epochs=2
14G	word2vec-tangentl-positional/model
0	word2vec-tangentl-positional/cache/custom-en-constrained_positional_word2vec_cbow-epochs=2
4,0K	word2vec-tangentl-positional/cache
14G	word2vec-tangentl-positional
14G	total
