# Prepare training dataset

To train tokenizer and language models, we will prepare a datasets from [ArXMLiv 2020][1] and [Math StackExchange][2] datasets.

 [1]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/
 [2]: https://www.cs.rit.edu/~dprl/ARQMath/arqmath-resources.html

In [1]:
! hostname

docker.apollo.fi.muni.cz


## The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [2]:
%%capture
! make dataset-text+latex.txt

In [3]:
%ls -lh dataset-text+latex.txt

-rw-rw-r-- 1 1008 1011 4.3G Apr 28 02:18 dataset-text+latex.txt


In [4]:
! head -3 dataset-text+latex.txt

That means that [MATH] \delta [/MATH] should retain the signal under convolution. In the frequency domain the spectrum [MATH] \Delta:=\sF(\delta) [/MATH] of [MATH] \delta [/MATH] should retain the spectrum [MATH] X [/MATH] of the signal [MATH] x [/MATH] under multiplication. If our signal space [MATH] \sS [/MATH] contains signals [MATH] x [/MATH] whose spectra [MATH] X [/MATH] have unbounded support this implies that the spectrum of [MATH] \Delta [/MATH] must be the unit function, i.e., [MATH] \Delta(\omega)=1 [/MATH] for all [MATH] \omega\in\nR [/MATH] . But, if there exists an upper limit frequency [MATH] \omega_\rmu>0 [/MATH] such that the spectra [MATH] X [/MATH] for all signals [MATH] x\in\sS [/MATH] are zero outside of [MATH] [-\omega_\rmu,\omega_\rmu] [/MATH] then [MATH] \Delta(\omega) [/MATH] must only be 1 for [MATH] \omega\in[-\omega_\rmu,\omega_\rmu] [/MATH] . We can construct an appropriate generator for our base as inverse Fourier transformed of \begin{align} \Delta(\omega

## The LaTeX format

To train a tokenizer just for math, we also have a separate dataset with just LaTeX.

In [5]:
%%capture
! make dataset-latex.txt

In [6]:
%ls -lh dataset-latex.txt

-rw-rw-r-- 1 1008 1011 702M Apr 28 01:33 dataset-latex.txt


In [7]:
! head -3 dataset-latex.txt

f(x) = m(E \cap (-\infty,x])
\displaystyle=|\psi(\boldsymbol{R})|^{2}.
\|f-\Gamma_n\|_\infty \le \epsilon


## The Tangent-L format

To train a tokenizer just for math, we also have a separate dataset with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [8]:
%%capture
! make dataset-tangentl.txt

In [9]:
%ls -lh dataset-tangentl.txt

-rw-rw-r-- 1 1008 1011 24G Apr 28 06:40 dataset-tangentl.txt


In [10]:
! head -3 dataset-tangentl.txt

#(start)# #(v!p,[n,b],-)# #(v!p,[n,b])# #(v!p,m!()1x1,n,-)# #(v!p,m!()1x1,n)# #(m!()1x1,[n,w],n)# #(m!()1x1,[n,w])# #(m!()1x1,=,n,n)# #(m!()1x1,=,n)# #(=,n!2,n,nn)# #(=,n!2,n)# #(n!2,v!p,n,nnn)# #(n!2,v!p,n)# #(v!p,m!()1x1,n,nnnn)# #(v!p,m!()1x1,n)# #(m!()1x1,v!g,w,nnnnn)# #(m!()1x1,v!g,w)# #(v!g,!0,5n1w)# #(v!g,!0)# #(m!()1x1,v!g,w,n)# #(m!()1x1,v!g,w)# #(v!g,!0,nw)# #(v!g,!0)# #{v!g,nnnnw,w,n}# #{v!g,nnnnw,w}# #{m!()1x1,nnnn,n}# #{m!()1x1,nnnn}# #(v!p,n!2,b,-)# #(v!p,n!2,b)# #(n!2,!0,b)# #(n!2,!0)# #{n!2,nnn,b,-}# #{n!2,nnn,b}# #{v!p,nnnn,-}# #{v!p,nnnn}# #(end)#
#(start)# #(v!φ,[n,b],-)# #(v!φ,[n,b])# #(v!φ,m!()1x1,n,-)# #(v!φ,m!()1x1,n)# #(m!()1x1,[n,w],n)# #(m!()1x1,[n,w])# #(m!()1x1,f!,n,n)# #(m!()1x1,f!,n)# #(f!,[n,o,u],nn)# #(f!,[n,o,u])# #(f!,=,n,nn)# #(f!,=,n)# #(=,v!φ,n,nnn)# #(=,v!φ,n)# #(v!φ,[n,a,b],nnnn)# #(v!φ,[n,a,b])# #(v!φ,v!e,n,nnnn)# #(v!φ,v!e,n)# #(v!e,[n,a],nnnnn)# #(v!e,[n,a])# #(v!e,×,n,nnnnn)# #(v!e,×,n)# #(×,m!{2x2,n,6n)# #(×,m!{2x2,n)# #(m!{2x2,v!f,w,7n)# #(m