# Prepare dataset

To train tokenizer and language models, we will prepare a datasets from [ArXMLiv 2020][1] and [Math StackExchange][2] datasets.

 [1]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/
 [2]: https://www.cs.rit.edu/~dprl/ARQMath/arqmath-resources.html

In [1]:
! hostname

mir


## The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [2]:
%%capture
! make dataset-text+latex.txt

In [3]:
%ls -lh dataset-text+latex.txt

-rw-r--r-- 1 novotny novotny 52G May  5 03:52 dataset-text+latex.txt


In [4]:
! head -3 dataset-text+latex.txt

The study was conducted based on a within-subjects design. Participants were shown five scenes with a virtual character walking in different environments. Participants performed [MATH] 10 [/MATH] trials per scene, corresponding to [MATH] 10 [/MATH] virtual characters with varying levels of predicted dominance. The order of the scenes and the dominance levels of the virtual characters were counterbalanced. Participants performed the study using HTC Vive HMD. Participants could look around in the virtual environment by rotating their heads and could also walk in the tracking space, but there was no interaction with the virtual characters (Figure 1 (top)).
The security properties of the longest chain protocol has been intensely studied in recent years. The strong security properties have been demonstrated in increasing sophistication (both network models as well as tightness of security threshold): the pioneering work of [12] on the round by round network model has been extended to discre

For validation of our language models, we use the `error` subset (documents with recoverable errors) of [ArXMLiv 2020][1].

 [1]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/

In [5]:
%%capture
! make dataset-text+latex-validation.txt

In [6]:
%ls -lh dataset-text+latex-validation.txt

-rw-r--r-- 1 novotny novotny 22G May  5 07:25 dataset-text+latex-validation.txt


In [7]:
! head -3 dataset-text+latex-validation.txt

Following our conventions in [GHK13], let [MATH] \mathfrak{T} [/MATH] be the infinite oriented rooted tree with [MATH] |I_{\mathrm{uf}}| [/MATH] outgoing edges from each vertex, labelled by the elements of [MATH] I_{\mathrm{uf}} [/MATH] . Let [MATH] v [/MATH] be the root of the tree. Attach some choice of initial seed [MATH] {\bf s}\in[{\bf s}] [/MATH] to the vertex [MATH] v [/MATH] . (We write [MATH] \mathfrak{T}_{{\bf s}} [/MATH] if we want to record this choice of initial seed.) Now each simple path starting at [MATH] v [/MATH] determines a sequence of mutations, just mutating at the label attached to the edge. In this way we attach a seed to each vertex of [MATH] \mathfrak{T} [/MATH] . We write the seed attached to a vertex [MATH] w [/MATH] as [MATH] {\bf s}_{w} [/MATH] , and write [MATH] T_{N^{\circ},{\bf s}_{w}},T_{M,{\bf s}_{w}} [/MATH] etc. for the corresponding tori. Mutations define birational maps between these tori, and the associated Fock-Goncharov [MATH] \mathcal{A} [/MAT

## The text format

To train a tokenizer just for text, we also have a separate dataset with just text.

In [8]:
%%capture
! make dataset-text.txt

In [9]:
%ls -lh dataset-text.txt

-rw-rw-r-- 1 novotny novotny 32G May  5 08:49 dataset-text.txt


In [10]:
! head -3 dataset-text.txt

 In the re-acceleration model, the radio halo switches off with the radio halo CRe cooling timescale of about 0.1 Gyr the moment the re-acceleration stops to operate, irrespective if there is CRe transport from the outside or not.
 Let be the Euclidean unit disk Let and define . Let be open and nonempty. Then determines and , up to the gauge transformations, in the convex hull of .
 In real space the simplest estimator for the full box is given by Mo and White (1996) (33) where is the mean number of neighbour halos in a shell at distance with width around a halo at , and is the mean number density of halos in the simulation at a given time, such that gives the mean number of neighbour halos if the halos were evenly distributed. Therefore estimates the excess probability to find a halo within an interval away from another halo. We determine as , where is the total number of halos in the box and is the number of all halo pairings with distance . In practice we calculate (34) where is the

## The LaTeX format

To train a tokenizer just for math, we also have a separate dataset with just LaTeX.

In [11]:
%%capture
! make dataset-latex.txt

In [12]:
%ls -lh dataset-latex.txt

-rw-r--r-- 1 novotny novotny 11G May  5 03:36 dataset-latex.txt


In [13]:
! head -3 dataset-latex.txt

\displaystyle=E(F_{n}(L(a),L(F_{n}(b,c))))
\int_{\hat{S}_{\rm men}}\!\!\!\!\!\!dA\;\Pi_{1}u_{2}\sim 2\pi\gamma r_{0}% \varepsilon_{\Pi}u(d)+\Pi(d)\left[2\int_{{S}_{\rm men,2}}\!\!\!\!\!\!dA\;u_{2}% -\frac{1}{2}\pi r_{0}^{3}\varepsilon_{\Pi}+\pi r_{0}^{2}\langle u_{2}\rangle_{% 2}\right],
\left|t\right|>\left|\tau\right|


## The Tangent-L format

To train a tokenizer just for math, we also have a separate dataset with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [14]:
%%capture
! make dataset-tangentl.txt

In [15]:
%ls -lh dataset-tangentl.txt

-rw-rw-r-- 1 novotny novotny 24G Apr 28 08:40 dataset-tangentl.txt


In [16]:
! head -3 dataset-tangentl.txt

#(start)# #(v!p,[n,b],-)# #(v!p,[n,b])# #(v!p,m!()1x1,n,-)# #(v!p,m!()1x1,n)# #(m!()1x1,[n,w],n)# #(m!()1x1,[n,w])# #(m!()1x1,=,n,n)# #(m!()1x1,=,n)# #(=,n!2,n,nn)# #(=,n!2,n)# #(n!2,v!p,n,nnn)# #(n!2,v!p,n)# #(v!p,m!()1x1,n,nnnn)# #(v!p,m!()1x1,n)# #(m!()1x1,v!g,w,nnnnn)# #(m!()1x1,v!g,w)# #(v!g,!0,5n1w)# #(v!g,!0)# #(m!()1x1,v!g,w,n)# #(m!()1x1,v!g,w)# #(v!g,!0,nw)# #(v!g,!0)# #{v!g,nnnnw,w,n}# #{v!g,nnnnw,w}# #{m!()1x1,nnnn,n}# #{m!()1x1,nnnn}# #(v!p,n!2,b,-)# #(v!p,n!2,b)# #(n!2,!0,b)# #(n!2,!0)# #{n!2,nnn,b,-}# #{n!2,nnn,b}# #{v!p,nnnn,-}# #{v!p,nnnn}# #(end)#
#(start)# #(v!φ,[n,b],-)# #(v!φ,[n,b])# #(v!φ,m!()1x1,n,-)# #(v!φ,m!()1x1,n)# #(m!()1x1,[n,w],n)# #(m!()1x1,[n,w])# #(m!()1x1,f!,n,n)# #(m!()1x1,f!,n)# #(f!,[n,o,u],nn)# #(f!,[n,o,u])# #(f!,=,n,nn)# #(f!,=,n)# #(=,v!φ,n,nnn)# #(=,v!φ,n)# #(v!φ,[n,a,b],nnnn)# #(v!φ,[n,a,b])# #(v!φ,v!e,n,nnnn)# #(v!φ,v!e,n)# #(v!e,[n,a],nnnnn)# #(v!e,[n,a])# #(v!e,×,n,nnnnn)# #(v!e,×,n)# #(×,m!{2x2,n,6n)# #(×,m!{2x2,n)# #(m!{2x2,v!f,w,7n)# #(m