Skip to content

Set of tools we used to create, cultivate and process datasets for our math2vec project.

Notifications You must be signed in to change notification settings

gipplab/math2vec

Repository files navigation

Workflow

  1. Getting arXMLiv dataset (here)
  2. Processing it via planetext (here)
  3. Post-Processing via custom procedures (here)

arXMLiv

Download arXMLiv 08.2018 (requires git lfs) and extract the html files.

PlaneText

We use PlaneText for processing the html files from arXMLiv. We customizes the code a little bit. The original sources can be found on KMCS-NII. Our customized version can be found under planetext.

Our version of PlaneText a) do not substitute MathML inner elements b) do not create xhtml or html files

For processing arXMLiv:

# navigate to the planetext directory
./bin/planetext arxmliv.yaml <path to html files> -O <output path>

The process may stop or crash for a subset of files. In that case you can use missing.sh to copy not yet processed files to another directory and repead the conversion process for the subset of the files. Before you do so, change the paths in missing.sh

DIRIN="no_problems_raw"
DIRPROC="no_problems_txt"
DIROUT="no_problems_tmp"

Another problem that appears are empty or files without meanings. PlaneText will than generate empty annotation files and maybe even empty text files. To identify those files in advance (before post processing) we created the broken.sh. Similar to missing.sh one may have to modify the paths in broken.sh also.

Post-Processing

The folder post contain a java project to post process the files generated by PlaneText. We split the text into sentences (one line per sentence) and replace all math-tokens by their LLaMaPuN (Language and Mathematics Processing and Understanding) representation. You have to specify the in- and output directory.

About

Set of tools we used to create, cultivate and process datasets for our math2vec project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published