Workflow

Getting arXMLiv dataset (here)
Processing it via planetext (here)
Post-Processing via custom procedures (here)

arXMLiv

Download arXMLiv 08.2018 (requires git lfs) and extract the html files.

PlaneText

We use PlaneText for processing the html files from arXMLiv. We customizes the code a little bit. The original sources can be found on KMCS-NII. Our customized version can be found under planetext.

Our version of PlaneText a) do not substitute MathML inner elements b) do not create xhtml or html files

For processing arXMLiv:

# navigate to the planetext directory
./bin/planetext arxmliv.yaml <path to html files> -O <output path>

The process may stop or crash for a subset of files. In that case you can use missing.sh to copy not yet processed files to another directory and repead the conversion process for the subset of the files. Before you do so, change the paths in missing.sh

DIRIN="no_problems_raw"
DIRPROC="no_problems_txt"
DIROUT="no_problems_tmp"

Another problem that appears are empty or files without meanings. PlaneText will than generate empty annotation files and maybe even empty text files. To identify those files in advance (before post processing) we created the broken.sh. Similar to missing.sh one may have to modify the paths in broken.sh also.

Post-Processing

The folder post contain a java project to post process the files generated by PlaneText. We split the text into sentences (one line per sentence) and replace all math-tokens by their LLaMaPuN (Language and Mathematics Processing and Understanding) representation. You have to specify the in- and output directory.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
CombineDocuments		CombineDocuments
POS_Tagger		POS_Tagger
evaluation		evaluation
mathplotvec/default		mathplotvec/default
planetext		planetext
post		post
results		results
.gitignore		.gitignore
README.md		README.md
broken-2.sh		broken-2.sh
broken.sh		broken.sh
ideas.txt		ideas.txt
math_lexemes.txt		math_lexemes.txt
math_lexemes_unique.txt		math_lexemes_unique.txt
missing.sh		missing.sh
split.sh		split.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workflow

arXMLiv

PlaneText

Post-Processing

About

Releases

Packages

Languages

gipplab/math2vec

Folders and files

Latest commit

History

Repository files navigation

Workflow

arXMLiv

PlaneText

Post-Processing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages