math-dlmf-dataset

This is a twin math dataset consisting of a per-expression dataset, and Simple-XML dataset (with marked-up sentences and marked-up math).

This twin dataset is derived from the Digital Library of Mathematical Functions (DLMF) of NIST.

The per-expression dataset, residing in One-Record-Per-Math-Expression.zip, is structured and labeled at fine granularity. For each math equation or expression in the DLMF, there is a record that provides a number of related elements, both contextual and annotational. The Simple-XML dataset, residing in Simple-XML.zip, consists of "Simple XML" files, where the contents of each Simple XML file are organized as marked-up sentences within the marked-up hierarchy of paragraphs/subsections/sections inherited form the original DLMF XML files (derived from LaTeX source files using Bruce Miller's LaTeXML tool). Each sentence consists of its text and math XML-elements with their own unique IDs.

The per-expression dataset is organized by records, one per equation and one per mathematical expression. Each equation-record starts with the keyword "Equation" and ends with the keyword "End-equation" on separate lines, and has separate fields, where each field is a name:value pair. The name of each field is a meaningful string, and the value is a text string that can be a LATEX encoding, an ID, or a sequence of name:value fields. The fields' names and values are described in Tables 1 and 2 below. The record of an expression is identical to that of an equation except that it starts with the keyword Expression, ends with the keyword End-expression, and does not have the following fields: equation-number, permalink, constraints, symbols-used, and symbols-defined.

In the Simple-XML dataset, each section of the DLMF is a lean XML file, structured as a tree of section and subsections. Each subsection consists of paragraphs, and each paragraph is a sequence of marked-up sentences that contain text and/or marked-up math elements. Each sentence element has valuable XML attributes, including the xml-id attribute of the sentence. The attributes of the sentence element and of the Math element are explained below in Tables 3 and 4, respectively.

Both twin datasets have a directory structure that mirrors the directory structure of the DLMF, that is, each chapter is a directory of files, one file per section, where the chapters are numbered 1-36, and the section files also have numeric names. For example, file "2.3.txt" in the per-expression dataset is the (text) file corresponding to Section 3 of Chapter 2, and containing the records of the equations and math expressions of Section 3 of Chapter 2 of the DLMF. Similarly, file "2.3.xml" in the Simple-XML dataset is the lean sentence-oriented XML file corresponding to the contents of Section 3 of Chapter 2 of the DLMF.

To make the format of the equation/exrepession records more "standard" and easier to load & process, a JSON version of the per-expression dataset is provided at One-Json-Object-Per-Math-Expression.zip. It contains three subdirectories having the same json objects. One subdirectory has the same chapter-section tree structure as the DLMF (one json file per section); another subdirectory has one json file per chapter; and the third subdirectory has all of the JSON objects of the expressions/equations of the DLMF in one single large file. Depending on the computing resources available, users may opt to choose one subdirectory or another. Note that the JSON objects mirror the records in One-Json-Object-Per-Math-Expression.zip in that the JSON object and the record of any expression/equation have the same field names and field values.

Table 1: Names, values and explanations of the fields of equation records

Field Name	Field Value and its Explanation
equation-number	the unique equation number of the equation in DLMF
permalink	a unique URL of the equation
xml-id	a unique XML ID of the equation within the DLMF
tex	LaTeX encoding of the equation, surrounded with double dollar signs
content-tex	LaTeX encoding of the equation, but using DLMF-defined semantic Latex macros
constraints	a number of name:value fields encoding the constraints of the equation, if any, in both \LaTeX and content-tex
symbols-defined	a number of name:value fields where the name is "symbol", and the value is in turn a number of name:value fields encoding and describing a math symbol in the equation, where the description gives the meaning of the symbol, which can be viewed as a symbol label in the ML sense
symbols-used	similar to the symbols-defined values above, except that each symbol has an additional idref:value field where the latter value provides the ID where the original definition of that symbol is located in the DLMF
meaning	the meaning or role of the symbol in question
idref	a unqiue ID reference to the location where a symbol is initially defined in the DLMF
context-references	a number of name:value fields that provide context-identifying references and titles of the textual units containing the equation, such as subsection and section titles, as detailed in Table 2

Table 2: Names, values and explanations of the context-references fields of equation records

Field Name	Field Value and its Explanation
sentence-xmlid	a unique sentence ID within the Simple-XML files
sentence-num-in-section	the in-section number of the sentence containing the equation
sentence-num-in-chapter	the in-chapter number of the sentence containing the equation
sentence-num-in-corpus	the in-corpus number of the sentence containing the equation
para-xmlid	a unique ID of the physical paragraph of the equation
para-num-of-sentences	the number of sentences in the physical paragraph of the equation
paragraph-xmlid	a unique XML ID of the logical paragraph of the equation
paragraph-title	the title of the logical paragraph of the equation
subsection-xmlid	a unique XML ID of the subsection containing the equation
subsection-title	the title of the subsection containing the equation
section-xmlid	a unique XML ID of the section containing the equation
section-title	the title of the section containing the equation
chapter-xmlid	a unique XML ID of the chapter containing the equation
chapter-title	the title of the chapter containing the equation

Table 3: Attribute names and values of the sentence element in Simple XML files

Attribute Name	Attribute Value and its Explanation
xml:id	a unique sentence ID within the Simple-XML files
sentence-num-in-para	the number of the sentence in its physical paragraph
sentence-num-in-section	the number of the sentence in its section

Table 4: Attribute names and values of the Math element in Simple XML files

Attribute Name	Attribute Value and its Explanation
mode	the value is "inline" for unnumbered math expressions, and "display" for numbered equations
xml:id	a unique ID of the math expression/equation within the Simple-XML files
equation-number	the unique equation number of the equation in DLMF, if the Math element is a numbered equation

The full context of each equation or expression can be easily and quickly derivable from the twin datasets, which enables users to identify and fully extract the sentence containing a given equation/expression, as well as neighboring sentences or full paragraphs, for contextualized processing needed in many math language processing (MLP) tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
One-Json-Object-Per-Math-Exprression.zip		One-Json-Object-Per-Math-Exprression.zip
One-Record-Per-Math-Expression.zip		One-Record-Per-Math-Expression.zip
README.md		README.md
Simple-XML.zip		Simple-XML.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One-Json-Object-Per-Math-Exprression.zip

One-Json-Object-Per-Math-Exprression.zip

One-Record-Per-Math-Expression.zip

One-Record-Per-Math-Expression.zip

README.md

README.md

Simple-XML.zip

Simple-XML.zip

Repository files navigation

math-dlmf-dataset

About

Releases

Packages

abdouyoussef/math-dlmf-dataset

Folders and files

Latest commit

History

Repository files navigation

math-dlmf-dataset

About

Resources

Stars

Watchers

Forks