This is a twin math dataset consisting of a per-expression dataset, and Simple-XML dataset (with marked-up sentences and marked-up math).
This twin dataset is derived from the Digital Library of Mathematical Functions (DLMF) of NIST.
The per-expression dataset, residing in One-Record-Per-Math-Expression.zip, is structured and labeled at fine granularity. For each math equation or expression in the DLMF, there is a record that provides a number of related elements, both contextual and annotational. The Simple-XML dataset, residing in Simple-XML.zip, consists of "Simple XML" files, where the contents of each Simple XML file are organized as marked-up sentences within the marked-up hierarchy of paragraphs/subsections/sections inherited form the original DLMF XML files (derived from LaTeX source files using Bruce Miller's LaTeXML tool). Each sentence consists of its text and math XML-elements with their own unique IDs.
The per-expression dataset is organized by records, one per equation and one per mathematical expression. Each equation-record starts with the keyword "Equation" and ends with the keyword "End-equation" on separate lines, and has separate fields, where each field is a name:value pair. The name of each field is a meaningful string, and the value is a text string that can be a LATEX encoding, an ID, or a sequence of name:value fields. The fields' names and values are described in Tables 1 and 2 below. The record of an expression is identical to that of an equation except that it starts with the keyword Expression, ends with the keyword End-expression, and does not have the following fields: equation-number, permalink, constraints, symbols-used, and symbols-defined.
In the Simple-XML dataset, each section of the DLMF is a lean XML file, structured as a tree of section and subsections. Each subsection consists of paragraphs, and each paragraph is a sequence of marked-up sentences that contain text and/or marked-up math elements. Each sentence element has valuable XML attributes, including the xml-id attribute of the sentence. The attributes of the sentence element and of the Math element are explained below in Tables 3 and 4, respectively.
Both twin datasets have a directory structure that mirrors the directory structure of the DLMF, that is, each chapter is a directory of files, one file per section, where the chapters are numbered 1-36, and the section files also have numeric names. For example, file "2.3.txt" in the per-expression dataset is the (text) file corresponding to Section 3 of Chapter 2, and containing the records of the equations and math expressions of Section 3 of Chapter 2 of the DLMF. Similarly, file "2.3.xml" in the Simple-XML dataset is the lean sentence-oriented XML file corresponding to the contents of Section 3 of Chapter 2 of the DLMF.
To make the format of the equation/exrepession records more "standard" and easier to load & process, a JSON version of the per-expression dataset is provided at One-Json-Object-Per-Math-Expression.zip. It contains three subdirectories having the same json objects. One subdirectory has the same chapter-section tree structure as the DLMF (one json file per section); another subdirectory has one json file per chapter; and the third subdirectory has all of the JSON objects of the expressions/equations of the DLMF in one single large file. Depending on the computing resources available, users may opt to choose one subdirectory or another. Note that the JSON objects mirror the records in One-Json-Object-Per-Math-Expression.zip in that the JSON object and the record of any expression/equation have the same field names and field values.
Table 1: Names, values and explanations of the fields of equation recordsField Name | Field Value and its Explanation |
---|---|
equation-number | the unique equation number of the equation in DLMF |
permalink | a unique URL of the equation |
xml-id | a unique XML ID of the equation within the DLMF |
tex | LaTeX encoding of the equation, surrounded with double dollar signs |
content-tex | LaTeX encoding of the equation, but using DLMF-defined semantic Latex macros |
constraints | a number of name:value fields encoding the constraints of the equation, if any, in both \LaTeX and content-tex |
symbols-defined | a number of name:value fields where the name is "symbol", and the value is in turn a number of name:value fields encoding and describing a math symbol in the equation, where the description gives the meaning of the symbol, which can be viewed as a symbol label in the ML sense |
symbols-used | similar to the symbols-defined values above, except that each symbol has an additional idref:value field where the latter value provides the ID where the original definition of that symbol is located in the DLMF |
meaning | the meaning or role of the symbol in question |
idref | a unqiue ID reference to the location where a symbol is initially defined in the DLMF |
context-references | a number of name:value fields that provide context-identifying references and titles of the textual units containing the equation, such as subsection and section titles, as detailed in Table 2 |
Field Name | Field Value and its Explanation |
---|---|
sentence-xmlid | a unique sentence ID within the Simple-XML files |
sentence-num-in-section | the in-section number of the sentence containing the equation |
sentence-num-in-chapter | the in-chapter number of the sentence containing the equation |
sentence-num-in-corpus | the in-corpus number of the sentence containing the equation |
para-xmlid | a unique ID of the physical paragraph of the equation |
para-num-of-sentences | the number of sentences in the physical paragraph of the equation |
paragraph-xmlid | a unique XML ID of the logical paragraph of the equation |
paragraph-title | the title of the logical paragraph of the equation |
subsection-xmlid | a unique XML ID of the subsection containing the equation |
subsection-title | the title of the subsection containing the equation |
section-xmlid | a unique XML ID of the section containing the equation |
section-title | the title of the section containing the equation |
chapter-xmlid | a unique XML ID of the chapter containing the equation |
chapter-title | the title of the chapter containing the equation |
Attribute Name | Attribute Value and its Explanation |
---|---|
xml:id | a unique sentence ID within the Simple-XML files |
sentence-num-in-para | the number of the sentence in its physical paragraph |
sentence-num-in-section | the number of the sentence in its section |
Attribute Name | Attribute Value and its Explanation |
---|---|
mode | the value is "inline" for unnumbered math expressions, and "display" for numbered equations |
xml:id | a unique ID of the math expression/equation within the Simple-XML files |
equation-number | the unique equation number of the equation in DLMF, if the Math element is a numbered equation |
The full context of each equation or expression can be easily and quickly derivable from the twin datasets, which enables users to identify and fully extract the sentence containing a given equation/expression, as well as neighboring sentences or full paragraphs, for contextualized processing needed in many math language processing (MLP) tasks.