Skip to content

abdouyoussef/math-dlmf-dataset

Repository files navigation

math-dlmf-dataset

This is a twin math dataset consisting of a per-expression dataset, and Simple-XML dataset (with marked-up sentences and marked-up math).

This twin dataset is derived from the Digital Library of Mathematical Functions (DLMF) of NIST.

The per-expression dataset, residing in One-Record-Per-Math-Expression.zip, is structured and labeled at fine granularity. For each math equation or expression in the DLMF, there is a record that provides a number of related elements, both contextual and annotational. The Simple-XML dataset, residing in Simple-XML.zip, consists of "Simple XML" files, where the contents of each Simple XML file are organized as marked-up sentences within the marked-up hierarchy of paragraphs/subsections/sections inherited form the original DLMF XML files (derived from LaTeX source files using Bruce Miller's LaTeXML tool). Each sentence consists of its text and math XML-elements with their own unique IDs.

The per-expression dataset is organized by records, one per equation and one per mathematical expression. Each equation-record starts with the keyword "Equation" and ends with the keyword "End-equation" on separate lines, and has separate fields, where each field is a name:value pair. The name of each field is a meaningful string, and the value is a text string that can be a LATEX encoding, an ID, or a sequence of name:value fields. The fields' names and values are described in Tables 1 and 2 below. The record of an expression is identical to that of an equation except that it starts with the keyword Expression, ends with the keyword End-expression, and does not have the following fields: equation-number, permalink, constraints, symbols-used, and symbols-defined.

In the Simple-XML dataset, each section of the DLMF is a lean XML file, structured as a tree of section and subsections. Each subsection consists of paragraphs, and each paragraph is a sequence of marked-up sentences that contain text and/or marked-up math elements. Each sentence element has valuable XML attributes, including the xml-id attribute of the sentence. The attributes of the sentence element and of the Math element are explained below in Tables 3 and 4, respectively.

Both twin datasets have a directory structure that mirrors the directory structure of the DLMF, that is, each chapter is a directory of files, one file per section, where the chapters are numbered 1-36, and the section files also have numeric names. For example, file "2.3.txt" in the per-expression dataset is the (text) file corresponding to Section 3 of Chapter 2, and containing the records of the equations and math expressions of Section 3 of Chapter 2 of the DLMF. Similarly, file "2.3.xml" in the Simple-XML dataset is the lean sentence-oriented XML file corresponding to the contents of Section 3 of Chapter 2 of the DLMF.

To make the format of the equation/exrepession records more "standard" and easier to load & process, a JSON version of the per-expression dataset is provided at One-Json-Object-Per-Math-Expression.zip. It contains three subdirectories having the same json objects. One subdirectory has the same chapter-section tree structure as the DLMF (one json file per section); another subdirectory has one json file per chapter; and the third subdirectory has all of the JSON objects of the expressions/equations of the DLMF in one single large file. Depending on the computing resources available, users may opt to choose one subdirectory or another. Note that the JSON objects mirror the records in One-Json-Object-Per-Math-Expression.zip in that the JSON object and the record of any expression/equation have the same field names and field values.

Table 1: Names, values and explanations of the fields of equation records
Field Name Field Value and its Explanation
equation-number the unique equation number of the equation in DLMF
permalink a unique URL of the equation
xml-id a unique XML ID of the equation within the DLMF
tex LaTeX encoding of the equation, surrounded with double dollar signs
content-tex LaTeX encoding of the equation, but using DLMF-defined semantic Latex macros
constraints a number of name:value fields encoding the constraints of the equation, if any, in both \LaTeX and content-tex
symbols-defined a number of name:value fields where the name is "symbol", and the value is in turn a number of name:value fields encoding and describing a math symbol in the equation, where the description gives the meaning of the symbol, which can be viewed as a symbol label in the ML sense
symbols-used similar to the symbols-defined values above, except that each symbol has an additional idref:value field where the latter value provides the ID where the original definition of that symbol is located in the DLMF
meaning the meaning or role of the symbol in question
idref a unqiue ID reference to the location where a symbol is initially defined in the DLMF
context-references a number of name:value fields that provide context-identifying references and titles of the textual units containing the equation, such as subsection and section titles, as detailed in Table 2
Table 2: Names, values and explanations of the context-references fields of equation records
Field Name Field Value and its Explanation
sentence-xmlid a unique sentence ID within the Simple-XML files
sentence-num-in-section the in-section number of the sentence containing the equation
sentence-num-in-chapter the in-chapter number of the sentence containing the equation
sentence-num-in-corpus the in-corpus number of the sentence containing the equation
para-xmlid a unique ID of the physical paragraph of the equation
para-num-of-sentences the number of sentences in the physical paragraph of the equation
paragraph-xmlid a unique XML ID of the logical paragraph of the equation
paragraph-title the title of the logical paragraph of the equation
subsection-xmlid a unique XML ID of the subsection containing the equation
subsection-title the title of the subsection containing the equation
section-xmlid a unique XML ID of the section containing the equation
section-title the title of the section containing the equation
chapter-xmlid a unique XML ID of the chapter containing the equation
chapter-title the title of the chapter containing the equation
Table 3: Attribute names and values of the sentence element in Simple XML files
Attribute Name Attribute Value and its Explanation
xml:id a unique sentence ID within the Simple-XML files
sentence-num-in-para the number of the sentence in its physical paragraph
sentence-num-in-section the number of the sentence in its section
Table 4: Attribute names and values of the Math element in Simple XML files
Attribute Name Attribute Value and its Explanation
mode the value is "inline" for unnumbered math expressions, and "display" for numbered equations
xml:id a unique ID of the math expression/equation within the Simple-XML files
equation-number the unique equation number of the equation in DLMF, if the Math element is a numbered equation

The full context of each equation or expression can be easily and quickly derivable from the twin datasets, which enables users to identify and fully extract the sentence containing a given equation/expression, as well as neighboring sentences or full paragraphs, for contextualized processing needed in many math language processing (MLP) tasks.

About

A twin math dataset consisting of a per-expression data subset, and Simple-XML data subset (with marked-up sentences and marked-up math)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published