https://physicsderivationgraph.blogspot.com/2020/05/characterizing-latex-content-in.html

In this notebook methods for extracting math expressions are explored:
* regex
* TexSoup

_Results_: 
* TexSoup is very slow (because it creates a complex data structure per document) and therefore isn't suitable for bulk processing. Therefore regex is preferrable
* TexSoup errors on invalid LaTeX and halts processing

Also investigated tex2py but that doesn't seem to be relevant for the tasks

# load libraries

In [1]:
# https://github.com/alvinwan/tex2py
!pip install tex2py



In [2]:
# https://github.com/alvinwan/TexSoup
!pip install texsoup



In [3]:
import re
import sys
print(sys.version)
import time
import glob
import matplotlib.pyplot as plt

3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:08:43) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]


# download data

https://www.cs.cornell.edu/projects/kddcup/datasets.html

In [4]:
!ls -hal 2003/ | wc -l

1022


# load data

In [5]:
list_of_files = glob.glob('2003/*')

In [6]:
# how many files are in the corpus?
len(list_of_files)

1019

In [7]:
# what's the first file name?
list_of_files[0]

'2003/0301116'

In [8]:
# load data from the first file
with open(list_of_files[0], 'r') as f:
    data = f.read()

In [9]:
# quick preview of file content
data[0:100]

'\n\n%******************LATEX FILE OF THE PAPER***********************\n%\n%\n\n%%%%%%%%%%%%%%%%%%%%%%%%%%%'

# method 1: find relevant tags in latex using regex

this section relies on regex instead of using a library

In [10]:
start_time=time.time()
regex_reslts={}
number_of_eq=0
for this_file in list_of_files:
    with open(this_file,'rb') as f:
        data = f.read()
    
    resp = re.findall('\\\\begin\s*{(?:eqnarray|equation|multiline)}.*?end\s*{(?:eqnarray|equation|multiline)}',
                      str(data),
                      re.DOTALL)

#    print(this_file)
    for eq in resp:
        number_of_eq+=1
        try:
            regex_reslts[this_file].append(eq)
        except KeyError:
            regex_reslts[this_file]=[]
            regex_reslts[this_file].append(eq)
#        print('  ',eq,'\n')

print(round(time.time()-start_time,2),'seconds')

6.31 seconds


In [11]:
# number of matching candidates in .tex files
number_of_eq

29481

In [12]:
# number of files containing candidates
len(regex_reslts.keys())

929

In [13]:
# first file results
regex_reslts[list(regex_reslts.keys())[0]]

['\\begin{equation}}\\n\\\\newcommand{\\\\eeq}{\\\\end{equation}',
 '\\begin{eqnarray}}\\n\\\\newcommand{\\\\eeqa}{\\\\end{eqnarray}']

In [14]:
# first expression in the second file results
regex_reslts[list(regex_reslts.keys())[1]][0]

'\\begin{equation}\\\\label{planewave}\\n\\\\begin{split}\\nds^2 & = 2dx^+dx^--\\\\m^2\\\\vec{x}^2\\\\bigl(dx^+\\\\bigr)^2+d\\\\vec{x}^2\\\\,,\\\\\\\\\\nF_5 & = 4\\\\m dx^+\\\\wedge\\\\bigl(dx^1\\\\wedge dx^2\\\\wedge dx^3\\\\wedge dx^4+dx^5\\\\wedge dx^6\\\\wedge dx^7\\\\wedge dx^8\\\\bigr)\\\\,.\\n\\\\end{split}\\n\\\\end{equation}'

# parse the Latex using libraries


options: 
* http://plastex.sourceforge.net/plastex/sect0025.html
* tex2py - https://github.com/alvinwan/tex2py
* texsoup - https://texsoup.alvinwan.com/docs/quickstart.html; https://github.com/alvinwan/TexSoup

## tex2py

https://github.com/alvinwan/tex2py - "converts LaTeX into a Python parse tree, allowing navigation using the default or a custom hierarchy"<BR>
Built on top of TexSoup

BHP: I am unable to extract math or text using this library. tex2py seems to be intended for document hierarchy

In [15]:
# https://github.com/alvinwan/tex2py
from tex2py import tex2py

In [16]:
list_of_files[1]

'2003/0304232'

In [17]:
with open(list_of_files[1]) as f: data = f.read()

# this takes a long time because it relies on texSoup
start_time = time.time()
toc = tex2py(data)
print('elapsed =', round(time.time()-start_time,2),'seconds')

elapsed = 33.18 seconds


In [18]:
toc.valid_tags[0:10]

('addcontentsline',
 'addtocontents',
 'addtocounter',
 'address',
 'addtolength',
 'addvspace',
 'alph',
 'appendix',
 'arabic',
 'author')

In [19]:
toc.branches

[Introduction,
 Preliminaries,
 A supersymmetric extension in the $SO(4)\times SO(4)$ \\ formalism,
 Anomalous dimension from string field theory,
 Conclusions,
 Conventions and Notation,
 Useful identities and (anti)commutators,
 More detailed computations,
 Functional expressions for the prefactor]

## TexSoup

https://texsoup.alvinwan.com/docs/quickstart.html

In [20]:
from TexSoup import TexSoup

### single document

In [21]:
with open(list_of_files[1]) as f: data = f.read()

In [22]:
data[0:200]

'\\documentclass[12pt]{article}\n\\usepackage{graphics}\n\\usepackage{epsfig}\n\\usepackage{color}\n\\usepackage{amsmath}\n\\usepackage{amssymb}\n\\def\\da{\\dot{\\alpha}}\n\\def\\db{\\dot{\\beta}}\n\\def\\dg{\\dot{\\gamma}}\n\\d'

In [23]:
start_time=time.time()
soup = TexSoup(data)
print(round(time.time()-start_time,2),'seconds')

26.35 seconds


In [24]:
list(soup.text)[0:20]

['12pt',
 'article',
 'graphics',
 'epsfig',
 'color',
 'amsmath',
 'amssymb',
 '\\dot',
 '\\dot',
 '\\dot',
 '\\dot',
 '\\Gamma',
 '\\Delta',
 '\\Lambda',
 '\\Sigma',
 '\\alpha',
 '\\beta',
 '\\gamma',
 '\\delta',
 '\\varepsilon']

In [25]:
soup.equation

\begin{equation}\label{planewave}
\begin{split}
ds^2 & = 2dx^+dx^--\m^2\vec{x}^2\bigl(dx^+\bigr)^2+d\vec{x}^2\,,\\
F_5 & = 4\m dx^+\wedge\bigl(dx^1\wedge dx^2\wedge dx^3\wedge dx^4+dx^5\wedge dx^6\wedge dx^7\wedge dx^8\bigr)\,.
\end{split}
\end{equation}

In [26]:
lst = list(soup.find_all('equation'))

In [27]:
# how many equations are in the document?
len(lst) 

79

In [28]:
# show the first match
lst[0] 

\begin{equation}\label{planewave}
\begin{split}
ds^2 & = 2dx^+dx^--\m^2\vec{x}^2\bigl(dx^+\bigr)^2+d\vec{x}^2\,,\\
F_5 & = 4\m dx^+\wedge\bigl(dx^1\wedge dx^2\wedge dx^3\wedge dx^4+dx^5\wedge dx^6\wedge dx^7\wedge dx^8\bigr)\,.
\end{split}
\end{equation}

In [29]:
# what is the string inside the "begin{equation}"?
lst[0][0] 

'\\label{planewave}\n\\begin{split}\nds^2 & = 2dx^+dx^--\\m^2\\vec{x}^2\\bigl(dx^+\\bigr)^2+d\\vec{x}^2\\,,\\\\\nF_5 & = 4\\m dx^+\\wedge\\bigl(dx^1\\wedge dx^2\\wedge dx^3\\wedge dx^4+dx^5\\wedge dx^6\\wedge dx^7\\wedge dx^8\\bigr)\\,.\n'

In [30]:
lst[1]

\begin{equation}
\o_n=\sqrt{n^2+\bigl(\m\a'p^+\bigr)^2}\,,\qquad n\in\Nop\,,
\end{equation}

In [31]:
lst[1][0]

"\n\\o_n=\\sqrt{n^2+\\bigl(\\m\\a'p^+\\bigr)^2}\\,,\\qquad n\\in\\Nop\\,,\n"

In [32]:
for this_eq in list(soup.find_all('equation'))[0:3]: # convert generator to list so we can limit returns to the first 10
    print(this_eq[0])

\label{planewave}
\begin{split}
ds^2 & = 2dx^+dx^--\m^2\vec{x}^2\bigl(dx^+\bigr)^2+d\vec{x}^2\,,\\
F_5 & = 4\m dx^+\wedge\bigl(dx^1\wedge dx^2\wedge dx^3\wedge dx^4+dx^5\wedge dx^6\wedge dx^7\wedge dx^8\bigr)\,.


\o_n=\sqrt{n^2+\bigl(\m\a'p^+\bigr)^2}\,,\qquad n\in\Nop\,,

\label{dict}
\frac{1}{\m}H = \D-J\,,\qquad \frac{1}{\bigl(\m\a'p^+\bigr)^2} = \frac{g^2_{\text{YM}}N}{J^2}\equiv \l'\,,\qquad
4\pi g_{\text{s}}\bigl(\m\a'p^+\bigr)^2 = \frac{J^2}{N}\equiv g_2



### loop over all documents

In [33]:
start_time=time.time()
texsoup_reslts={}
number_of_eq=0
for this_file in list_of_files[0:12]: # the complete set would take a long time
    loop_time=time.time()
    print(this_file)
    try:
        with open(this_file,'r', encoding='utf-8') as f:
            data = f.read()
    except UnicodeDecodeError as err:
        print(err)
        try: 
            with open(this_file,'r', encoding='ISO-8859-1') as f:
                data = f.read()
        except Exception as err:
            print(err)
    
    soup = TexSoup(data)
    lst = list(soup.find_all('equation'))
    for eq in lst:
        number_of_eq+=1
        try:
            texsoup_reslts[this_file].append(eq)
        except KeyError:
            texsoup_reslts[this_file]=[]
            texsoup_reslts[this_file].append(eq)
#        print('  ',eq,'\n')
    print(round(time.time()-loop_time,2),'seconds')
print(round(time.time()-start_time,2),'seconds')

2003/0301116
5.78 seconds
2003/0304232
31.12 seconds
2003/0303017
5.46 seconds
2003/0303225
6.46 seconds
2003/0302131
'utf-8' codec can't decode byte 0xa0 in position 38109: invalid start byte
8.53 seconds
2003/0303028
5.33 seconds
2003/0301129
14.49 seconds
2003/0302136


KeyboardInterrupt: 

In [34]:
import chardet

In [35]:
with open('2003/0302131','rb') as f:
        data = f.read()

In [36]:
chardet.detect(data)

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}