# Convert preprints 

Contents:
1. Introduction
2. Identify preprints
3. Convert preprints
4. Appendix: Request a submission from arXiv API

## 1. Introduction 

In this notebook, we convert preprints from TEX to XML, a format that simplifies downstream parsing. 

Each submission in the ./latex/ folder should contain 1 or more .tex files. If the submission contains more than 1 .tex file, we identify the main file. The additional files are usually inserts for the main file. 

After collecting the filepaths for all submissions' main files, we convert them from .tex to .xml using the [latexml](https://dlmf.nist.gov/LaTeXML/) package, spreading the work across all CPU cores (4 on my machine).

All converted .xml files are stored in ./xml/.

## 2. Identify preprints

Import all dependencies:

In [1]:
import os, re, subprocess, glob, multiprocessing, time, pathlib

Collect preprint filepaths:

In [2]:
global empty_submissions
global preprints 
global corrupt_submissions 

def collect_all_preprints():
    '''
    Collects the filepaths for all preprints within 
    the ./latex/ folder. Returns an array of strings,
    each string representing the path of a preprint.
    '''
    
    # Initialize variables
    global empty_submissions 
    empty_submissions = []
    global preprints 
    preprints = []
    global corrupt_submissions 
    corrupt_submissions = []
    base_path = 'latex'
    submission_count = 0
    texfile_count = 0

    # Walk through tar directories
    for idx, tar_folder in enumerate(os.listdir(base_path)):
        
        # If current path isn't a directory, skip
        tar_path = base_path + '/' + tar_folder
        if not os.path.isdir(tar_path):
            continue
        
        # Walk through each submission directory
        submission_dirs = os.listdir(tar_path)
        submission_count += len(submission_dirs)
        for submission in submission_dirs:
            
            # If current path isn't a directory, skip
            submission_path = tar_path + '/' + submission
            if not os.path.isdir(submission_path):
                submission_count -= 1
                continue

            arxiv_id = os.path.basename(submission_path) # used to note empty or corrupt submissions 

            # If submission is empty, note & skip
            texs = glob.glob(submission_path + '/**/*.tex', recursive=True)
            texfile_count += len(texs)
            if len(texs) == 0:
                empty_submissions.append(arxiv_id)
                continue
            
            # Otherwise get the preprint
            else:
                preprint_path = identify_preprint(submission_path, texs)
                if preprint_path:
                    preprints.append(preprint_path)
                else:
                    corrupt_submissions.append(arxiv_id)
    
    print('TEX files: ' + str(texfile_count))
    print('Submissions: ' + str(submission_count))
    print('Preprints: ' + str(len(preprints)))
    print('Empty submissions: ' + str(len(empty_submissions)))
    print('Potentially corrupt submissions: ' + str(len(corrupt_submissions)))
    

def identify_preprint(submission_path, texs):
    '''
    Identifies the preprint within a given submission. 
    
    Parameters
    ----------
    submission_path : str
        Filepath to submission directory
    texs : list of str
        Filepaths to all TEX files within submission directory
    '''
    preprint = None
    
    # If submission contains only one file, this is the preprint
    if len(texs) == 1:
        preprint = texs[0]
    # If submission contains ms.tex or main.tex, this is the preprint
    elif 'ms.tex' in texs:
        preprint = submission_path + '/' + 'ms.tex'
    elif 'main.tex' in texs:
        preprint = submission_path + '/' + 'main.tex'
    # Otherwise, iterate through each .tex looking for \documentclass or \documentstyle
    else: 
        for tex_path in texs: 
            with open(tex_path, 'rb') as f: 
                data = f.readlines()
                r = re.compile(b'(.*\\\\documentclass.*)|(.*\\\\documentstyle.*)')
                if len(list(filter(r.match, data))) > 0:
                    preprint = tex_path
                    break
    
    return preprint

In [3]:
collect_all_preprints()

TEX files: 125484
Submissions: 89908
Preprints: 89630
Empty submissions: 271
Potentially corrupt submissions: 7


View arXiv ids for the potentially corrupt submissions: 

In [11]:
corrupt_submissions

['1105.1087',
 '1211.4277',
 '1304.7762',
 '1308.6483',
 '1409.3422',
 '1606.06791',
 '1607.01189']

The website [arXiv Vanity]() is unable to render these corrupt preprints as well. There is something wrong with their TEX structure. Since there are only a few and I don't want to bother with their PDFs, we will skip these preprints for now. 

## 3. Convert each preprint:

In [8]:
def get_outpath(tex_path):
    '''
    Returns path for a XML file, 
    to be named with its arXiv identifier,
    which is extracted from the path of given TEX file. 
    '''
    
    path_parts = pathlib.Path(tex_path).parts
    arxiv_id = path_parts[2]
    outpath = 'xml/' + arxiv_id + '.xml'
    return outpath


def convert_to_xml(inpath):
    '''
    Converts given TEX file to XML using a subprocess that calls latexml.
    '''

    # Sleep briefly, which allows KeyboardInterrupt to be caught... idk why
    time.sleep(0.1) 
    
    # Get the path for the converted file
    outpath = get_outpath(inpath)
    
    # If the file has already been converted or conversion has been attempted, stop
    logfile_path = 'logs/' + os.path.splitext(os.path.basename(outpath))[0] + '.txt'
    if os.path.isfile(outpath) or os.path.isfile(logfile_path):
        print('...')
        return
    
    # Otherwise, try converting it, logging latexml output
    try:
        print('..... processing {}'.format(inpath))
        # Open a subprocess that runs latexml
        sub_proc = subprocess.Popen(['latexml', '--dest=' + outpath, inpath],
                                   stderr=subprocess.PIPE)
        # Make this function wait until the process has finished,
        # (otherwise convert_to_xml will finish, leaving latexml running in the background
        # and the pool will give the worker a new task, which is another convert_to_xml call, 
        # again spawning another latexml subprocess, until we are up to our eyeballs in latexml
        # processes and the resource limit on the number of open files has been reached) 
        # Also specify a timeout of 4 minutes, so the subprocess will quit after that,
        # (which is useful for some conversions that hang recursively such as with
        # latex/arXiv_src_1009_002/1009.1724/15727_eger.tex)
        sub_proc.communicate(timeout=240) 
    # If the subprocess has timed out, kill it (or it will eat up memory)
    except subprocess.TimeoutExpired: 
        print('{}: CONVERSION FAILED — TIMEOUT'.format(inpath))
        sub_proc.kill()
        stdout, stderr = proc.communicate()
        return
    except Exception as error:
        print('{}: CONVERSION FAILED.'.format(inpath))
        return
    finally:
        print('Wriiting log... {}'.format(inpath))
        if not os.path.isdir('logs'):
            os.makedirs('logs')
        #with open(logfile_path, 'w') as logfile:
            #logfile.write(stderr.decode())
    
    return
        
    
def start_conversion():
    '''
    Begins the preprint conversion, spreading the work across all cores.
    '''
    
    pool = multiprocessing.Pool(processes=multiprocessing.cpu_count(),
                               maxtasksperchild=1) # important to set maxtasksperchild, cuz each task takes up considerable memory
    print('Initialized {} workers'.format(multiprocessing.cpu_count()))
    print('Beginning conversion...')
    
    try:
        for _ in pool.imap_unordered(convert_to_xml, preprints):
            pass
    except KeyboardInterrupt:
        pool.terminate()
        exit(1)
    except Exception:
        pool.terminate()
        exit(1)
    
    pool.close()
    pool.join()
    
    
if __name__ == '__main__':
    start_conversion()

...
...
...
...
Initialized 4 workers
Beginning conversion...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..

...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...


If this is started after a whole bunch of files have converted, this will take a while to ensure that those files don't need converting before reaching the ones that do.

Then later check the difference between logs and xml files. For those in logs that are not in xml, they failed. 

https://github.com/hopper-project/hoptex/search?q=generate_sanitized_document&unscoped_q=generate_sanitized_document

First I need to confirm the main file in each repository. 
- If it doesn't contain a .bbl file, I need to add it to the bbl_lack folder. Later. Set aside and skip.
- If it doesn't contain a file, I need to retrieve it again. Later. Set aside and skip. 

I will look at each submission folder, check xml to see if a file exists with its name. If not, I will go into the submission folder to check each file if it contains \\documentclass. If it does, grab it and convert it. Break out of loop. 

In [165]:
def guess_extension_from_headers(h):
    """
    Given headers from an ArXiV e-print response, try and guess what the file
    extension should be.
    Based on: https://arxiv.org/help/mimetypes
    """
    if h.get('content-type') == 'application/pdf':
        return '.pdf'
    if h.get('content-encoding') == 'x-gzip' and h.get('content-type') == 'application/postscript':
        return '.ps.gz'
    if h.get('content-encoding') == 'x-gzip' and h.get('content-type') == 'application/x-eprint-tar':
        return '.tar.gz'
    # content-encoding is x-gzip but this appears to normally be a lie - it's
    # just plain text
    if h.get('content-type') == 'application/x-eprint':
        return '.tex'
    if h.get('content-encoding') == 'x-gzip' and h.get('content-type') == 'application/x-dvi':
        return '.dvi.gz'
    return None

def arxiv_id_to_source_url(arxiv_id):
    # This URL is normally a tarball, but sometimes something else.
    # ArXiV provides a /src/ URL which always serves up a tarball,
    # but if we used this, we'd have to untar the file to figure out
    # whether it's renderable or not. By using the /e-print/ endpoint
    # we can figure out straight away whether we should bother rendering
    # it or not.
    # https://arxiv.org/help/mimetypes has more info
    return 'https://arxiv.org/e-print/' + arxiv_id

def download_source_file(arxiv_id):
    """
    Download the LaTeX source of this paper and returns as ContentFile.
    """
    source_url = arxiv_id_to_source_url(arxiv_id)
    res = requests.get(source_url)
    res.raise_for_status()
    extension = guess_extension_from_headers(res.headers)
    if not extension:
        raise DownloadError("Could not determine file extension from "
                            "headers: Content-Type: {}; "
                            "Content-Encoding: {}".format(
                                res.headers.get('content-type'),
                                res.headers.get('content-encoding')))
    with open(arxiv_id + extension, 'wb+') as f:
        f.write(res.content)
        print('Created ' + arxiv_id + extension)

download_source_file('1010.3382')

Created 1010.3382.tar.gz
