# Convert preprints 

Contents:
1. Introduction
2. Extract and convert preprints
2. Identify preprints
3. Convert preprints
4. Appendix: Request a submission from arXiv API

## 1. Introduction 

In this notebook, we convert preprints from TEX to XML, a format that simplifies downstream parsing. 

Each submission in the ./latex/ folder should contain 1 or more .tex files. If the submission contains more than 1 .tex file, we identify the main file. The additional files are usually inserts for the main file. 

After collecting the filepaths for all submissions' main files, we convert them from .tex to .xml using the [latexml](https://dlmf.nist.gov/LaTeXML/) package, spreading the work across all CPU cores (4 on my machine).

All converted .xml files are stored in ./xml/.

## 2. Extract and convert preprints

Import all dependencies:

In [11]:
import os, re, subprocess, glob, multiprocessing, time, pathlib, tarfile, gzip, shutil, zipfile
import pandas as pd

Grab arXiv submission identifiers from the metadata:

In [2]:
metadata_df = pd.read_csv('../data/2020_03_06_arxiv_metadata_astroph/arxiv_metadata_astroph.csv', 
                         dtype={'filename_parsed': str})
identifiers = list(metadata_df['filename_parsed'])
len(identifiers)

  interactivity=interactivity, compiler=compiler, result=result)


267794

Load conversion log:

In [3]:
log_df = pd.read_csv('../data/2020_03_09_extract_and_convert_submissions/conversion_log.csv')
log_df

Unnamed: 0,submission,tarfile,type,extracted,extracted_suffix,converted,conversion_result
0,astro-ph0001001,arXiv_src_0001_001,.gz,yes,.zip,yes,\n(Loading /opt/local/lib/perl5/vendor_perl/5....
1,astro-ph0001002,arXiv_src_0001_001,.gz,yes,.zip,yes,\n(Loading /opt/local/lib/perl5/vendor_perl/5....
2,astro-ph0001003,arXiv_src_0001_001,.gz,yes,.zip,yes,\n(Loading /opt/local/lib/perl5/vendor_perl/5....
3,astro-ph0001004,arXiv_src_0001_001,.gz,yes,.zip,yes,\n(Loading /opt/local/lib/perl5/vendor_perl/5....
4,astro-ph0001005,arXiv_src_0001_001,.gz,yes,.zip,no,\n(Loading /opt/local/lib/perl5/vendor_perl/5....
...,...,...,...,...,...,...,...
12527,nlin0109004,arXiv_src_0109_001,.gz,yes,.zip,yes,\n(Loading /opt/local/lib/perl5/vendor_perl/5....
12528,nlin0109028,arXiv_src_0109_001,.gz,yes,.zip,yes,\n(Loading /opt/local/lib/perl5/vendor_perl/5....
12529,nucl-th0109009,arXiv_src_0109_001,.gz,yes,.zip,yes,\n(Loading /opt/local/lib/perl5/vendor_perl/5....
12530,physics0109018,arXiv_src_0109_001,.gz,no,.tex,yes,\n(Loading /opt/local/lib/perl5/vendor_perl/5....


Define utility functions:

In [62]:
tar_dirs = ['../data/2020_03_03_original_arxiv_tars/', '../data/2020_03_07_update_tars/']
print('Number of tar directories to process: ' + str(len(tar_dirs)))

converted_submissions = [os.path.splitext(os.path.basename(x))[0] for x in os.listdir('../data/2020_03_09_extract_and_convert_submissions/converted_xml/')]
print('Submissions already converted: ' + str(len(converted_submissions)))

processes = multiprocessing.cpu_count() - 1
print('This program will run on ' + str(processes) + ' cores.')

Number of tar directories to process: 2
Submissions already converted: 15510
This program will run on 3 cores.


In [63]:
def get_outpath(inpath):
    '''
    Returns the filepath for a XML file,
    based on the given filepath. 
    '''
    path_parts = pathlib.Path(inpath).parts
    submission_id = os.path.splitext(path_parts[4])[0]
    outpath = '../data/2020_03_09_extract_and_convert_submissions/converted_xml/' + submission_id + '.xml'
    return outpath, submission_id


def extract(submission):
    '''
    Extracts given submission (formatted as TarInfo object).
    Returns string signifying whether or not extraction was successful.
    '''
    
    try:
        suffix = '.zip'
        gz_obj = tar.extractfile(submission)
        gz = tarfile.open(fileobj=gz_obj, mode='r|gz')
        zipf = zipfile.ZipFile(file=extraction_path + submission_id + suffix, mode='a', compression=zipfile.ZIP_DEFLATED)

        for m in gz:
            f = gz.extractfile(m)
            if m.isdir():
                continue
            f_out = f.read()
            f_in = m.name
            zipf.writestr(f_in, f_out)
        zipf.close()
        gz.close()
        extracted = 'yes'

    except tarfile.ReadError: 
        # These submissions contain a single .tex file with no extension,
        # so we need to treat them differently
        suffix = '.tex'
        tar.extract(submission, extraction_path)
        with gzip.open(extraction_path + submission.name, 'rb') as f_in:
            with open(extraction_path + submission_id + suffix, 'wb+') as f_out:
                    shutil.copyfileobj(f_in, f_out)
        extracted = 'yes'

    except Exception as e:
        print('Something went wrong in convert(): ' + str(e))
        extracted = 'no'
        
    return extracted, suffix


def do_not_convert(submission_id):
    '''
    Returns boolean indicating whether or not 
    the given submission should be converted.
    We only want submissions that are from the
    astrophysics archive, and, for performance
    purposes, that haven't already been converted.
    '''
    
    not_astrophysics = submission_id not in identifiers
    already_converted = submission_id in log_df['submission'].values
    conversion_attempted = submission_id in converted_submissions # catches ones that failed, we don't want to try them again
    return not_astrophysics or already_converted or conversion_attempted

        
def convert(submission_path):
    '''
    Converts file at passed filepath to XML,
    using LaTeXMLc.
    '''
    outpath, submission_id = get_outpath(submission_path)
    try:
        print('Converting ' + submission_path + '...')
        proc = sp.Popen(['latexmlc', '--timeout=240', '--dest=' + outpath, submission_path], stderr=subprocess.PIPE)
        out, err = proc.communicate()
        err = err.decode('utf-8')
        # Check if file was converted successfully
        if 'Error! Did not write file' in err:
            converted = 'no'
        else:
            converted = 'yes'
    except Exception as e:
        print('Something went wrong in convert(): ' + str(e))
        converted = 'no'
    return err, converted

The main code:

In [64]:
# For each tar directory,
for tar_dir in tar_dirs: 
    files = os.listdir(tar_dir)
    
    # For each tar file, 
    for file in files:
        filepath = tar_dir + file
        if os.path.isfile(filepath) and tarfile.is_tarfile(filepath):
            
            # Open it as read-only
            log = []
            print('Opening ' + file + ',')
            tar = tarfile.open(filepath)
            tar_name = os.path.splitext(os.path.basename(tar.name))[0]
            extracted_tar_path = None # temp, only for ReadErrors, will know path once we get tar contents
            
            # Iterate over its .gz files (which are each an article submission),
            for submission in tar.getmembers():
                
                # Only look at submissions (.gz and .pdf)
                if submission.name.endswith('.gz') or submission.name.endswith('.pdf'):
                    
                    submission_id = os.path.splitext(os.path.basename(submission.name))[0]
                    if do_not_convert(submission_id):
                        continue
                    
                    print('Working on submission: ' + submission_id + '...')
                    submission_path = tar_dir + os.path.splitext(os.path.basename(file))[0] + '/' + submission_id
                    submission_type = os.path.splitext(os.path.basename(submission.name))[1]

                    # If .pdf, skip, we cannot extract and will not convert here
                    if submission_type == '.pdf':
                        result = None
                        extracted = 'no'
                        converted = 'no'
                    else:
                        # Extract submission 
                        extraction_path = '../data/2020_03_09_extract_and_convert_submissions/temp/'
                        extracted_tar_path = extraction_path + submission.name.split('/')[0]
                        extracted, suffix = extract(submission)
                        
                        # Convert submission and remove extracted submission
                        if extracted == 'yes':
                            result, converted = convert(extraction_path + submission_id + suffix)
                            os.remove(extraction_path + submission_id + suffix)
                        
                            # Remove the folder that appears for the tar during ReadErrors, if it exists
                            if os.path.exists(extracted_tar_path): # IS THIS IN THE RIGHT PLACE???
                                shutil.rmtree(extracted_tar_path)
                        
                    # Log submission extraction & conversion info, remove .zip
                    log.append([submission_id, tar_name, submission_type, extracted, suffix, converted, result])
            
            # Save log so we have something if it fails 
            df = pd.DataFrame(log, columns=['submission', 
                                    'tarfile', 
                                    'type', 
                                    'extracted', 
                                    'extracted_suffix',
                                    'converted',
                                    'conversion_result'])
            df.to_csv('../data/2020_03_09_extract_and_convert_submissions/conversion_log.csv', mode='a', header=not f.tell(), index=False)
        
            # After finishing tarfile, move it to 'processed' directory
            if not os.path.isdir(tar_dir + 'processed'):
                os.makedirs(tar_dir + 'processed')
            os.rename(filepath, tar_dir + 'processed/' + file)
            
            print()

Opening arXiv_src_0203_001.tar,
Working on submission: astro-ph0203078...
Converting ../data/2020_03_09_extract_and_convert_submissions/temp/astro-ph0203078.zip...
Working on submission: astro-ph0203113...
Working on submission: astro-ph0203140...
Working on submission: astro-ph0203212...
Working on submission: astro-ph0203231...
Converting ../data/2020_03_09_extract_and_convert_submissions/temp/astro-ph0203231.zip...


  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)


KeyboardInterrupt: 

## 3. analyze

Get from Untitled.ipynb then delete Untitled.ipynb

## 4. Can retrieve individual preprints from online or from the associated tar file:

This helped: https://jarrodmcclean.com/simple-bash-parallel-commands-in-python/

First I need to confirm the main file in each repository. 
- If it doesn't contain a .bbl file, I need to add it to the bbl_lack folder. Later. Set aside and skip.
- If it doesn't contain a file, I need to retrieve it again. Later. Set aside and skip. 

I will look at each submission folder, check xml to see if a file exists with its name. If not, I will go into the submission folder to check each file if it contains \\documentclass. If it does, grab it and convert it. Break out of loop. 

In [165]:
def guess_extension_from_headers(h):
    """
    Given headers from an ArXiV e-print response, try and guess what the file
    extension should be.
    Based on: https://arxiv.org/help/mimetypes
    """
    if h.get('content-type') == 'application/pdf':
        return '.pdf'
    if h.get('content-encoding') == 'x-gzip' and h.get('content-type') == 'application/postscript':
        return '.ps.gz'
    if h.get('content-encoding') == 'x-gzip' and h.get('content-type') == 'application/x-eprint-tar':
        return '.tar.gz'
    # content-encoding is x-gzip but this appears to normally be a lie - it's
    # just plain text
    if h.get('content-type') == 'application/x-eprint':
        return '.tex'
    if h.get('content-encoding') == 'x-gzip' and h.get('content-type') == 'application/x-dvi':
        return '.dvi.gz'
    return None

def arxiv_id_to_source_url(arxiv_id):
    # This URL is normally a tarball, but sometimes something else.
    # ArXiV provides a /src/ URL which always serves up a tarball,
    # but if we used this, we'd have to untar the file to figure out
    # whether it's renderable or not. By using the /e-print/ endpoint
    # we can figure out straight away whether we should bother rendering
    # it or not.
    # https://arxiv.org/help/mimetypes has more info
    return 'https://arxiv.org/e-print/' + arxiv_id

def download_source_file(arxiv_id):
    """
    Download the LaTeX source of this paper and returns as ContentFile.
    """
    source_url = arxiv_id_to_source_url(arxiv_id)
    res = requests.get(source_url)
    res.raise_for_status()
    extension = guess_extension_from_headers(res.headers)
    if not extension:
        raise DownloadError("Could not determine file extension from "
                            "headers: Content-Type: {}; "
                            "Content-Encoding: {}".format(
                                res.headers.get('content-type'),
                                res.headers.get('content-encoding')))
    with open(arxiv_id + extension, 'wb+') as f:
        f.write(res.content)
        print('Created ' + arxiv_id + extension)

download_source_file('1010.3382')

Created 1010.3382.tar.gz
