In [None]:
from catminer.download import get_pub_info, get_Elsevier_XML
from catminer.preprocess import Elsevier_to_sentences
import pandas as pd
import requests
import json
import time
import os

# set the current working directory to be the same as the file
os.path.abspath('')

In this tutorial, we will use the CatMiner package to download one of our lab's papers in XML format and then convert it into a plain text file. First, we will simply define the DOI, and retrieve the publisher information and open-access status using the CrossRef API. 

In [None]:
doi = '10.1016/j.apcatb.2015.11.002'
dois = [doi] # To target multiple DOIs, append them to this list
pub_info = get_pub_info([doi])
print(pub_info)

From the pub_info dictionary, we can see that this article was published by Elsevier, and that it is not open-access. We can thus use function get_Elsevier_XML to download the associated XML file. By default, this function will download the XML to 'XMLs/Elsevier', creating the folder if it does not exist. 

This function uses the Elsevier API, and so you must obtain and provide an API key. Instructions for doing so can be found here: https://dev.elsevier.com/. After obtaining a key, replace the 'PLACEHOLDER' variable below with your key string. 

In [None]:
get_Elsevier_XML(dois, key='PLACEHOLDER')

An XML copy of the article should now be downloaded. However, CatMiner is designed to accept a plain text file. We can lastly use features from the ChemDataExtractor package (http://chemdataextractor.org/), called through CatMiner preprocessing functions, to process this XML file into plain text. First, we will create the folder to write the plain text file to. After running our Elsevier_to_sentences function, we finish by deleting the intermediate files that were generated. These intermediate files should contain the plain text, but not separated line-by-line into sentences. 

In [None]:
# Create the directory to save sentence files in
if not os.path.exists('sentences'):
    os.makedirs('sentences')

# Convert files from markup language into text file inputs for CatMiner
Elsevier_to_sentences('XMLs/Elsevier/', 'sentences/')

# Delete the intermediate files that were generated. These should contain the plain text, but not separated line-by-line into sentences
for file in os.listdir('sentences'):
    if 'continuous' in file:
        os.remove(os.path.join('sentences', file))