# Corpus generation from articles published by Wiley
--------------------------

This notebook is intended to show the various methods and functionality of the `WileyCorpusGenerator()` class. The general workflow is stated in the top-level README.md, but the more specific workflow for this publisher is as follows:

1. Use search terms to query the CrossRef Clickthrough API for specific article DOI and API-specific URL's. This is done using the `meta_data_search()` method.

2. Use the identified URL's, which are API queries, to receive the PDF files of the original articles. Errors with specific error codes correspond to this step. This is done through the `get_fulltexts()` method.

3. The received PDF is converted to a dictionary of strings through a combination of `scipdf-parser` and custom methods. At this step, the user may choose to save the PDF too. They may also choose to extract and save the figures as PNG. This is all nested within the `get_fulltexts()` method.

-------------------------

### Pre-requisites:

To use the `WileyCorpusGenerator()`, there are several pre-requisite credentials and software packages that are needed. These are listed in the class docstring, as well as URL for getting them. They are also listed below:

 - Users must obtain a CrossRef clickthrough API token from  https://olabout.wiley.com/WileyCDA/Section/id-829772.html, which requires a valid email address and the user's ORCID ID number. Both of these are free of charge to register for.
 
 - In order to receive full text PDF's from Wiley through the API, it is suggested that users be logged into a university network, which automatically supplies subscription information. For the University of Washington, the BIG-IP EDGE client should be used: https://www.lib.washington.edu/help/connect/husky-onnet. Make sure that all internet traffic is being routed through this VPN during any use of this class that interfaces with the CrossRef API (i.e. `self.meta_data_search()` and `self.get_fulltexts()`)
 
 - In order to convert the PDF files to the machine-readable text formats, the `scipdf-paser` python library is used https://github.com/titipata/scipdf_parser. This is a python interface for a Java API, so users will need to download and install a java development kit. The Oracle Java development kit (JDK) is suggested: https://docs.oracle.com/en/java/javase/15/install/overview-jdk-installation.html#GUID-8677A77F-231A-40F7-98B9-1FD0B48C346A. The `scipdf-parser` github repo can be installed with `pip install git+git://github.com/titipata/scipdf_parser.git#egg=scipdf_parser`. Alternatively, the repo can be cloned locally and referred to as shown in the `wiley_corpus_generator.py` module.

 
 - The specific python library requirements for this class are detailed in wiley_requirements.txt

In [1]:
import json
import os
import sys

module_path = os.path.abspath(os.path.join('../modules/'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import wiley_corpus_generator as wcg

%load_ext autoreload

In [6]:
# CrossRef API Token is obtained from: 
# https://olabout.wiley.com/WileyCDA/Section/id-829772.html
cr_clickthrough_client_token = '********-********-********-********'

# Users need to follow this format to fill API headers
user_agent = 'Wesley Tatum, mailto:wesleyktatum@gmail.com'
email = 'wesleyktatum@gmail.com'

# Where to save JSON of URL's and DOI's
meta_save_path = '/Users/wesleytatum/Desktop/'
meta_data_json_path = meta_save_path+'wiley_meta_list.json'

# Where to save any converted articles, PDF's, and figures
fulltext_save_path = '/Volumes/easystore/post_doc/tutorial_fulltexts/'

# Search terms submitted to identify articles
search_terms = ['chemistry','energy','molecular']

In [3]:
# Initialize the corpus generator with your credentials
corp_gen = wcg.WileyCorpusGenerator(cr_clickthrough_client_token,
                                    user_agent, email)

# Using the search terms, query the CrossRef API. Returns the list of articles
# as a dictionary, as well as saving it for later access.
publist = corp_gen.meta_data_search(search_terms, meta_save_path)

Percent complete: 0.19510748471332856
Percent complete: 0.3902149694266571
Percent complete: 0.5853224541399857
Percent complete: 0.7804299388533142
Percent complete: 0.9755374235666429
Percent complete: 1.1706449082799715
Percent complete: 1.3657523929933
Percent complete: 1.5608598777066285
Percent complete: 1.7559673624199572
Percent complete: 1.9510748471332857
Percent complete: 2.146182331846614
Percent complete: 2.341289816559943
Percent complete: 2.5363973012732712
Percent complete: 2.7315047859866
Percent complete: 2.9266122706999287
Percent complete: 3.121719755413257
Percent complete: 3.3168272401265857
Percent complete: 3.5119347248399144
Percent complete: 3.7070422095532427
Percent complete: 3.9021496942665714
Percent complete: 4.0972571789799
Percent complete: 4.292364663693228
Percent complete: 4.487472148406558
Percent complete: 4.682579633119886
Percent complete: 4.877687117833214
Percent complete: 5.0727946025465425
Percent complete: 5.267902087259872
Percent complete:

In [10]:
print(list(publist.keys()))
print(len(publist['URL']))

# For this example notebook, we only need to request a few example articles
example_pub_urls = []
example_pub_dois = []

for i in range(100):
    example_pub_urls.append(publist['URL'][i])
    example_pub_dois.append(publist['DOI'][i])
    
example_arts = {'URL': example_pub_urls,
                'DOI': example_pub_dois}

#Overwrites the full publist returned by meta_data_search with our example list
with open(meta_data_json_path, 'w') as f:
    json.dump(example_arts, f)

['URL', 'DOI']
450746


Using either the same `corp_gen` instance, or a new one, read the saved list of article URL's and DOI's and request the fulltext PDF. This function is rate-limited by Wiley. Errors associated with rate, access, and broken URL's are tracked and saved in a JSON file, which is saved in the same directory as the article JSON.

If you have already begun requesting full texts but had to stop, you can start again at the same point by setting `checkpoint = True`. This counts the number of successfully converted article JSON's and any failed attempts, as tracked by the error_counts.json.

If figures are extracted and saved, a directory is created within the article save directory. The figures are saved in the `figures/` directory, and their captions and other information are saved in the `data/` directory

In [8]:
%autoreload

corp_gen.get_fulltexts(meta_data_json_path, fulltext_save_path,
                       checkpoint = False, save_pdf = True,
                       save_figs = True)

  0%|          | 0/100 [05:25<?, ?it/s]0.67s/it]
100%|██████████| 100/100 [38:55<00:00, 22.69s/it]

Total URL's queried = 100
Successful scrape rate = 75.00%
Successful PDF scrape rate = 75.00%
Total error rate = 25.00%
error299 rate = 0.00%
error400 rate = 0.00%
error403 rate = 20.00%
error404 rate = 5.00%
error500 rate = 0.00%
error503 rate = 0.00%
error503 rate = 0.00%


If you complete your full text acquisition, then the end of that function automatically calls an evaluation function, which reports the percentage of successful and failed full text queries. The end of the `get_fulltexts` function also terminates server connections to the GROBID API.

If you terminate the `get_fulltexts` function early, it is important to kill the background subprocess that makes the connection to GROBID server. That can be done as shown below, or by passing the `subprocess.Popen` object to the `corp_gen.terminate_grobid()` method.

At any point, you may evaluate the success-rate and error-rate of the full text query by calling `corp_gen.evaluate_corp_gen()`. The different error codes that are listed are defined in the `corp_gen.get_fulltexts()` method docstring.

In [9]:
%killbgscripts
corp_gen.evaluate_corp_gen(fulltext_save_path, save_pdf = True)

All background processes were killed.
Total URL's queried = 100
Successful scrape rate = 75.00%
Successful PDF scrape rate = 75.00%
Total error rate = 25.00%
error299 rate = 0.00%
error400 rate = 0.00%
error403 rate = 20.00%
error404 rate = 5.00%
error500 rate = 0.00%
error503 rate = 0.00%
error503 rate = 0.00%


The internal methods of the `WileyCorpusGenerator()` class are also able to be used for individual articles or, as shown below, single PDF's that you may already have saved.

In [10]:
%autoreload

pdf_path = '/Users/wesleytatum/Desktop/post_doc/data/1-s2.0-S0013468607007153-main.pdf'
figure_save_path = '/Volumes/easystore/post_doc/wiley_fulltexts/figures/'

corp_gen = wcg.WileyCorpusGenerator(cr_clickthrough_client_token, user_agent, email)

p = corp_gen.start_grobid()

article_dict = corp_gen.parse_pdf(pdf_path, figure_save_path, figures = True)

corp_gen.terminate_grobid(p)

In [11]:
article_dict

{'headers_list': ['Introduction',
  'Experimental',
  'Substrates preparation',
  'Solutions',
  'Experimental techniques',
  'Results and discussion',
  'EIS measurements and visual observation',
  'Polarization measurements',
  'SEM/EDS study',
  'XPS characterization',
  'SKPFM investigations',
  'Discussion and perspectives',
  'Conclusions',
  'Acknowledgements'],
 ' Meta-data ': {'DOI': '10.1016/j.electacta.2007.05.058',
  'Title': 'High effective organic corrosion inhibitors for 2024 aluminium alloy',
  'Figures': [{'figure_label': '1',
    'figure_type': '',
    'figure_id': 'fig_0',
    'figure_caption': 'Fig. 1 .1Fig. 1. (a and b) EIS spectra obtained for AA2024 after 14 days of immersion in 0.05 M NaCl with or without inhibitors; (c) Bode plots for samples immersed into blank NaCl and NaCl doped with quinaldic acid, salicylaldoxime and 8-hydroxyquinoline with fitting results.',
    'figure_data': ''},
   {'figure_label': '3',
    'figure_type': '',
    'figure_id': 'fig_1',


In [26]:
%autoreload