# Resolve PIDS

Ian Thomas
Research Capability Unit, 
RMIT University


## Introduction

The script takes a list of library persistent identifiers (PIDS), such as ISBN and LCCN and return the corresponding
data from the [openlibary.org](http://openlibrary.org).  It creates a CSV file with selected fields, with blank lines for any missing data.

## Instructions

Replace the example list of PIDS in the next cell with one PID per line with the correct prefix.

In [3]:

pids = """
    ISBN:9780980200447
    LCCN:93005405
    ISBN:123141244124
    """

When you are ready, select the above cell and press the run ">" button above to advance execution through this cell. 

Then press the run button multiple times to execute the next cell and then wait for output.

When the program completes you will see a new file called `final.csv` in left hand side panel. Double click that CSV file to review here, and  and then right click the filename to slect and  download the file to your desktop.

This notebook deployment under [mybinder.org](http://mybinder.org) is temporary and will disappear quickly if unused or left idle, so it is important that you download the CSV as soon as your are satisfied with the result.


In [None]:
import csv
import sys
import json
from jsonpath_rw import jsonpath, parse
from time import sleep
import requests
import random
import logging
from pprint import pformat
from IPython.display import display

logger = logging.getLogger(__name__)

#logging.basicConfig(level=logging.DEBUG)
logging.basicConfig(level=logging.INFO)

# The query for openlibrary {} is replaced with the pids
query_template = "https://openlibrary.org/api/books?bibkeys={}&jscmd=data&format=json"

# The set of patterns for selecting which fields in the response to put into the csv
# Uses the jsonpath schema: http://goessner.net/articles/JsonPath/
paths = [
    ("$.*.title", "Title"),
    ("$.*.authors[*].name", "Author(s) Name"),
    ("$.*.publish_date", "Publish Date"),
    ("$.*.url","URL")
    ]

output_file = "final.csv"
pid_column_name = "pid"

# Try to find parse errors earlier
for path,pname in paths:
    try:
        jsonpath_expr = parse(path)
        value = [match.value for match in jsonpath_expr.find([])]
    except Exception as e:
        logger.error(f"Possible path error in {path} gave error: {repr(e)}")
        raise e

pids_list =  pids.split()
logger.info(f"Pids to Check: {', '.join(pids_list)}")
logger.info("checking...")
# TODO: Check api encoding 
results = []
for pid in pids_list:
    trim_pids = pid.strip() # remove trailing whitespace
    query = query_template.format(trim_pids)
    logger.info(f"query = {query}")
    # TODO: add retries for this request
    res = requests.get(query)
    if res:
        results.append((trim_pids,res.json()))
    sleep(random.randint(5,10)) # sleep to avoid flooding api

logger.debug(f"query results: {pformat(results)}")

with open(output_file, 'w',  encoding="utf-8-sig") as csvfile:   
    writer = csv.DictWriter(csvfile, fieldnames=[pid_column_name] + [pname for (p, pname) in paths])
    writer.writeheader()
    for pid, v in results:
        logger.debug(f"v={v}")
        row = {}
        row[pid_column_name] = pid
        for p,pname in paths:
            try:
                jsonpath_expr = parse(p)
                value = [match.value for match in jsonpath_expr.find(v)]
            except Exception as e:
                logger.warning(f"Parse error in entry {pid} gave error: {e}")
                value = []
            row[pname] = ', '.join(value)
        writer.writerow(row)
        
logger.info("Done")

Don't forget to download the final.csv result from the left hand panel.