Skip to content
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

mod-oai-pmh

Copyright (C) 2019 The Open Library Foundation

This software is distributed under the terms of the Apache License, Version 2.0. See the file "LICENSE" for more information.

Portions of this software may use XML schemas Copyright © 2011 DCMI, the Dublin Core Metadata Initiative. These are licensed under the Creative Commons 3.0 Attribution license.

Introduction

Backend Module implementing the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH Version 2.0), but providing more RESTful API than described in the specification. At the core places the /oai/records endpoint which accepts verb name as main parameter which defines what type of request is and which handler should ve invoked for processing the request.

The following verbs are used:

Verb Required parameters Optional parameters Exclusive parameters Response status codes
Identify - - - 200, 400
ListRecords metadataPrefix from,until,set resumptionToken 200, 400, 404, 422
ListIdentifiers metadataPrefix from,until,set resumptionToken 200, 400, 404, 422
ListSets - - resumptionToken 200, 400
ListMetadataFormats - identifier - 200, 400, 404
GetRecord identifier, metadataPrefix - - 200, 400, 404, 422

The repository supports oai_dc, marc21 and marc21_withholdingsmetadata prefixes. The Latest is used to return holding and item information along with MARC records. The OAI Identifier Format is used for resource identifiers with the following pattern: oai:<repositoryBaseUrl>:<tenantId>/<uuid of record> e.g. oai:demo.folio.org:tenant123/fb857902-3ab2-4c34-9772-14ad7acdfe76.

Additional information

Schemas

The following schemas used:

Deployment requirements

OAI-PMH is heavily loaded module and for correct work with big data set(approximately 4-5 millions records) it requires to have as least 400 Mb of java heap and 1Gb for docker container memory.

Configuration

Configuration properties are intended to be retrieved from mod-configuration module. System property values are used as a fallback. Configurations can be managed from the UI through the mod-configuration via folio settings. The default configuration system properties split into the logically bounded groups and defined within next 3 json files: behavior.json, general.json, technical.json. The configurations by itself are placed within json 'value' field in the "key":"value" way.

The following configuration properties are used:

Module Config Code System Default Value Description
OAI-PMH repository.name FOLIO_OAI_Repository The name of the repository. The value is used to construct value for OAI-PMH/Identify/repositoryName element.
OAI-PMH repository.baseURL http://folio.org/oai The URL of the repository (basically the URL of the edge-oai-pmh). The value is used in OAI-PMH/Identify/baseURL element.
OAI-PMH repository.adminEmails oai-pmh@folio.org The e-mail address of an administrator(s) of the repository. Might contain several emails which should be separated by comma. The value is used in OAI-PMH/Identify/adminEmail element(s).
OAI-PMH repository.timeGranularity YYYY-MM-DDThh:mm:ssZ The finest harvesting granularity supported by the repository. The legitimate values are YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ with meanings as defined in ISO8601.
OAI-PMH repository.deletedRecords persistent The manner in which the repository supports the notion of deleted records. Legitimate values are no ; transient ; persistent with meanings defined in the section on deletion.
OAI-PMH repository.maxRecordsPerResponse 100 The maximum number of records returned in the List responses. The main intention is to implement Flow Control
OAI-PMH jaxb.marshaller.enableValidation false Boolean value which defines if the response content should be validated against xsd schemas.
OAI-PMH jaxb.marshaller.formattedOutput false Boolean value which is used to specify whether or not the marshalled XML data is formatted with linefeeds and indentation.
OAI-PMH repository.errorsProcessing 500 Defines in which way OAI-PMH level errors are going to be processed. 200 - OAI-PMH level error is associated with HTTP status 200. 500 - OAI-PMH level error may be associated with HTTP error status (4xx or 5xx).
OAI-PMH repository.srsHttpRequestRetryAttempts 50 Property is used in marc21_withholdings metadata prefix handler. If SRS returns an incorrect response then the same request will be sent again up to 50 times until the expected response will not be received or all 50 attempts will fail which leads to error response.
OAI-PMH repository.srsClientIdleTimeoutSec 20 The idle timeout for requests to SRS.

Configuration priority resolving

TenantApi 'POST' implementation is responsible for getting configurations for a module from mod-configuration and adjusting them to system properties when posting module for tenant. Since there 3 places of configurations (mod-configuration, JVM, default form resources), there are ways of resolving configuration inconsistencies when TenantAPI executes.
First one - if mod-configuration doesn't contain config entry for the particular configuration group then such group will be picked up from the resources with their default values and then will be posted to a mod-configuration. As well if some of them was already defined through JVM property setting up, then such "JVM" value will be used and posted instead of the default one. As well, these values will be used farther within module business logic.
Second one - if mod-configuration has successfully returned config entry for particular group then such configuration values overrides default and JVM specified as well and these values are set up to system properties and they will be used farther within module business logic.
So, configurations priority from highest to lowest is the next: mod-configuration (1 priority) -> JVM specified (2 priority) -> default from resources (3 priority)

Notes:

  • There is an option, as it says at "Configuration priority resolving" paragraph, if mod-configuration doesn't contain a value for configuration, then default or specified via JVM value will be used further as expected, but in the opposite case such value will be used only during InitAPIs execution and will be overridden further with value from mod-configuration after ModTenantAPI execution. Only configPath and repository.storage configurations are an exception here.
  • The system default values can be overwritten by VM options e.g. -Drepository.name=Specific_FOLIO_OAI-PMH_Repository.
  • Another configuration file can be specified via -DconfigPath=<path_to_configs> but the file should be accessible by ClassLoader.
  • For verb ListRecords and metadata prefix marc21_withholdings, holding and item fields from mod-inventory-storage are returned along with the corresponding records from mod-source-record-storage

About Marc21 with holdings and initial load

Oai-pmh supports marc21_withholdings metadata prefix for ListRecords request. Handling this metadata prefix differs from marc21 and oai_dc. Request processing of marc21_whithholdings involves “initial-load” (IL) process. Initial load is a process when all instances ids with some metadata are retrieved from inventory via inventory-hierarchy API and put to instances table of the local oia-pmh database. Since there may be a lot of instances, such process can take significant time for processing. For 8 million instances it takes around 6-7 minutes to save instances and respond the first batch of records. Each next batch of records does not take the same time because instances are loaded to DB only once for each harvesting process. After initial-load is completed the first batch of instances ids is processed.

To differ each set of saved instances between several marc21_withholdings requests/initial-loads the “requestId” param is used. Such request id is generated for each single harvester doing the request and describes the request until harvesting will be ended. Request id is stored among resumptionTokens generated for harvester in scope of single harvesting process. As well request id is stored into next database tables – “request_metadata_lb” and “instances”. First table holds request id as primary key and date column for keeping last updated date of request id.

At the first time, the last updated date is set during the initial-load and then is updated when each next batch of records is requested via resumptionToken which holds such request id. Instances table as well holds particular requestId as foreign key for associating instance ids with particular harvesting process. By default, each batch of instance ids are removed from database when requesting records via resumptionToken, i.e. when a request with resumption token is sent then required instances ids are processed and then they are cleaned from the database.

But in a case when harvester lost his resumptionToken then saved instances ids with metadata that have not been retrieved and processed yet will be kept in DB and never will be cleaned. For preventing this case, request id has an expired period which for now equals to one day(24 h.) Therefore, when some request ids with expired last updated date exist then both such request ids and associated with them instances ids start to be considered as expired and will be removed from DB by cleaning job which are run each 2 hours.

Issue tracker

See project MODOAIPMH at the FOLIO issue tracker.

Other documentation

Other modules are described, with further FOLIO Developer documentation at dev.folio.org

About

Backend Module implementing the OAI-PMH metadata harvesting protocol

Resources

License

Packages

No packages published