Copyright (C) 2019 The Open Library Foundation
This software is distributed under the terms of the Apache License, Version 2.0. See the file "LICENSE" for more information.
Backend Module implementing the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH Version 2.0), but providing more RESTful API than described in the specification. At the core places the /oai/records endpoint which accepts verb name as main parameter which defines what type of request is and which handler should ve invoked for processing the request.
The following verbs are used:
|Verb||Required parameters||Optional parameters||Exclusive parameters||Response status codes|
|ListRecords||metadataPrefix||from,until,set||resumptionToken||200, 400, 404, 422|
|ListIdentifiers||metadataPrefix||from,until,set||resumptionToken||200, 400, 404, 422|
|ListMetadataFormats||-||identifier||-||200, 400, 404|
|GetRecord||identifier, metadataPrefix||-||-||200, 400, 404, 422|
The repository supports oai_dc, marc21 and
marc21_withholdingsmetadata prefixes. The Latest is used to return holding and item information along with MARC records.
The OAI Identifier Format is used for resource identifiers with the following pattern:
oai:<repositoryBaseUrl>:<tenantId>/<uuid of record> e.g.
The following schemas used:
- OAI-PMH Schema: OAI-PMH.xsd (please refer to OAI-PMH specification for more dtails)
- XML Schema for Dublin Core without qualification: oai_dc.xsd (please refer to OAI-PMH specification for more dtails)
- MARC 21 XML Schema: MARC21slim.xsd (please refer to MARC 21 XML Schema for more details)
OAI-PMH is heavily loaded module and for correct work with big data set(approximately 4-5 millions records) it requires to have as least 400 Mb of java heap and 1Gb for docker container memory.
Configuration properties are intended to be retrieved from mod-configuration module. System property values are used as a fallback. Configurations can be managed from the UI through the mod-configuration via folio settings. The default configuration system properties split into the logically bounded groups and defined within next 3 json files: behavior.json, general.json, technical.json. The configurations by itself are placed within json 'value' field in the "key":"value" way.
The following configuration properties are used:
|Module||Config Code||System Default Value||Description|
||The name of the repository. The value is used to construct value for
||The URL of the repository (basically the URL of the edge-oai-pmh). The value is used in
||The e-mail address of an administrator(s) of the repository. Might contain several emails which should be separated by comma. The value is used in
||The finest harvesting granularity supported by the repository. The legitimate values are
||The manner in which the repository supports the notion of deleted records. Legitimate values are no ; transient ; persistent with meanings defined in the section on deletion.|
||The maximum number of records returned in the List responses. The main intention is to implement Flow Control|
||Boolean value which defines if the response content should be validated against xsd schemas.|
||Boolean value which is used to specify whether or not the marshalled XML data is formatted with linefeeds and indentation.|
||Defines in which way OAI-PMH level errors are going to be processed.
||Property is used in marc21_withholdings metadata prefix handler. If SRS returns an incorrect response then the same request will be sent again up to 50 times until the expected response will not be received or all 50 attempts will fail which leads to error response.|
||The idle timeout for requests to SRS.|
Configuration priority resolving
TenantApi 'POST' implementation is responsible for getting configurations for a module from mod-configuration and adjusting them to system properties when posting module for tenant. Since there 3 places of configurations (mod-configuration, JVM, default form resources), there are ways of resolving configuration inconsistencies when TenantAPI executes.
First one - if mod-configuration doesn't contain config entry for the particular configuration group then such group will be picked up from the resources with their default values and then will be posted to a mod-configuration. As well if some of them was already defined through JVM property setting up, then such "JVM" value will be used and posted instead of the default one. As well, these values will be used farther within module business logic.
Second one - if mod-configuration has successfully returned config entry for particular group then such configuration values overrides default and JVM specified as well and these values are set up to system properties and they will be used farther within module business logic.
So, configurations priority from highest to lowest is the next: mod-configuration (1 priority) -> JVM specified (2 priority) -> default from resources (3 priority)
- There is an option, as it says at "Configuration priority resolving" paragraph, if mod-configuration doesn't contain a value for configuration, then default or specified via JVM value will be used further as expected, but in the opposite case such value will be used only during InitAPIs execution and will be overridden further with value from mod-configuration after ModTenantAPI execution. Only configPath and repository.storage configurations are an exception here.
- The system default values can be overwritten by VM options e.g.
- Another configuration file can be specified via
-DconfigPath=<path_to_configs>but the file should be accessible by ClassLoader.
- For verb
ListRecordsand metadata prefix
marc21_withholdings, holding and item fields from mod-inventory-storage are returned along with the corresponding records from mod-source-record-storage
About Marc21 with holdings and initial load
Oai-pmh supports marc21_withholdings metadata prefix for ListRecords request. Handling this metadata prefix differs from marc21 and oai_dc. Request processing of marc21_whithholdings involves “initial-load” (IL) process. Initial load is a process when all instances ids with some metadata are retrieved from inventory via inventory-hierarchy API and put to instances table of the local oia-pmh database. Since there may be a lot of instances, such process can take significant time for processing. For 8 million instances it takes around 6-7 minutes to save instances and respond the first batch of records. Each next batch of records does not take the same time because instances are loaded to DB only once for each harvesting process. After initial-load is completed the first batch of instances ids is processed.
To differ each set of saved instances between several marc21_withholdings requests/initial-loads the “requestId” param is used. Such request id is generated for each single harvester doing the request and describes the request until harvesting will be ended. Request id is stored among resumptionTokens generated for harvester in scope of single harvesting process. As well request id is stored into next database tables – “request_metadata_lb” and “instances”. First table holds request id as primary key and date column for keeping last updated date of request id.
At the first time, the last updated date is set during the initial-load and then is updated when each next batch of records is requested via resumptionToken which holds such request id. Instances table as well holds particular requestId as foreign key for associating instance ids with particular harvesting process. By default, each batch of instance ids are removed from database when requesting records via resumptionToken, i.e. when a request with resumption token is sent then required instances ids are processed and then they are cleaned from the database.
But in a case when harvester lost his resumptionToken then saved instances ids with metadata that have not been retrieved and processed yet will be kept in DB and never will be cleaned. For preventing this case, request id has an expired period which for now equals to one day(24 h.) Therefore, when some request ids with expired last updated date exist then both such request ids and associated with them instances ids start to be considered as expired and will be removed from DB by cleaning job which are run each 2 hours.