Skip to content

ConvertModifyload

Chris Delis edited this page Oct 15, 2015 · 1 revision

Convert, Modify, and Load data

The ILS Export Script exports metadata from the ILS, transports it to the machine where the OAI Toolkit is running, and then converts, modifies, and loads it into the OAI Toolkit. This section details the features in the OAI Toolkit for converting, modifying, and loading metadata into the repository embedded in the OAI Toolkit.

These steps are accomplished with a command-line utility provided with the OAI Toolkit. They can be executed together with a single call to the utility, or they can be performed in separate calls. If you call the command-line utility with the –convert, –modify, and –load parameters, the utility will perform all three steps at once. Alternatively, you can call the command line with just one of the parameters, in which case you will have to call the utility three times. Details on the pros and cons of running the three parameters in one step or in separate steps can be found later in the "One Step Load - Convert, Modify, and Load” and "Multi Step Load - Separate, Convert, Modify, and Load" sections. Sometimes, the convert and modify steps are optional, depending on what format your metadata is in when it comes out of the ILS or another repository.

Also included with the OAI Toolkit are some simple command-line scripts. Each script includes a single call to the OAI Toolkit command-line utility with the most common parameters specified. Look for these sample scripts in the distribution directory of the OAI Toolkit. There is a script for convert, a script for load, a script that calls both convert and load, and a script that does all three together (convert, modify, and load). The modify step requires an XSLT transformation sheet, so no script is provided for that step. There are various sample XSLT transformation sheets provided in the xslts directory. If you write your own XSL transformation sheets, these can be included in the same xslts directory.

Convert – The MARCXML Transformer

This feature of the toolkit will take MARC21 files and convert them to the MARCXML format. The OAI Toolkit embeds an open-source tool called MARC4J (we currently use version 2.4) that is able to convert a valid raw MARC record into a valid MARCXML-formatted record. This is a lossless conversion, meaning that none of the information from the original MARC record is lost during its conversion into MARCXML.

Many institutions have a long history of MARC data in different formats, and, as a result, some records found in an ILS might not conform to the current MARC21 format schema. There are three options for dealing with this situation. An institution may tweak the MARCXML schema (the .xsd file) that is used by the OAI Toolkit to allow the OAI Toolkit to support the past or non-standard cataloging practices of the institution. This will technically cause the output MARCXML records to have a format slightly different from that of a true MARCXML record. In some cases, this will not cause problems with the other XC applications (and may indeed avoid massive record cleanup projects). Examples of acceptable changes can be found in the "Schema Changes" section. Instead of changing the MARCXML schema to accommodate the non-standard data, an institution may perform “record cleanup” in their ILS records to address invalid MARC format issues before using the convert feature. A third option is the next step in the OAI Toolkit process, modify (described below), that offers support for minimal cleanup on-the-fly. Once the data is minimally valid, another application in the eXtensible Catalog system (the Metadata Services Toolkit) can be used to perform other, more extensive types of cleanup.

Please note that if a record is invalid, it will not be converted or imported to the OAI Toolkit’s database, and thus it cannot be harvested. If you modify the schema file to allow invalid records to be converted and imported into the database, you are effectively changing what it means to be valid MARCXML. Since you are changing the MARCXML standard schema that the OAI Toolkit is checking against, you run the risk that the resultant harvested records may be invalid according to the master schema file maintained by the Library of Congress. Please think these options over carefully before modifying the .xsd file.

Modify – Modify MARCXML with XSLT stylesheets

We mentioned above that the MARC records exported from an ILS do not always follow the current MARC standards. Making corrections to the source records in the ILS is the best approach to fixing this problem. Using Modify is the next best approach because the Modify step allows the user to modify the MARCXML in the records to fit the standards, while maintaining the validity of the MARCXML. Alterations to the .xsd file (mentioned in the Convert step above) are the least desirable as they are basically changing the rules the OAI Tooklit uses to create MARCXML records; however this can often be an effective solution, too.

Some typical issues that the Modify step can be used to resolve:

  • The ILS uses a 9xx field as control number instead of the standard 001 field
  • The data field indicator contains pipe (|) or underscore (_) characters instead of a space
  • Some custom/local data fields do not have the default structure, etc.

XSLT is a useful and flexible way to modify XML. We have provided some sample XSLT stylesheets in the xslts directory, but you can modify them or create new ones to suit your own needs. You can even apply multiple stylesheets. The order of stylesheet processing is the same order as what you give in the command line. Note: you must surround the list of multiple stylesheets with double quotes, e.g., -modify "drop_pipeline.xsl populate003.xsl"

For example, suppose you choose to use the Modify step, and you use the drop_pipeline.xsl stylesheet. You would include "-modify drop_pipeline.xsl" in the big command/script of the import process. It would apply the drop_pipeline.xsl from the “xslts” directory in the OAIToolkit home directory and remove the pipe (|) characters from the various marc-xml files.

Another stylesheet in the xslts directory is "populate003.xsl." This stylesheet will add the 003 field with the value specified in it, or, if there is an already existing 003 field, it will replace it with the value specified here. We recommend the use of this stylesheet in the XC software if the 003 field is inconsistent or not present in the extracted records.

Load - API

The next step in the process is to take the MARCXML records and load them into a persistent database in the OAI Toolkit. The OAI Toolkit determines whether each incoming record is new or updated by comparing the unique identifier of the incoming record with records that have already been stored in the OAI Toolkit database. The unique identifier in the case of MARC is the 001 field. Since some institutions would like to collect records from different sources, the OAI Toolkit automatically uses both the MARC repository code (field 003) and the control number (field 001) to determine uniqueness. If your ILS uses another field for the unique ID, we advise you to create an XSLT stylesheet that copies the value of that field into the 001 (you will find a sample stylesheet for this procedure, use907as001.xsl, in the xslt directory). If your ILS does not have a unique ID, contact us. For bibliographic and holdings records, uniqueness means that the identifier is unique among records of its own type. Note that a bibliographic record and a holdings record may contain the same identifiers in different fields, which provides a way to link the two records together. There are three possible scenarios for each record that is loaded into the OAI Toolkit. For each record that is loaded, the arriving records 001, 003, and record type (bibliographic, holdings, etc) will be compared with records already stored in the OAI Toolkit repository:

  • New – Incoming new records are added to the database
  • Updates – Incoming updated records will result in a replacement of the record in the database. Nothing will remain of the original record.
  • Deletions – Incoming deleted records will result in the prior record being overwritten by the deleted record and marked as deleted. Note that the OAI-PMH protocol uses record synchronization to support deletion and that this requires that records not be physically removed from the system. Instead, record stubs are maintained in the OAI Toolkit repository (the contents of the record can be removed, but the record container is kept, along with the unique identifier and a record status of “deleted”). In order to “delete” records from the OAI Toolkit’s repository, a sample script is provided. If you are using Convert and Load in one step, the script is called “convertload_as_deleted,” and if using Load as a separate step, it is “load_as_deleted.” These scripts treat the MARCXML records as deleted and load them in the database. Records that are suppressed in the ILS can also make use of this feature.

Once the MARCXML records are imported into the database in a “harvestable” format, the OAI-PMH Repository will be able to serve the native MARCXML and, with built-in transformation methods, the following converted formats:

Stored Format Converted Format In Which Release?
MARCXML OAI_DC Current
MARCXML MARCXML Current
MARCXML MODS Current
Clone this wiki locally