Structure of a Workbench Project

gregjan edited this page Feb 14, 2011 · 5 revisions

Structure of a Workbench Project

Projects are basically standard Eclipse IProjects, taken from the Eclipse Resources Plugin, which provides notions of Workspace, Project, Folder, File and Project "Natures". Each Curator's Workbench project is a standard Eclipse IProject with a "METS Project" nature added. The METS Project Nature adds a METS file to the project, along with automation for loading/saving and tracking changes.

Projects consist of four sections that are visible to the user. These sections are described below, along with some salient implementation details.

Originals

The original files section of a project contains a list of folder locations, outside of the project, that have been linked to by the project. The project only knows about these specific locations outside of the project folder. Originals cannot be modified by the Workbench, only read and copied. This keeps originals intact but also allows the Workbench to track any changes that are made to these folders by other software.

Eclipse File System and URI File Tracking

In fact the Workbench links to original folders (and all file systems) by means of a abstract layer called EFS (Eclipse File System). Files are referenced by a URI and EFS implementations may employ unique URI schemes to access unique sources of data. For instance, in the CDR environment we stage files to an IRODS grid. We identify locations in IRODS via a URI with an "irods://" scheme. Using URIs makes it easy to add new file systems to the Workbench. It also makes it easy to track file locations in metadata, such as in METS.

This is an example of an IRODS URI:

irods://count0@cdr-dev-vault.libint.unc.edu:5555/cdrTestZone/home/count0/staging/myproject

This is an example of a typical URI for an original file mounted on the local computer:

file:///home/count0/Desktop/workshop+samples

Arrangement

The Arrangement is the tree of objects and their folders that together make up the structure of the finished project. This tree consists of those objects that are selected from the Originals. The Workbench stores this Arrangement in a METS structMap. Currently this structMap is composed of div elements of type "Collection", "Folder" and "File".

Users may modify the Arrangement as they wish, moving, reordering, removing or renaming objects. As the tree is edited by the user only the structMap in METS actually changes. Originals and staged files are left as is. So while editing the Arrangement gives the user control over the order of a submission, they are not needlessly modifying physical files on media.

Staged Files

Files that have been "captured" to the Arrangement are queued for staging. These files will be staged to a location that is unique to the project. A staging location may be local or remote and is configured by means of a URI when the project is created. Once a file is staged it remains in a unique static location for the life of a project. This location is based on both the staging URI and the original file location URI.

This might be a staging URI:

irods://count0@cdr-dev-vault.libint.unc.edu:5555/cdrTestZone/home/count0/staging/myproject/workshop+samples/images/001xdd.tif

Crosswalks

The Crosswalks within a project are automated metadata transformations. The Crosswalks in the Workbench are user-defined mappings of custom metadata to a standard format, such as MODS. In terms of project structure, Crosswalks are XML files that are created by means of a visual editor. All crosswalk files are stored in a "crosswalks" subdirectory of the project folder. When new resources are captured or a Crosswalk definition is modified, the Crosswalk is run, generating MODS records for each row in custom metadata. The MODS records output by a Crosswalk are embedded in the project MODS and linked if possible from the relevant div in the structMap for the Arrangement.

NOTE: I think we will move toward putting the Crosswalk output into separate files, instead of embedding them in the project METS file. This will save on memory/performance for large record sets. It will also permit more general file-based tooling downstream.