Skip to content

Commit

Permalink
Merge branch '7.x-2.x' of https://github.com/Islandora-Labs/islandora
Browse files Browse the repository at this point in the history
…into 7.x-2.x
  • Loading branch information
daniel-dgi committed Feb 12, 2015
2 parents 40f3d2c + 1f97592 commit 1459e24
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 17 deletions.
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,27 @@

## Introduction

Islandora Fedora 4 Integration Project!
[Islandora](http://islandora.ca) and [Fedora 4](http://fedorarepository.org/) Integration Project!

This is where the Islandora and Fedora 4 development will happen.
This is where the Islandora and Fedora 4 development will happen. If you would like to get involved in the community around this project, please check out the Islandora Foundation [Fedora 4 Interest Group](https://github.com/Islandora/Islandora-Fedora4-Interest-Group).

## Maintainers/Sponsors

- LYRASIS
- York University
- McMaster University
- University of Prince Edward Island
- University of Manitoba
- University of Limerick

Current maintainers:

* [Nick Ruest](https://github.com/ruebot)
* [Daniel Lamb](https://github.com/daniel-dgi/)

## Development

If you would like to contribute, please check out our helpful [Documentation for Developers](https://github.com/Islandora/islandora/wiki#wiki-documentation-for-developers) info, as well as our [Developers](http://islandora.ca/developers) section on the Islandora.ca site.
Ifyou would like to contribute, please check out our helpful [Documentation for Developers](https://github.com/Islandora/islandora/wiki#wiki-documentation-for-developers) info, as well as our [Developers](http://islandora.ca/developers) section on the [Islandora.ca](http://islandora.ca) site.

## License

Expand Down
28 changes: 14 additions & 14 deletions docs/technical_documentation.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
#Islandora 7.x 2.x Technical Design Doc
#Islandora 7.x-2.x Technical Design Doc

Islandora version 7.x-2.x is middleware built using Apache Camel to orchestrate distributed data processing and to provide web services required by institutions who would like to use Drupal as a frontend to a Fedora 4 JCR repository. This goal presents a unique set of challenges, as Drupal is much, much more than a simple display layer. It is a full blown content management system designed to be built on top of a traditional relational database such as MySQL or Postgres, not a JCR repository. Additionally, there is a large amount of data processing and manipulation that must be performed for presentation and discovery. This means that there is much more software that must be integrated than just Drupal and Fedora (Tesseract, ImageMagick, ffmpeg, just to name a few). To make matters worse, doing all of this processing on the servers containing either Fedora or Drupal is detrimental to the performance of the overall system, resulting in a unusable site during periods of content migration or manipulation. Plus, as most of us have already found out, systems such as these are incredibly difficult to install, configure, and maintain.
Islandora version 7.x-2.x is middleware built using Apache Camel to orchestrate distributed data processing and to provide web services required by institutions who would like to use Drupal as a front-end to a Fedora 4 JCR repository. This goal presents a unique set of challenges, as Drupal is much, much more than a simple display layer. It is a full blown content management system designed to be built on top of a traditional relational database such as MySQL or Postgres, not a JCR repository. Additionally, there is a large amount of data processing and manipulation that must be performed for presentation and discovery. This means that there is much more software that must be integrated than just Drupal and Fedora (Tesseract, ImageMagick, FFmpeg, just to name a few). To make matters worse, doing all of this processing on the servers containing either Fedora or Drupal is detrimental to the performance of the overall system, resulting in a unusable site during periods of content migration or manipulation. Plus, as most of us have already found out, systems such as these are incredibly difficult to install, configure, and maintain.

To mitigate these issues, the overall design goals of the 7.x-2.x version of Islandora are:
- A properly modularized installation procedure so that Islandora can be consistently installed and configured in distributed environments. As a result of this, a consistent development environment can also be made available to contributors.
- Asynchronous communication between Fedora and Drupal, so that neither waits on the other nor any of the various processing components of the stack. This will be achieved through the use of persistant queues, which will also allow the stack to be easily distribruted across multiple computers.
- Fedora is treated as the source of the truly important data, only containing preservation masters and descriptive metadata. Metadata can exist either in Fedora's native RDF attached to the resource itself, or as standardized formats such as MODS, MADS, PBCORE, etc... that exist as resources in their own right.
- Data from Fedora is transformed and indexed into the other major system of the stack, most notably Drupal and Solr. This includes lower quality access copies of perservation masters such as thumbnails or streaming video, which will be stored in Drupal as managed files.
- Drupal content is represented as Nodes and Fields, allowing the content management system to utilize the relational database it is expecting instead of shimming in a completely different type of datastore. This will open up the entire Drupal module ecosystem to Islandora. As an added benefit, viewers (OpenSeaDragon, IA Book Viewer, Video.js, etc...) can be written as custom Field renderers, finally giving site builders the ability to control the display of content.
- Asynchronous communication between Fedora and Drupal, so that neither waits on the other nor any of the various processing components of the stack. This will be achieved through the use of persistent queues, which will also allow the stack to be easily distributed across multiple computers.
- Fedora is treated as the source of the truly important data, only containing preservation objects and descriptive metadata. Metadata can exist either in Fedora's native RDF attached to the resource itself, or as standardized formats such as MODS, MADS, PBCORE, etc... that exist as resources in their own right.
- Data from Fedora is transformed and indexed into the other major system of the stack, most notably Drupal and Solr. This includes lower quality access copies of preservation masters such as thumbnails or streaming video, which will be stored in Drupal as managed files.
- Drupal content is represented as Nodes and Fields, allowing the content management system to utilize the relational database it is expecting instead of shimming in a completely different type of datastore. This will open up the entire Drupal module ecosystem to Islandora. As an added benefit, viewers (OpenSeadragon, IA Book Viewer, Video.js, etc...) can be written as custom Field renderers, finally giving site builders the ability to control the display of content.
- Drupal's Services module will be used to expose RESTful services to middleware layer so that it can sanely perform CRUD operations on Nodes without having to delve into Drupal's internals.

### The Importance of Using an Integration Framework
Let's not mince words. *Islandora is middleware, warts and all.* The word 'middleware' has plenty of connotations and baggage, but it really is what we're doing. We have a huge stack with a lot of moving parts, and we have to glue them all together. So it only makes sense to adopt a integration framework to help us pull this off.

The framework that has been chosen for the project is [Apache Camel]. It seeks to provide implementations of the fundamental design patterns codified by Gregor Hohpe and Bobby Woolf in their book [Enterprise Integration Patterns]. The best way to describe these patterns is that they're standardized, re-usable templates for processing messages that have to flow through multiple pieces of software before arriving at their final destination. So Camel will help us do things like route messages from Fedora's queue/topic to the appropriate handling function for an operation on a particular content type, allowing it to be processed along the way by derivative generation tools on the commandline. In addition to this, Camel provides code for interacting with software through basically any protocol you can think of, so we don't have to waste our time writing code for common situations like posting to http endpoints, reading data from files, polling queues, etc... It also has fantastic support for try/catch exception handling across distributed systems and transactional functionality.
The framework that has been chosen for the project is [Apache Camel]. It seeks to provide implementations of the fundamental design patterns codified by Gregor Hohpe and Bobby Woolf in their book [Enterprise Integration Patterns]. The best way to describe these patterns is that they're standardized, re-usable templates for processing messages that have to flow through multiple pieces of software before arriving at their final destination. So Camel will help us do things like route messages from Fedora's queue/topic to the appropriate handling function for an operation on a particular content type, allowing it to be processed along the way by derivative generation tools on the command line. In addition to this, Camel provides code for interacting with software through basically any protocol you can think of, so we don't have to waste our time writing code for common situations like posting to http endpoints, reading data from files, polling queues, etc... It also has fantastic support for try/catch exception handling across distributed systems and transactional functionality.

But perhaps the greatest advantage of using an integration framework is that lets us focus solely on the application logic that's important to Islandora. *There is no need for us to engineer any generic systems to get our job done.* We can identify the operations that need to happen for every supported content type, carve out a space to do the work, and get to it. It's not particuarly sexy, but the work we have to do is difficult and we've already got enough on our plate!
But perhaps the greatest advantage of using an integration framework is that lets us focus solely on the application logic that's important to Islandora. *There is no need for us to engineer any generic systems to get our job done.* We can identify the operations that need to happen for every supported content type, carve out a space to do the work, and get to it. It's not particularly sexy, but the work we have to do is difficult and we've already got enough on our plate!

### Using Camel
Camel, which at first glance appears as terrifying as Java IOC frameworks (more on those later), is actually incredibly straightforward. A Camel application is known as a [Camel Context], which is really just a collection of messaging [Route]s. These routes are defined in [Route Builder] classes. Each route has a starting point (the from() method), from which an initial [Message] is consumed. The Message is placed in an [Exchange], which contains two messages: one incoming, and another outgoing. As the Exchange is passed through each step of the route, the outgoing message from one step becomes the incoming message of the next. Data that must persist between multiple steps in the route can be cached in the Exchange as properties.

That's pretty much it. Seriously.

Camel provides most of the functionality we need to work with these routes out of the box, and has built-ins for message routing, filtering, data extraction with XPath, transformations with Xslt's, and much more. If you need something beyond what is offered by default, you can make your own custom [Processor]s, which have unfettered access to the Exchange for whatever custom logic you desire. Heck, you can even define Processors as anonymous subclasses on the fly within the RouteBuilder. It looks almost like Javascript!
Camel provides most of the functionality we need to work with these routes out of the box, and has built-ins for message routing, filtering, data extraction with XPath, transformations with XSLTs, and much more. If you need something beyond what is offered by default, you can make your own custom [Processor]s, which have unfettered access to the Exchange for whatever custom logic you desire. Heck, you can even define Processors as anonymous subclasses on the fly within the RouteBuilder. It looks almost like Javascript!

But perhaps the best part of using Camel is that [Aaron Coburn] has already created [fcrepo-camel]. This Camel component makes working with Fedora's REST API and JMS Messages incredibly easy. Everyone involved in the project now officially owes Aaron a beer :)

Expand Down Expand Up @@ -111,9 +111,9 @@ To keep things simple for the purposes of this example, we're leaving out some p
For more on Camel, check out the [Camel API] documentation and the community documentation on the [Camel][Apache Camel] website. If you're looking for something more than just community and api docs, check out [Camel in Action]. It's well worth the money.

### Inversion of Control and Camel
Camel works very well with both the [Spring] and [Blueprint] Inversion of Control (e.g. Dependency Injection) frameworks. The routes can even be defined directly in the application context xml's for either. These sorts of frameworks, while both powerful and valuable, are often a stumbling point for developers who have never been exposed to them. We will be using an inversion of control framework to bootstrap the application, but routes will be defined in the Java DSL. It is also advisable to stick to the Camel API, extending custom Processors when extra functionality is needed. Bean injection and delegation should only happen when interfaces and single inheritance cannot be utilized for code re-use. We are attempting to keep the application context's xml as simple as possible. Plus, let's be honest, non-programmers and managers aren't going to be manipulating the xml and redeploying if we expose bean references in this manner. We're programmers, let's do as much in code as possible.
Camel works very well with both the [Spring] and [Blueprint] Inversion of Control (e.g. Dependency Injection) frameworks. The routes can even be defined directly in the application context XML's for either. These sorts of frameworks, while both powerful and valuable, are often a stumbling point for developers who have never been exposed to them. We will be using an inversion of control framework to bootstrap the application, but routes will be defined in the Java DSL. It is also advisable to stick to the Camel API, extending custom Processors when extra functionality is needed. Bean injection and delegation should only happen when interfaces and single inheritance cannot be utilized for code re-use. We are attempting to keep the application context's XML as simple as possible. Plus, let's be honest, non-programmers and managers aren't going to be manipulating the XML and redeploying if we expose bean references in this manner. We're programmers, let's do as much in code as possible.

Aside from configuration and activemq setup, hopefully the application context can stay as simple as this for as long as possible:
Aside from configuration and ActiveMQ setup, hopefully the application context can stay as simple as this for as long as possible:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<blueprint xmlns="http://www.osgi.org/xmlns/blueprint/v1.0.0"
Expand All @@ -137,7 +137,7 @@ Camel provides extra functionality for scripting language integration, but it is
In order to avoid over-genericized and over-engineered code, we are simply going to map out space for each operation that must be performed on each type of resource based on message type and content model. As work progresses and similarities present themselves, we will aggressively refactor in order to maintain code re-use. But as experience has proven, attempting to make a single system that handles all use cases will only lead to deterioration over time as the assumptions of the generic system are violated with each new data type/format and use case. We have to give each concept its own room in the code base so that things which at first appear similar can vary independently over the course of development.

### The Gateway
When messages first come in from Fedora through Activemq, there will be a sorting layer that will process each message so that it eventually winds up in the appropriate place. The things we will have to sort on are:
When messages first come in from Fedora through ActiveMQ, there will be a sorting layer that will process each message so that it eventually winds up in the appropriate place. The things we will have to sort on are:
- Resource type
- Container
- Binary
Expand Down Expand Up @@ -178,7 +178,7 @@ RouteBuilder builder = new RouteBuilder() {
```

### Derivative Creation
In order to interact with the various commandline programs utilized to create derivatives, we will take advantage of Camel's exec component, which passes the message body into the program that is executed through STDIN. Here' a trivial example using the wordcount function in linux 'wc', demonstrating how to handle the results:
In order to interact with the various command line programs utilized to create derivatives, we will take advantage of Camel's exec component, which passes the message body into the program that is executed through STDIN. Here' a trivial example using the word count function in Linux `wc`, demonstrating how to handle the results:
```Java
from("direct:exec")
.to("exec:wc?args=--words /usr/share/dict/words")
Expand All @@ -205,7 +205,7 @@ With so much of the core functionality being moved out of the Drupal layer, we'l
- Provide custom Islandora views
- Provide custom renderers for the access copy derivatives

It should be noted that although there will still exist a module for each content model, they will not be in separate git repos. There is a difference between modularity of code and modulatity of revision control. Managing some thirty odd git repos is a maintenance nightmare, and so you will see all code move into a single repository. This will help eliminate commit mis-matches between modules, and will synchronize changes with the middlware layer as well.
It should be noted that although there will still exist a module for each content model, they will not be in separate git repos. There is a difference between modularity of code and modularity of revision control. Managing some thirty odd git repos is a maintenance nightmare, and so you will see all code move into a single repository. This will help eliminate commit mis-matches between modules, and will synchronize changes with the middleware layer as well.

[Apache Camel]:http://camel.apache.org/
[Enterprise Integration Patterns]:http://www.enterpriseintegrationpatterns.com/
Expand Down

0 comments on commit 1459e24

Please sign in to comment.