Skip to content

dccn-tg/dr-data-stager

Repository files navigation

Data Stager

An efficient data transfer service for transferring data between the work-in-progress storage at DCCN and the Radboud Data Repository.

Note: it is a rewrite of the DCCN data-stager in Golang.

Architecture

The figure below is a schematic drawing of the data stager architecture in relation with the DCCN infrastructure and the iRODS service of the Radboud Data Repository.

The components in green are key components of the data stager stack.

architecture

Stager users

There are two type of users of the data-stager: the DCCN data streamer and researcher.

  1. The data streamer implements automatic raw data transfer from DCCN MEG/MRI labs to DACs.

    After a data-acquisition in lab is completed, the streamer initiates the data transfer. Given that every data-acquisition is associated with a DCCN project ID, the streamer calls out to the data stager to resolve the DAC namespace corresponding to the project ID and submits a data stager task.

    Given that there is no particular RDR user involved in this automatic data transfer, a RDR service account is used to interact with RDR and transfer data to RDR.

  2. Researcher uses data-stager to transfer data between RDR and the DCCN's project storage.

    In this use case, researchers uses the web-based graphical interface to specifiy transfer sources and destination and submit the transfer tasks accordingly to the data stager.

    Researcher logs in to RDR with their data-access credential (retrieved from the RDR portal) in order to browse through RDR collections. This data-access credential is transferred to the data stager and used to interact with iRODS for data transfer.

Stager task

The stager task is a JSON document like the one below.

{
  "drPass": "string",
  "drUser": "string",
  "dstURL": "string",
  "srcURL": "string",
  "stagerUser": "string",
  "stagerEmail": "string",
  "timeout": 0,
  "timeout_noprogress": 0,
  "title": "string"
}

Task is submitted to the API server and dispatched to a distributed Worker. The task scheduler is implemeted with the asynq Go library. Administrators can manage the tasks through the WebUI Asynqmon.

For each transfer, the Worker spawns a child process as the stagerUser to execute a CLI program called s-isync which performs data transfer between the local filesystem and iRODS. When interacting with iRODS, s-isync makes use of the go-irodsclient Go library.

DCCN credential

The UI (frontend) implements the OIDC workflow throught the UI (backend).

RDR credential

The user is authenticated through UI (backend) with the RDR data-access credential when submitting transfer jobs from the UI (frontend). The authentication is done by the UI (backend) making a PROPFIND call to the RDR WebDAV endpoint to check the response code (e.g. 401 indicates an invalid credential).

Following a successful authentication, the credential is transferred and stored at the UI (backend) as a session cookie which is valid for 4 hours. When the user submit a transfer job, the credential is encrypted by UI (backend) and transferred to the API server as part of the task payload. Tasks are stored in the task store (redis). When the task is processed by the Worker, the credential is decrypted and used by s-isync program to perform data transfer using the IRODS protocol.

The credential en-/decryption uses a RSA key pair.

The s-isync program

THe s-isync program is a standalone CLI written in Go and uses the irods-goclient to communicate with the RDR iRODS service.

When the Worker processes a transfer job, it makes a system call to run the s-isync program, using the account of stagerUser. It guarantees that the data-access right on the host filesystem (e.g. /project directory) is respected.

When interacting with iRODS, the s-isync program makes use of the RDR data-access credential (i.e. drUser and drPass) so that the access right to RDR collection and the resulting RDR event logs are respected.

Build the containers

Containers of API server and Worker can be built with the command below:

$ docker-compose build

Environment variables

Some of the supported environment variables are listed in env.sh or print_env.sh.