Skip to content
This repository has been archived by the owner on Jul 3, 2019. It is now read-only.

architecture

Jorrit Poelen edited this page Feb 19, 2019 · 8 revisions

This page describes a high-level description of the system that runs http://gimmefreshdata.github.io and associated services.

components

Fresh Data uses the following components:

  1. https://github.com/gimmefreshdata/gimmefreshdata.github.io uses GitHub pages and Jekyll to provide web pages at http://gimmefreshdata.github.io
  2. Archive uses Jenkins to manage occurrence data archives.
  3. https://github.com/bio-guoda/effechecka uses Akka to run the api as a service and provide access to way for users to check for new data.
  4. the spark job at https://github.com/bio-guoda/idigbio-spark uses Apache Spark v2.x, Mesos v0.23 to crunch the data archives and populate monitors with relevant occurrence data.
  5. Both the api and the spark job use Hadoop's HDFS to persist monitors and searches and cache monitor results.
  6. Mesos v0.23 + Marathon to run api, Apache Spark, and Kafka v0.9.0.0. Cassandra runs as a service outside of Mesos/Marathon (see https://github.com/gimmefreshdata/freshdata/issues/37).

Note that use of Cassandra and Kafka was removed in an attempt to make the system easier to manage by reducing the number of external dependencies. This effective turned the system from a push notification (get a message to get notified), to a pull notification system (periodically check whether records were modified).

dynamics

Roughly three workflows are supported: create , add archive and check.

create monitor (previously: subscribe)

A freshdata user specifies search criteria to create a monitor. On first creating a monitor, a spark job is launched to collect occurrence records from registered data archives that match the search criteria. Once all results are collected and cached in Hadoop's HDFS, the web api provides the information needed for user to check for, and retrieve, new records.

add archive

var 1 - manually add archive

A data manager/provider adds a new data archive by creating a job at archive using a Jenkins pipeline cloned and customized from https://github.com/gimmefreshdata/source-neon . After the new archive has been successfully converted to parquet and linked to the freshdata archive library, the subscriber monitors are updated by http://archive.effechecka.org/job/update%20monitors/ . Subscribers are notified daily of new changes to their (fresh) data monitors by http://archive.effechecka.org/job/notify%20subscribers/ .

var 2 - refresh archive

A Jenkins pipeline at http://archive.effechecka.org is configured to periodically download a data archive. After download, the Jenkins pipeline compares the previous version of the archive with the new archive. If a change is detected, the monitors are updated by running a spark job.

check

Users are expected to check the web api for changes in their monitors.

deployment

At time of writing (Aug 2016), the components are deployed on a single node cluster containing a 32 Core Ubuntu server with 128 GB memory and ~1.5 TB disk space.

At time of writing (Feb 2019), the components are deployed on a >10 node compute cluster.