Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
The GUODA infrastructure is a server cluster and software stack hosted by the ACIS Lab at the University of Florida. These resources provide the capacity and technology required to process data sets of the size made available by biodiversity aggregators. It is also co-located with the iDigBio infrastructure so working with iDigBio data such as the images can be as fast as a local network.
We run Apache Mesos as a scheduler and process manager for the cluster. This allows us to run both long-running applications like API servers and database caches as well as batch-style jobs like Spark. Mesos is a clustered process manager and will allocate resources dynamically across the cluster in response to resource requests by processes.
The main processing technology we use is Apache Spark. Spark is a distributed processing engine that can be considered as a modern replacement for Hadoop MapReduce. The GUODA services are interfaces to running Spark jobs, whether through public APIs, jobs submitted by users, or through Jupyter notebooks.
GUODO data is stored in Hadoop File System (HDFS). This is a distributed parallel file system that provides high performance for reading and writing data used by Spark.
Spark is aware of HDFS and where the various parts of files live relative to where Spark is running its work. It will attempt to access HDFS data from the local machine first thus minimizing the amount of network transfer needed by HDFS.
The cluster is currently one IBM H series blade chassis filled with (14) HS22 blades each with 8 cores and 24 GB of memory. The storage is integrated with these machines and consists of 1 TB of space per node. The services the cluster provides are available on UF's campus research network (CRN) which is connected to Internet2 at 10 Gbps and to the commodity internet at 1 Gbps.
By modern standards, this is a very small cluster. However, with a total of 112 cores and 336 GB of memory it is significantly larger that machines available to the collections community and it is about 3 times larger than the largest biodiversity data set typically used.