RFC: Distributer Pipeline execution via Workers #107

Skarlso · 2018-09-05T20:05:35Z

Abstract

This document discusses the problem of executing pipelines in a distributed
manner.

Introduction

Problem Statement

The problem poses the following set of challenges for Gaia:

Manage workers
- See what pipeline is running on which worker at any given point in time
- Add / Delete / Suspend workers
- Add specific environment variables to the worker
Either automatically, or manually choose which pipeline should run on which
worker.
Label the workers so the user knows it's a windows machine or a linux machine
or Go, Python, Java SDK is available on it... etc.

Terminology

Gaia Master: The Gaia Master is a running instance of gaia launched via make or the
released Gaia binary.
Worker: A worker is a server which is connected to the Gaia Master and has
certain capabilities like, what kind of SDK it supports or what operating system
is installed on it.
Pipeline: A pipeline is a configured entity with a set of Jobs.
Job: A job is a single running task like, create a user. A pipeline can have multiple jobs.
RPC: Remote Procedure Call

Architecture Diagram

Proposed Worker Distribution Model

The proposed model which aims to solve this problem is laid out as follows.

Managing Workers

The managing of the workers will happen through a set of API endpoints.
All workers are stored in the database with a designated set of labels
assigned name and IP address.

These endpoints will be Delete / Suspend. Since adding will be taken care of
by the Gaia agent, we don't support that operation here specifically.

Delete: Delete will simply remove the server from the rotation. It won't restart
the server, or shut it down, it will just simply delete it from the database which
holds the worker instances.
Suspend: Suspending a worker will take it out of rotation but will not delete it.
Suspended this worker will not be able to run any pipelines. This is a good option
if some kind of maintenance needs to be performed on the machine.

Worker Tags

The workers will need to be tagged with what kind of resource they are providing. For example:

name	tags
Worker 1	Ubuntu Linux 64bit
Worker 2	Windows 10 64bit
Worker 3	Debian Linux 64bit

When a pipeline is first created in needs to set on the pipeline creation window what kind of resources it requires. These tags will need to be made accessible by a drop down list for ease of usage. These tags can be created when a Worker is created and saved to Gaia. Tagging them can be done manually on the Worker Manager screen.

The Worker RPC API

The Workers will talk to the Gaia Master via a set of defined RPC interfaces.
These are as follows:

// RegisterWorker will take a worker struct which contains the following information:
// Security: This will be protected by the TLS connection between master and worker.
// IP: The address of the worker
// Name: The name of the worker which typically can be `hostname`.
// Operating System: The OS of the worker to save as a label.
// SDK: The SDK the worker has.
rpc RegisterWorker(Worker) {}

// RunPipeline will take a pipeline, and execute it. This ia bi-directional endpoint.
// Pipeline struct:
// ID: Id of the pipeline
// Repo: The git repository for the pipeline. This is needed because the worker needs
// to build the pipeline.
rpc RunPipeline(Pipeline) returns (Success) {}

rpc GetAllPipelines(Worker) returns (Pipelines) {}

Gaia Master - Agent

The current Gaia implementation will still hold and will be designated as Gaia Master.
The master will be a hub for the worker to connect to, get pipelines from, and report
back on the current state of the pipelines they are running.

As such, Gaia Master will no longer be solely responsible to build and distribute
binaries. Since the operation system of the worker decides in what format the binaries
will be in, the workers will build their own binaries.

Which means a worker will get a repository to pull code from and do the whole thing
that Gaia does currently. This will not involve duplicating code however, since the
whole thing will be in the worker package. Gaia Master will use this package by
setting worker to localhost.

The Workers will need to have the go-plugin extracted because HashiCorp's plugin
system does not support RPC calls over the network. Just strictly localhost communication
is allowed. Pipeline execution and communication between jobs' running and state
changes are all through RPC.

Scheduling Jobs

Scheduling jobs will also have to be included into the workers. Workers will schedule
their own parallel jobs execution model and Gaia Master will have to schedule and manage
which worker to distribute pipelines to. This means that the workers will need an indicator
to define when they are too busy to accept more pipelines.

Where jobs are built

Currently, once a user initiates a pipeline build, that pipeline is saved and built on Gaia Master.
This has to change in order for the worker to be able to run the pipeline. The binary
needs to be built on the worker. However, Gaia also needs to be aware of the jobs,
and does pre-validation which means it also needs to build the pipeline.

Scenario 1:

We build the pipeline on both, the Gaia master, and the Worker. Which means we get immediate validation of the pipeline but have to duplicate the building process.

Scenario 2:

We only build on the worker and just save the pipeline on the master to track it. The validation will be deferred until it's actually built on one of the workers. This way, validation is deferred but the building process isn't duplicated.

Implementation Approach

Extract all functionality regarding running and building pipelines including
the SDK and the go-plugin facility into a worker package. This should not change
the current behavior of Gaia. All tests should still pass. Including the WebHook
capability which should be able to still just call build. The worker package should
take care of building and distributing the binary.
Create the API which handles most of the things worker related. But still don't
bother extracting it.
Create an Agent binary which calls back to master's RPC API and registers a
server as a worker.
Implement the managing of the servers below settings on the left of the admin
screen.

The text was updated successfully, but these errors were encountered:

Skarlso · 2018-09-05T20:12:10Z

Deals with #46.

michelvocks · 2018-09-06T08:08:48Z

Awesome work, @Skarlso . 🤗

A few hints from my side:

RegisterWorker should pass something like a secret. Otherwise attackers could easily attach random workers to Gaia and use that to their advantage.
Currently it's not really clear when pipelines are build on the worker. When a user opens the detailed view of a pipeline, he should be able to see all jobs from this pipeline. That means the pipeline needs to be started after creation and GetJobs must be executed to get this information.
We might need an interface for DeregisterWorker. If you dynamically spawn new workers "on-demand", they need to be removed somehow. And a graceful removal can be done by the worker itself.
Really love the Suspend functionality! 🤗

Skarlso · 2018-09-06T09:02:15Z

Of course. I'll spell that out more in detail.
Uh. :D Yeah.
The dynamical spawning I would leave for later. For now I was just thinking that someones spins up a server and executes the agent on it, which calls home.
Thanks. :) 😁

Skarlso · 2018-09-26T20:44:21Z

@michelvocks just a heads up. I'm going to start on this soon, which means there will be merge conflicts all over the place as I extract the building functionality into a worker package. :)

michelvocks · 2018-09-27T06:53:15Z

Awesome @Skarlso 🤗 Looking forward to all the merge conflicts 💀 🤓

Skarlso · 2018-10-31T13:31:13Z

@michelvocks Added some more info about Worker tags and requirements for resource tagging.

michelvocks · 2019-02-01T20:54:09Z

One additional point which is missing in the RFC: Securing the communication between Master and Worker.

Following the proposal:

Generate a global secret in Gaia Master. This secret can be replaced by a newly generated one (in case of a leak).
Every Worker expects that secret as an argument as well as the hostname of Gaia Master.
Gaia Master provides two different gRPC APIs: 1.) Registration API 2.) API for already registered workers.
Worker connects (insecure) to the registration API and starts the registration process with the given secret.
Gaia Master validates given secret and registers worker.
Gaia master returns valid certificates for mTLS which can be used for API 2.).
Worker can securely talk with Gaia Master via a mTLS secured connection.

This should work without any problems. The only disadvantage (from security perspective) is the initial registration process where the secret is sent to the master in plain text. What do you think @Skarlso ?

Skarlso added the needs review label Sep 5, 2018

Skarlso mentioned this issue Sep 28, 2018

Extracting building logic into a workers package #115

Merged

Skarlso mentioned this issue Dec 11, 2018

Implement pipeline scheduling via cron syntax #105

Closed

michelvocks mentioned this issue Feb 2, 2019

Distributed execution via worker #166

Merged

5 tasks

michelvocks closed this as completed in #166 Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Distributer Pipeline execution via Workers #107

RFC: Distributer Pipeline execution via Workers #107

Skarlso commented Sep 5, 2018 •

edited

Loading

Skarlso commented Sep 5, 2018

michelvocks commented Sep 6, 2018

Skarlso commented Sep 6, 2018

Skarlso commented Sep 26, 2018

michelvocks commented Sep 27, 2018

Skarlso commented Oct 31, 2018

michelvocks commented Feb 1, 2019

RFC: Distributer Pipeline execution via Workers #107

RFC: Distributer Pipeline execution via Workers #107

Comments

Skarlso commented Sep 5, 2018 • edited Loading

Abstract

Table of Contents

Introduction

Problem Statement

Terminology

Architecture Diagram

Proposed Worker Distribution Model

Managing Workers

Worker Tags

The Worker RPC API

Gaia Master - Agent

Scheduling Jobs

Where jobs are built

Scenario 1:

Scenario 2:

Implementation Approach

Skarlso commented Sep 5, 2018

michelvocks commented Sep 6, 2018

Skarlso commented Sep 6, 2018

Skarlso commented Sep 26, 2018

michelvocks commented Sep 27, 2018

Skarlso commented Oct 31, 2018

michelvocks commented Feb 1, 2019

Skarlso commented Sep 5, 2018 •

edited

Loading