New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Distributer Pipeline execution via Workers #107

Open
Skarlso opened this Issue Sep 5, 2018 · 7 comments

Comments

Projects
None yet
2 participants
@Skarlso
Copy link
Member

Skarlso commented Sep 5, 2018

Abstract

This document discusses the problem of executing pipelines in a distributed
manner.

Table of Contents

  1. Introduction
  2. Problem Statement
  3. Terminology
  4. Architecture Diagram
  5. Proposed Worker Distribution Model
  6. Managing Workers
  7. Worker Tags
  8. The Worker RPC API
  9. Gaia Master - Agent
  10. Scheduling Jobs
  11. Implementation Approach

Introduction

Problem Statement

The problem poses the following set of challenges for Gaia:

  1. Manage workers
    • See what pipeline is running on which worker at any given point in time
    • Add / Delete / Suspend workers
    • Add specific environment variables to the worker
  2. Either automatically, or manually choose which pipeline should run on which
    worker.
  3. Label the workers so the user knows it's a windows machine or a linux machine
    or Go, Python, Java SDK is available on it... etc.

Terminology

Gaia Master: The Gaia Master is a running instance of gaia launched via make or the
released Gaia binary.
Worker: A worker is a server which is connected to the Gaia Master and has
certain capabilities like, what kind of SDK it supports or what operating system
is installed on it.
Pipeline: A pipeline is a configured entity with a set of Jobs.
Job: A job is a single running task like, create a user. A pipeline can have multiple jobs.
RPC: Remote Procedure Call

Architecture Diagram

distributed workers

Proposed Worker Distribution Model

The proposed model which aims to solve this problem is laid out as follows.

Managing Workers

The managing of the workers will happen through a set of API endpoints.
All workers are stored in the database with a designated set of labels
assigned name and IP address.

These endpoints will be Delete / Suspend. Since adding will be taken care of
by the Gaia agent, we don't support that operation here specifically.

Delete: Delete will simply remove the server from the rotation. It won't restart
the server, or shut it down, it will just simply delete it from the database which
holds the worker instances.
Suspend: Suspending a worker will take it out of rotation but will not delete it.
Suspended this worker will not be able to run any pipelines. This is a good option
if some kind of maintenance needs to be performed on the machine.

Worker Tags

The workers will need to be tagged with what kind of resource they are providing. For example:

name tags
Worker 1 Ubuntu Linux 64bit
Worker 2 Windows 10 64bit
Worker 3 Debian Linux 64bit

When a pipeline is first created in needs to set on the pipeline creation window what kind of resources it requires. These tags will need to be made accessible by a drop down list for ease of usage. These tags can be created when a Worker is created and saved to Gaia. Tagging them can be done manually on the Worker Manager screen.

The Worker RPC API

The Workers will talk to the Gaia Master via a set of defined RPC interfaces.
These are as follows:

// RegisterWorker will take a worker struct which contains the following information:
// Security: This will be protected by the TLS connection between master and worker.
// IP: The address of the worker
// Name: The name of the worker which typically can be `hostname`.
// Operating System: The OS of the worker to save as a label.
// SDK: The SDK the worker has.
rpc RegisterWorker(Worker) {}

// RunPipeline will take a pipeline, and execute it. This ia bi-directional endpoint.
// Pipeline struct:
// ID: Id of the pipeline
// Repo: The git repository for the pipeline. This is needed because the worker needs
// to build the pipeline.
rpc RunPipeline(Pipeline) returns (Success) {}

rpc GetAllPipelines(Worker) returns (Pipelines) {}

Gaia Master - Agent

The current Gaia implementation will still hold and will be designated as Gaia Master.
The master will be a hub for the worker to connect to, get pipelines from, and report
back on the current state of the pipelines they are running.

As such, Gaia Master will no longer be solely responsible to build and distribute
binaries. Since the operation system of the worker decides in what format the binaries
will be in, the workers will build their own binaries.

Which means a worker will get a repository to pull code from and do the whole thing
that Gaia does currently. This will not involve duplicating code however, since the
whole thing will be in the worker package. Gaia Master will use this package by
setting worker to localhost.

The Workers will need to have the go-plugin extracted because HashiCorp's plugin
system does not support RPC calls over the network. Just strictly localhost communication
is allowed. Pipeline execution and communication between jobs' running and state
changes are all through RPC.

Scheduling Jobs

Scheduling jobs will also have to be included into the workers. Workers will schedule
their own parallel jobs execution model and Gaia Master will have to schedule and manage
which worker to distribute pipelines to. This means that the workers will need an indicator
to define when they are too busy to accept more pipelines.

Where jobs are built

Currently, once a user initiates a pipeline build, that pipeline is saved and built on Gaia Master.
This has to change in order for the worker to be able to run the pipeline. The binary
needs to be built on the worker. However, Gaia also needs to be aware of the jobs,
and does pre-validation which means it also needs to build the pipeline.

Scenario 1:

We build the pipeline on both, the Gaia master, and the Worker. Which means we get immediate validation of the pipeline but have to duplicate the building process.

Scenario 2:

We only build on the worker and just save the pipeline on the master to track it. The validation will be deferred until it's actually built on one of the workers. This way, validation is deferred but the building process isn't duplicated.

Implementation Approach

  1. Extract all functionality regarding running and building pipelines including
    the SDK and the go-plugin facility into a worker package. This should not change
    the current behavior of Gaia. All tests should still pass. Including the WebHook
    capability which should be able to still just call build. The worker package should
    take care of building and distributing the binary.

  2. Create the API which handles most of the things worker related. But still don't
    bother extracting it.

  3. Create an Agent binary which calls back to master's RPC API and registers a
    server as a worker.

  4. Implement the managing of the servers below settings on the left of the admin
    screen.

@Skarlso Skarlso added the needs review label Sep 5, 2018

@Skarlso

This comment has been minimized.

Copy link
Member Author

Skarlso commented Sep 5, 2018

Deals with #46.

@michelvocks

This comment has been minimized.

Copy link
Member

michelvocks commented Sep 6, 2018

Awesome work, @Skarlso . 🤗

A few hints from my side:

  1. RegisterWorker should pass something like a secret. Otherwise attackers could easily attach random workers to Gaia and use that to their advantage.
  2. Currently it's not really clear when pipelines are build on the worker. When a user opens the detailed view of a pipeline, he should be able to see all jobs from this pipeline. That means the pipeline needs to be started after creation and GetJobs must be executed to get this information.
  3. We might need an interface for DeregisterWorker. If you dynamically spawn new workers "on-demand", they need to be removed somehow. And a graceful removal can be done by the worker itself.
  4. Really love the Suspend functionality! 🤗
@Skarlso

This comment has been minimized.

Copy link
Member Author

Skarlso commented Sep 6, 2018

  1. Of course. I'll spell that out more in detail.
  2. Uh. :D Yeah.
  3. The dynamical spawning I would leave for later. For now I was just thinking that someones spins up a server and executes the agent on it, which calls home.
  4. Thanks. :) 😁
@Skarlso

This comment has been minimized.

Copy link
Member Author

Skarlso commented Sep 26, 2018

@michelvocks just a heads up. I'm going to start on this soon, which means there will be merge conflicts all over the place as I extract the building functionality into a worker package. :)

@michelvocks

This comment has been minimized.

Copy link
Member

michelvocks commented Sep 27, 2018

Awesome @Skarlso 🤗 Looking forward to all the merge conflicts 💀 🤓

@Skarlso

This comment has been minimized.

Copy link
Member Author

Skarlso commented Oct 31, 2018

@michelvocks Added some more info about Worker tags and requirements for resource tagging.

@michelvocks

This comment has been minimized.

Copy link
Member

michelvocks commented Feb 1, 2019

One additional point which is missing in the RFC: Securing the communication between Master and Worker.

Following the proposal:

  1. Generate a global secret in Gaia Master. This secret can be replaced by a newly generated one (in case of a leak).
  2. Every Worker expects that secret as an argument as well as the hostname of Gaia Master.
  3. Gaia Master provides two different gRPC APIs: 1.) Registration API 2.) API for already registered workers.
  4. Worker connects (insecure) to the registration API and starts the registration process with the given secret.
  5. Gaia Master validates given secret and registers worker.
  6. Gaia master returns valid certificates for mTLS which can be used for API 2.).
  7. Worker can securely talk with Gaia Master via a mTLS secured connection.

This should work without any problems. The only disadvantage (from security perspective) is the initial registration process where the secret is sent to the master in plain text. What do you think @Skarlso ?

@michelvocks michelvocks referenced a pull request that will close this issue Feb 2, 2019

Open

[WIP] Distributed execution via worker #166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment