Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kube Queueing POC #3464

Merged
merged 47 commits into from
Jun 10, 2021
Merged

Kube Queueing POC #3464

merged 47 commits into from
Jun 10, 2021

Conversation

davinchia
Copy link
Contributor

@davinchia davinchia commented May 18, 2021

What

Required changes for Airbyte to be deployed into separate nodes on Kubernetes.

How

See https://docs.google.com/document/d/1h1aJr0_AFx68EQL2W0QqVN8OSWpmik-uli6ynQ2weF0/edit?usp=sharing for up to date information.

Remaining work required to merge this into master:

  • move templates to fluent interface to get rid of hacky yaml templating
  • use kube api for getting entrypoint (use fabric8 sdk instead of calling kube directly. the main challenge is pulling stdout properly from the logging endpoint)
  • remove kubectl install dependency
  • clean up getPodIP (remove scan)
  • robust pod state lookup in the factory and the pod process (every place where we are looking up pods. needs to scope by namespacing too)
  • tighten up shutdown handling (clean up all threads)
  • capture stderr with entrypoint overriding (we want to capture error logs from the main entrypoint and use a separate named pipe for stderr)
  • handle KubePodProcess executorService exceptions (#3849)
  • create a processbuilderfactory that produces a processbuilder that only implements .start that uses KubePodProcess under the hood. we may need to adjust the interface to specify I/O requirements -- like if it uses stdin -- and update usages. We may want to use a different abstraction than ProcessBuilder itself since we only want a startable process, not something where we can modify the environment, etc. Maybe Supplier<Process>? Or maybe we could just produce the process itself if nobody modifies it?
  • do end to end sync test
  • do a high scale end to end sync test (Conduct High Scale Sync on new Kube deploy. #3847)
  • handle entrypoint passthrough for operators from processbuilderfactory
  • add tests for not logging records in any of the containers (not doing see airbytehq/airbyte-internal-issues#103 )
  • test that we're producing valid manifests from the fluent api (there is a getSpec call, but it spits out a wholebunch of random fields that is annoying to clean up. created airbytehq/airbyte-internal-issues#116)
  • test prematurely closing the process to ensure it kills the underlying pod
  • test failure modes for connector images (does waitFor work, is the return code correct, etc) (not doing see airbytehq/airbyte-internal-issues#103 )

Recommended reading order

This is a big change. Each change is reviewed as it's merged into this branch. Not applicable.

┆Issue is synchronized with this Asana task by Unito

davinchia and others added 24 commits May 19, 2021 16:10
…It also waits for the pod to be ready before doing anything else. Sync worker will also remove the pod on termination.
* nearly working sources

* update

* stdin example
…t the airbyte-workers resource folder; place all the poc yamls together.
also clean up kubernetes deploys.
@davinchia davinchia mentioned this pull request May 25, 2021
davinchia and others added 15 commits May 26, 2021 12:30
The first 6 points of #3464.

The only interesting thing about this PR is the kube pod shutdown. For whatever reason, the OkHttpPool isn't respecting the evictAll call and 1 idle thread remains. So instead of shutting down immediately, the worker pod shuts down after 5 mins when the idle thread id reaped. There isn't an easy way to modify the pool's idle reap configuration now. I do not think this issue is blocking since it's relatively benign, so I vote we create a ticket and come back to this once we do an e2e test.
* processes must handle file mounting

* remove comment

* default to base entrypoint

* use process builder factory / select stdin / use a pool of ports

* fix up

* add super hacky copying example

* Checkpoint: Works end to end!

* Checkpoint: Use API to make sure init container is ready instead of blind sleep. Propagate exception in DefaultCheckConnectionWorker.

* Refactor KubePodProcess. Checked to make sure everything still works.

* Format.

* Clean up code. Begin putting this into variables and breaking up long constructor function.

* Add comments to explain what is happening.

* fix normalization test

* increase timeout for initcontainer

Co-authored-by: Davin Chia <davinchia@gmail.com>
* clean up

* remove source-always-works

* create separate commons-docker

* fix test
* enable kube e2e tests

* use more generally accepted env definition

* use new runners

* use its own runner and install minikube differently

* update name

* use kubectl alias

* use link instead of alias that doesn't propagate

* start minikube

* use driver=none

* go back to using action

* mess with versions

* revert runner

* install socat

* print logs after run

* also try re-runnining tasks

* always wait for file transfer

* use ports

* increase wait timeout for kube

* use different localhost ips and bump normalization to include an entrypoint

* proposed fix

* all working locally

* revert temporary changes

* revert normalization image change that's happening in a separate pr

* readability

* final comment
@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels Jun 9, 2021
ports:
- containerPort: 8001
- containerPort: 9000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhizor why is this necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pool of worker ports that can be used by jobs on this scheduler node.

davinchia and others added 3 commits June 9, 2021 13:32
* Port over the basic changes.

* Add logic to return proper exit code in the event of termination. Add comments to explain why.
@jrhizor jrhizor marked this pull request as ready for review June 10, 2021 01:11
@jrhizor jrhizor merged commit b04c080 into master Jun 10, 2021
@jrhizor jrhizor deleted the davinchia/kube-queueing-poc branch June 10, 2021 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants