Configure kube pod process per job type #10200

pmossman · 2022-02-09T00:56:42Z

What

Rather than defining a single WorkerConfigs and KubePodProcess to serve every job type, we want job-type-specific configurations so that we can, for example, run Check job pods in a node pool that is tailored to the needs of that particular job type.

How

Before this PR, the WorkerApp instantiated a single WorkerConfigs instance and a single ProcessFactory instance, which were shared across every type of job.

Now, the WorkerApp instantiates one WorkerConfigs for each job type, and one ProcessFactory for each job type.

For now, only the Check job has new behavior (it will pull node selectors from a new environment variable, CHECK_JOB_KUBE_NODE_SELECTORS, and will use a status check interval of 1 second instead of 30). But the scaffolding is now in place to make further job-type-specific customization easier going forward.

cgardens

This looks like a good approach to me.

In terms of going forward with the env variables I think we should do one of the following:

Do what you did for check for discover and spec as well.
Instead of calling it check selectors, call it quick jobs selectors or something.

#2 is a best that we want to just use the same node pool for check, discover, and spec. AKA all the jobs that return quickly. This seems plausible to me. That said, if we are wrong, it'll be a breaking change to get out of it (or at least awkward to get out of). #1 will be a bit more verbose though.

Only comment since a WIP.

airbyte-workers/src/main/java/io/airbyte/workers/WorkerConfigs.java

airbyte-config/models/src/main/java/io/airbyte/config/EnvConfigs.java

davinchia

Thanks for the tag - direction looks good to me.

I'll take a closer look when the PR is ready to be reviewed.

pmossman · 2022-02-10T02:04:44Z

@davinchia @cgardens made some updates to this today and added a few tests. I tried to figure out a good way to test this more thoroughly via KubePodProcessIntegrationTest.java, but since that test is actually running against a local kube cluster, I couldn't actually set job-specific node pools since it would prevent the pod from ever starting. If either of you have suggestions for testing this more thoroughly, please let me know!

pmossman · 2022-02-10T02:12:32Z

@jrhizor this PR currently has merge conflicts with the resourceRequirement changes you made for replication orchestrator pods. I think your approach conflicts a little bit with the approach I took in this PR, because you're adding additional information to WorkerConfigs, while my PR is instantiating a separate WorkerConfigs object for each pod type.

I'm not super opinionated here, but I think we should be consistent. So either I can refactor my PR to use a single WorkerConfigs with new fields like this:

WorkerConfigs {
  ...
  private final Map<String, String> workerKubeNodeSelectors;
  private final Map<String, String> specKubeNodeSelectors;
  private final Map<String, String> checkKubeNodeSelectors;
  private final Map<String, String> discoverKubeNodeSelectors;
  ...
}

Or I can create a new WorkerConfigs specifically for the ReplicationWorker that sets its resourceRequirements to the new values (this would be more consistent with the approach in my PR right now). I think I slightly lean towards this approach just because it seems like WorkerConfigs can easily become a huge bloated class with custom configurations for every worker type.

Let me know if that makes sense, also down to chat through it tomorrow in person if that's easier

cgardens · 2022-02-10T02:52:00Z

I tried to figure out a good way to test this more thoroughly via KubePodProcessIntegrationTest.java, but since that test is actually running against a local kube cluster, I couldn't actually set job-specific node pools since it would prevent the pod from ever starting.

Maybe dumb, but are you allowed to use wildcards with node selectors? If you can then you could set the node selector for the default to something you know will fail:my-fake-node-pool. Then for the job type specific ones you set the selector to be *?

There may be a real way to do this, this is obviously a bit of a workaround.

cgardens

Makes sense to me. I left a suggestion that may or may not work for your integration test. @davinchia may have a better idea.

cgardens · 2022-02-10T02:52:54Z

(also you can remove [wip] from the title)

jrhizor · 2022-02-10T03:13:26Z

Or I can create a new WorkerConfigs specifically for the ReplicationWorker that sets its resourceRequirements to the new values (this would be more consistent with the approach in my PR right now). I think I slightly lean towards this approach just because it seems like WorkerConfigs can easily become a huge bloated class with custom configurations for every worker type.

Agreed with your approach.

davinchia · 2022-02-10T08:25:57Z

w.r.t testing, I would give a manual test in one of our dev envs a shot as a sanity check. I don't think there is alot of value vis-a-vis the complexity of setting an E2E node pool test since the node selector bits this would mainly assert is mostly hitting the already well tested Kube api.

the bit here with more custom logic is the whole env var injection bit and it looks like there are tests to make sure the env vars are as injected - this seems fine to me.

pmossman · 2022-02-11T01:46:01Z

@jrhizor can you sanity check this PR for the places it touches Replication Orchestrator configs?

Key change is here: https://github.com/airbytehq/airbyte/pull/10200/files#diff-b332562b5e3b34bdc4e253e7186267ab8dbc13ec4364fd0d28e7fa418b5bfa6aR138-R154

@davinchia and @cgardens I will plan on releasing this with a new OSS version tomorrow, and will sanity check that new version on Cloud dev before doing the Cloud upgrade. Then once we have the new node pool with appropriate labels, we should be able to add CHECK_JOB_KUBE_NODE_SELECTORS here in order to get chose Check pods moved over to the node pool.

Of course, let me know if anything there doesn't make sense or should be approached differently!

davinchia

🚀

… job node selectors * move status check interval to WorkerConfigs and customize for check worker * add scaffolding for spec, discover, and sync configs * optional orElse instead of orElseGet

jrhizor · 2022-02-11T19:17:43Z

@pmossman lgtm

pmossman · 2022-02-12T00:47:11Z

QA Summary:

Opened Cloud PR to populate CHECK_JOB_KUBE_NODE_SELECTORS with new quick-jobs node pools that Davin recently added
Deployed that Cloud branch to dev-2
Built, tagged, and pushed OSS airbyte/worker image from this branch
Edited dev-2 airbyte-worker deployment to pull this tagged image
Created a new source on dev-2 via UI and described the check pod that was spawned:

❯ kl describe pod ource-pokeapi-sync-7e551fa5-ba30-4348-9fa5-8861ea56a5de-0-buiol -n jobs | grep -i selector

Node-Selectors:  cloud.google.com/gke-nodepool=dev-2-quick-jobs-pool

Also found the pod via GCP console on one of the new quick-jobs nodes: https://console.cloud.google.com/kubernetes/pod/us-west3/dev-2-gke-cluster/jobs/ource-pokeapi-sync-7e551fa5-ba30-4348-9fa5-8861ea56a5de-0-buiol/details?project=dev-2-ab-cloud-proj

So, I feel pretty comfortable that this can be merged. Cloud PR should go first, then this one. Will wait until Monday to avoid any issues over the weekend. @davinchia @cgardens for visibility

pmossman requested review from cgardens and davinchia February 9, 2022 00:56

github-actions bot added area/platform issues related to the platform area/worker Related to worker labels Feb 9, 2022

cgardens reviewed Feb 9, 2022

View reviewed changes

airbyte-workers/src/main/java/io/airbyte/workers/WorkerConfigs.java Outdated Show resolved Hide resolved

airbyte-config/models/src/main/java/io/airbyte/config/EnvConfigs.java Outdated Show resolved Hide resolved

davinchia reviewed Feb 9, 2022

View reviewed changes

pmossman temporarily deployed to more-secrets February 10, 2022 01:55 Inactive

cgardens approved these changes Feb 10, 2022

View reviewed changes

pmossman changed the title ~~[WIP] Configure kube pod process per job type~~ Configure kube pod process per job type Feb 10, 2022

pmossman force-pushed the parker/check-connection-speedup branch from 17856f0 to 9fe0d05 Compare February 10, 2022 23:20

pmossman temporarily deployed to more-secrets February 10, 2022 23:22 Inactive

pmossman temporarily deployed to more-secrets February 11, 2022 00:12 Inactive

pmossman force-pushed the parker/check-connection-speedup branch from 9fe0d05 to 7a67942 Compare February 11, 2022 00:50

pmossman temporarily deployed to more-secrets February 11, 2022 00:52 Inactive

pmossman force-pushed the parker/check-connection-speedup branch from 7a67942 to 3b78abf Compare February 11, 2022 01:23

pmossman temporarily deployed to more-secrets February 11, 2022 01:25 Inactive

pmossman force-pushed the parker/check-connection-speedup branch from 3b78abf to 40855f2 Compare February 11, 2022 01:36

pmossman temporarily deployed to more-secrets February 11, 2022 01:38 Inactive

davinchia approved these changes Feb 11, 2022

View reviewed changes

pmossman added 2 commits February 11, 2022 08:44

split workerConfigs and processFactory by job type, env var for check…

5146263

… job node selectors * move status check interval to WorkerConfigs and customize for check worker * add scaffolding for spec, discover, and sync configs * optional orElse instead of orElseGet

add replicationWorkerConfigs with custom resource requirements

f39cf5d

pmossman force-pushed the parker/check-connection-speedup branch from 40855f2 to f39cf5d Compare February 11, 2022 16:45

pmossman temporarily deployed to more-secrets February 11, 2022 16:47 Inactive

pmossman merged commit b742a45 into master Feb 15, 2022

pmossman deleted the parker/check-connection-speedup branch February 15, 2022 17:59

octavia-squidington-iii mentioned this pull request Feb 15, 2022

Bump Airbyte version from 0.35.29-alpha to 0.35.30-alpha #10358

Merged

tbcdns mentioned this pull request Mar 11, 2022

airbyte-workers: add support for kubernetes pod annotations #10753

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure kube pod process per job type #10200

Configure kube pod process per job type #10200

pmossman commented Feb 9, 2022 •

edited

cgardens left a comment

davinchia left a comment

pmossman commented Feb 10, 2022

pmossman commented Feb 10, 2022 •

edited

cgardens commented Feb 10, 2022

cgardens left a comment

cgardens commented Feb 10, 2022

jrhizor commented Feb 10, 2022

davinchia commented Feb 10, 2022 •

edited

pmossman commented Feb 11, 2022

davinchia left a comment

jrhizor commented Feb 11, 2022

pmossman commented Feb 12, 2022

Configure kube pod process per job type #10200

Configure kube pod process per job type #10200

Conversation

pmossman commented Feb 9, 2022 • edited

What

How

cgardens left a comment

Choose a reason for hiding this comment

davinchia left a comment

Choose a reason for hiding this comment

pmossman commented Feb 10, 2022

pmossman commented Feb 10, 2022 • edited

cgardens commented Feb 10, 2022

cgardens left a comment

Choose a reason for hiding this comment

cgardens commented Feb 10, 2022

jrhizor commented Feb 10, 2022

davinchia commented Feb 10, 2022 • edited

pmossman commented Feb 11, 2022

davinchia left a comment

Choose a reason for hiding this comment

jrhizor commented Feb 11, 2022

pmossman commented Feb 12, 2022

pmossman commented Feb 9, 2022 •

edited

pmossman commented Feb 10, 2022 •

edited

davinchia commented Feb 10, 2022 •

edited