🎉 Migrate OSS to temporal scheduler #12757

lmossman · 2022-05-11T01:07:01Z

What

See issue comment for context: #10021 (comment)
Resolves #10021

Migrates OSS to the new scheduler.

How

This PR sets the usesNewScheduler feature flag to true for everyone regardless of the environment variable value.
I also added logic to set a database metadata flag to true after the migration is performed, and skip the migration if it is true, in order to prevent us needing to execute this server logic on every startup. I borrowed this idea from Benoit's PR here.

I have tested this out locally and it seems to properly migrate my deployment from the old scheduler to the new one.

🚨 User Impact 🚨

The impact to the user is that they will be migrated to the new temporal-based scheduler upon upgrade, which should improve OSS Airbyte deployments' ability to handle a large number of connections.

A smooth migration requires users to spin down their existing deployment and spin it up on the upgraded version, as described in our docs here, which should prevent the old scheduler from running during the migration.
If any jobs were actively running at the time of the upgrade, then the next time those connections perform a sync, those zombie jobs will be marked as failed.

lmossman · 2022-05-11T01:13:24Z

airbyte-server/src/main/java/io/airbyte/server/ServerApp.java

@@ -233,6 +233,10 @@ public static ServerRunnable getServer(final ServerFactory apiFactory,
    final Flyway jobsFlyway = FlywayFactory.create(jobsDataSource, DbMigrationHandler.class.getSimpleName(), JobsDatabaseMigrator.DB_IDENTIFIER,
        JobsDatabaseMigrator.MIGRATION_FILE_LOCATION);

+    // It is important that the migration to the temporal scheduler is performed before the server accepts any requests.
+    // This is why this migration is performed here instead of in the bootloader - so that the server blocks on this.
+    migrateExistingConnectionsToTemporalScheduler(configRepository, jobPersistence, eventRunner);


Recall that as I described in this comment, the logic to cleanup non-terminal jobs is only executed just before a new job is created by the temporal workflow.

One consequence of this is that if a user upgrades while a job is running, then that job will continue to have that RUNNING status until the next time that the connection is synced. So in the worst case, this means 24 hours could pass before that zombie job is marked as FAILED and a new job is created.

This isn't an ideal situation, but I think it is tolerable as a one-time occurrence for this migration.

Just to note, this was fixed in that other PR by also cleaning up jobs at the beginning of the temporal workflow, so this should no longer be an issue

jdpgrailsdev

Follow up question: is there a plan/scheduled work to come back and a) remove the environment variable that represented the feature flag and b) remove/clean up the code so that there is no mention of "new" scheduler, now that there is only one?

jdpgrailsdev · 2022-05-11T13:15:00Z

@lmossman Looks like some files in the PR need formatting.

cgardens · 2022-05-11T16:24:09Z

...eduler/persistence/src/main/java/io/airbyte/scheduler/persistence/DefaultJobPersistence.java

+  private final String SCHEDULER_MIGRATION_STATUS = "schedulerMigration";
+
+  @Override
+  public boolean isSchedulerMigrated() throws IOException {


is this something that we can remove next time we do a major version bump? i don't think we should do it just for this, but if we can remove it when we do, it would be good to make sure we track or label it somehow.

Yeah we should be able to remove this on the next major version bump. I will add a comment in the code and also make a ticket for tracking

@lmossman Do you have a ticket for the tracking I have some elements I want to add in the list of stuff we want to do for the next major bump

Yes here is the ticket: #12823

benmoriceau · 2022-05-11T16:57:01Z

airbyte-server/src/main/java/io/airbyte/server/ServerApp.java

+      LOGGER.info("Migration to temporal scheduler has already been performed");
+      return;
+    }
+
    LOGGER.info("Start migration to the new scheduler...");
    final Set<UUID> connectionIds =
        configRepository.listStandardSyncs().stream()
            .filter(standardSync -> standardSync.getStatus() == Status.ACTIVE || standardSync.getStatus() == Status.INACTIVE)
            .map(standardSync -> standardSync.getConnectionId()).collect(Collectors.toSet());
    eventRunner.migrateSyncIfNeeded(connectionIds);


Should we add a test that check that we don't run anything if the jobPersitence returns true for the isSchedulerMigrated

Yeah I can add that test

I have added tests for both cases

lmossman · 2022-05-12T21:06:39Z

Follow up question: is there a plan/scheduled work to come back and a) remove the environment variable that represented the feature flag and b) remove/clean up the code so that there is no mention of "new" scheduler, now that there is only one?

@jdpgrailsdev yes this work is planned, we briefly discussed it this morning during sprint planning. This is the ticket: #8445 (still needs to be fleshed out but it is basically what you described)

lmossman · 2022-05-13T00:56:23Z

I merged the base branch of this PR into master without realizing that would auto-close this PR. Reopening this and rebasing it back on master

… to be consistent

This reverts commit 21b3670.

… to true to be consistent" This reverts commit 6dd2ec0.

…env vars to true to be consistent"" This reverts commit 2f40f9d.

…o true"" This reverts commit 26068d5.

…here

…ORCHESTRATOR_ENABLED flag to true for kube .env files

marcosmarxm · 2022-05-19T16:57:52Z

@lmossman there is any impact for users? what improvements OSS can expect with new scheduler? Asking because this can be a good feature release.
@Amruta-Ranade fyi.

Amruta-Ranade · 2022-05-19T17:01:27Z

If it is a user-facing feature, can you add the 🎉 emoji in the PR title so I will know to add it to the changelog?

lmossman · 2022-05-19T17:39:57Z

@marcosmarxm @Amruta-Ranade I described the user impact in the PR description. The migration to the temporal scheduler will happen without any user action required, so user's won't necessarily see any change in terms of the user interface. But now that the scheduling is handled by temporal, OSS deployments should now be able to handle a larger scale of number of active connections in the deployment.

And, for OSS deployments on kubernetes, after upgrading to the next version that will include this change, users should then be able to do high-availability upgrades of airbyte without resulting in job failures. I.e. in future upgrades, kube users can just kubectl apply -k kube/overlays/stable without first needing to scale down airbyte.

I updated the PR title with the correct emoji, but that won't change the commit history on master since this PR was already merged in.

* Migrate OSS to temporal scheduler * add comment about migration being performed in server * add comments about removing migration logic * formatting and add tests for migration logic * rm duplicated test * remove more duplicated build task * remove retry * disable acceptance tests that call temporal directly when on kube * set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars to true to be consistent * set default value of container orchestrator enabled to true * Revert "set default value of container orchestrator enabled to true" This reverts commit 21b3670. * Revert "set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars to true to be consistent" This reverts commit 6dd2ec0. * Revert "Revert "set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars to true to be consistent"" This reverts commit 2f40f9d. * Revert "Revert "set default value of container orchestrator enabled to true"" This reverts commit 26068d5. * fix sync workflow test * remove defunct cancellation tests due to internal temporal error * format - remove unused imports * revert changes that set container orchestrator enabled to true everywhere * remove NEW_SCHEDULER feature flag from .env files, and set CONTAINER_ORCHESTRATOR_ENABLED flag to true for kube .env files Co-authored-by: Benoit Moriceau <benoit@airbyte.io>

github-actions bot added area/platform issues related to the platform area/scheduler area/server labels May 11, 2022

lmossman temporarily deployed to more-secrets May 11, 2022 01:11 Inactive

lmossman commented May 11, 2022

View reviewed changes

lmossman requested review from benmoriceau, jdpgrailsdev and cgardens May 11, 2022 01:13

jdpgrailsdev approved these changes May 11, 2022

View reviewed changes

cgardens reviewed May 11, 2022

View reviewed changes

benmoriceau reviewed May 11, 2022

View reviewed changes

lmossman mentioned this pull request May 12, 2022

Remove scheduler migration logic on next major version bump #12823

Closed

lmossman temporarily deployed to more-secrets May 12, 2022 22:19 Inactive

Base automatically changed from lmossman/start-temporal-from-clean-state to lmossman/repair-unexpected-temporal-state May 13, 2022 00:42

michel-tricot closed this May 13, 2022

lmossman reopened this May 13, 2022

lmossman changed the base branch from lmossman/repair-unexpected-temporal-state to master May 13, 2022 00:56

github-actions bot added area/api Related to the api area/worker Related to worker labels May 13, 2022

lmossman added 3 commits May 12, 2022 18:00

Migrate OSS to temporal scheduler

cc54c7e

add comment about migration being performed in server

8f350ea

add comments about removing migration logic

e452515

lmossman force-pushed the lmossman/migrate-oss-to-temporal-scheduler branch from 2ffe5f8 to e452515 Compare May 13, 2022 01:00

github-actions bot removed area/worker Related to worker area/api Related to the api labels May 13, 2022

lmossman added 2 commits May 18, 2022 09:00

set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars to true…

6dd2ec0

… to be consistent

set default value of container orchestrator enabled to true

21b3670

lmossman temporarily deployed to more-secrets May 18, 2022 16:04 Inactive

lmossman added 4 commits May 18, 2022 09:32

Revert "set default value of container orchestrator enabled to true"

26068d5

This reverts commit 21b3670.

Revert "set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars…

2f40f9d

… to true to be consistent" This reverts commit 6dd2ec0.

Revert "Revert "set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED …

4299db6

…env vars to true to be consistent"" This reverts commit 2f40f9d.

Revert "Revert "set default value of container orchestrator enabled t…

fa08b93

…o true"" This reverts commit 26068d5.

lmossman temporarily deployed to more-secrets May 18, 2022 16:35 Inactive

fix sync workflow test

a5248fb

github-actions bot added the area/worker Related to worker label May 18, 2022

lmossman temporarily deployed to more-secrets May 18, 2022 18:28 Inactive

remove defunct cancellation tests due to internal temporal error

155715f

lmossman temporarily deployed to more-secrets May 18, 2022 20:40 Inactive

format - remove unused imports

c66f45f

lmossman temporarily deployed to more-secrets May 18, 2022 20:59 Inactive

lmossman added 2 commits May 18, 2022 15:52

revert changes that set container orchestrator enabled to true everyw…

f3c2d1d

…here

remove NEW_SCHEDULER feature flag from .env files, and set CONTAINER_…

b0b9b0e

…ORCHESTRATOR_ENABLED flag to true for kube .env files

github-actions bot removed the area/worker Related to worker label May 18, 2022

lmossman temporarily deployed to more-secrets May 18, 2022 22:56 Inactive

lmossman changed the title ~~Migrate OSS to temporal scheduler~~ ✨ Migrate OSS to temporal scheduler May 19, 2022

lmossman merged commit 26ed385 into master May 19, 2022

lmossman deleted the lmossman/migrate-oss-to-temporal-scheduler branch May 19, 2022 00:05

lmossman changed the title ~~✨ Migrate OSS to temporal scheduler~~ 🎉 Migrate OSS to temporal scheduler May 19, 2022

octavia-squidington-iii mentioned this pull request May 20, 2022

Bump Airbyte version from 0.38.4-alpha to 0.39.0-alpha #13065

Merged

lmossman mentioned this pull request May 25, 2022

Track record schema validation errors in Datadog #13114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 Migrate OSS to temporal scheduler #12757

🎉 Migrate OSS to temporal scheduler #12757

lmossman commented May 11, 2022 •

edited

Loading

lmossman May 11, 2022

lmossman May 13, 2022

jdpgrailsdev left a comment

jdpgrailsdev commented May 11, 2022

cgardens May 11, 2022

lmossman May 11, 2022

benmoriceau May 13, 2022

lmossman May 13, 2022

benmoriceau May 11, 2022

lmossman May 11, 2022

lmossman May 13, 2022

lmossman commented May 12, 2022

lmossman commented May 13, 2022 •

edited

Loading

marcosmarxm commented May 19, 2022

Amruta-Ranade commented May 19, 2022

lmossman commented May 19, 2022

🎉 Migrate OSS to temporal scheduler #12757

🎉 Migrate OSS to temporal scheduler #12757

Conversation

lmossman commented May 11, 2022 • edited Loading

What

How

Recommended reading order

🚨 User Impact 🚨

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdpgrailsdev left a comment

Choose a reason for hiding this comment

jdpgrailsdev commented May 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmossman commented May 12, 2022

lmossman commented May 13, 2022 • edited Loading

marcosmarxm commented May 19, 2022

Amruta-Ranade commented May 19, 2022

lmossman commented May 19, 2022

lmossman commented May 11, 2022 •

edited

Loading

lmossman commented May 13, 2022 •

edited

Loading