Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s runtime: force deletion to avoid hung function worker during connector restart #12504

Merged
merged 4 commits into from
Nov 6, 2021

Conversation

dlg99
Copy link
Contributor

@dlg99 dlg99 commented Oct 26, 2021

Motivation

Restart of connector via pulsar-admin source restart (debezium postgres, but reprod with another too) failed and the function worker became non-responsive, repeatedly logging

19:43:58.139 [function-web-27-6] ERROR org.apache.pulsar.functions.worker.rest.api.ComponentImpl - Failed to restart Source: public/default/cassandra-source-ks1-table1
org.apache.pulsar.client.admin.PulsarAdminException$TimeoutException: java.util.concurrent.TimeoutException
        at org.apache.pulsar.client.admin.internal.SourcesImpl.restartSource(SourcesImpl.java:474) ~[com.datastax.oss-pulsar-client-admin-original-2.7.2.1.1.8-SNAPSHOT.jar:2.7.2.1.1.8-SNAPSHOT]
        at org.apache.pulsar.functions.worker.FunctionRuntimeManager.restartFunctionInstances(FunctionRuntimeManager.java:427) ~[com.datastax.oss-pulsar-functions-worker-2.7.2.1.1.8-SNAPSHOT.jar:2.7.2.1.1.8-SNAPSHOT]
        at org.apache.pulsar.functions.worker.rest.api.ComponentImpl.restartFunctionInstances(ComponentImpl.java:699) [com.datastax.oss-pulsar-functions-worker-2.7.2.1.1.8-SNAPSHOT.jar:2.7.2.1.1.8-SNAPSHOT]
        at org.apache.pulsar.functions.worker.rest.api.v3.SourcesApiV3Resource.restartSource(SourcesApiV3Resource.java:187) [com.datastax.oss-pulsar-functions-worker-2.7.2.1.1.8-SNAPSHOT.jar:2.7.2.1.1.8-SNAPSHOT]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
...
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
        at org.apache.pulsar.client.admin.internal.SourcesImpl.restartSource(SourcesImpl.java:467) ~[com.datastax.oss-pulsar-client-admin-original-2.7.2.1.1.8-SNAPSHOT.jar:2.7.2.1.1.8-SNAPSHOT]

Modifications

the rootcause tracked to the k8s client call timing out.
Looks like V1DeleteOptions weren't passed to the corresponding calls, and Foreground policy was not passed properly AFAICT from the k8s-client github/issues.
I also moved grace period for deleteNamespacedStatefulSetCall into the config.

Verifying this change

Tested on the env.
Don't know how to unit test this.

Does this pull request potentially affect one of the following parts:

NO, AFAIK.
New config parameter is added, keeps the same value as hardcoded one it replaced.

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (yes / no)
  • The schema: (yes / no / don't know)
  • The default values of configurations: (yes / no)
  • The wire protocol: (yes / no)
  • The rest endpoints: (yes / no)
  • The admin cli options: (yes / no)
  • Anything that affects deployment: (yes / no / don't know)

Documentation

  • doc

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 26, 2021
@dlg99 dlg99 marked this pull request as draft October 27, 2021 00:16
// https://amalgjose.com/2021/07/28/how-to-delete-a-kubernetes-pod-which-is-stuck-in-terminating-state/
// https://www.ibm.com/support/pages/kubernetes-pods-are-stuck-terminating-state
// https://github.com/kubernetes-client/java/issues/770
options.setGracePeriodSeconds(0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerrypeng fixed.

@github-actions github-actions bot added the doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. label Oct 27, 2021
@dlg99 dlg99 marked this pull request as ready for review October 27, 2021 04:41
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

site2/docs/functions-runtime.md Outdated Show resolved Hide resolved
site2/docs/functions-runtime.md Outdated Show resolved Hide resolved
@Anonymitaet Anonymitaet removed the doc-not-needed Your PR changes do not impact docs label Oct 28, 2021
Co-authored-by: Anonymitaet <50226895+Anonymitaet@users.noreply.github.com>
@dlg99
Copy link
Contributor Author

dlg99 commented Oct 28, 2021

@jerrypeng @Anonymitaet Please take another look; I addressed your suggestions.

@Anonymitaet
Copy link
Member

@dlg99 LGTM from a tech writing perspective.

@eolivelli eolivelli merged commit a3f6aba into apache:master Nov 6, 2021
zeo1995 pushed a commit to zeo1995/pulsar that referenced this pull request Nov 7, 2021
* up/master: (55 commits)
  [broker] remove useless method "PersistentTopic#getPersistentTopic" (apache#12655)
  [Python Schema] Python schema support custom Avro configurations for Enum type (apache#12642)
  Allow to configure different implementations for Pulsar functions state store (apache#12646)
  Remove replicator global test from the quarantine group (apache#12648)
  [Java Client] Remove invalid call to Thread.currentThread().interrupt(); (apache#12652)
  k8s runtime: force deletion to avoid hung function worker during connector restart (apache#12504)
  [Broker] Optimize exception information for schemas (apache#12647)
  Close Zk database on unit tests (apache#12649)
  Fix call sync method in an async callback when enabling geo replicator. (apache#12590)
  [pulsar-broker] Add git branch information for PulsarVersion (apache#12541)
  PulsarAdmin: Fix last exit code storage (apache#12581)
  Add @test annotation to test methods (apache#12640)
  Upgrade debezium to 1.7.1 (apache#12644)
  [ML] Avoid passing OpAddEntry across a thread boundary in asyncAddEntry (apache#12606)
  [Functions] Prevent NPE while stopping a non started Pulsar LogAppender (apache#12643)
  Update io-debezium-source.md (apache#12638)
  Add missing cmds on pulsar-admin document page (apache#12634)
  Clean up the metadata of the non-persistent partitioned topics. (apache#12550)
  modify check waitingForPingResponse with volatile (apache#12615)
  [pulsar-admin] Check backlog quota policy for namespace (apache#12512)
  ...
zeo1995 pushed a commit to zeo1995/pulsar that referenced this pull request Nov 7, 2021
* up/master: (55 commits)
  [broker] remove useless method "PersistentTopic#getPersistentTopic" (apache#12655)
  [Python Schema] Python schema support custom Avro configurations for Enum type (apache#12642)
  Allow to configure different implementations for Pulsar functions state store (apache#12646)
  Remove replicator global test from the quarantine group (apache#12648)
  [Java Client] Remove invalid call to Thread.currentThread().interrupt(); (apache#12652)
  k8s runtime: force deletion to avoid hung function worker during connector restart (apache#12504)
  [Broker] Optimize exception information for schemas (apache#12647)
  Close Zk database on unit tests (apache#12649)
  Fix call sync method in an async callback when enabling geo replicator. (apache#12590)
  [pulsar-broker] Add git branch information for PulsarVersion (apache#12541)
  PulsarAdmin: Fix last exit code storage (apache#12581)
  Add @test annotation to test methods (apache#12640)
  Upgrade debezium to 1.7.1 (apache#12644)
  [ML] Avoid passing OpAddEntry across a thread boundary in asyncAddEntry (apache#12606)
  [Functions] Prevent NPE while stopping a non started Pulsar LogAppender (apache#12643)
  Update io-debezium-source.md (apache#12638)
  Add missing cmds on pulsar-admin document page (apache#12634)
  Clean up the metadata of the non-persistent partitioned topics. (apache#12550)
  modify check waitingForPingResponse with volatile (apache#12615)
  [pulsar-admin] Check backlog quota policy for namespace (apache#12512)
  ...
eolivelli pushed a commit that referenced this pull request Nov 9, 2021
@eolivelli eolivelli modified the milestones: 2.10.0, 2.9.0 Nov 9, 2021
@dlg99 dlg99 deleted the k8s_connector_restart branch November 9, 2021 22:20
codelipenghui pushed a commit that referenced this pull request Nov 18, 2021
@codelipenghui codelipenghui added release/2.8.2 cherry-picked/branch-2.8 Archived: 2.8 is end of life and removed release/2.8.3 labels Nov 18, 2021
eolivelli pushed a commit to eolivelli/pulsar that referenced this pull request Nov 29, 2021
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Dec 1, 2021
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Dec 3, 2021
…ector restart (apache#12504)

(cherry picked from commit a3f6aba)
(cherry picked from commit 82c01bc)
momo-jun added a commit to momo-jun/pulsar that referenced this pull request Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/function cherry-picked/branch-2.8 Archived: 2.8 is end of life doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. release/2.8.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants