Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] flytepropeller fails trying to get pod resource using the kubeClient #4730

Closed
2 tasks done
andresgomezfrr opened this issue Jan 16, 2024 · 1 comment · Fixed by #4733 or #4752
Closed
2 tasks done

[BUG] flytepropeller fails trying to get pod resource using the kubeClient #4730

andresgomezfrr opened this issue Jan 16, 2024 · 1 comment · Fixed by #4733 or #4752
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo

Comments

@andresgomezfrr
Copy link
Contributor

Describe the bug

Yesterday we upgraded the propeller to the version 1.10.6 and today we noticed that we have a lot of pods with the Terminating state in our cluster. After some checks we detected that we started to see this log many times

logger.Warningf(ctx, "Failed to find the Resource with name: %v. Error: %v", nsName, err)
before the upgraded we didn't have any logs like this.

We deleted all the pods manually and downgraded the version and the logs disappeared, now the pods are deleted again without issues.

Expected behavior

The pods should be deleted properly and not stuck in the Terminating state.

Additional context to reproduce

Upgrade to the latest version of the propeller and execute pods.

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@andresgomezfrr andresgomezfrr added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jan 16, 2024
Copy link

dosubot bot commented Jan 16, 2024

🤖

Hello @andresgomezfrr! I'm Dosu, a friendly bot here to assist you with your Flyte issues and queries. I'm here to help you while we wait for a human maintainer to join us. Don't worry, together we'll figure this out!

Thank you for reporting this issue. It seems like you're experiencing a problem similar to a previously resolved issue in the Flyte repository: [BUG] Pods stuck on Terminating with finalizer. This issue was resolved by a pull request: Abort subworkflow on subnode failure.

However, since you're using a newer version of the propeller (1.10.6), it's possible that the problem has resurfaced due to changes made in this version. Here are some changes that were made in the propeller version 1.10.6 that could potentially cause the pods to get stuck in the Terminating state:

  1. "Moved controller-runtime start out of webhook Run function" - PR#546
  2. "Fixing recovering of SKIPPED nodes" - PR#551
  3. "Remove resource injection on the node for container task" - PR#544
  4. "Remove BarrierTick" - PR#545
  5. "Check for TerminateExecution error and eat Precondition status" - PR#553
  6. "Setting primaryContainerName by default on Pod plugin" - PR#555
  7. "Fast fail if task resource requests exceed k8s resource limits" - PR#488

These are potential causes and further investigation would be needed to confirm the exact cause.

As for the error message you're seeing in the logs, it's a generic error message that's logged when the propeller fails to get the pod resource using the kubeClient. The actual error details are provided by the "err" variable in the Go code, which is not visible in the log message you provided.

To help us investigate this issue further, could you please provide more details about your environment? Specifically, it would be helpful to know:

  • The Kubernetes version you're using
  • The configuration of your Flyte deployment
  • Any customizations you've made to the default Flyte configuration
  • The exact error message(s) you're seeing in the logs, including the values of the "err" variable and other context-specific variables

This information will help us understand the context in which the issue is occurring and potentially identify the cause. Thank you for your cooperation.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@hamersaw hamersaw added exo backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo
Projects
None yet
2 participants