Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flytepropeller error and crash when "limit-namespace" is set #5087

Closed
2 tasks done
timheb opened this issue Mar 21, 2024 · 7 comments · Fixed by #5238
Closed
2 tasks done

[BUG] Flytepropeller error and crash when "limit-namespace" is set #5087

timheb opened this issue Mar 21, 2024 · 7 comments · Fixed by #5238
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo

Comments

@timheb
Copy link

timheb commented Mar 21, 2024

Describe the bug

Flytepropeller error and crash of the pod (version 1.11.0, same for version 1.10.6/1.10.7)

I0321 12:57:51.491581       1 leaderelection.go:260] successfully acquired lease peach2-flyte/propeller-leader
{"json":{},"level":"panic","msg":"failed to load plugin - container: [PluginInitializationFailed] wrong type. Actual: *cache.multiNamespaceInformer","ts":"2024-03-21T12:57:51Z"}
panic: (*logrus.Entry) 0xc000703880

goroutine 366 [running]:
github.com/sirupsen/logrus.(*Entry).log(0xc000703810, 0x0, {0xc00013fd00, 0x71})
	/go/pkg/mod/github.com/sirupsen/logrus@v1.9.3/entry.go:260 +0x491
github.com/sirupsen/logrus.(*Entry).Log(0xc000703810, 0x0, {0xc000f813d0?, 0x4521720?, 0xc000f813d0?})
	/go/pkg/mod/github.com/sirupsen/logrus@v1.9.3/entry.go:304 +0x48
github.com/sirupsen/logrus.(*Entry).Panic(0x2f37ff0?, {0xc000f813d0?, 0x23ac240?, 0x1?})
	/go/pkg/mod/github.com/sirupsen/logrus@v1.9.3/entry.go:342 +0x25
github.com/flyteorg/flyte/flytestdlib/logger.Panic({0x2f37ff0?, 0xc0009fc5f0?}, {0xc000f813d0, 0x1, 0x1})
	/go/src/github.com/flyteorg/flytestdlib/logger/logger.go:144 +0x47
github.com/flyteorg/flyte/flytepropeller/pkg/controller.(*Controller).onStartedLeading.func1()
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/controller.go:132 +0x99
created by github.com/flyteorg/flyte/flytepropeller/pkg/controller.(*Controller).onStartedLeading in goroutine 395
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/controller.go:130 +0xc5

Expected behavior

no crash

Additional context to reproduce

  1. Version 1.9.0 not crash
  2. Version 1.10.0 not crash
  3. Version 1.10.6 crash
  4. Version 1.10.7 crash
  5. Version 1.11.0 crash

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@timheb timheb added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Mar 21, 2024

This comment was marked as off-topic.

@eapolinario eapolinario added backlogged For internal use. Reserved for contributor team workflow. exo and removed untriaged This issues has not yet been looked at by the Maintainers labels Mar 21, 2024
@ByronHsu ByronHsu self-assigned this Mar 27, 2024
@ByronHsu
Copy link
Contributor

I also run into this issue. Will help taking a look

@ByronHsu
Copy link
Contributor

Root cause analysis

  1. New Cache creates a cache (by controller-runtime)
  2. Inside Cache, it creates multinamespace cache if options.DefaultNamespace exists. Otherwise, create an all namespace cache.
  3. In the plugin manager, it tried to get a shared informer. However, it enforces the informer type to be cache.SharedIndexInformer (go-client) but the type is cache.multiNamespaceInformer (controller-runtime) if options.DefaultNamespace exists.
  4. Technically, limit-namespace should be a “list” since the task can be created in multiple namespaces

I haven't found a good solution yet. Flyte mixes the usage of controller-runtime cache and client-go cache, so what returned from Cache.New might not be compatible with returned type (cache.SharedIndexInformer)

@MortalHappiness
Copy link
Contributor

#take

MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue Mar 28, 2024
Resolves: flyteorg#5087
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
@austin362667 austin362667 removed their assignment Mar 28, 2024
@wild-endeavor
Copy link
Contributor

wild-endeavor commented Mar 28, 2024

It looks like we upgraded from 0.12.1 of controller-runtime to 0.16.2 when we added support for open telemetry. That brought in a whole lot of changes one of which was the change you (line 287 of cache.go which is hidden) mentioned @ByronHsu. How did you find this? Tracing through, I couldn't find my way to that cache.go function.

I don't think the current proposal quite works though. We have to do this through the informer. it's way too expensive to go through the client every time (unless the newer client is somehow already caching things?).

@hamersaw - could you take a look at this when you get a chance please? i haven't worked with controller runtime enough.

Thanks again for the investigation byron.

@ByronHsu
Copy link
Contributor

Thanks @wild-endeavor! bullet 1 is in flyte code, but bullet 2 (cache.go) is in controller-runtime code.

@MortalHappiness
Copy link
Contributor

@wild-endeavor
I think the controller-runtime client internally uses the informer cache because it does not provide an interface to access the store from the informer. It efficiently retrieves resources from the cache without the need for direct store access. But I haven't deeply investigated the codebase of controller-runtime.

MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue Mar 29, 2024
Resolves: flyteorg#5087
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue Apr 4, 2024
Resolves: flyteorg#5087
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue Apr 4, 2024
Update unittests corresponding to the changes from SharedIndexInformer to Informer

Resolves: flyteorg#5087
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo
Projects
None yet
6 participants