-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement repo version 4 with migration #4030
Conversation
- fixes #4024 - this ensures that each time a compute node is started it attempts to register itself wit the requester. This is imporatant since in the event a requester loses it state compute nodes will re-register themselves with it. If they have already registered with a requester node registering again idempotent. - pairing this with the parent commit regarding the V3 Migration is required.
Important Review SkippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
if m.registrationFile.Exists() { | ||
log.Ctx(ctx).Debug().Msg("not registering with requester, already registered") | ||
return nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing the registration state? This means the compute node will have to re-register every time at startup and the node will fail to start if it is offline or disconnected from the requester.
Overall, I think we need to do better for node registration and allow the compute node to periodically re-check with the requester if the registration info match, and re-register if needed. So not only during startup, and should be best effort if the compute node is offline. Meaning the compute node should continue to run the locally allocated jobs even if it fails to register or re-register. I have a NodeManagerV2 placeholder for 1.5 that should track and fix most issues related to node management.
Since we don't have full offline capabilities today, it might make sense to force compute nodes to re-register at startup if we know what we are doing, and we can fix things with NodeManagerV2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing the registration state?
Its a requirement for the migration added here to work since we are deleting the requester nodes nodestore. Removing this lock ensures each time a compute node is started it attempts to register itself wit the requester. This is important since in the event a requester loses it nodestate store (i.e. the migration) compute nodes will re-register themselves with it. If they have already registered with a requester node registering again idempotent.
the node will fail to start if it is offline or disconnected from the requester
True, although the compute nodes will also fail to start if starting fresh and they are unable to connect to the requester. I don't think a failure here should result in an aborted start, rather the node should continue attempting to register itself until successful (e.g. loop until success)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I understand. As I mentioned it is fine to remove the registration state for now
if err != nil { | ||
return err | ||
} | ||
return os.RemoveAll(resolvedCfg.Node.Network.StoreDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually Node.Network.StoreDir
gives the parent
~/.bacalhau/orchestrator_store
which holds both job store and nats store. I don't recall the motivation for doing it this way.
I am also concerned that we delete something not intended if the user misconfigured this path to be something outside of the bacalhau dir and something we shouldn't delete.
If we were to go with the repo migration route instead of migrating entries, then we need to validate this path , and only delete specific sub-directories that are related to node kv store and not the whole path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually Node.Network.StoreDir gives the parent
~/.bacalhau/orchestrator_store which holds both job store and nats store. I don't recall the motivation for doing it this way.
I don't think that is the case see https://github.com/bacalhau-project/bacalhau/blob/main/pkg/config/config.go#L87 and attached debugger screenshot.
We can check if the path is outside the repo, and if that is the case abort the migration, but I think the odds of that happening are next to zero.
var V3Migration = repo.NewMigration( | ||
repo.RepoVersion3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a test scenario for this migration. We already have a test suite for repo migrations and should be trivial to add this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also you might want to validate that migration will be graceful if the store doesn't exist or the path is not configured properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was waiting for feedback on this vs #4029 before doing any more work here. But if we decide we like this PR better I will add tests before merging.
closing in favor of alternative #4029 |
A simpler alternative of #4029
bacalhau node list
returns errorfailed request: invalid node type: nodeTypeUndefined
#4024