New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploy workflow #1540
Deploy workflow #1540
Conversation
Codecov Report
@@ Coverage Diff @@
## dynamic_services #1540 +/- ##
====================================================
- Coverage 93.87% 93.77% -0.11%
====================================================
Files 205 203 -2
Lines 29086 29282 +196
====================================================
+ Hits 27304 27458 +154
- Misses 1782 1824 +42
Continue to review full report at Codecov.
|
exonum/src/runtime/mod.rs
Outdated
//! 4. Async deployment usually has a deadline and/or success and failure conditions. As an example, | ||
//! the supervisor may collect confirmations from the validator nodes that have | ||
//! successfully deployed the artifact, and once all the validator nodes have sent | ||
//! their confirmations, the artifact is *committed*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no 'undeploy' operation in case when some nodes failed to deploy the artifact, is that correct that the runtimes do not know if and when the artifact is actually committed and just rely on the (?) dispatcher issuing start_adding_service
only if the conditions are met?
What would happen if
-
Most nodes deployed an artifact correctly, but a couple of nodes failed (e.g., the admins forgot to put the artifact where it belongs). As a result, the runtimes on the 'good' nodes have it deployed; on the 'bad' — don't. The runtimes on the 'good' don't know that it was not committed.
-
The administrators attempt a re-deploy. This time, as the 'good' nodes already have this artifact, the operation fails there (just because of an illegal attempt to deploy it twice); and succeeds on the 'bad' ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, "consensus-agreed result" = failure -> deploy of this particular artifact failed in the whole network, but 'good' runtimes will retain the artifact as deployed till restarted -> the redeploy of the same artifact is impossible without restart, is that correct?
I think a deploy error on some but all nodes is a realistic use-case (but the one we can probably agree to ignore in the upcoming release). Would a 'Runtime#unloadArtifact' operation, that the dispatcher invokes when the operation failed in terms of network but completed successfully on this node (i.e., if I remember the current policy correctly, failed on any single node), help? Or Runtime#unloadIfPresent
if we'd like the dispatcher to not have local data on whether the local deployment was successful — but it might already have that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK yes, we plan it.
/// if several alternative initial service configurations are tried), but as a rule of thumb, | ||
/// a `Runtime` should not return an error or panic here unless it wants the node to stop forever. | ||
/// An error or panic returned from this method will not be processed and will lead | ||
/// to the node stopping. A runtime should only return an error / panic if the error is local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that the Java runtime that performs some user and framework code (= same on all nodes and hopefully mostly deterministic — we can't say that of all the framework and library code) that is not perfomed in start_adding_service
(but is not considered likely to fail) violates this restriction? But respects
The runtime should commit long-term resources for the service after a
commit_service()
call.
Is that fine given small/unknown likelyhood? Or shall we attempt to do everything in start_adding_services
and then just flip a switch (that would complicate things with clean ups in after_commit, and, if we need that, a restriction of API operation of a newly added service before commit_service; oh, and with no start_adding_
for restarts, it will have to operate differently for newly added and restarted services)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that the Java runtime [...] violates this restriction?
There is no 100% non-violable restriction here, hence the use of "should" instead of "must" as per RFC 2119. The balance between delaying resource commitment and potentially ending up with the non-functional blockchain is up to the runtime. Subjectively, I don't think it's reasonable to assume that the probability of the service developer screwing up HTTP API (but not screwing up anything else) is high enough to complicate workflow because of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I agree it is reasonable to not complicate the workflow for highly unlikely scenarious (also given the user tests their code with testkit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Runtime) LGTM.
This PR describes artifact deployment in greater detail and adds some tests covering deployment.