New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: when cluster fails ensure end time of open allocation is set to the last cluster heartbeat [DET 6509] #3657
fix: when cluster fails ensure end time of open allocation is set to the last cluster heartbeat [DET 6509] #3657
Conversation
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
✔️ Deploy Preview for determined-ui canceled. 🔨 Explore the source changes: 2f8bbac 🔍 Inspect the deploy log: https://app.netlify.com/sites/determined-ui/deploys/62195cc025d95400082b8905 |
master/internal/core.go
Outdated
@@ -807,6 +820,8 @@ func (m *Master) Run(ctx context.Context) error { | |||
return err | |||
} | |||
|
|||
go updateClusterHeartbeat(m.db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you drop a comment like explaining that this must happen after CloseOpenAllocations above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1 @@ | |||
ALTER TABLE public.cluster_id ADD COLUMN cluster_heartbeat timestamp not null DEFAULT (DATE_TRUNC('millisecond', now() at time zone 'utc')::timestamp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add new line endings to the sql files and format them a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
master/internal/db/postgres_tasks.go
Outdated
cluster_heartbeat, err := db.GetClusterHeartBeat() | ||
|
||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster_heartbeat, err := db.GetClusterHeartBeat() | |
if err != nil { | |
return err | |
} | |
cluster_heartbeat, err := db.GetClusterHeartBeat() | |
if err != nil { | |
return err | |
} |
master/internal/db/postgres_tasks.go
Outdated
if _, err := db.sql.Exec(` | ||
UPDATE allocations | ||
SET end_time = current_timestamp AT TIME ZONE 'UTC' | ||
SET end_time = $1 | ||
WHERE end_time IS NULL | ||
`); err != nil { | ||
`, cluster_heartbeat); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer something like
UPDATE allocations SET end_time = cluster_heartbeat FROM cluster_id;
and dropping the code above this, with the advantages mainly being GetClusterHeartBeat
becomes unused and less code is always better (and this becomes much less code), this is 1 db call instead of 2 (but this is not a big deal since it's every 10m, but applies elsewhere so it makes sense to try to be consistent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I'm not sure why I did it this way in the first place. Getting directly from the cluster_id makes a lot more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the sql query to this: UPDATE allocations
SET end_time = cluster_heartbeat FROM cluster_id
WHERE end_time IS NULL. We need the end_time is NULL bit to ensure that only the open allocations end_time is set to cluster heartbeat, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I got rid fo the GetClusterHeartBeat method from postgres_cluster.go
var uuidVal []string | ||
if err := db.sql.Select(&uuidVal, `SELECT cluster_id FROM cluster_id`); err != nil { | ||
return errors.Wrapf(err, "error reading cluster_id from cluster_id table") | ||
} | ||
if len(uuidVal) != 1 { | ||
return errors.Errorf( | ||
"expecting exactly one cluster_id from cluster_id table, %d values found", len(uuidVal), | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is all basically unused, I think we can just get rid of this. if it is to check the 1 row invariant is upheld, then i would say the check should happen elsewhere because there is nothing we can do about it here, we may as well still do what we need to do (set cluster heartbeat).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. done!
var layout = "2006-01-02T15:04:05.000Z" | ||
cluster_heartbeat, err := time.Parse(layout, row_heartbeat[0]) | ||
return cluster_heartbeat, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i commented elsewhere i dont think we need this func, but i'm also curious, i would think this should just work if you use time.Time rather than string + time.Parse. is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like this:
var rowHeartbeat time.Time
if err := db.sql.Select(&rowHeartbeat, SELECT cluster_heartbeat FROM cluster_id
); err != nil {
return time.Time{}, errors.Wrapf(err, "error reading cluster_heartbeat from cluster_id table")
}
Yes, I'll have to check if that works. I added this to the integration test so that we can ensure that we're updating the cluster heartbeat correctly.
master/internal/core.go
Outdated
t := time.NewTicker(10 * time.Minute) | ||
defer t.Stop() | ||
current_cluster_heartbeat := time.Now().UTC().Truncate(time.Millisecond) | ||
for range t.C { | ||
err := db.UpdateClusterHeartBeat(current_cluster_heartbeat) | ||
if err != nil { | ||
log.Error(err.Error()) | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to get the first tick immediately
t := time.NewTicker(10 * time.Minute) | |
defer t.Stop() | |
current_cluster_heartbeat := time.Now().UTC().Truncate(time.Millisecond) | |
for range t.C { | |
err := db.UpdateClusterHeartBeat(current_cluster_heartbeat) | |
if err != nil { | |
log.Error(err.Error()) | |
} | |
} | |
t := time.NewTicker(10 * time.Minute) | |
defer t.Stop() | |
current_cluster_heartbeat := time.Now().UTC().Truncate(time.Millisecond) | |
for { | |
err := db.UpdateClusterHeartBeat(current_cluster_heartbeat) | |
if err != nil { | |
log.Error(err.Error()) | |
} | |
<-t.C | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also just because making sure we can gracefully shutdown is generally useful, can you go ahead pass a context.Context
to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
current_time := time.Now().UTC().Truncate(time.Millisecond) | ||
db.UpdateClusterHeartBeat(current_time) | ||
|
||
cluster_heartbeat, err := db.GetClusterHeartBeat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.NoError after this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
_, err := db.GetOrCreateClusterID() | ||
|
||
require.NoError(t, err, "failed to get or create cluster id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i prefer we keep the require.NoError and the call that created it side by side to signal to the reader 'this is just an err check for the above call'. similar to how most go code reads
x, err := doThing()
if err != nil {
return err
}
y, err := doOtherThing()
...
not
x, err := doThing()
if err != nil {
return err
}
y, err := doOtherThing()
...
as another example (not just me), go source https://github.com/golang/go/blob/master/src/io/fs/readdir.go#L22-L47 usually does this, i find it makes it a lot easily to quickly skim code. as an extension of this, there's also https://github.com/golang/go/wiki/CodeReviewComments#indent-error-flow which helps too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does make it easier to read. done!
aOut, err := db.AllocationByID(aIn.AllocationID) | ||
|
||
if *aOut.EndTime != cluster_heartbeat { | ||
errors.Errorf("Expected end time of open allocation is = %q but it is = %q instead", cluster_heartbeat.String(), (*aOut.EndTime).String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use require.Equal
, errors.Errorf
doesn't do anthing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops yes thank you. I missed this one! Changed it.
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
2 similar comments
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
@nrajanee when you are done addressing feedback, could you reassign me so i know to re-review (or request a re-review and re-assign, too) |
@cla-bot[bot] check |
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
The cla-bot has been summoned, and re-checked this pull request! |
@nrajanee you're added to contributors but cla-bot is still failing so you'll need to follow the instructions it commented likely. fyi, make sure to sign your commits with your |
@cla-bot[bot] check |
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
The cla-bot has been summoned, and re-checked this pull request! |
@cla-bot[bot] check |
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
The cla-bot has been summoned, and re-checked this pull request! |
@cla-bot[bot] check |
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
The cla-bot has been summoned, and re-checked this pull request! |
@cla-bot[bot] check |
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
The cla-bot has been summoned, and re-checked this pull request! |
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just a few final comments
err = db.AddAllocation(aIn) | ||
require.NoError(t, err, "failed to add allocation") | ||
|
||
//Don't complete the above allocation and call CloseOpenAllocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit here and elsewhere
//Don't complete the above allocation and call CloseOpenAllocations | |
// Don't complete the above allocation and call CloseOpenAllocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
@@ -0,0 +1 @@ | |||
ALTER TABLE public.cluster_id DROP COLUMN IF EXISTS cluster_heartbeat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@cla-bot[bot] check |
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
The cla-bot has been summoned, and re-checked this pull request! |
59f4317
to
727c09d
Compare
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
1 similar comment
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
50cdbcb
to
3d0b138
Compare
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
|
… is set to the last cluster heartbeat in cluster_id table
3d0b138
to
2f8bbac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Great work!
Description
fix: when cluster fails ensure end time of open allocation is set to the last cluster heartbeat [DET 6509]
Test Plan
Added an integration test which tests to ensure that the cluster heartbeat is added to the cluster id table accurately. It also checks whether open allocations have the correct cluster heartbeat as their endTime. The integration test covers the new functions added to postgres_cluster.go and the change made in CloseOpenAllocaitons in postgres_tasks.go. Manually check and ensure that the cluster heartbeat is getting updated every 10 minutes. This manual check will ensure that the go routine added to core.go works correctly.
Checklist
docs/release-notes/
.See Release Note for details.