Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: when cluster fails ensure end time of open allocation is set to the last cluster heartbeat [DET 6509] #3657

Merged

Conversation

nrajanee
Copy link
Contributor

@nrajanee nrajanee commented Feb 23, 2022

Description

fix: when cluster fails ensure end time of open allocation is set to the last cluster heartbeat [DET 6509]

Test Plan

Added an integration test which tests to ensure that the cluster heartbeat is added to the cluster id table accurately. It also checks whether open allocations have the correct cluster heartbeat as their endTime. The integration test covers the new functions added to postgres_cluster.go and the change made in CloseOpenAllocaitons in postgres_tasks.go. Manually check and ensure that the cluster heartbeat is getting updated every 10 minutes. This manual check will ensure that the go routine added to core.go works correctly.

Checklist

  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot
Copy link

cla-bot bot commented Feb 23, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@netlify
Copy link

netlify bot commented Feb 23, 2022

✔️ Deploy Preview for determined-ui canceled.

🔨 Explore the source changes: 2f8bbac

🔍 Inspect the deploy log: https://app.netlify.com/sites/determined-ui/deploys/62195cc025d95400082b8905

@nrajanee nrajanee requested a review from stoksc February 23, 2022 23:37
@@ -807,6 +820,8 @@ func (m *Master) Run(ctx context.Context) error {
return err
}

go updateClusterHeartbeat(m.db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you drop a comment like explaining that this must happen after CloseOpenAllocations above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1 @@
ALTER TABLE public.cluster_id ADD COLUMN cluster_heartbeat timestamp not null DEFAULT (DATE_TRUNC('millisecond', now() at time zone 'utc')::timestamp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add new line endings to the sql files and format them a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 195 to 199
cluster_heartbeat, err := db.GetClusterHeartBeat()

if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cluster_heartbeat, err := db.GetClusterHeartBeat()
if err != nil {
return err
}
cluster_heartbeat, err := db.GetClusterHeartBeat()
if err != nil {
return err
}

Comment on lines 201 to 205
if _, err := db.sql.Exec(`
UPDATE allocations
SET end_time = current_timestamp AT TIME ZONE 'UTC'
SET end_time = $1
WHERE end_time IS NULL
`); err != nil {
`, cluster_heartbeat); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer something like

UPDATE allocations SET end_time = cluster_heartbeat FROM cluster_id;

and dropping the code above this, with the advantages mainly being GetClusterHeartBeat becomes unused and less code is always better (and this becomes much less code), this is 1 db call instead of 2 (but this is not a big deal since it's every 10m, but applies elsewhere so it makes sense to try to be consistent).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I'm not sure why I did it this way in the first place. Getting directly from the cluster_id makes a lot more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the sql query to this: UPDATE allocations
SET end_time = cluster_heartbeat FROM cluster_id
WHERE end_time IS NULL. We need the end_time is NULL bit to ensure that only the open allocations end_time is set to cluster heartbeat, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I got rid fo the GetClusterHeartBeat method from postgres_cluster.go

Comment on lines 36 to 44
var uuidVal []string
if err := db.sql.Select(&uuidVal, `SELECT cluster_id FROM cluster_id`); err != nil {
return errors.Wrapf(err, "error reading cluster_id from cluster_id table")
}
if len(uuidVal) != 1 {
return errors.Errorf(
"expecting exactly one cluster_id from cluster_id table, %d values found", len(uuidVal),
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all basically unused, I think we can just get rid of this. if it is to check the 1 row invariant is upheld, then i would say the check should happen elsewhere because there is nothing we can do about it here, we may as well still do what we need to do (set cluster heartbeat).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. done!

Comment on lines 62 to 64
var layout = "2006-01-02T15:04:05.000Z"
cluster_heartbeat, err := time.Parse(layout, row_heartbeat[0])
return cluster_heartbeat, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i commented elsewhere i dont think we need this func, but i'm also curious, i would think this should just work if you use time.Time rather than string + time.Parse. is that not the case?

Copy link
Contributor Author

@nrajanee nrajanee Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like this:

var rowHeartbeat time.Time
if err := db.sql.Select(&rowHeartbeat, SELECT cluster_heartbeat FROM cluster_id); err != nil {
return time.Time{}, errors.Wrapf(err, "error reading cluster_heartbeat from cluster_id table")
}

Yes, I'll have to check if that works. I added this to the integration test so that we can ensure that we're updating the cluster heartbeat correctly.

Comment on lines 617 to 630
t := time.NewTicker(10 * time.Minute)
defer t.Stop()
current_cluster_heartbeat := time.Now().UTC().Truncate(time.Millisecond)
for range t.C {
err := db.UpdateClusterHeartBeat(current_cluster_heartbeat)
if err != nil {
log.Error(err.Error())
}

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to get the first tick immediately

Suggested change
t := time.NewTicker(10 * time.Minute)
defer t.Stop()
current_cluster_heartbeat := time.Now().UTC().Truncate(time.Millisecond)
for range t.C {
err := db.UpdateClusterHeartBeat(current_cluster_heartbeat)
if err != nil {
log.Error(err.Error())
}
}
t := time.NewTicker(10 * time.Minute)
defer t.Stop()
current_cluster_heartbeat := time.Now().UTC().Truncate(time.Millisecond)
for {
err := db.UpdateClusterHeartBeat(current_cluster_heartbeat)
if err != nil {
log.Error(err.Error())
}
<-t.C
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also just because making sure we can gracefully shutdown is generally useful, can you go ahead pass a context.Context to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

current_time := time.Now().UTC().Truncate(time.Millisecond)
db.UpdateClusterHeartBeat(current_time)

cluster_heartbeat, err := db.GetClusterHeartBeat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require.NoError after this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Comment on lines 23 to 22
_, err := db.GetOrCreateClusterID()

require.NoError(t, err, "failed to get or create cluster id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer we keep the require.NoError and the call that created it side by side to signal to the reader 'this is just an err check for the above call'. similar to how most go code reads

x, err := doThing()
if err != nil {
    return err
}

y, err := doOtherThing()
...

not

x, err := doThing()

if err != nil {
    return err
}

y, err := doOtherThing()
...

as another example (not just me), go source https://github.com/golang/go/blob/master/src/io/fs/readdir.go#L22-L47 usually does this, i find it makes it a lot easily to quickly skim code. as an extension of this, there's also https://github.com/golang/go/wiki/CodeReviewComments#indent-error-flow which helps too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does make it easier to read. done!

aOut, err := db.AllocationByID(aIn.AllocationID)

if *aOut.EndTime != cluster_heartbeat {
errors.Errorf("Expected end time of open allocation is = %q but it is = %q instead", cluster_heartbeat.String(), (*aOut.EndTime).String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use require.Equal, errors.Errorf doesn't do anthing here

Copy link
Contributor Author

@nrajanee nrajanee Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops yes thank you. I missed this one! Changed it.

@cla-bot
Copy link

cla-bot bot commented Feb 24, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

2 similar comments
@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@stoksc
Copy link
Contributor

stoksc commented Feb 25, 2022

@nrajanee when you are done addressing feedback, could you reassign me so i know to re-review (or request a re-review and re-assign, too)

@stoksc
Copy link
Contributor

stoksc commented Feb 25, 2022

@cla-bot[bot] check

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

The cla-bot has been summoned, and re-checked this pull request!

@stoksc
Copy link
Contributor

stoksc commented Feb 25, 2022

@nrajanee you're added to contributors but cla-bot is still failing so you'll need to follow the instructions it commented likely. fyi, make sure to sign your commits with your @hpe email.

@nrajanee
Copy link
Contributor Author

@cla-bot[bot] check

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

The cla-bot has been summoned, and re-checked this pull request!

@nrajanee
Copy link
Contributor Author

@cla-bot[bot] check

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

The cla-bot has been summoned, and re-checked this pull request!

@nrajanee nrajanee requested a review from stoksc February 25, 2022 20:55
@nrajanee
Copy link
Contributor Author

@cla-bot[bot] check

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

The cla-bot has been summoned, and re-checked this pull request!

@nrajanee
Copy link
Contributor Author

@cla-bot[bot] check

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

The cla-bot has been summoned, and re-checked this pull request!

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a few final comments

err = db.AddAllocation(aIn)
require.NoError(t, err, "failed to add allocation")

//Don't complete the above allocation and call CloseOpenAllocations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit here and elsewhere

Suggested change
//Don't complete the above allocation and call CloseOpenAllocations
// Don't complete the above allocation and call CloseOpenAllocations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@@ -0,0 +1 @@
ALTER TABLE public.cluster_id DROP COLUMN IF EXISTS cluster_heartbeat;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@nrajanee
Copy link
Contributor Author

@cla-bot[bot] check

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

The cla-bot has been summoned, and re-checked this pull request!

@nrajanee nrajanee force-pushed the DET-6509_accurate_allocation_time branch from 59f4317 to 727c09d Compare February 25, 2022 21:28
@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

1 similar comment
@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@nrajanee nrajanee requested a review from stoksc February 25, 2022 21:38
@nrajanee nrajanee force-pushed the DET-6509_accurate_allocation_time branch from 50cdbcb to 3d0b138 Compare February 25, 2022 21:53
@cla-bot
Copy link

cla-bot bot commented Feb 25, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Nikita Rajaneesh.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

… is set to the last cluster heartbeat in cluster_id table
@nrajanee nrajanee force-pushed the DET-6509_accurate_allocation_time branch from 3d0b138 to 2f8bbac Compare February 25, 2022 22:48
@cla-bot cla-bot bot added the cla-signed label Feb 25, 2022
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Great work!

@nrajanee nrajanee changed the title fix: [DET 6509] when cluster fails ensure end time of open allocation is set to the last cluster heartbeat fix: when cluster fails ensure end time of open allocation is set to the last cluster heartbeat [DET 6509] Feb 28, 2022
@nrajanee nrajanee merged commit ecea441 into determined-ai:master Feb 28, 2022
@dannysauer dannysauer modified the milestones: 0.0.102, 0.17.11 Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants