Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Skip resource checking for unmanaged exp #9372

Merged
merged 4 commits into from
May 21, 2024
Merged

Conversation

gt2345
Copy link
Contributor

@gt2345 gt2345 commented May 15, 2024

Ticket

MD-385

Description

For unmanaged experiments, the slots per trial should be 0 since we are not spending any resource for it.

Test Plan

Create a master without any agent connection, run unmanaged experiments and the experiments complete successfully even though there is no computing resource available

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@gt2345 gt2345 requested a review from a team as a code owner May 15, 2024 19:06
@gt2345 gt2345 requested a review from ShreyaLnuHpe May 15, 2024 19:06
@cla-bot cla-bot bot added the cla-signed label May 15, 2024
Copy link

codecov bot commented May 15, 2024

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 46.07%. Comparing base (d4e23f4) to head (a61f606).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9372   +/-   ##
=======================================
  Coverage   46.07%   46.07%           
=======================================
  Files        1228     1228           
  Lines      155900   155902    +2     
  Branches     2439     2439           
=======================================
+ Hits        71837    71838    +1     
- Misses      83872    83873    +1     
  Partials      191      191           
Flag Coverage Δ
backend 42.02% <77.77%> (+<0.01%) ⬆️
harness 64.06% <ø> (-0.01%) ⬇️
web 38.24% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/core_experiment.go 61.84% <77.77%> (+0.33%) ⬆️

... and 3 files with indirect coverage changes

Copy link

netlify bot commented May 15, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit a61f606
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/664bd2ed3a0a7f0008605550
😎 Deploy Preview https://deploy-preview-9372--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

master/internal/core_experiment.go Outdated Show resolved Hide resolved
slotsPerTrial = 0
}

poolName, _, err := m.ResolveResources(resources.ResourcePool(), slotsPerTrial, workspaceID, isSingleNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason to call resolve resources at all if the experiment is unmanaged? e.g., we have a very similar bug if someone passes a non-existent resource pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess what makes this one so painful is that the default is 1 slot which doesn't work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, haran hit this recently as well: we don't need to validate resource pool names for unmanaged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @stoksc and @ioga , I've moved ResolveResources and TaskContainerDefaults behind unmanaged condition

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be a fine fix for this, but i'd like to split the managed/unmanaged code paths even more eventually.

@gt2345 gt2345 changed the title fix: Set slots to zero for unmanaged exp fix: Skip resource checking for unmanaged exp May 16, 2024
@gt2345 gt2345 requested review from ioga and stoksc May 16, 2024 17:28
Comment on lines 303 to 311
taskContainerDefaults, err := m.rm.TaskContainerDefaults(
poolName,
m.config.TaskContainerDefaults,
)
if err != nil {
return nil, nil, config, nil, nil, errors.Wrapf(err, "error getting TaskContainerDefaults")
}
taskSpec.TaskContainerDefaults = taskContainerDefaults
taskSpec.TaskContainerDefaults.MergeIntoExpConfig(&config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need to put this under unmanaged check. I think it's supposed to work for unmanaged experiments, and having properly merged spec might become helpful if/when we want the master to have access to checkpoints produced by the unmanaged experiment, e.g. for downloads or GC. LGTM otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've moved taskContainerDefaults out of the condition

@gt2345 gt2345 force-pushed the gt/385-unmanaged-resource branch from 85b1281 to a61f606 Compare May 20, 2024 22:47
@gt2345 gt2345 merged commit 5c51164 into main May 21, 2024
84 of 97 checks passed
@gt2345 gt2345 deleted the gt/385-unmanaged-resource branch May 21, 2024 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants