Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: non-training-specific checkpoints #3630

Closed
wants to merge 3 commits into from
Closed

Conversation

rb-determined-ai
Copy link
Member

Description

Checkpoints in Determined were originally designed exclusively for
training. We would like in the future to support checkpointing for all
tasks. To get there, we need to change what data we keep about
checkpoints. This change focuses on migrating existing
training-specific data into checkpoint metadata, modifying how
checkpoints are uploaded so that the Core API checkpointing does not
feel training-specific, and updating the WebUI to reflect the new
schema.

Test Plan

Commentary (optional)

This upstream feature branch will include backend work, webui client-side work, and python client-side work before it is complete.

@netlify
Copy link

netlify bot commented Feb 18, 2022

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 2cb88bb
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/623e44045ae4fc000886347e

@rb-determined-ai rb-determined-ai force-pushed the checkpoints branch 2 times, most recently from 4d74f54 to 2eefccb Compare March 16, 2022 23:17

// QueryExt is a helper struct to define custom .Where, .Order, and .Limit without a ton of inlined
// functions.
type QueryExt struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super excited about this solution. asymptotically it's a custom proxy that'll need to handle the entirety of bun options.

Couple thoughts/options:

  1. Functions like CheckpointsList that use this doesn't add much before/after the query: basically only db.NewSelect().Model(...) call before and error handling after. is that worth the extra complexity? both can be done inline in the "view" methods.
  2. We could have filterFn approach similar to bun's RunInTx interface. i.e.
    signature
func CheckpointsList(ctx context.Context, filterFn func(bun.SelectQuery) (bun.SelectQuery, error))

and then at call-site

# req is grpc request thing
ckpts, err := CheckpointsList(ctx, func(ctx context.Context, query bun.SelectQuery) {
  	query = query.Where("trial_id = ?", req.Id)
    ...
    return query, nil
})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with the filterFn but I really disliked it, because there are cases where you hit errors creating the filter (not running the filter), but when when you do ckpts, err := ckpts.List(ctx, func(*bun.SelectQuery)(*bun.SelectQuery, error)), you don't know if your err came from user error (defining a nonsensical error) or from the database call itself. This would be the difference between an invalid request error and a 500 error, so the calling code does need to distinguish.

I toyed with nesting closures, but it got pretty unreadable pretty quick.

Copy link
Contributor

@stoksc stoksc Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the difference between an invalid request error and a 500 error, so the calling code does need to distinguish.

At what point would we be building an invalid query because of user input?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for some reason we are, then we could define the signature as like

func CheckpointsList(ctx context.Context, filterFn func(bun.SelectQuery) (bun.SelectQuery, errInvalidQuery))

and use errors.Is.

Comment on lines 364 to 377
return resp, db.Bun.RunInTx(ctx, nil, func(ctx context.Context, tx bun.Tx) error {
total, err := ckpts.CountTx(ctx, tx, ext)
if err != nil {
return errors.Wrapf(err, "error counting checkpoints for trial %d from database", req.Id)
}

if len(req.States) != 0 && !found {
return false
// apply pagination
ext = ext.Limit(int(req.Limit))
ext = ext.Offset(int(req.Offset))

ckpts, err := ckpts.ListTx(ctx, tx, ext)
if err != nil {
return errors.Wrapf(err, "error fetching checkpoints for trial %d from database", req.Id)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ioga this is the code in question, where it's really nice to have this QueryExt object.

If we decide that separate CountTx and ListTx calls are not acceptable anyway, then we should just rm -rf the QueryExt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we were to follow 1), i.e. simply inline it, then we'd be building a bun.SelectQuery from db.NewSelect() instead of QueryExt, call query.Count(ctx) to get the count, and then proceed with adding Limit and Offset to the same query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a separate note, I think this transaction block doesn't do anything. You get "read committed" level by default, and these selects can get different committed data.
you've got to set transaction to repeatable read if you want to get count & list results matching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related uptrace/bun#342

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this transaction block doesn't do anything

Interesting. So I'm fine with any of the following, in order I would prefer them:

  • don't bother with transactions on pagination data, it seems racy anyway. Pagination data will change from call to call so, eh.
  • increase transaction level
  • switch to one megaquery, as with get_experiments. I see this as an optimization that complicates the code to require much more SQL skill (more than I have at least, and I'm having to write it). IMO that's worth it only when we've identified this code as a bottleneck, otherwise I think it's a premature optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for not bothering here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stoksc So you like the closures and you don't like QueryExt (totally fine), but what about the megaquery vs two separate queries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two separate queries is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better, even.

@rb-determined-ai
Copy link
Member Author

Ok, @stoksc and @ioga, I would like to highlight three patterns I would like feedback on before going any further:

  1. basic queries in internal/ckpts/ckpts.go and internal/models/models.go. I don't think these are terribly controversial, but getting settled on these is probably enough to get @mapmeld taken care of.
  2. pagination queries like internal/api_trials.go::GetTrialCheckpoints(). See thread above to continue discussion.
  3. compound queries, like internal/api_models.go::GetModelVersion(). I found that static/srv/get_model_versions.sql was both insanely complicated and silly in its implementation; it's secretly just three separate, non-interacting queries (get model, get checkpoint, get model version) that are running as one query and using json to combine their output into a REST-API-compatible form. Also, one of those queries (get model) is run before get_model_versions.sql, then the model_id is fed back into get_model_versions.sql so the model can be fetched again. Instead, I wrote it as three completely separate queries in a transaction (probably has the same isolation level issue @ioga mentioned already). But it is way simpler like this.

@stoksc
Copy link
Contributor

stoksc commented Mar 21, 2022

totally pointless comment: I think internal/ckpts/ckpts.go should be internal/db/ckpts/ckpts.go

@ioga
Copy link
Contributor

ioga commented Mar 22, 2022

3. compound queries, like `internal/api_models.go::GetModelVersion()`.  I found that `static/srv/get_model_versions.sql` was both insanely complicated and silly in its implementation; it's secretly just three separate, non-interacting queries (get model, get checkpoint, get model version) that are running as one query and using json to combine their output into a REST-API-compatible form.  Also, one of those queries (get model) is run before `get_model_versions.sql`, then the model_id is fed back into `get_model_versions.sql` so the model can be fetched again.  Instead, I wrote it as three completely separate queries in a transaction (probably has the same isolation level issue @ioga mentioned already).  But it is _way_ simpler like this.

I am on board. Let's have these as separate queries, and not bother with transactions and their isolation levels. The fallout from a possible "wrong" data here seems minimal. We only try to optimize only the expensive ones.

@stoksc
Copy link
Contributor

stoksc commented Mar 22, 2022

personally:

  1. basic queries ✔️ i think what's here looks great
  2. for paginated/parameterized queries i much prefer using bun directly and accepting closures to extend queries. if we want different behavior on different errors, just use different error types.
  3. for compound queries, i think we should have a compound/composite layer, like a service layer, that encapsulates that logic. user.Service is basically the exact pattern i think we should follow (maybe start to stash them all in internal/service/thing). so 3 separate queries calling into the internal/db/things_you_need layers is great to me.

@rb-determined-ai
Copy link
Member Author

for compound queries, i think we should have a compound/composite layer, like a service layer, that encapsulates that logic. user.Service is basically the exact pattern i think we should follow (maybe start to stash them all in internal/service/thing). so 3 separate queries calling into the internal/db/things_you_need layers is great to me.

I'm not sure I understand. user.Service looks to me like a series of REST API endpoints (patchUser, postLogin, etc). I don't understand the connection to multiple queries vs a megaquery.

@stoksc
Copy link
Contributor

stoksc commented Mar 22, 2022

I'm not sure I understand. user.Service looks to me like a series of REST API endpoints (patchUser, postLogin, etc). I don't understand the connection to multiple queries vs a megaquery.

model api need to return the results of many queries. write a model service, and put the combining logic in there so it can be reused without calling a.GetModel() from inside a.UpdateModel later.

@rb-determined-ai
Copy link
Member Author

model api need to return the results of many queries. write a model service, and put the combining logic in there so it can be reused without calling a.GetModel() from inside a.UpdateModel later

I definitely agree with avoiding calling a.GetModel() from a.UpdateModel(), really under any circumstance. But what you call a "service layer"... do you just mean "some helper functions"? Or do you mean "an actor that you pass messages to and get messages back from"?

@stoksc
Copy link
Contributor

stoksc commented Mar 22, 2022

i mean a struct called ModelService that has helper functions that encapsulate shared logic around models. Stuff like "validate this model before I insert it" or "composite these 3 queries into some other object" rather than that being duplicated or forgotten at every call site, or every set of endpoints using a different pattern for how to share/reuse this type of logic.

@stoksc
Copy link
Contributor

stoksc commented Mar 22, 2022

in determined, a lot right now, this layer is just done by an actor (think agents actor, jobs actor now, etc)

@rb-determined-ai
Copy link
Member Author

i mean a struct called ModelService that has helper functions that encapsulate shared logic around models

But isn't this what the internal/db/models module would be for? What would it need to be a stateful object?

@stoksc
Copy link
Contributor

stoksc commented Mar 22, 2022

But isn't this what the internal/db/models module would be for?

i thought internal/db/models was just for accessing the db, service layer would be for composite queries. if ckpts is in internal/db/ckpts and models is in internal/db/models, the model service that composites them would just be in internal/service/models so that it can call db/ckpts+models without ever hitting circular imports.

What would it need to be a stateful object?

it doesn't have to be. it's just helper methods in a special place. (previously it may've just had a config struct or a db, but the db is a singleton at least)

@ioga
Copy link
Contributor

ioga commented Mar 22, 2022

We are inching towards a model-view-controller here. "Service layer" sounds like a controller.

Similarly to the shortcuts code, I think the code should be structured not by layer but by "entity". i.e. internal/workflows/ directory will have:

  • api.go for "view" part: serialization between protobufs and internal data types.
  • models.go for "model" part: database-related code, basic business logic such as helpers or methods on model instances.
  • controller.go for "controller": composite business logic, stateful objects, watchers, polling, heartbeats, etc. in practice, simple APIs don't need that and it can be folded either into models or into view.

@stoksc
Copy link
Contributor

stoksc commented Mar 22, 2022

this is exactly what i'm pushing for, just pushed around some (sorry, now it is obvious i used to write a lot of vanilla .net core apis, but the pattern works "just fine"). the users code is fairly close to this already https://github.com/determined-ai/determined/tree/master/master/internal/user

@stoksc
Copy link
Contributor

stoksc commented Mar 22, 2022

@ioga the only issue is defining the apis in a different package than the apiServer is basically impossible.

@rb-determined-ai
Copy link
Member Author

I think the code should be structured not by layer but by "entity". i.e. internal/workflows/ directory will have:

This doesn't seem like it would solve the circular dependency problem in the way that @stoksc's suggestion would. Where would compound queries go? Or would we still have internal/db/* in addition to internal/workflows and internal/ckpts, etc?

@ioga the only issue is defining the apis in a different package than the apiServer is basically impossible.

I don't see why this wouldn't work:

type struct apiServer {
    models.Server
    workflows.Server
    users.Server
}

@stoksc
Copy link
Contributor

stoksc commented Mar 22, 2022

obviously i dont read well, let me come back to this when i have more than 5m straight (~30m)

@ioga
Copy link
Contributor

ioga commented Mar 22, 2022

This doesn't seem like it would solve the circular dependency problem in the way that @stoksc's suggestion would. Where would compound queries go? Or would we still have internal/db/* in addition to internal/workflows and internal/ckpts, etc?

The "entities" should form a dependency tree. 3 options:

  • workflows depend on checkpoints and import them.
  • checkpoints depend on workflows and import them.
  • if they're tightly intertwined and depend on each other, add separate compound entity "checkpointed_workflows" — it would probably not have models.go, only api and controller.

@rb-determined-ai
Copy link
Member Author

rb-determined-ai commented Mar 22, 2022

So then what about leaving api.go in the internal/workflows and use:

type struct apiServer {
    m *Master
    workflows.Server
}

?

I tried using composition like this to define one of the required methods of the apiServer and it still compiles.

@stoksc
Copy link
Contributor

stoksc commented Mar 22, 2022

are we all in agreement now? seems like it?

err := idb.NewSelect().
Model(&out).
Where("modelId = ?", modelId).
Where("version = ?", version).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different from what we currently do in WebUI. In our current setup /models/Tester/versions/41/overview , links to model_versions.id = 41, not the auto-incremented model_versions.version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I think I messed up my interpretation of the original sql.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it should be:

	err := idb.NewSelect().
		Model(&out).
		Where("model_id = ?", modelId).
		Where("id = ?", version).
		Scan(ctx)

right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait... model_version.version is not auto-incremented, but id is:

                                          Table "public.model_versions"
      Column       |           Type           | Collation | Nullable |                  Default
-------------------+--------------------------+-----------+----------+--------------------------------------------
 version           | integer                  |           | not null |
 checkpoint_uuid   | uuid                     |           | not null |
 creation_time     | timestamp with time zone |           | not null |
 last_updated_time | timestamp with time zone |           |          |
 metadata          | jsonb                    |           |          |
 model_id          | integer                  |           | not null |
 id                | integer                  |           | not null | nextval('model_versions_id_seq'::regclass)
 name              | text                     |           |          |
 comment           | text                     |           |          |
 user_id           | integer                  |           |          |
 labels            | text[]                   |           |          |
 notes             | text                     |           |          |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since ID is unique I think it can safely be just Where("id = ?", version).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code looks buggy to me.

The go code was:

QueryProto("get_model_version", resp.ModelVersion, parentModel.Model.Id, req.ModelVersion)

The sql was:

WITH mv AS (
  SELECT version, checkpoint_uuid, model_versions.id, creation_time, name, comment, metadata, labels, notes, username, last_updated_time
    FROM model_versions
    LEFT JOIN users ON users.id = model_versions.user_id
    WHERE model_id = $1 AND model_versions.id = $2
),

It looks like we were previously matching model_versions.id against req.ModelVersion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the increment is being done in insert_model_version but maybe should be defined by the table
(SELECT COALESCE(max(version), 0) + 1 FROM model_versions WHERE model_id = $1)

I'd say req.ModelVersion is named poorly rather than it being a bug, because in the Model page we link to ModelVersions by the id field: https://github.com/determined-ai/determined/blob/master/webui/react/src/components/Table.tsx#L167-L182 - it's a long time ago but I think I had a convo about which one we wanted to use

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait... isn't the id field already an autoincrement field? I thought that's what the nextval('model_versions_id_seq'::regclass) thing meant. Why are we doing a manual "auto" increment on insert?

Nevermind, I misunderstood the SQL I was looking at again... I understand now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say req.ModelVersion is named poorly rather than it being a bug

ok.

master/pkg/model/model.go Outdated Show resolved Hide resolved
// Validation metrics reported at the same latest_batch as the checkpoint.
google.protobuf.Struct validation_metrics = 6;
// TODO: is this actually a good idea?
google.protobuf.DoubleValue searcher_metric = 17;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was recently added to make checkpoints sort-able by searcher_metric (even server-side needs it to appear here to read it out of the DB) --- if there's a better way then I'm open to it.

on a related note, does ProtoConverter handle these nullable types (google.protobuf.DoubleValue, google.protobuf.StringValue) well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO was mostly a note to myself. This is a very training-specific detail, and more specific than that, it tightly couples several details together (experiment config, searcher apis, metrics apis, and checkpoints apis). If we already have it, I don't think it's going away overnight, but the note was more along the lines of, "what would be a better way to keep these things more loosely coupled in the future?".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, does ProtoConverter handle these nullable types (google.protobuf.DoubleValue, google.protobuf.StringValue) well?

If there's a type it doesn't handle yet... just add it. Really the ProtoConverter just makes it easier to copy from normal go types (int, time.Time, or map[string]interface{}) to protobuf equivalents without a zillion error checks and intermediate values.

for _, vstate := range req.ValidationStates {
switch vstate {
case checkpointv1.State_STATE_ACTIVE:
validationStates = append(validationStates, "'ACTIVE'")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra layers of quotes

This includes:

- a TODO listing the queries to be changed and the work done on those
  already completed
- creates a view called checkpoints_expanded that collects the data
  needed to populate the protobuf Checkpoint object from various tables
- bun-ifying the model.Checkpoint, model.Model, and model.Version
  objects.  model.CheckpointExpanded was added to have a bun-ified
  struct for querying the checkpoints_expanded view.
- Introducing internal/ckpts and internal/models
- a bunch of XXX comments that need to be resolved before landing this
@rb-determined-ai
Copy link
Member Author

closing in favor of #3859.

@dannysauer dannysauer deleted the checkpoints branch August 19, 2022 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants