feat: non-training-specific checkpoints #3630

rb-determined-ai · 2022-02-18T13:25:09Z

Description

Checkpoints in Determined were originally designed exclusively for
training. We would like in the future to support checkpointing for all
tasks. To get there, we need to change what data we keep about
checkpoints. This change focuses on migrating existing
training-specific data into checkpoint metadata, modifying how
checkpoints are uploaded so that the Core API checkpointing does not
feel training-specific, and updating the WebUI to reflect the new
schema.

Test Plan

Commentary (optional)

This upstream feature branch will include backend work, webui client-side work, and python client-side work before it is complete.

netlify · 2022-02-18T13:25:18Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`2cb88bb`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/623e44045ae4fc000886347e

ioga · 2022-03-21T19:05:02Z

master/internal/db/bun.go

+
+// QueryExt is a helper struct to define custom .Where, .Order, and .Limit without a ton of inlined
+// functions.
+type QueryExt struct {


I am not super excited about this solution. asymptotically it's a custom proxy that'll need to handle the entirety of bun options.

Couple thoughts/options:

Functions like CheckpointsList that use this doesn't add much before/after the query: basically only db.NewSelect().Model(...) call before and error handling after. is that worth the extra complexity? both can be done inline in the "view" methods.

We could have filterFn approach similar to bun's RunInTx interface. i.e.
signature

func CheckpointsList(ctx context.Context, filterFn func(bun.SelectQuery) (bun.SelectQuery, error))

and then at call-site

# req is grpc request thing ckpts, err := CheckpointsList(ctx, func(ctx context.Context, query bun.SelectQuery) { query = query.Where("trial_id = ?", req.Id) ... return query, nil })

I experimented with the filterFn but I really disliked it, because there are cases where you hit errors creating the filter (not running the filter), but when when you do ckpts, err := ckpts.List(ctx, func(*bun.SelectQuery)(*bun.SelectQuery, error)), you don't know if your err came from user error (defining a nonsensical error) or from the database call itself. This would be the difference between an invalid request error and a 500 error, so the calling code does need to distinguish.

I toyed with nesting closures, but it got pretty unreadable pretty quick.

This would be the difference between an invalid request error and a 500 error, so the calling code does need to distinguish.

At what point would we be building an invalid query because of user input?

If for some reason we are, then we could define the signature as like

func CheckpointsList(ctx context.Context, filterFn func(bun.SelectQuery) (bun.SelectQuery, errInvalidQuery))

and use errors.Is.

rb-determined-ai · 2022-03-21T19:39:04Z

master/internal/api_trials.go

+	return resp, db.Bun.RunInTx(ctx, nil, func(ctx context.Context, tx bun.Tx) error {
+		total, err := ckpts.CountTx(ctx, tx, ext)
+		if err != nil {
+			return errors.Wrapf(err, "error counting checkpoints for trial %d from database", req.Id)
+		}

-		if len(req.States) != 0 && !found {
-			return false
+		// apply pagination
+		ext = ext.Limit(int(req.Limit))
+		ext = ext.Offset(int(req.Offset))
+
+		ckpts, err := ckpts.ListTx(ctx, tx, ext)
+		if err != nil {
+			return errors.Wrapf(err, "error fetching checkpoints for trial %d from database", req.Id)
+		}


@ioga this is the code in question, where it's really nice to have this QueryExt object.

If we decide that separate CountTx and ListTx calls are not acceptable anyway, then we should just rm -rf the QueryExt.

if we were to follow 1), i.e. simply inline it, then we'd be building a bun.SelectQuery from db.NewSelect() instead of QueryExt, call query.Count(ctx) to get the count, and then proceed with adding Limit and Offset to the same query.

on a separate note, I think this transaction block doesn't do anything. You get "read committed" level by default, and these selects can get different committed data.
you've got to set transaction to repeatable read if you want to get count & list results matching.

related uptrace/bun#342

I think this transaction block doesn't do anything

Interesting. So I'm fine with any of the following, in order I would prefer them:

don't bother with transactions on pagination data, it seems racy anyway. Pagination data will change from call to call so, eh.

increase transaction level

switch to one megaquery, as with get_experiments. I see this as an optimization that complicates the code to require much more SQL skill (more than I have at least, and I'm having to write it). IMO that's worth it only when we've identified this code as a bottleneck, otherwise I think it's a premature optimization.

I'd vote for not bothering here.

@stoksc So you like the closures and you don't like QueryExt (totally fine), but what about the megaquery vs two separate queries?

two separate queries is fine.

better, even.

rb-determined-ai · 2022-03-21T23:32:33Z

Ok, @stoksc and @ioga, I would like to highlight three patterns I would like feedback on before going any further:

basic queries in internal/ckpts/ckpts.go and internal/models/models.go. I don't think these are terribly controversial, but getting settled on these is probably enough to get @mapmeld taken care of.
pagination queries like internal/api_trials.go::GetTrialCheckpoints(). See thread above to continue discussion.
compound queries, like internal/api_models.go::GetModelVersion(). I found that static/srv/get_model_versions.sql was both insanely complicated and silly in its implementation; it's secretly just three separate, non-interacting queries (get model, get checkpoint, get model version) that are running as one query and using json to combine their output into a REST-API-compatible form. Also, one of those queries (get model) is run before get_model_versions.sql, then the model_id is fed back into get_model_versions.sql so the model can be fetched again. Instead, I wrote it as three completely separate queries in a transaction (probably has the same isolation level issue @ioga mentioned already). But it is way simpler like this.

stoksc · 2022-03-21T23:42:55Z

totally pointless comment: I think internal/ckpts/ckpts.go should be internal/db/ckpts/ckpts.go

ioga · 2022-03-22T03:13:54Z

3. compound queries, like `internal/api_models.go::GetModelVersion()`.  I found that `static/srv/get_model_versions.sql` was both insanely complicated and silly in its implementation; it's secretly just three separate, non-interacting queries (get model, get checkpoint, get model version) that are running as one query and using json to combine their output into a REST-API-compatible form.  Also, one of those queries (get model) is run before `get_model_versions.sql`, then the model_id is fed back into `get_model_versions.sql` so the model can be fetched again.  Instead, I wrote it as three completely separate queries in a transaction (probably has the same isolation level issue @ioga mentioned already).  But it is _way_ simpler like this.

I am on board. Let's have these as separate queries, and not bother with transactions and their isolation levels. The fallout from a possible "wrong" data here seems minimal. We only try to optimize only the expensive ones.

stoksc · 2022-03-22T18:56:31Z

personally:

basic queries ✔️ i think what's here looks great
for paginated/parameterized queries i much prefer using bun directly and accepting closures to extend queries. if we want different behavior on different errors, just use different error types.
for compound queries, i think we should have a compound/composite layer, like a service layer, that encapsulates that logic. user.Service is basically the exact pattern i think we should follow (maybe start to stash them all in internal/service/thing). so 3 separate queries calling into the internal/db/things_you_need layers is great to me.

rb-determined-ai · 2022-03-22T19:20:41Z

for compound queries, i think we should have a compound/composite layer, like a service layer, that encapsulates that logic. user.Service is basically the exact pattern i think we should follow (maybe start to stash them all in internal/service/thing). so 3 separate queries calling into the internal/db/things_you_need layers is great to me.

I'm not sure I understand. user.Service looks to me like a series of REST API endpoints (patchUser, postLogin, etc). I don't understand the connection to multiple queries vs a megaquery.

stoksc · 2022-03-22T19:24:10Z

I'm not sure I understand. user.Service looks to me like a series of REST API endpoints (patchUser, postLogin, etc). I don't understand the connection to multiple queries vs a megaquery.

model api need to return the results of many queries. write a model service, and put the combining logic in there so it can be reused without calling a.GetModel() from inside a.UpdateModel later.

rb-determined-ai · 2022-03-22T19:27:24Z

model api need to return the results of many queries. write a model service, and put the combining logic in there so it can be reused without calling a.GetModel() from inside a.UpdateModel later

I definitely agree with avoiding calling a.GetModel() from a.UpdateModel(), really under any circumstance. But what you call a "service layer"... do you just mean "some helper functions"? Or do you mean "an actor that you pass messages to and get messages back from"?

stoksc · 2022-03-22T19:36:27Z

i mean a struct called ModelService that has helper functions that encapsulate shared logic around models. Stuff like "validate this model before I insert it" or "composite these 3 queries into some other object" rather than that being duplicated or forgotten at every call site, or every set of endpoints using a different pattern for how to share/reuse this type of logic.

stoksc · 2022-03-22T19:38:22Z

in determined, a lot right now, this layer is just done by an actor (think agents actor, jobs actor now, etc)

rb-determined-ai · 2022-03-22T19:50:30Z

i mean a struct called ModelService that has helper functions that encapsulate shared logic around models

But isn't this what the internal/db/models module would be for? What would it need to be a stateful object?

stoksc · 2022-03-22T20:35:41Z

But isn't this what the internal/db/models module would be for?

i thought internal/db/models was just for accessing the db, service layer would be for composite queries. if ckpts is in internal/db/ckpts and models is in internal/db/models, the model service that composites them would just be in internal/service/models so that it can call db/ckpts+models without ever hitting circular imports.

What would it need to be a stateful object?

it doesn't have to be. it's just helper methods in a special place. (previously it may've just had a config struct or a db, but the db is a singleton at least)

ioga · 2022-03-22T20:45:07Z

We are inching towards a model-view-controller here. "Service layer" sounds like a controller.

Similarly to the shortcuts code, I think the code should be structured not by layer but by "entity". i.e. internal/workflows/ directory will have:

api.go for "view" part: serialization between protobufs and internal data types.
models.go for "model" part: database-related code, basic business logic such as helpers or methods on model instances.
controller.go for "controller": composite business logic, stateful objects, watchers, polling, heartbeats, etc. in practice, simple APIs don't need that and it can be folded either into models or into view.

stoksc · 2022-03-22T20:58:24Z

this is exactly what i'm pushing for, just pushed around some (sorry, now it is obvious i used to write a lot of vanilla .net core apis, but the pattern works "just fine"). the users code is fairly close to this already https://github.com/determined-ai/determined/tree/master/master/internal/user

stoksc · 2022-03-22T21:01:07Z

@ioga the only issue is defining the apis in a different package than the apiServer is basically impossible.

rb-determined-ai · 2022-03-22T21:03:33Z

I think the code should be structured not by layer but by "entity". i.e. internal/workflows/ directory will have:

This doesn't seem like it would solve the circular dependency problem in the way that @stoksc's suggestion would. Where would compound queries go? Or would we still have internal/db/* in addition to internal/workflows and internal/ckpts, etc?

@ioga the only issue is defining the apis in a different package than the apiServer is basically impossible.

I don't see why this wouldn't work:

type struct apiServer {
    models.Server
    workflows.Server
    users.Server
}

stoksc · 2022-03-22T21:06:28Z

obviously i dont read well, let me come back to this when i have more than 5m straight (~30m)

ioga · 2022-03-22T21:09:16Z

This doesn't seem like it would solve the circular dependency problem in the way that @stoksc's suggestion would. Where would compound queries go? Or would we still have internal/db/* in addition to internal/workflows and internal/ckpts, etc?

The "entities" should form a dependency tree. 3 options:

workflows depend on checkpoints and import them.
checkpoints depend on workflows and import them.
if they're tightly intertwined and depend on each other, add separate compound entity "checkpointed_workflows" — it would probably not have models.go, only api and controller.

rb-determined-ai · 2022-03-22T21:35:53Z

So then what about leaving api.go in the internal/workflows and use:

type struct apiServer {
    m *Master
    workflows.Server
}

?

I tried using composition like this to define one of the required methods of the apiServer and it still compiles.

stoksc · 2022-03-22T21:51:25Z

are we all in agreement now? seems like it?

mapmeld · 2022-03-23T18:11:37Z

master/internal/models/models.go

+	err := idb.NewSelect().
+		Model(&out).
+		Where("modelId = ?", modelId).
+		Where("version = ?", version).


This is different from what we currently do in WebUI. In our current setup /models/Tester/versions/41/overview , links to model_versions.id = 41, not the auto-incremented model_versions.version.

good catch. I think I messed up my interpretation of the original sql.

So it should be:

err := idb.NewSelect(). Model(&out). Where("model_id = ?", modelId). Where("id = ?", version). Scan(ctx)

right?

since ID is unique I think it can safely be just Where("id = ?", version).

The old code looks buggy to me.

The go code was:

QueryProto("get_model_version", resp.ModelVersion, parentModel.Model.Id, req.ModelVersion)

The sql was:

WITH mv AS ( SELECT version, checkpoint_uuid, model_versions.id, creation_time, name, comment, metadata, labels, notes, username, last_updated_time FROM model_versions LEFT JOIN users ON users.id = model_versions.user_id WHERE model_id = $1 AND model_versions.id = $2 ),

It looks like we were previously matching model_versions.id against req.ModelVersion.

the increment is being done in insert_model_version but maybe should be defined by the table
(SELECT COALESCE(max(version), 0) + 1 FROM model_versions WHERE model_id = $1)

I'd say req.ModelVersion is named poorly rather than it being a bug, because in the Model page we link to ModelVersions by the id field: https://github.com/determined-ai/determined/blob/master/webui/react/src/components/Table.tsx#L167-L182 - it's a long time ago but I think I had a convo about which one we wanted to use

~~wait... isn't the id field already an autoincrement field? I thought that's what the nextval('model_versions_id_seq'::regclass) thing meant. Why are we doing a manual "auto" increment on insert?~~

Nevermind, I misunderstood the SQL I was looking at again... I understand now

I'd say req.ModelVersion is named poorly rather than it being a bug

ok.

master/pkg/model/model.go

mapmeld · 2022-03-23T18:17:59Z

proto/src/determined/checkpoint/v1/checkpoint.proto

+  // Validation metrics reported at the same latest_batch as the checkpoint.
+  google.protobuf.Struct validation_metrics = 6;
+  // TODO: is this actually a good idea?
+  google.protobuf.DoubleValue searcher_metric = 17;


this was recently added to make checkpoints sort-able by searcher_metric (even server-side needs it to appear here to read it out of the DB) --- if there's a better way then I'm open to it.

on a related note, does ProtoConverter handle these nullable types (google.protobuf.DoubleValue, google.protobuf.StringValue) well?

The TODO was mostly a note to myself. This is a very training-specific detail, and more specific than that, it tightly couples several details together (experiment config, searcher apis, metrics apis, and checkpoints apis). If we already have it, I don't think it's going away overnight, but the note was more along the lines of, "what would be a better way to keep these things more loosely coupled in the future?".

On a related note, does ProtoConverter handle these nullable types (google.protobuf.DoubleValue, google.protobuf.StringValue) well?

If there's a type it doesn't handle yet... just add it. Really the ProtoConverter just makes it easier to copy from normal go types (int, time.Time, or map[string]interface{}) to protobuf equivalents without a zillion error checks and intermediate values.

rb-determined-ai · 2022-03-23T23:52:21Z

master/internal/ckpts/api.go

+	for _, vstate := range req.ValidationStates {
+		switch vstate {
+		case checkpointv1.State_STATE_ACTIVE:
+			validationStates = append(validationStates, "'ACTIVE'")


remove extra layers of quotes

This includes: - a TODO listing the queries to be changed and the work done on those already completed - creates a view called checkpoints_expanded that collects the data needed to populate the protobuf Checkpoint object from various tables - bun-ifying the model.Checkpoint, model.Model, and model.Version objects. model.CheckpointExpanded was added to have a bun-ified struct for querying the checkpoints_expanded view. - Introducing internal/ckpts and internal/models - a bunch of XXX comments that need to be resolved before landing this

rb-determined-ai · 2022-05-03T16:26:55Z

closing in favor of #3859.

rb-determined-ai requested a review from stoksc February 18, 2022 13:25

rb-determined-ai self-assigned this Feb 18, 2022

cla-bot bot added the cla-signed label Feb 18, 2022

rb-determined-ai force-pushed the checkpoints branch 2 times, most recently from 4d74f54 to 2eefccb Compare March 16, 2022 23:17

ioga reviewed Mar 21, 2022

View reviewed changes

rb-determined-ai force-pushed the checkpoints branch from 2eefccb to 511076f Compare March 21, 2022 19:33

rb-determined-ai commented Mar 21, 2022

View reviewed changes

rb-determined-ai force-pushed the checkpoints branch from 511076f to 30f77ed Compare March 21, 2022 23:20

mapmeld reviewed Mar 23, 2022

View reviewed changes

master/pkg/model/model.go Outdated Show resolved Hide resolved

mapmeld reviewed Mar 23, 2022

View reviewed changes

rb-determined-ai force-pushed the checkpoints branch from 30f77ed to 17e6d26 Compare March 23, 2022 20:58

rb-determined-ai commented Mar 23, 2022

View reviewed changes

rb-determined-ai force-pushed the checkpoints branch from 25497d0 to 8b35cf5 Compare March 24, 2022 17:09

rb-determined-ai added 2 commits March 25, 2022 10:42

create bun-ready model structs

ec7583c

rb-determined-ai force-pushed the checkpoints branch from 8b35cf5 to 2fce3ee Compare March 25, 2022 22:10

intial work on task3

2cb88bb

rb-determined-ai force-pushed the checkpoints branch from 2fce3ee to 2cb88bb Compare March 25, 2022 22:36

rb-determined-ai closed this May 3, 2022

dannysauer deleted the checkpoints branch August 19, 2022 19:15

feat: non-training-specific checkpoints #3630

feat: non-training-specific checkpoints #3630

Conversation

rb-determined-ai commented Feb 18, 2022

Description

Test Plan

Commentary (optional)

netlify bot commented Feb 18, 2022 • edited

✅ Deploy Preview for determined-ui canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stoksc Mar 21, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rb-determined-ai Mar 21, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rb-determined-ai commented Mar 21, 2022

stoksc commented Mar 21, 2022

ioga commented Mar 22, 2022

stoksc commented Mar 22, 2022

rb-determined-ai commented Mar 22, 2022

stoksc commented Mar 22, 2022

rb-determined-ai commented Mar 22, 2022

stoksc commented Mar 22, 2022

stoksc commented Mar 22, 2022

rb-determined-ai commented Mar 22, 2022

stoksc commented Mar 22, 2022

ioga commented Mar 22, 2022

stoksc commented Mar 22, 2022 • edited

stoksc commented Mar 22, 2022

rb-determined-ai commented Mar 22, 2022

stoksc commented Mar 22, 2022

ioga commented Mar 22, 2022 • edited

rb-determined-ai commented Mar 22, 2022 • edited

stoksc commented Mar 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rb-determined-ai Mar 23, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rb-determined-ai commented May 3, 2022

netlify bot commented Feb 18, 2022 •

edited

stoksc Mar 21, 2022 •

edited

rb-determined-ai Mar 21, 2022 •

edited

stoksc commented Mar 22, 2022 •

edited

ioga commented Mar 22, 2022 •

edited

rb-determined-ai commented Mar 22, 2022 •

edited

rb-determined-ai Mar 23, 2022 •

edited