New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: non-training-specific checkpoints #3630
Conversation
✅ Deploy Preview for determined-ui canceled.
|
4d74f54
to
2eefccb
Compare
master/internal/db/bun.go
Outdated
|
||
// QueryExt is a helper struct to define custom .Where, .Order, and .Limit without a ton of inlined | ||
// functions. | ||
type QueryExt struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not super excited about this solution. asymptotically it's a custom proxy that'll need to handle the entirety of bun options.
Couple thoughts/options:
- Functions like
CheckpointsList
that use this doesn't add much before/after the query: basically onlydb.NewSelect().Model(...)
call before and error handling after. is that worth the extra complexity? both can be done inline in the "view" methods. - We could have
filterFn
approach similar to bun'sRunInTx
interface. i.e.
signature
func CheckpointsList(ctx context.Context, filterFn func(bun.SelectQuery) (bun.SelectQuery, error))
and then at call-site
# req is grpc request thing
ckpts, err := CheckpointsList(ctx, func(ctx context.Context, query bun.SelectQuery) {
query = query.Where("trial_id = ?", req.Id)
...
return query, nil
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with the filterFn
but I really disliked it, because there are cases where you hit errors creating the filter (not running the filter), but when when you do ckpts, err := ckpts.List(ctx, func(*bun.SelectQuery)(*bun.SelectQuery, error))
, you don't know if your err
came from user error (defining a nonsensical error) or from the database call itself. This would be the difference between an invalid request error and a 500 error, so the calling code does need to distinguish.
I toyed with nesting closures, but it got pretty unreadable pretty quick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be the difference between an invalid request error and a 500 error, so the calling code does need to distinguish.
At what point would we be building an invalid query because of user input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If for some reason we are, then we could define the signature as like
func CheckpointsList(ctx context.Context, filterFn func(bun.SelectQuery) (bun.SelectQuery, errInvalidQuery))
and use errors.Is
.
2eefccb
to
511076f
Compare
master/internal/api_trials.go
Outdated
return resp, db.Bun.RunInTx(ctx, nil, func(ctx context.Context, tx bun.Tx) error { | ||
total, err := ckpts.CountTx(ctx, tx, ext) | ||
if err != nil { | ||
return errors.Wrapf(err, "error counting checkpoints for trial %d from database", req.Id) | ||
} | ||
|
||
if len(req.States) != 0 && !found { | ||
return false | ||
// apply pagination | ||
ext = ext.Limit(int(req.Limit)) | ||
ext = ext.Offset(int(req.Offset)) | ||
|
||
ckpts, err := ckpts.ListTx(ctx, tx, ext) | ||
if err != nil { | ||
return errors.Wrapf(err, "error fetching checkpoints for trial %d from database", req.Id) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ioga this is the code in question, where it's really nice to have this QueryExt object.
If we decide that separate CountTx and ListTx calls are not acceptable anyway, then we should just rm -rf the QueryExt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we were to follow 1), i.e. simply inline it, then we'd be building a bun.SelectQuery
from db.NewSelect()
instead of QueryExt
, call query.Count(ctx)
to get the count, and then proceed with adding Limit
and Offset
to the same query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a separate note, I think this transaction block doesn't do anything. You get "read committed" level by default, and these selects can get different committed data.
you've got to set transaction to repeatable read if you want to get count & list results matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related uptrace/bun#342
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this transaction block doesn't do anything
Interesting. So I'm fine with any of the following, in order I would prefer them:
- don't bother with transactions on pagination data, it seems racy anyway. Pagination data will change from call to call so, eh.
- increase transaction level
- switch to one megaquery, as with get_experiments. I see this as an optimization that complicates the code to require much more SQL skill (more than I have at least, and I'm having to write it). IMO that's worth it only when we've identified this code as a bottleneck, otherwise I think it's a premature optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for not bothering here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stoksc So you like the closures and you don't like QueryExt (totally fine), but what about the megaquery vs two separate queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two separate queries is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better, even.
511076f
to
30f77ed
Compare
Ok, @stoksc and @ioga, I would like to highlight three patterns I would like feedback on before going any further:
|
totally pointless comment: I think |
I am on board. Let's have these as separate queries, and not bother with transactions and their isolation levels. The fallout from a possible "wrong" data here seems minimal. We only try to optimize only the expensive ones. |
personally:
|
I'm not sure I understand. user.Service looks to me like a series of REST API endpoints ( |
model api need to return the results of many queries. write a model service, and put the combining logic in there so it can be reused without calling |
I definitely agree with avoiding calling |
i mean a struct called |
in determined, a lot right now, this layer is just done by an actor (think agents actor, jobs actor now, etc) |
But isn't this what the |
i thought internal/db/models was just for accessing the db, service layer would be for composite queries. if ckpts is in internal/db/ckpts and models is in internal/db/models, the model service that composites them would just be in internal/service/models so that it can call db/ckpts+models without ever hitting circular imports.
it doesn't have to be. it's just helper methods in a special place. (previously it may've just had a config struct or a db, but the db is a singleton at least) |
We are inching towards a model-view-controller here. "Service layer" sounds like a controller. Similarly to the shortcuts code, I think the code should be structured not by layer but by "entity". i.e.
|
this is exactly what i'm pushing for, just pushed around some (sorry, now it is obvious i used to write a lot of vanilla .net core apis, but the pattern works "just fine"). the users code is fairly close to this already https://github.com/determined-ai/determined/tree/master/master/internal/user |
@ioga the only issue is defining the apis in a different package than the apiServer is basically impossible. |
This doesn't seem like it would solve the circular dependency problem in the way that @stoksc's suggestion would. Where would compound queries go? Or would we still have
I don't see why this wouldn't work: type struct apiServer {
models.Server
workflows.Server
users.Server
} |
obviously i dont read well, let me come back to this when i have more than 5m straight (~30m) |
The "entities" should form a dependency tree. 3 options:
|
So then what about leaving type struct apiServer {
m *Master
workflows.Server
} ? I tried using composition like this to define one of the required methods of the apiServer and it still compiles. |
are we all in agreement now? seems like it? |
master/internal/models/models.go
Outdated
err := idb.NewSelect(). | ||
Model(&out). | ||
Where("modelId = ?", modelId). | ||
Where("version = ?", version). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different from what we currently do in WebUI. In our current setup /models/Tester/versions/41/overview
, links to model_versions.id = 41
, not the auto-incremented model_versions.version
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. I think I messed up my interpretation of the original sql.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it should be:
err := idb.NewSelect().
Model(&out).
Where("model_id = ?", modelId).
Where("id = ?", version).
Scan(ctx)
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait... model_version.version
is not auto-incremented, but id
is:
Table "public.model_versions"
Column | Type | Collation | Nullable | Default
-------------------+--------------------------+-----------+----------+--------------------------------------------
version | integer | | not null |
checkpoint_uuid | uuid | | not null |
creation_time | timestamp with time zone | | not null |
last_updated_time | timestamp with time zone | | |
metadata | jsonb | | |
model_id | integer | | not null |
id | integer | | not null | nextval('model_versions_id_seq'::regclass)
name | text | | |
comment | text | | |
user_id | integer | | |
labels | text[] | | |
notes | text | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since ID is unique I think it can safely be just Where("id = ?", version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code looks buggy to me.
The go code was:
QueryProto("get_model_version", resp.ModelVersion, parentModel.Model.Id, req.ModelVersion)
The sql was:
WITH mv AS (
SELECT version, checkpoint_uuid, model_versions.id, creation_time, name, comment, metadata, labels, notes, username, last_updated_time
FROM model_versions
LEFT JOIN users ON users.id = model_versions.user_id
WHERE model_id = $1 AND model_versions.id = $2
),
It looks like we were previously matching model_versions.id
against req.ModelVersion
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the increment is being done in insert_model_version
but maybe should be defined by the table
(SELECT COALESCE(max(version), 0) + 1 FROM model_versions WHERE model_id = $1)
I'd say req.ModelVersion
is named poorly rather than it being a bug, because in the Model page we link to ModelVersions by the id field: https://github.com/determined-ai/determined/blob/master/webui/react/src/components/Table.tsx#L167-L182 - it's a long time ago but I think I had a convo about which one we wanted to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait... isn't the id
field already an autoincrement field? I thought that's what the nextval('model_versions_id_seq'::regclass)
thing meant. Why are we doing a manual "auto" increment on insert?
Nevermind, I misunderstood the SQL I was looking at again... I understand now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say req.ModelVersion is named poorly rather than it being a bug
ok.
// Validation metrics reported at the same latest_batch as the checkpoint. | ||
google.protobuf.Struct validation_metrics = 6; | ||
// TODO: is this actually a good idea? | ||
google.protobuf.DoubleValue searcher_metric = 17; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was recently added to make checkpoints sort-able by searcher_metric (even server-side needs it to appear here to read it out of the DB) --- if there's a better way then I'm open to it.
on a related note, does ProtoConverter handle these nullable types (google.protobuf.DoubleValue, google.protobuf.StringValue) well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO was mostly a note to myself. This is a very training-specific detail, and more specific than that, it tightly couples several details together (experiment config, searcher apis, metrics apis, and checkpoints apis). If we already have it, I don't think it's going away overnight, but the note was more along the lines of, "what would be a better way to keep these things more loosely coupled in the future?".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a related note, does ProtoConverter handle these nullable types (google.protobuf.DoubleValue, google.protobuf.StringValue) well?
If there's a type it doesn't handle yet... just add it. Really the ProtoConverter just makes it easier to copy from normal go types (int
, time.Time
, or map[string]interface{}
) to protobuf equivalents without a zillion error checks and intermediate values.
30f77ed
to
17e6d26
Compare
for _, vstate := range req.ValidationStates { | ||
switch vstate { | ||
case checkpointv1.State_STATE_ACTIVE: | ||
validationStates = append(validationStates, "'ACTIVE'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra layers of quotes
25497d0
to
8b35cf5
Compare
This includes: - a TODO listing the queries to be changed and the work done on those already completed - creates a view called checkpoints_expanded that collects the data needed to populate the protobuf Checkpoint object from various tables - bun-ifying the model.Checkpoint, model.Model, and model.Version objects. model.CheckpointExpanded was added to have a bun-ified struct for querying the checkpoints_expanded view. - Introducing internal/ckpts and internal/models - a bunch of XXX comments that need to be resolved before landing this
8b35cf5
to
2fce3ee
Compare
2fce3ee
to
2cb88bb
Compare
closing in favor of #3859. |
Description
Checkpoints in Determined were originally designed exclusively for
training. We would like in the future to support checkpointing for all
tasks. To get there, we need to change what data we keep about
checkpoints. This change focuses on migrating existing
training-specific data into checkpoint metadata, modifying how
checkpoints are uploaded so that the Core API checkpointing does not
feel training-specific, and updating the WebUI to reflect the new
schema.
Test Plan
Commentary (optional)
This upstream feature branch will include backend work, webui client-side work, and python client-side work before it is complete.