Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add model streaming #8973

Merged
merged 17 commits into from
Mar 19, 2024
Merged

feat: add model streaming #8973

merged 17 commits into from
Mar 19, 2024

Conversation

gt2345
Copy link
Contributor

@gt2345 gt2345 commented Mar 7, 2024

Description

Add model streaming, this is not yet exposed to the WebUI.
There will be a separate PR for EE to handle rbac.

Test Plan

Connect to ws://localhost:8080/stream, should be able to subscribe and unsubscribe for models.

Commentary (optional)

There is a separate ticket for adding unit test, so unit test is not yet included in this PR.

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

MD-261

@gt2345 gt2345 requested review from a team as code owners March 7, 2024 17:46
@cla-bot cla-bot bot added the cla-signed label Mar 7, 2024
Copy link

netlify bot commented Mar 7, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit c33a76d
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/65f99fddcef390000835b543

@gt2345 gt2345 requested review from stoksc and corban-beaird and removed request for szewaiyuen6 and eecsliu March 7, 2024 17:47
DROP TRIGGER IF EXISTS stream_model_trigger_seq ON models;
CREATE TRIGGER stream_model_trigger_seq
BEFORE INSERT OR UPDATE OF
name, description, creation_time, last_updated_time, metadata, labels, user_id, archived, notes, workspace_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's likely outside the scope of this ticket; however, due to the limitations of postgres notify/listen queues, we'll want to only communicate over these channels using filterable columns. It doesn't look like the work on making the transition to the strategy is complete.

That being said, this will need to be adjusted with the work so that we don't potentially exceed that 8k character limit with user defined fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For models, I think it should be safe to exclude creation_time and user_id from the trigger, but do you mean we also want to exclude metadata and notes? I know we have a ticket to address the character limit, I wonder if that's still valid, maybe @jgongd has more information?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket is related to this issue. I plan to add unit tests next before focusing on this ticket. Let me know if you think this should take priority, and I'll be happy to tackle it first.

Copy link
Contributor

@corban-beaird corban-beaird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streaming updates code looks solid, could integration tests be added?

Copy link

codecov bot commented Mar 10, 2024

Codecov Report

Attention: Patch coverage is 67.94258% with 67 lines in your changes are missing coverage. Please review.

Project coverage is 47.56%. Comparing base (68017dd) to head (c33a76d).
Report is 38 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8973      +/-   ##
==========================================
+ Coverage   47.55%   47.56%   +0.01%     
==========================================
  Files        1168     1169       +1     
  Lines      176706   176883     +177     
  Branches     2356     2353       -3     
==========================================
+ Hits        84026    84137     +111     
- Misses      92522    92588      +66     
  Partials      158      158              
Flag Coverage Δ
backend 42.84% <74.13%> (+0.09%) ⬆️
harness 63.91% <37.14%> (-0.03%) ⬇️
web 42.89% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/stream/authz_basic_impl.go 83.33% <100.00%> (+8.33%) ⬆️
master/internal/stream/messages.go 0.00% <ø> (ø)
master/internal/stream/projects.go 77.01% <100.00%> (+1.55%) ⬆️
master/internal/stream/subscription.go 94.82% <100.00%> (+2.14%) ⬆️
master/internal/stream/test_util.go 68.57% <100.00%> (+2.15%) ⬆️
master/cmd/stream-gen/main.go 0.00% <0.00%> (ø)
master/internal/stream/publisher.go 65.08% <92.30%> (+1.61%) ⬆️
harness/determined/common/streams/_client.py 80.00% <53.84%> (-1.40%) ⬇️
master/internal/stream/util.go 69.49% <71.42%> (+1.74%) ⬆️
harness/determined/common/streams/wire.py 69.33% <27.27%> (-17.46%) ⬇️
... and 1 more

... and 3 files with indirect coverage changes

@gt2345
Copy link
Contributor Author

gt2345 commented Mar 11, 2024

The streaming updates code looks solid, could integration tests be added?

Thanks! I refactored the project entity to use common utility functions, and I also added an integration test for streaming multiple entities.

// determined:stream-gen source=client
type ModelSubscriptionSpec struct {
WorkspaceIDs []int `json:"workspace_ids"`
ModelIDs []int `json:"Model_ids"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be capitalized?

// ModelSubscriptionSpec is what a user submits to define a Model subscription.
//
// determined:stream-gen source=client
type ModelSubscriptionSpec struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be helpful to be able to subscribe to model streaming updates by model name(s)? i think in most cases users won't easily have access to the model ID, and the model name is unique.

for reference, the python SDK accepts both model ID and model name as possible fetch parameters (client.get_model(ID|name)), and the CLI only accepts name (in det model describe NAME)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's hesitancy to allow for subscribing based on model name due to limitations on postgres notify/listen queues. To allow for subscribing based on model names, we'd have to enforce an maximum length. The current maximum size of a model name is only restricted by the index limit in postgres as of right now (8191 bytes), which will exceed the capacity of the postgres queues.

// ModelSubscriptionSpec is what a user submits to define a Model subscription.
//
// determined:stream-gen source=client
type ModelSubscriptionSpec struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, would it be a good idea to be subscribe by user ID and/or labels? i'm imagining a use case where you don't want to have to figure out specific identifiers for the models you're interested in, so maybe a broader subscription scope would be helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an issue with allow subscription based on user id; assuming RBAC is implemented in EE.

I would be hesitant to allow label-based subscriptions for the same reason as above, especially since there doesn't seem to be a limit on the length of a tag.

@azhou-determined
Copy link
Contributor

should model streaming be added to the python streaming client _client.py as well? or is that not relevant to this PR?

@azhou-determined
Copy link
Contributor

a few general design questions wrt streaming updates and not in the scope of this PR (feel free to redirect me if this isn't the right place to ask these, just some thoughts i had while reading through the code as a newcomer to streaming updates 😄 ):

  1. it seems like streaming can only happen on a single table, and events returned to the client are also scoped to a single table. what's the plan for more complicated streaming use cases in the future? say for example i want to listen for the latest validation metric for a given trial, or the best checkpoint for an experiment (something our SDK/CLI/REST APIs currently support).
    in the current streaming updates world, this means i would have to listen for updates on the experiments, trials, checkpoints, and also metrics tables? or do we intend to allow joins, either in the SubscriptionSpecs or the returned events? i.e. if i subscribe to the runs table, there's a foreign key best_validation_id to the metrics table, and that FK row automatically gets joined and included in the event.

  2. is the assumption that streaming is mainly for listening for updates for an object that has already been created? i'm realizing that for a lot of these SubscriptionSpecs, we're gonna require IDs, but in order to know the IDs, you'd have to have created the object before you start streaming -- for some tables this means you basically can't listen for creation/insert events. in this case i wonder if broader filters in the SubscriptionSpec might be useful, like created_ts > now().

Copy link
Contributor

@azhou-determined azhou-determined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. commented a few suggestions/questions but up to you whether they're relevant.

@gt2345
Copy link
Contributor Author

gt2345 commented Mar 13, 2024

a few general design questions wrt streaming updates and not in the scope of this PR (feel free to redirect me if this isn't the right place to ask these, just some thoughts i had while reading through the code as a newcomer to streaming updates 😄 ):

  1. it seems like streaming can only happen on a single table, and events returned to the client are also scoped to a single table. what's the plan for more complicated streaming use cases in the future? say for example i want to listen for the latest validation metric for a given trial, or the best checkpoint for an experiment (something our SDK/CLI/REST APIs currently support).
    in the current streaming updates world, this means i would have to listen for updates on the experiments, trials, checkpoints, and also metrics tables? or do we intend to allow joins, either in the SubscriptionSpecs or the returned events? i.e. if i subscribe to the runs table, there's a foreign key best_validation_id to the metrics table, and that FK row automatically gets joined and included in the event.
  2. is the assumption that streaming is mainly for listening for updates for an object that has already been created? i'm realizing that for a lot of these SubscriptionSpecs, we're gonna require IDs, but in order to know the IDs, you'd have to have created the object before you start streaming -- for some tables this means you basically can't listen for creation/insert events. in this case i wonder if broader filters in the SubscriptionSpec might be useful, like created_ts > now().

Hi @azhou-determined , thanks for your review. To answer you questions:

  1. One subscription should only target one table in database, the join is supposed to happen at the client side. Just like you said, we can listen for updates for multiple tables, then join data at the front end.
  2. The streaming updates only serves read, for creating/updating objects, we are still using the rest apis. Could you please expand on the kinds of tables we can't listen for creation/insert events?

type ModelSubscriptionSpec struct {
WorkspaceIDs []int `json:"workspace_ids"`
ModelIDs []int `json:"model_ids"`
ModelNames []string `json:"model_names"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are model names being restricted in length to ensure that we don't exceed the postgres payload limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe not, thanks for your input, I decided to remove subscription by model name

@gt2345 gt2345 requested a review from a team as a code owner March 18, 2024 22:26
@gt2345 gt2345 requested a review from wes-turner March 18, 2024 22:26
Copy link
Contributor

@corban-beaird corban-beaird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 😄

@gt2345 gt2345 merged commit 137bfcd into main Mar 19, 2024
70 of 83 checks passed
@gt2345 gt2345 deleted the gt/261-stream-model branch March 19, 2024 14:49
maxrussell pushed a commit that referenced this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants