Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update state message #12586

Merged
merged 8 commits into from
May 12, 2022
Merged

Update state message #12586

merged 8 commits into from
May 12, 2022

Conversation

benmoriceau
Copy link
Contributor

@benmoriceau benmoriceau commented May 4, 2022

What

Update the airbyte state message to support the per stream state.

The state message still contains the old way of storing the state in the data fields. It introduce 2 new fields to represent the global and the per stream state.

The new state message has been defined in: https://docs.google.com/document/d/1zJVUQkVy8ZJUcqzM_hYPGijr4pI0Xx6I2eqpagJ2r7E/edit#heading=h.voxzl9lf9qd1

@github-actions github-actions bot added area/platform issues related to the platform area/protocol labels May 4, 2022
Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benmoriceau was it discussed whether the connector should declare via spec.json if it manages global or per-stream state? this would allow transparency to the user about whether a table reset they perform in the UI will nuke their whole dataset vs. just one stream. Or is the idea that this determination would happen by reading the STATE message?

The type of state the other fields represent.
If not set, the state data is interpreted as GLOBAL and should be read from the `data` field for backwards compatibility.
GLOBAL means that the state should be read from `global` and means that it represents the state for all the streams.
PER_STREAM means that the state should be read from `streams`. Each item in the ist represents the state for the associated stream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PER_STREAM means that the state should be read from `streams`. Each item in the ist represents the state for the associated stream.
PER_STREAM means that the state should be read from `streams`. Each item in the list represents the state for the associated stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

description: "per stream state data"
additionalProperties: false
properties:
name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name:
stream_name:

Copy link
Contributor Author

@benmoriceau benmoriceau May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it redundant with the name of the fields in the AirbyteSateMessage it will be something like stream[x].stream_name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly on this one, but I think I do lean towards sherif's suggestion. I don't think it's redundant because when you say an AirbyteStreamState has a name it isn't necessarily obvious that that name is a stream name. It could be interpreted as AirbyteStreamStates just have a name for some reason. doing with stream_name removes the ambiguity.

global:
"$ref": "#/definitions/AirbyteStateBlob"
streams:
type: array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this an array instead of a map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map type doesn't exist in json schema. One version was using the additional properties for it but it was not clear: https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.4.2.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would "additionalProperties": { "type": AirbyteStateBlob } work with object definition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth exploring something like what Sherif proposed above. Otherwise we will need to need to sort through all of the streamStates in this array to find specific ones, rather than having direct access to a stream state given the stream name

Copy link
Contributor Author

@benmoriceau benmoriceau May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would "additionalProperties": { "type": AirbyteStateBlob } work with object definition?

This would work but it is not explicit that the key is a string plus the field is gonna be named additionalProperties and not streams (doc). We considered it at some point but abandoned it because of all the confusion it was bringing.
The tradeoff of rebuilding it as a map sounds better than having a field that it is not properly named and which can lead to have something unexpected in it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maps in JSONSchema have been pretty sus. i.e. it seems like a lot of the codegen doesn't know what to do with them. I think the array is going to lead to less headache.

Lake makes a good point on the sorting, but I think we can handle it.

additionalProperties: true
required:
- data
additionalProperties: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing this to false? I think that will make it harder in the future to add/remove fields to this object in a backwards-compatible way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field allows too many thing it is a Map<String, Object>, we can't know what is in it and we should avoid using it. I don't think it is being used at the moment and we shouldn't be using it in the future.

global:
"$ref": "#/definitions/AirbyteStateBlob"
streams:
type: array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth exploring something like what Sherif proposed above. Otherwise we will need to need to sort through all of the streamStates in this array to find specific ones, rather than having direct access to a stream state given the stream name

@lmossman
Copy link
Contributor

lmossman commented May 5, 2022

@benmoriceau was it discussed whether the connector should declare via spec.json if it manages global or per-stream state? this would allow transparency to the user about whether a table reset they perform in the UI will nuke their whole dataset vs. just one stream. Or is the idea that this determination would happen by reading the STATE message?

I will let Benoit confirm, but AFAIK the idea was that this would be determined by reading the most recent state from the connector, defaulting to GLOBAL if there is no state for that connector. Once we process the first PER_STREAM state message from that connector, it would change the behavior in the UI.

@CLAassistant
Copy link

CLAassistant commented May 5, 2022

CLA assistant check
All committers have signed the CLA.

@benmoriceau
Copy link
Contributor Author

I will let Benoit confirm, but AFAIK the idea was that this would be determined by reading the most recent state from the connector, defaulting to GLOBAL if there is no state for that connector. Once we process the first PER_STREAM state message from that connector, it would change the behavior in the UI.

We spoke about that offline with Sherif. Some connector will support both global and per stream update. The way we will know which state to pass will be by using the state table and the TBD per stream state table and see which one is the active one. As you say the default state will be GLOBAL if no state is present.

@benmoriceau benmoriceau temporarily deployed to more-secrets May 5, 2022 17:08 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 5, 2022 17:08 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 5, 2022 17:34 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 5, 2022 17:35 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 5, 2022 18:44 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 5, 2022 18:44 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 5, 2022 19:40 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 5, 2022 19:40 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 9, 2022 17:08 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 9, 2022 17:08 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 10, 2022 15:58 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 10, 2022 15:58 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 10, 2022 20:31 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets May 10, 2022 20:31 Inactive
@benmoriceau benmoriceau requested a review from cgardens May 11, 2022 20:12
Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. i think it is worth doing name => stream_name

properties:
state_type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't make this required because it wouldn't be backwards compatible, right?

global:
"$ref": "#/definitions/AirbyteStateBlob"
streams:
type: array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maps in JSONSchema have been pretty sus. i.e. it seems like a lot of the codegen doesn't know what to do with them. I think the array is going to lead to less headache.

Lake makes a good point on the sorting, but I think we can handle it.

description: "per stream state data"
additionalProperties: false
properties:
name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly on this one, but I think I do lean towards sherif's suggestion. I don't think it's redundant because when you say an AirbyteStreamState has a name it isn't necessarily obvious that that name is a stream name. It could be interpreted as AirbyteStreamStates just have a name for some reason. doing with stream_name removes the ambiguity.

@benmoriceau benmoriceau merged commit b3373b9 into master May 12, 2022
@benmoriceau benmoriceau deleted the bmoric/new-state-message branch May 12, 2022 16:00
suhomud pushed a commit that referenced this pull request May 23, 2022
What
Update the airbyte state message to support the per stream state.

The state message still contains the old way of storing the state in the data fields. It introduce 2 new fields to represent the global and the per stream state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants