Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
π [airbyte-cdk] Fix bug where substreams depending on an RFR parent stream don't paginate or use existing state #40671
π [airbyte-cdk] Fix bug where substreams depending on an RFR parent stream don't paginate or use existing state #40671
Changes from all commits
df655bf
b6febc8
fca7f71
94b4d77
c69c578
d114927
5d5a2c2
77cc58d
85c92e5
3979cbf
eec284b
2b0801b
bfde8a5
fe402e9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we raise an exception if parent_record is neither an
AirbyteMessage
nor aRecord
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are effectively calling the top level
read()
command which allows forStreamData
which could be a Mapping, I think we also need to account for that type, but if not those 3 (record, message, mapping), then we can throw the error. will addThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how will we map the record to the slice when
parent_record
is aMapping
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't in the end. granted like we discussed above, its only really for custom components thats don't return the right
Record
interface, but for a majority of our cases this should not be an issueThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one drawback is that we no longer checkpoint per slice like we used to, although not sure this was something we must retain given we lose some context going from read_records to read()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems fairly undesirable to me. Will we only checkpoint at the end of the sync?
naive question: could we instead keep the iteration on the stream_slices and expose a method to read a single, but complete slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like in order to do this, we would have to track in the model_to_component_factory which stream is a parent and if it is a parent, avoid instantiating
ResumableFullRefreshCursor
. It feels possible to me but it would require passing a new parameter all the way to_merge_stream_slicers
. This seems fair to me thoughThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely yield state after every slice. Otherwise, we risk having stuck syncs, where some transient error will result in a failed sync without any progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, alex and i discussed a bit on wednesday, it's a bit hacky, but one way we can do this is summarized in the above comment. By inspecting the
associated_slice
we can tell if we moved onto the next slice if it changes and checkpoint + emit the current set of records as mentioned in the new comment above.If this approach I implemented seems too crazy or prone to failure, then @maxi297 's suggestion to just not use RFR might be reasonable. Although the drawback is that we have effectively two different implementations of the parent stream at runtime. Since
incremental_dependency
will switch our the cursor for substream, it's a small gotchaThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see cases where some sources use
stream_slices
andread_record
afterward (like this). Are we fine with those because they don't rely on low-code so they shouldn't be affected by the RFR cursor?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point and that's correct. This bug was only for low-code because all non-incremental and non-substreams for low-code automatically turned on RFR. whereas for python based sources streams are currently opt in based on a code change. I think we can live with this for now, but as we look into implementing auto-rfr for concurrent/python then we need to be aware.