fix(flows): FlowAppendView support for 'once' flag by forcing batch read#22
Conversation
|
LGTM |
| spark = self.spark | ||
| spark_reader = spark.readStream | ||
| if self.once: | ||
| spark_reader = spark.read |
There was a problem hiding this comment.
what happens in streaming flows if you have this turned on, have we tested this? and can we please add a sample for this as well if possible
There was a problem hiding this comment.
I've successfully tested this manually, but I'm happy to add a sample as well.
There was a problem hiding this comment.
awesome, yeah that would be good if you could add one into the features samples under bronze, we use these as regression testing for now and will later turn into the tests for CICD pipeline so if we have a test for every feature that would help make sure we don't break anything on future developments
There was a problem hiding this comment.
Ok, sounds good.
I just finished testing this change on new samples locally with these results:
- When the source view has
mode: batch, the flow succeeds (the batch view is inserted "once" into the target streaming table during the initial pipeline run, and then ignored in subsequent runs). - When the source view has
mode: stream, the flow fails withView 'v_append_view_once_stream_flow' is a streaming view and must be referenced using readStream.
This behaviour is expected, as customers should only ever use a batch view as the source for an append once flow. As outlined in the append_flow documentation:
Using once=True changes the flow: the return value must be a batch DataFrame in this case, not a streaming DataFrame.
There was a problem hiding this comment.
Here's a link to my test pipeline run: https://e2-demo-field-eng.cloud.databricks.com/pipelines/d55b8d12-d5c0-4e8e-85d9-234f3d99d594/updates/d2073a2f-c1ca-4094-9727-7b18212c67bd?o=1444828305810485
The relevant table is named append_view_once_flow
…ead (#22) * Fix FlowAppendView support for 'once' flag by forcing batch read * Add new sample dataflow for append_view_once_flow * Improve flow name in append_view_once dataflow
Implementing a fix for appending views using the 'once' flag by ensuring the returned value is a batch DataFrame, not a streaming DataFrame.
As outlined in the append_flow documentation:
Currently, configuring the 'once' flag with a batch data source fails with:
View is not a streaming view and must be referenced using read.I've tested this fix manually, and it resolves the issue: it successfully appends a one-time batch flow into the target streaming table, as expected.