Skip to content

BigQueryStreamingBufferEmptySensor can falsely report an empty streaming buffer (metadata lag) #66963

@shahar1

Description

@shahar1

Apache Airflow Provider(s)

google

What happened?

BigQueryStreamingBufferEmptySensor (added in #66652) decides the streaming buffer is empty by checking table.streaming_buffer is None. That check is unreliable because BigQuery's streamingBuffer table-metadata field is eventually consistent.

For a window of several seconds after a streaming insert, the row is physically in the streaming buffer but table.streaming_buffer is still None. During that window the sensor reports the buffer empty — a false negative.

streaming_buffer is None is therefore ambiguous — it means either:

  • "fully flushed / table never had streamed rows", or
  • "rows were just streamed in, metadata hasn't caught up yet".

The sensor cannot tell these apart, so a DML task placed downstream of it (UPDATE/DELETE/MERGE) can still hit UPDATE or DELETE statement over table ... would affect rows in the streaming buffer — the exact error the sensor exists to prevent.

Evidence

Empirical timing check against real BigQuery — stream one row, then poll get_table().streaming_buffer:

t=  1.2s   streaming_buffer = None                       -> sensor reports EMPTY (false)
t= 11.6s   streaming_buffer present (estimated_rows=1)    -> sensor WAITs (correct)
t= 22s..302s  still present, estimated_rows=1            -> stays correct

There is a ~10–12s false-empty window. System tests run with the simulated executor (tasks run back-to-back, near-zero scheduling overhead), so the sensor's first poke fires ~1–2s after the upstream streaming-insert task finishes — squarely inside that window.

What you think should happen instead

The sensor should not report an empty buffer while rows are still buffered. Possible directions to evaluate:

  • Require the sensor to observe a non-empty → empty transition rather than trusting a single None reading.
  • Otherwise, clearly document the eventual-consistency limitation in the sensor docstring and treat it as best-effort, so users know to add their own guard between a streaming insert and the sensor.

Acceptance criteria: either the sensor no longer false-passes immediately after a streaming insert, or the limitation is clearly documented — at which point the workaround below can be removed.

Workaround currently in place

The example_bigquery_sensors system test's streaming_insert task polls get_table().streaming_buffer until it becomes non-None before completing, so the metadata is caught up before check_streaming_buffer_empty runs its first poke. This makes the system test deterministic but does not fix the sensor for real users. It is tagged in the test with a link back to this issue.

How to reproduce

  1. Create a BigQuery table.
  2. Stream a row into it (BigQueryHook.insert_all).
  3. Immediately call BigQueryStreamingBufferEmptySensor.poke() — it returns True (empty) although the row is in the streaming buffer.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct


Drafted-by: Claude Code (Opus 4.7) (human reviewed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions