Skip to content

perf: replace UNION with UNION ALL to avoid costly deduplication#372

Merged
halconel merged 2 commits intoepoch8:feat/offsetsfrom
halconel:Looky-7769/offsets
Feb 2, 2026
Merged

perf: replace UNION with UNION ALL to avoid costly deduplication#372
halconel merged 2 commits intoepoch8:feat/offsetsfrom
halconel:Looky-7769/offsets

Conversation

@halconel
Copy link
Contributor

@halconel halconel commented Jan 30, 2026

Summary

  1. Replace UNION with UNION ALL to avoid costly HashAggregate deduplication
  2. Filter NaN values from SQL IN clauses to prevent type mismatch errors

Problem 1: UNION deduplication overhead

Current implementation uses UNION which forces PostgreSQL to:

  1. Read ALL rows from both parts (update_ts and delete_ts)
  2. Perform expensive HashAggregate deduplication
  3. Only then apply LIMIT

This causes slow queries (~818ms) even with proper indexes.

Solution

Use UNION ALL with explicit duplicate exclusion in WHERE:

  • Part 1: WHERE update_ts > offset
  • Part 2: WHERE delete_ts > offset AND update_ts <= offset

This eliminates duplicates at query planning stage, allowing early LIMIT application.

Performance Impact: 4,757x faster (818ms → 0.172ms in production)

Problem 2: NaN values cause type mismatch

When DataFrame columns contain NaN values (common for NULL-able foreign keys like profile_id), pandas passes them as Python float('nan') to SQL queries:

WHERE (id, profile_id) IN (
    ('uuid1', 'uuid2'),
    ('uuid3', nan),  ← Python float, not SQL NULL
)

PostgreSQL error: operator does not exist: character varying = double precision

This causes:

  • 87% query failure rate in affected transformations
  • Offset stuck at old timestamps (90+ days lag)
  • Production incidents (discovered 29-30 Jan 2026)

Solution

Filter NaN values before building SQL IN clauses:

  • Single key: idx[key].dropna().to_list()
  • Multiple keys: idx[primary_keys].dropna()

Follows pattern from commit b3e7cc5 which fixed the same issue for join_keys.

Changes

Modified files:

  1. datapipe/meta/sql_meta.py - UNION ALL optimization in 3 functions
  2. datapipe/sql_util.py - NaN filtering in sql_apply_idx_filter_to_table

All offset tests pass (64 tests).

@halconel halconel merged commit 8a477c2 into epoch8:feat/offsets Feb 2, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant