perf: replace UNION with UNION ALL to avoid costly deduplication by halconel · Pull Request #372 · epoch8/datapipe

halconel · 2026-01-30T00:51:14Z

Summary

Replace UNION with UNION ALL to avoid costly HashAggregate deduplication
Filter NaN values from SQL IN clauses to prevent type mismatch errors

Problem 1: UNION deduplication overhead

Current implementation uses UNION which forces PostgreSQL to:

Read ALL rows from both parts (update_ts and delete_ts)
Perform expensive HashAggregate deduplication
Only then apply LIMIT

This causes slow queries (~818ms) even with proper indexes.

Solution

Use UNION ALL with explicit duplicate exclusion in WHERE:

Part 1: WHERE update_ts > offset
Part 2: WHERE delete_ts > offset AND update_ts <= offset

This eliminates duplicates at query planning stage, allowing early LIMIT application.

Performance Impact: 4,757x faster (818ms → 0.172ms in production)

Problem 2: NaN values cause type mismatch

When DataFrame columns contain NaN values (common for NULL-able foreign keys like profile_id), pandas passes them as Python float('nan') to SQL queries:

WHERE (id, profile_id) IN (
    ('uuid1', 'uuid2'),
    ('uuid3', nan),  ← Python float, not SQL NULL
)

PostgreSQL error: operator does not exist: character varying = double precision

This causes:

87% query failure rate in affected transformations
Offset stuck at old timestamps (90+ days lag)
Production incidents (discovered 29-30 Jan 2026)

Solution

Filter NaN values before building SQL IN clauses:

Single key: idx[key].dropna().to_list()
Multiple keys: idx[primary_keys].dropna()

Follows pattern from commit b3e7cc5 which fixed the same issue for join_keys.

Changes

Modified files:

datapipe/meta/sql_meta.py - UNION ALL optimization in 3 functions
datapipe/sql_util.py - NaN filtering in sql_apply_idx_filter_to_table

All offset tests pass (64 tests).

… avoid costly deduplication

…RCHAR-float type mismatch errors

halconel added 2 commits January 30, 2026 03:48

perf: replace UNION with UNION ALL and exclude duplicates in WHERE to…

cf8b966

… avoid costly deduplication

fix: filter NaN values in sql_apply_idx_filter_to_table to prevent VA…

4a0d33a

…RCHAR-float type mismatch errors

halconel merged commit 8a477c2 into epoch8:feat/offsets Feb 2, 2026
37 checks passed

halconel mentioned this pull request Feb 6, 2026

feat: add offset diagnostic logging and max_records_per_run limit #373

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: replace UNION with UNION ALL to avoid costly deduplication#372

perf: replace UNION with UNION ALL to avoid costly deduplication#372
halconel merged 2 commits intoepoch8:feat/offsetsfrom
halconel:Looky-7769/offsets

halconel commented Jan 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

halconel commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem 1: UNION deduplication overhead

Solution

Problem 2: NaN values cause type mismatch

Solution

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

halconel commented Jan 30, 2026 •

edited

Loading