perf(in-filter): Update in filter to use any operator for performance improvements #165

zaidjan-devrev · 2025-11-08T14:14:49Z

https://app.devrev.ai/devrev/works/ISS-225995

🚀 Optimize IN and NOT IN Filter Query Generation with String Split Approach

Summary

This PR introduces a significant performance optimization for IN and NOT IN filter operations on string and number types by replacing the traditional AST node-per-value approach with a string_split + unnest + ANY subquery pattern. This dramatically reduces AST size and query generation time, especially for large filter value sets.

🎯 Problem Statement

Previously, when filtering with IN or NOT IN operators containing many values (e.g., 1000+ items), the query generation created one AST node per value. This resulted in:

Exponentially growing AST size
Slow query generation times (120ms+ for 10,000 values)
Performance degradation proportional to the number of filter values

✨ Solution

Implemented an optimized approach for string and number types that:

Joins all filter values into a single delimited string using §§ as delimiter
Uses DuckDB's string_split to split the string back into an array
Applies unnest to convert the array into rows
Wraps in an ANY subquery for comparison
Casts to appropriate type for numeric values

This reduces the AST from O(n) nodes to O(1) nodes, where n is the number of filter values.

📊 Performance Improvements

Query Creation Benchmark Results

Query time baseline vs Optimized
================================================================================
Size    | Baseline (ms) | Optimized (ms) | Improvement | Speedup | AST Reduction | Valid
-----------------------------------------------------------------------------------------------
10      | 1.67          | 0.78           | 53.4%       | 2.1x    | 41.7%         | ✅
50      | 1.72          | 0.84           | 50.9%       | 2.0x    | 85.6%         | ✅
100     | 1.95          | 0.69           | 64.7%       | 2.8x    | 92.0%         | ✅
500     | 7.56          | 0.69           | 90.9%       | 11.0x   | 97.0%         | ✅
1000    | 13.04         | 0.77           | 94.1%       | 16.9x   | 97.7%         | ✅
2000    | 23.93         | 1.13           | 95.3%       | 21.2x   | 97.9%         | ✅
5000    | 58.28         | 1.88           | 96.8%       | 31.0x   | 98.0%         | ✅
10000   | 120.63        | 4.14           | 96.6%       | 29.1x   | 98.0%         | ✅
================================================================================
✅ ALL TESTS COMPLETED - ALL RESULTS VALIDATED
================================================================================

Key Highlights

Up to 31x faster query generation for large filter sets (5000 values)
96.8% improvement in query generation time at 5000 values
98% AST size reduction for large filter sets
Consistent sub-2ms performance regardless of filter size (vs 120ms+ baseline)

🔧 Changes Made

Modified Files

meerkat-core/src/cube-filter-transformer/in/in.ts
- Added string split optimization for string and number types
- Maintained array overlap operator (&&) for string_array and number_array types
- Implemented ANY subquery pattern with unnest(string_split())
- Added type casting for numeric values
meerkat-core/src/cube-filter-transformer/not-in/not-in.ts
- Applied same optimization pattern wrapped in OPERATOR_NOT
- Consistent handling across both IN and NOT IN operators
Test Files
- meerkat-core/src/cube-filter-transformer/in/in.spec.ts
- meerkat-core/src/cube-filter-transformer/not-in/not-in.spec.ts
- meerkat-node/src/__tests__/test-data.ts
- Enhanced test coverage for new optimization paths
- Added validation for string and number type filtering
Package Versions
- Updated versions in meerkat-browser, meerkat-core, and meerkat-node

🎨 Technical Details

Example Transformation

Before: IN filter with 1000 values creates 1000+ AST nodes

column IN (value1, value2, ..., value1000)

After: Single string split operation

column = ANY(
  SELECT CAST(unnest(string_split('value1§§value2§§...§§value1000', '§§')) AS DOUBLE)
)

Type Handling

string and number: Uses optimized string_split approach
string_array and number_array: Uses array overlap operator (&&)
Other types: Falls back to standard COMPARE_IN approach

Implementation Details

The optimization is applied in the inDuckDbCondition and notInDuckDbCondition functions:

Delimiter Selection: Uses §§ (section sign) as delimiter - uncommon in normal data
Value Joining: All filter values are joined into a single string
String Split: DuckDB's string_split function splits the string back into an array
Unnest: Converts the array into individual rows
Type Casting: For numeric types, wraps in CAST(...AS DOUBLE)
Subquery Pattern: Uses ANY subquery with COMPARE_EQUAL for efficient comparison
NOT IN: Wraps the entire subquery in OPERATOR_NOT

✅ Testing

✅ All existing tests pass
✅ New test cases added for string and number filtering
✅ Benchmark validation confirms correctness across all test sizes
✅ No breaking changes to existing functionality
✅ Validated with 10, 50, 100, 500, 1000, 2000, 5000, and 10000 value filter sets

📈 Impact

This optimization is particularly beneficial for:

Large filter sets (100+ values)
Dashboard queries with multi-select filters
Bulk data filtering operations
Any scenario using IN/NOT IN with many values

🔍 Performance Analysis

AST Size Reduction

The optimization achieves dramatic AST size reductions:

Small sets (10 values): 41.7% reduction
Medium sets (100 values): 92.0% reduction
Large sets (1000+ values): 97.7%+ reduction

Query Generation Time

Query generation time remains consistently low:

Baseline: Scales linearly with filter size (120ms at 10k values)
Optimized: Remains under 5ms even at 10k values
Improvement: 2x-31x faster depending on filter size

zaidjan-devrev added 2 commits November 8, 2025 19:42

added any filter

a69ddb0

added not in operator update

dae2358

zaidjan-devrev force-pushed the ISS-225995 branch 2 times, most recently from 7561be2 to 8fe297c Compare November 10, 2025 11:06

smaller AST

3951c14

zaidjan-devrev force-pushed the ISS-225995 branch from 8fe297c to 3951c14 Compare November 10, 2025 11:15

zaidjan-devrev added 3 commits November 10, 2025 17:52

recd '/Users/zaidjan/Documents/Projects/meerkat'

6e69168

updated comments

b344b1e

updated versions

10e0a15

zaidjan-devrev marked this pull request as ready for review November 10, 2025 12:24

zaidjan-devrev requested review from itsTalwar and vpbs2 as code owners November 10, 2025 12:24

zaidjan-devrev added 5 commits November 10, 2025 18:21

added number and string based filtering with string split

6705786

common delimiter

b82aa8f

added combined test cases

8b3831f

added combined test cases

cbccc03

elongate the delimiter

b2fe578

shriramchandirasekaran-beep approved these changes Nov 10, 2025

View reviewed changes

vpbs2 approved these changes Nov 10, 2025

View reviewed changes

zaidjan-devrev merged commit 0a1a1d5 into main Nov 10, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(in-filter): Update in filter to use any operator for performance improvements #165

perf(in-filter): Update in filter to use any operator for performance improvements #165

Uh oh!

zaidjan-devrev commented Nov 8, 2025 •

edited by shriramchandirasekaran-beep

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf(in-filter): Update in filter to use any operator for performance improvements #165

perf(in-filter): Update in filter to use any operator for performance improvements #165

Uh oh!

Conversation

zaidjan-devrev commented Nov 8, 2025 • edited by shriramchandirasekaran-beep Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Optimize IN and NOT IN Filter Query Generation with String Split Approach

Summary

🎯 Problem Statement

✨ Solution

📊 Performance Improvements

Query Creation Benchmark Results

Key Highlights

🔧 Changes Made

Modified Files

🎨 Technical Details

Example Transformation

Type Handling

Implementation Details

✅ Testing

📈 Impact

🔍 Performance Analysis

AST Size Reduction

Query Generation Time

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zaidjan-devrev commented Nov 8, 2025 •

edited by shriramchandirasekaran-beep

Loading