Skip to content

Conversation

@zaidjan-devrev
Copy link
Contributor

@zaidjan-devrev zaidjan-devrev commented Nov 8, 2025

https://app.devrev.ai/devrev/works/ISS-225995

🚀 Optimize IN and NOT IN Filter Query Generation with String Split Approach

Summary

This PR introduces a significant performance optimization for IN and NOT IN filter operations on string and number types by replacing the traditional AST node-per-value approach with a string_split + unnest + ANY subquery pattern. This dramatically reduces AST size and query generation time, especially for large filter value sets.

🎯 Problem Statement

Previously, when filtering with IN or NOT IN operators containing many values (e.g., 1000+ items), the query generation created one AST node per value. This resulted in:

  • Exponentially growing AST size
  • Slow query generation times (120ms+ for 10,000 values)
  • Performance degradation proportional to the number of filter values

✨ Solution

Implemented an optimized approach for string and number types that:

  1. Joins all filter values into a single delimited string using §§ as delimiter
  2. Uses DuckDB's string_split to split the string back into an array
  3. Applies unnest to convert the array into rows
  4. Wraps in an ANY subquery for comparison
  5. Casts to appropriate type for numeric values

This reduces the AST from O(n) nodes to O(1) nodes, where n is the number of filter values.

📊 Performance Improvements

Query Creation Benchmark Results

Query time baseline vs Optimized
================================================================================
Size    | Baseline (ms) | Optimized (ms) | Improvement | Speedup | AST Reduction | Valid
-----------------------------------------------------------------------------------------------
10      | 1.67          | 0.78           | 53.4%       | 2.1x    | 41.7%         | ✅
50      | 1.72          | 0.84           | 50.9%       | 2.0x    | 85.6%         | ✅
100     | 1.95          | 0.69           | 64.7%       | 2.8x    | 92.0%         | ✅
500     | 7.56          | 0.69           | 90.9%       | 11.0x   | 97.0%         | ✅
1000    | 13.04         | 0.77           | 94.1%       | 16.9x   | 97.7%         | ✅
2000    | 23.93         | 1.13           | 95.3%       | 21.2x   | 97.9%         | ✅
5000    | 58.28         | 1.88           | 96.8%       | 31.0x   | 98.0%         | ✅
10000   | 120.63        | 4.14           | 96.6%       | 29.1x   | 98.0%         | ✅
================================================================================
✅ ALL TESTS COMPLETED - ALL RESULTS VALIDATED
================================================================================

Key Highlights

  • Up to 31x faster query generation for large filter sets (5000 values)
  • 96.8% improvement in query generation time at 5000 values
  • 98% AST size reduction for large filter sets
  • Consistent sub-2ms performance regardless of filter size (vs 120ms+ baseline)

🔧 Changes Made

Modified Files

  • meerkat-core/src/cube-filter-transformer/in/in.ts

    • Added string split optimization for string and number types
    • Maintained array overlap operator (&&) for string_array and number_array types
    • Implemented ANY subquery pattern with unnest(string_split())
    • Added type casting for numeric values
  • meerkat-core/src/cube-filter-transformer/not-in/not-in.ts

    • Applied same optimization pattern wrapped in OPERATOR_NOT
    • Consistent handling across both IN and NOT IN operators
  • Test Files

    • meerkat-core/src/cube-filter-transformer/in/in.spec.ts
    • meerkat-core/src/cube-filter-transformer/not-in/not-in.spec.ts
    • meerkat-node/src/__tests__/test-data.ts
    • Enhanced test coverage for new optimization paths
    • Added validation for string and number type filtering
  • Package Versions

    • Updated versions in meerkat-browser, meerkat-core, and meerkat-node

🎨 Technical Details

Example Transformation

Before: IN filter with 1000 values creates 1000+ AST nodes

column IN (value1, value2, ..., value1000)

After: Single string split operation

column = ANY(
  SELECT CAST(unnest(string_split('value1§§value2§§...§§value1000', '§§')) AS DOUBLE)
)

Type Handling

  • string and number: Uses optimized string_split approach
  • string_array and number_array: Uses array overlap operator (&&)
  • Other types: Falls back to standard COMPARE_IN approach

Implementation Details

The optimization is applied in the inDuckDbCondition and notInDuckDbCondition functions:

  1. Delimiter Selection: Uses §§ (section sign) as delimiter - uncommon in normal data
  2. Value Joining: All filter values are joined into a single string
  3. String Split: DuckDB's string_split function splits the string back into an array
  4. Unnest: Converts the array into individual rows
  5. Type Casting: For numeric types, wraps in CAST(...AS DOUBLE)
  6. Subquery Pattern: Uses ANY subquery with COMPARE_EQUAL for efficient comparison
  7. NOT IN: Wraps the entire subquery in OPERATOR_NOT

✅ Testing

  • ✅ All existing tests pass
  • ✅ New test cases added for string and number filtering
  • ✅ Benchmark validation confirms correctness across all test sizes
  • ✅ No breaking changes to existing functionality
  • ✅ Validated with 10, 50, 100, 500, 1000, 2000, 5000, and 10000 value filter sets

📈 Impact

This optimization is particularly beneficial for:

  • Large filter sets (100+ values)
  • Dashboard queries with multi-select filters
  • Bulk data filtering operations
  • Any scenario using IN/NOT IN with many values

🔍 Performance Analysis

AST Size Reduction

The optimization achieves dramatic AST size reductions:

  • Small sets (10 values): 41.7% reduction
  • Medium sets (100 values): 92.0% reduction
  • Large sets (1000+ values): 97.7%+ reduction

Query Generation Time

Query generation time remains consistently low:

  • Baseline: Scales linearly with filter size (120ms at 10k values)
  • Optimized: Remains under 5ms even at 10k values
  • Improvement: 2x-31x faster depending on filter size

@zaidjan-devrev zaidjan-devrev force-pushed the ISS-225995 branch 2 times, most recently from 7561be2 to 8fe297c Compare November 10, 2025 11:06
@zaidjan-devrev zaidjan-devrev marked this pull request as ready for review November 10, 2025 12:24
@zaidjan-devrev zaidjan-devrev merged commit 0a1a1d5 into main Nov 10, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants