Skip to content

Best Practices for Dynamic Filtering on Array Columns (e.g., ClickHouse Array(Int32)) #10048

@OrBarCycode

Description

@OrBarCycode

Problem

I'm using Cube.js with a ClickHouse data source and need to support dynamic filtering on an array column (Project IDs of type Array(Int32)) via the GraphQL API. Users should be able to provide an arbitrary list of values (e.g., [123, 456]) to filter rows where the array contains any of those values (OR semantics).
Cube.js doesn't natively support array types or operators like contains for arrays, so standard dimension filters append = ? to the WHERE clause, which breaks array functions (e.g., hasAny). I've tried several workarounds, but each has trade-offs (detailed below). I'm looking for:

Better approaches or best practices for this use case.
Confirmation if the "formatted string" method is acceptable (it works but feels hacky).
Any plans for native array support (e.g., custom operators like arrayContains)?

Attempts and Issues

  1. Unnest with arrayJoin:
    javascriptproject_id_element: {
    sql: arrayJoin(${CUBE}."Project IDs"),
    type: number
    }
    Issue: Filtering by multiple values duplicates measures (e.g., count) when multiple values match in one row.

  2. Direct Array Function:
    javascripthas_project_id: {
    sql: arrayExists(x -> x = {filterValue}, ${CUBE}."Project IDs"),
    type: boolean
    }
    Issue: Cube appends = ? to filters, breaking arrayExists or hasAny.

  3. Using FILTER_PARAMS in Segment:
    javascriptsegments: {
    project_filter: { sql: hasAny(${CUBE}."Project IDs", ${FILTER_PARAMS.project_ids}) }
    }
    Issue: Still gets wrapped in = ? or is hard to maintain with dynamic values.

  4. Format Array as String (Current Solution):
    javascriptproject_id_filter: {
    type: string,
    sql: arrayStringConcat(arrayMap(x -> concat('$', toString(x), '$'), ${CUBE}."Project IDs"), ',')
    }
    Works: Filters with contains on ["$123$", "$456$"] produce WHERE ... LIKE '%$123$%' OR ... LIKE '%$456$%'.
    Concerns: Brittle (client must format $x$), potential performance issues with string ops, edge cases (empty arrays).

Related Cube.js schema

cube(`violations`, {
  sql_table: `violations`,
  
  data_source: `default`,

  measures: {
    // Assuming some measures like count
    count: {
      type: `count`,
    },
  },

  dimensions: {
    id: {
      sql: `${CUBE}."ID"`,
      type: `string`,
      primary_key: true
    },
    // Note: "Project IDs" is the array column (Array(Int32))
    project_id: {
      type: `number`,
      sql: `${CUBE}."Project IDs"`,  // This doesn't work for filtering as-is
    },
    has_project_id: {
      sql: `arrayExists(x -> x = {filterValue}, ${CUBE}."Project IDs")`,
      type: `boolean`,
    },
    project_id_element: {
      sql: `arrayJoin(${CUBE}."Project IDs")`,
      type: `number`,  // Or string
    },
    project_id_filter: {
      type: `string`,
      sql: `arrayStringConcat(arrayMap(x -> concat('$', toString(x), '$'), ${CUBE}."Project IDs"), ',')`,
    },
  }
});```

**Related Cube.js generated SQL**

```sql
SELECT 
  someOtherDim,
  COUNT(*) AS count
FROM violations
WHERE hasAny("Project IDs", [123, 456])  -- OR equivalent with arrayExists
GROUP BY someOtherDim;

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionThe issue is a question. Please use Stack Overflow for questions.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions