Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query not using indices when using columnar storage #7524

Open
vbergeron-ledger opened this issue Feb 20, 2024 · 7 comments
Open

Query not using indices when using columnar storage #7524

vbergeron-ledger opened this issue Feb 20, 2024 · 7 comments
Labels

Comments

@vbergeron-ledger
Copy link

PG 15
Citus 12.1
I am evaluating the columnar storage to compress old data for our use case.
So far the compression ratio is great, but the querying is very slow.

After analysis, and despite defining a primary key index, all queries are scanning the whole table.

(Domain model is ethereum call data)

Schemas :

                                            Table "public.calls_columnar"
      Column      |   Type    | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
------------------+-----------+-----------+----------+---------+----------+-------------+--------------+-------------
 height           | height    |           | not null |         | plain    |             |              | 
 tx_index         | integer   |           | not null |         | plain    |             |              | 
 trace_index      | integer   |           | not null |         | plain    |             |              | 
 trace_address    | integer[] |           |          |         | extended |             |              | 
 subtraces        | integer   |           |          |         | plain    |             |              | 
 error            | text      |           |          |         | extended |             |              | 
 action_from      | bytea     |           |          |         | extended |             |              | 
 action_to        | bytea     |           |          |         | extended |             |              | 
 action_value     | numeric   |           |          |         | main     |             |              | 
 action_input     | bytea     |           |          |         | extended |             |              | 
 action_gas       | numeric   |           |          |         | main     |             |              | 
 action_call_type | call_type |           |          |         | plain    |             |              | 
 result_output    | bytea     |           |          |         | extended |             |              | 
 result_gas_used  | bigint    |           |          |         | plain    |             |              | 
Indexes:
    "calls_columnar_pkey" PRIMARY KEY, btree (height, tx_index, trace_index)
Access method: columnar
                                                Table "public.calls"
      Column      |   Type    | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
------------------+-----------+-----------+----------+---------+----------+-------------+--------------+-------------
 height           | height    |           | not null |         | plain    |             |              | 
 tx_index         | integer   |           | not null |         | plain    |             |              | 
 trace_index      | integer   |           | not null |         | plain    |             |              | 
 trace_address    | integer[] |           |          |         | extended |             |              | 
 subtraces        | integer   |           |          |         | plain    |             |              | 
 error            | text      |           |          |         | extended |             |              | 
 action_from      | bytea     |           |          |         | extended |             |              | 
 action_to        | bytea     |           |          |         | extended |             |              | 
 action_value     | numeric   |           |          |         | main     |             |              | 
 action_input     | bytea     |           |          |         | extended |             |              | 
 action_gas       | numeric   |           |          |         | main     |             |              | 
 action_call_type | call_type |           |          |         | plain    |             |              | 
 result_output    | bytea     |           |          |         | extended |             |              | 
 result_gas_used  | bigint    |           |          |         | plain    |             |              | 
Indexes:
    "calls_pkey" PRIMARY KEY, btree (height, tx_index, trace_index)
Access method: heap

Exemple of a query that should be fast using the index (as index support is claimed in the documentation) :

explain analyse select * from calls_columnar where height between 1000 and 2000;

                                                                                                   QUERY PLAN                                                                                                   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Custom Scan (ColumnarScan) on calls_columnar  (cost=0.00..909705.14 rows=575113 width=288) (actual time=130.746..50768.705 rows=153736 loops=1)
   Filter: (((height)::bigint >= 1000) AND ((height)::bigint <= 2000))
   Rows Removed by Filter: 114868939
   Columnar Projected Columns: height, tx_index, trace_index, trace_address, subtraces, error, action_from, action_to, action_value, action_input, action_gas, action_call_type, result_output, result_gas_used
 Planning Time: 3.430 ms
 JIT:
   Functions: 1
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.326 ms, Inlining 2.154 ms, Optimization 7.891 ms, Emission 5.067 ms, Total 15.439 ms
 Execution Time: 50773.834 ms
(10 rows)

We can see clearly no index usage in the plan.

For reference, here is the plan for the regular table.

                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using calls_pkey on calls  (cost=0.57..1139168.99 rows=1962967 width=178) (actual time=1.055..103.694 rows=153736 loops=1)
   Index Cond: (((height)::bigint >= 1000) AND ((height)::bigint <= 2000))
 Planning Time: 0.318 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.972 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 1.972 ms
 Execution Time: 111.912 ms
(8 rows)

I have the feeling I am either not using the whole columnar store correctly or there is some pitfall in the planner regarding costs.

@onurctirtir
Copy link
Member

onurctirtir commented Feb 21, 2024

Similar to #7202.

@onurctirtir
Copy link
Member

onurctirtir commented Feb 21, 2024

Hey @vbergeron-ledger,

Columnar is more inclined to use the index if the predicates in the where clause are based on some equality filters. And when using range based predicates, I'd expect columnar to prefer ColumnarScan because it would be much faster than an index-scan on columnar tables --due to underlying index support implementation in columnar.

And ColumnarScan performs best when it can make use of chunk group filtering. When this is the case, you can see which filters are used for chunk group filtering in the EXPLAIN output (Columnar Chunk Group Filters), as in:

                                    QUERY PLAN
---------------------------------------------------------------------
 Custom Scan (ColumnarScan) on simple_chunk_filtering (actual rows=111111 loops=1)
   Filter: (i > 123456)
   Rows Removed by Filter: 3457
   Columnar Projected Columns: i
   Columnar Chunk Group Filters: (i > 123456)
   Columnar Chunk Groups Removed by Filter: 12

However, this doesn't seem like the case for your query. To understand why this is so, could you re-run the query by first setting columnar.planner_debug_level to something that can be captured based on the value of client_min_messages And then, you would see such debug/notice log that reports the correlation calculated for height column:

NOTICE:  columnar planner: cannot push down clause:  ....

And if you can share that log line, I can help you further.

@vbergeron-ledger
Copy link
Author

Hello and thanks for your detailed reply.
Using the configuration options from #7202 I managed to get a decent speed for point-blank queries and range queries based on the primary keys. Even better, I observe 50% more performance consistently using columnar storage, which I presume is linked with lesser data size on disk.
I will get back to you with the push down logs detailed asap.

@vbergeron-ledger
Copy link
Author

And here it comes :

explain analyse select * from calls_columnar where height = 5000;
NOTICE:  columnar planner: cannot push down clause: must match 'Var <op> Expr' or 'Expr <op> Var'
HINT:  Var must only reference this rel, and Expr must not reference this rel
NOTICE:  columnar planner: adding CustomScan path for calls_columnar
DETAIL:  unparameterized; 0 clauses pushed down

Resulting plan is kind of the same as before.

@onurctirtir
Copy link
Member

And here it comes :

explain analyse select * from calls_columnar where height = 5000;
NOTICE:  columnar planner: cannot push down clause: must match 'Var <op> Expr' or 'Expr <op> Var'
HINT:  Var must only reference this rel, and Expr must not reference this rel
NOTICE:  columnar planner: adding CustomScan path for calls_columnar
DETAIL:  unparameterized; 0 clauses pushed down

Resulting plan is kind of the same as before.

There must be something going wrong with the planner because the reason stated in NOTICE is not much applicable for height = 5000, interesting ..

@vbergeron-ledger
Copy link
Author

if it can add more context, when specifying more columns on conditions that should also use a more complex index :

explain (analyse,costs) select height, tx_index, trace_index from calls_columnar where action_from = '\x095ddce4fd8818ad159d778e6a9898a2474933ca' and height between 804000 and 804500;
NOTICE:  columnar planner: cannot push down clause: must match 'Var <op> Expr' or 'Expr <op> Var'
HINT:  Var must only reference this rel, and Expr must not reference this rel
NOTICE:  columnar planner: cannot push down clause: must match 'Var <op> Expr' or 'Expr <op> Var'
HINT:  Var must only reference this rel, and Expr must not reference this rel
NOTICE:  columnar planner: adding CustomScan path for calls_columnar
DETAIL:  unparameterized; 1 clauses pushed down

So it happens on every pushdowns.
when deactivating seq and columnar scans, I also get quite bad performance for an index scan but this maybe linked to the storage itself.

@vbergeron-ledger
Copy link
Author

Another feedback : I managed to get the predicate pushdown working : this is caused by a domain type.
I altered the type of the height column to bigint and the error vanished.

the performance of the height = 5000 is still way worse than the index scan though, so maybe there is some tuning to be done for their relative costs, idk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants