Simplify query in bootstrap_query method by jordanlewis · Pull Request #310 · elixir-ecto/postgrex

jordanlewis · 2017-04-04T18:52:55Z

Previously, bootstrap_query constructed a pretty complicated SELECT
statement that used a subquery, array and generate_series to exclude
rows whose oids were part of a predefined set. The query looked like
this:

SELECT ... WHERE t.oid NOT IN
  (SELECT (ARRAY[integer, literals, ...])[i]
   FROM generate_series(1, length_of_above_array))

This subquery is now replaced by a simpler list of integer literals,
which should be more performant and easier to understand.

Now, the query looks like this:

SELECT ... WHERE t.oid NOT IN (integer, literals, ...)

This should work in all versions of Postgres and Redshift, and works around a typing issue in CockroachDB: cockroachdb/cockroach#14554

Previously, bootstrap_query constructed a pretty complicated `SELECT` statement that used a subquery, array and generate_series to exclude rows whose oids were part of a predefined set. The query looked like this: ``` SELECT ... WHERE t.oid NOT IN (SELECT (ARRAY[integer, literals, ...])[i] FROM generate_series(1, length_of_above_array)) ``` This subquery is now replaced by a simpler list of integer literals, which should be more performant and easier to understand. Now, the query looks like this: ``` SELECT ... WHERE t.oid NOT IN (integer, literals, ...) ```

sourcelevel-bot · 2017-04-04T18:53:01Z

Hello, @jordanlewis! This is your first Pull Request that will be reviewed by Ebert, an automatic Code Review service. It will leave comments on this diff with potential issues and style violations found in the code as you push new commits. You can also see all the issues found on this Pull Request on its review page. Please check our documentation for more information.

fishcakez · 2017-04-04T19:58:18Z

Can you check the query plan for this on postgres < 9.6, i believe postgresql will make a suboptimal decision causing O(n^2) complexity.

…

On 4 Apr 2017 7:53 pm, "Jordan Lewis" ***@***.***> wrote: Previously, bootstrap_query constructed a pretty complicated SELECT statement that used a subquery, array and generate_series to exclude rows whose oids were part of a predefined set. The query looked like this: SELECT ... WHERE t.oid NOT IN (SELECT (ARRAY[integer, literals, ...])[i] FROM generate_series(1, length_of_above_array)) This subquery is now replaced by a simpler list of integer literals, which should be more performant and easier to understand. Now, the query looks like this: SELECT ... WHERE t.oid NOT IN (integer, literals, ...) This should work in all versions of Postgres and Redshift, and works around a typing issue in CockroachDB: cockroachdb/cockroach#14554 <cockroachdb/cockroach#14554> ------------------------------ You can view, comment on, or merge this pull request online at: #310 Commit Summary - Simplify query in bootstrap_query method File Changes - *M* lib/postgrex/types.ex <https://github.com/elixir-ecto/postgrex/pull/310/files#diff-0> (7) Patch Links: - https://github.com/elixir-ecto/postgrex/pull/310.patch - https://github.com/elixir-ecto/postgrex/pull/310.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#310>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB6JTfEIAkgEsPLiYf6_UNO0I1daVW8Fks5rspGIgaJpZM4MzRx1> .

jordanlewis · 2017-04-04T20:50:53Z

It seems like 9.5 and 9.6 both claim to plan the simpler query as a simple filter, which does sound like it would be O(n^2). However, empirically it seems like this isn't a problem, since the literal integer list is small and fully in-memory. I did a test on both 9.5 and 9.6 and it seems like the two queries are pretty much the same (very low) cost, with the simpler version often running more quickly.

I think it's probably not a big deal either way, and the simpler query has the benefit of being much easier to understand.

On 9.5:

psql (9.5.6)
Type "help" for help.

postgres=# explain analyze select typname from pg_type where oid NOT IN (26,27,28,29,30);
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Seq Scan on pg_type  (cost=0.00..13.75 rows=349 width=64) (actual time=0.011..0.115 rows=349 loops=1)
   Filter: (oid <> ALL ('{26,27,28,29,30}'::oid[]))
   Rows Removed by Filter: 5
 Planning time: 0.098 ms
 Execution time: 0.162 ms
(5 rows)

postgres=# explain analyze select typname from pg_type where oid NOT IN (SELECT (ARRAY[26,27,28,29,30])[i] FROM generate_series(1, 5) i);
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pg_type  (cost=12.50..24.93 rows=177 width=64) (actual time=0.033..0.241 rows=349 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 5
   SubPlan 1
     ->  Function Scan on generate_series i  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.010..0.012 rows=5 loops=1)
 Planning time: 0.082 ms
 Execution time: 0.359 ms
(7 rows)

On 9.6:

psql (9.6.2)
Type "help" for help.

postgres=# explain analyze select typname from pg_type where oid NOT IN (26,27,28,29,30);
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Seq Scan on pg_type  (cost=0.00..14.83 rows=354 width=64) (actual time=0.009..0.117 rows=354 loops=1)
   Filter: (oid <> ALL ('{26,27,28,29,30}'::oid[]))
   Rows Removed by Filter: 5
 Planning time: 0.070 ms
 Execution time: 0.150 ms
(5 rows)

postgres=# explain analyze select typname from pg_type where oid NOT IN (SELECT (ARRAY[26,27,28,29,30])[i] FROM generate_series(1, 5) i);
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pg_type  (cost=12.50..25.99 rows=180 width=64) (actual time=0.038..0.114 rows=354 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 5
   SubPlan 1
     ->  Function Scan on generate_series i  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.014..0.016 rows=5 loops=1)
 Planning time: 0.077 ms
 Execution time: 0.202 ms
(7 rows)

jordanlewis · 2017-04-04T20:54:15Z

Also, now that I think about it, even if the more complex query is in fact hashing the array as it claims with "Hashed SubPlan 1", wouldn't the overall query still be an O(n^2) operation since the query is doing a negative match between the two data sets?

Regardless, I'm still of the opinion that the efficiency difference here is negligible because of the small size of the data involved.

fishcakez · 2017-04-04T21:03:53Z

Unfortunately the efficiency difference is significant and a function scan can be orders of magnitude slower for some users. The hashed anti join is O(nlog(n)) IIRC for the expensive part.

jordanlewis · 2017-04-04T21:19:23Z

I don't believe that the more complicated query uses a hashed anti-join, though, since it doesn't show up in the EXPLAIN output. When a plan actually uses a hashed anti-join it shows up prominently in EXPLAIN.

Can you give an example of an instance of an order of magnitude slowdown between these two queries?

fishcakez · 2017-04-04T21:36:55Z

On a local database this branch takes approximately 37 seconds and master takes 100 milliseconds.

jordanlewis · 2017-04-04T21:41:07Z

Wow! That's a huge difference. 37 seconds/100ms to do what though? I'd love to reproduce it locally.

fishcakez · 2017-04-04T21:50:49Z

That is the execution time when I ran EXPLAIN ANALYZE on the rebootstrap query using a database with quite a few thousand tables on 9.5.5:

{:ok, pid} = Postgrex.start_link([])
%{types: types} = Postgrex.prepare!(pid, "SELECT 42", [])
statement = Postgrex.Types.bootstrap_query({9, 5, 5}, types)
%{rows: rows} = Postgrex.query!(pid, ["EXPLAIN ANALYSE " | statement], [], [timeout: :infinity])
IO.puts Enum.join(rows, "\n")

jordanlewis · 2017-04-04T22:04:12Z

Thank you! Clearly this PR won't work.

fishcakez · 2017-04-04T22:49:57Z

I was hoping you could figure out a query that has similar efficiency and works in cockroach and redshift :(

jordanlewis · 2017-04-05T01:53:47Z

Another option is to use a VALUES clause instead of the array subquery. This seems to also produce a hashed subplan in Postgres, which I assume will get hash anti-joined just the same as the other version.

jordan=# explain analyze select typname from pg_type where oid not in (values(26),(27),(28),(29),(30));
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Seq Scan on pg_type  (cost=0.07..13.56 rows=180 width=64) (actual time=0.037..0.180 rows=365 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 5
   SubPlan 1
     ->  Values Scan on "*VALUES*"  (cost=0.00..0.06 rows=5 width=4) (actual time=0.004..0.004 rows=5 loops=1)
 Planning time: 0.075 ms
 Execution time: 0.241 ms
(7 rows)

jordanlewis · 2017-04-05T02:10:38Z

Hmm, I'm wondering what the purpose of this query is in general. It seems like oids contains over 150 values that have to be sent to the database each time this runs - perhaps there is a more efficient way of getting the information you want from the database?

fishcakez · 2017-04-05T07:13:57Z

I'm not sure that redshift supports that query.

…

On 5 Apr 2017 2:53 am, "Jordan Lewis" ***@***.***> wrote: Another option is to use a VALUES clause instead of the array subquery. This seems to also produce a hashed subplan in Postgres, which I assume will get hash anti-joined just the same as the other version. jordan=# explain analyze select typname from pg_type where oid not in (values(26),(27),(28),(29),(30)); QUERY PLAN --------------------------------------------------------------------------------------------------------------- Seq Scan on pg_type (cost=0.07..13.56 rows=180 width=64) (actual time=0.037..0.180 rows=365 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 5 SubPlan 1 -> Values Scan on "*VALUES*" (cost=0.00..0.06 rows=5 width=4) (actual time=0.004..0.004 rows=5 loops=1) Planning time: 0.075 ms Execution time: 0.241 ms (7 rows) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#310 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB6JTSyiBJuKjR09uvUdrqHucCrTPguQks5rsvQsgaJpZM4MzRx1> .

fishcakez · 2017-04-05T09:04:39Z

I'm wondering what the purpose of this query is in general.

On connect we want to fetch all rows we don't currently know about.

talentdeficit · 2017-06-16T12:17:12Z

redshift does not support VALUES in that manner

jordanlewis mentioned this pull request Apr 4, 2017

unsupported comparison operator: <oid> NOT IN <tuple{int}> cockroachdb/cockroach#14554

Closed

jordanlewis closed this Apr 4, 2017

jordanlewis reopened this Apr 5, 2017

fishcakez mentioned this pull request Jul 8, 2017

Avoid loading information about table-types at start #320

Closed

fishcakez mentioned this pull request Nov 25, 2017

Filter bootstrap more efficiently #352

Merged

josevalim closed this in #352 Jul 16, 2019

Conversation

jordanlewis commented Apr 4, 2017

Uh oh!

sourcelevel-bot Bot commented Apr 4, 2017

Uh oh!

fishcakez commented Apr 4, 2017 via email

Uh oh!

jordanlewis commented Apr 4, 2017

Uh oh!

jordanlewis commented Apr 4, 2017

Uh oh!

fishcakez commented Apr 4, 2017

Uh oh!

jordanlewis commented Apr 4, 2017

Uh oh!

fishcakez commented Apr 4, 2017

Uh oh!

jordanlewis commented Apr 4, 2017

Uh oh!

fishcakez commented Apr 4, 2017

Uh oh!

jordanlewis commented Apr 4, 2017

Uh oh!

fishcakez commented Apr 4, 2017

Uh oh!

jordanlewis commented Apr 5, 2017

Uh oh!

jordanlewis commented Apr 5, 2017

Uh oh!

fishcakez commented Apr 5, 2017 via email

Uh oh!

fishcakez commented Apr 5, 2017

Uh oh!

talentdeficit commented Jun 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants