Skip to content

Simplify query in bootstrap_query method#310

Closed
jordanlewis wants to merge 1 commit intoelixir-ecto:masterfrom
jordanlewis:simplify-bootstrap-query
Closed

Simplify query in bootstrap_query method#310
jordanlewis wants to merge 1 commit intoelixir-ecto:masterfrom
jordanlewis:simplify-bootstrap-query

Conversation

@jordanlewis
Copy link
Copy Markdown

Previously, bootstrap_query constructed a pretty complicated SELECT
statement that used a subquery, array and generate_series to exclude
rows whose oids were part of a predefined set. The query looked like
this:

SELECT ... WHERE t.oid NOT IN
  (SELECT (ARRAY[integer, literals, ...])[i]
   FROM generate_series(1, length_of_above_array))

This subquery is now replaced by a simpler list of integer literals,
which should be more performant and easier to understand.

Now, the query looks like this:

SELECT ... WHERE t.oid NOT IN (integer, literals, ...)

This should work in all versions of Postgres and Redshift, and works around a typing issue in CockroachDB: cockroachdb/cockroach#14554

Previously, bootstrap_query constructed a pretty complicated `SELECT`
statement that used a subquery, array and generate_series to exclude
rows whose oids were part of a predefined set. The query looked like
this:

```
SELECT ... WHERE t.oid NOT IN
  (SELECT (ARRAY[integer, literals, ...])[i]
   FROM generate_series(1, length_of_above_array))
```

This subquery is now replaced by a simpler list of integer literals,
which should be more performant and easier to understand.

Now, the query looks like this:

```
SELECT ... WHERE t.oid NOT IN (integer, literals, ...)
```
@sourcelevel-bot
Copy link
Copy Markdown

Hello, @jordanlewis! This is your first Pull Request that will be reviewed by Ebert, an automatic Code Review service. It will leave comments on this diff with potential issues and style violations found in the code as you push new commits. You can also see all the issues found on this Pull Request on its review page. Please check our documentation for more information.

@fishcakez
Copy link
Copy Markdown
Member

fishcakez commented Apr 4, 2017 via email

@jordanlewis
Copy link
Copy Markdown
Author

It seems like 9.5 and 9.6 both claim to plan the simpler query as a simple filter, which does sound like it would be O(n^2). However, empirically it seems like this isn't a problem, since the literal integer list is small and fully in-memory. I did a test on both 9.5 and 9.6 and it seems like the two queries are pretty much the same (very low) cost, with the simpler version often running more quickly.

I think it's probably not a big deal either way, and the simpler query has the benefit of being much easier to understand.

On 9.5:

psql (9.5.6)
Type "help" for help.

postgres=# explain analyze select typname from pg_type where oid NOT IN (26,27,28,29,30);
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Seq Scan on pg_type  (cost=0.00..13.75 rows=349 width=64) (actual time=0.011..0.115 rows=349 loops=1)
   Filter: (oid <> ALL ('{26,27,28,29,30}'::oid[]))
   Rows Removed by Filter: 5
 Planning time: 0.098 ms
 Execution time: 0.162 ms
(5 rows)

postgres=# explain analyze select typname from pg_type where oid NOT IN (SELECT (ARRAY[26,27,28,29,30])[i] FROM generate_series(1, 5) i);
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pg_type  (cost=12.50..24.93 rows=177 width=64) (actual time=0.033..0.241 rows=349 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 5
   SubPlan 1
     ->  Function Scan on generate_series i  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.010..0.012 rows=5 loops=1)
 Planning time: 0.082 ms
 Execution time: 0.359 ms
(7 rows)

On 9.6:

psql (9.6.2)
Type "help" for help.

postgres=# explain analyze select typname from pg_type where oid NOT IN (26,27,28,29,30);
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Seq Scan on pg_type  (cost=0.00..14.83 rows=354 width=64) (actual time=0.009..0.117 rows=354 loops=1)
   Filter: (oid <> ALL ('{26,27,28,29,30}'::oid[]))
   Rows Removed by Filter: 5
 Planning time: 0.070 ms
 Execution time: 0.150 ms
(5 rows)

postgres=# explain analyze select typname from pg_type where oid NOT IN (SELECT (ARRAY[26,27,28,29,30])[i] FROM generate_series(1, 5) i);
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pg_type  (cost=12.50..25.99 rows=180 width=64) (actual time=0.038..0.114 rows=354 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 5
   SubPlan 1
     ->  Function Scan on generate_series i  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.014..0.016 rows=5 loops=1)
 Planning time: 0.077 ms
 Execution time: 0.202 ms
(7 rows)

@jordanlewis
Copy link
Copy Markdown
Author

Also, now that I think about it, even if the more complex query is in fact hashing the array as it claims with "Hashed SubPlan 1", wouldn't the overall query still be an O(n^2) operation since the query is doing a negative match between the two data sets?

Regardless, I'm still of the opinion that the efficiency difference here is negligible because of the small size of the data involved.

@fishcakez
Copy link
Copy Markdown
Member

Unfortunately the efficiency difference is significant and a function scan can be orders of magnitude slower for some users. The hashed anti join is O(nlog(n)) IIRC for the expensive part.

@jordanlewis
Copy link
Copy Markdown
Author

I don't believe that the more complicated query uses a hashed anti-join, though, since it doesn't show up in the EXPLAIN output. When a plan actually uses a hashed anti-join it shows up prominently in EXPLAIN.

Can you give an example of an instance of an order of magnitude slowdown between these two queries?

@fishcakez
Copy link
Copy Markdown
Member

On a local database this branch takes approximately 37 seconds and master takes 100 milliseconds.

@jordanlewis
Copy link
Copy Markdown
Author

Wow! That's a huge difference. 37 seconds/100ms to do what though? I'd love to reproduce it locally.

@fishcakez
Copy link
Copy Markdown
Member

That is the execution time when I ran EXPLAIN ANALYZE on the rebootstrap query using a database with quite a few thousand tables on 9.5.5:

{:ok, pid} = Postgrex.start_link([])
%{types: types} = Postgrex.prepare!(pid, "SELECT 42", [])
statement = Postgrex.Types.bootstrap_query({9, 5, 5}, types)
%{rows: rows} = Postgrex.query!(pid, ["EXPLAIN ANALYSE " | statement], [], [timeout: :infinity])
IO.puts Enum.join(rows, "\n")

@jordanlewis
Copy link
Copy Markdown
Author

Thank you! Clearly this PR won't work.

@jordanlewis jordanlewis closed this Apr 4, 2017
@fishcakez
Copy link
Copy Markdown
Member

I was hoping you could figure out a query that has similar efficiency and works in cockroach and redshift :(

@jordanlewis jordanlewis reopened this Apr 5, 2017
@jordanlewis
Copy link
Copy Markdown
Author

Another option is to use a VALUES clause instead of the array subquery. This seems to also produce a hashed subplan in Postgres, which I assume will get hash anti-joined just the same as the other version.

jordan=# explain analyze select typname from pg_type where oid not in (values(26),(27),(28),(29),(30));
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Seq Scan on pg_type  (cost=0.07..13.56 rows=180 width=64) (actual time=0.037..0.180 rows=365 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 5
   SubPlan 1
     ->  Values Scan on "*VALUES*"  (cost=0.00..0.06 rows=5 width=4) (actual time=0.004..0.004 rows=5 loops=1)
 Planning time: 0.075 ms
 Execution time: 0.241 ms
(7 rows)

@jordanlewis
Copy link
Copy Markdown
Author

Hmm, I'm wondering what the purpose of this query is in general. It seems like oids contains over 150 values that have to be sent to the database each time this runs - perhaps there is a more efficient way of getting the information you want from the database?

@fishcakez
Copy link
Copy Markdown
Member

fishcakez commented Apr 5, 2017 via email

@fishcakez
Copy link
Copy Markdown
Member

I'm wondering what the purpose of this query is in general.

On connect we want to fetch all rows we don't currently know about.

@talentdeficit
Copy link
Copy Markdown
Contributor

redshift does not support VALUES in that manner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants