Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: implement nested SRFs #26234

Open
knz opened this issue May 30, 2018 · 7 comments
Open

sql: implement nested SRFs #26234

knz opened this issue May 30, 2018 · 7 comments
Labels
A-sql-pgcompat Semantic compatibility with PostgreSQL C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team X-anchored-telemetry The issue number is anchored by telemetry references.

Comments

@knz
Copy link
Contributor

knz commented May 30, 2018

for example:

kena=# explain(verbose) select 123, generate_series(1, generate_series(2,3)), generate_series(4,generate_series(5,generate_series(6,generate_Series(7,8)))) from foo;
                                                                      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
 ProjectSet  (cost=0.00..11356550895255.20 rows=2260000000000000 width=12)
   Output: 123, (generate_series(1, (generate_series(2, 3)))), generate_series(4, (generate_series(5, (generate_series(6, (generate_series(7, 8)))))))
   ->  ProjectSet  (cost=0.00..11350895255.20 rows=2260000000000 width=8)
         Output: generate_series(5, (generate_series(6, (generate_series(7, 8))))), (generate_series(1, (generate_series(2, 3))))
         ->  ProjectSet  (cost=0.00..11345255.20 rows=2260000000 width=8)
               Output: generate_series(1, (generate_series(2, 3))), generate_series(6, (generate_series(7, 8)))
               ->  ProjectSet  (cost=0.00..11355.20 rows=2260000 width=8)
                     Output: generate_series(2, 3), generate_series(7, 8)
                     ->  Seq Scan on kena.foo  (cost=0.00..32.60 rows=2260 width=0)
                           Output: foo, bar

Jira issue: CRDB-5683

@knz knz added this to Triage in (DEPRECATED) SQL Front-end, Lang & Semantics via automation May 30, 2018
@knz knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-pgcompat Semantic compatibility with PostgreSQL labels May 30, 2018
@knz knz moved this from Triage to Current milestone in (DEPRECATED) SQL Front-end, Lang & Semantics May 30, 2018
@knz
Copy link
Contributor Author

knz commented May 30, 2018

So here are some investigation steps:

kena=# explain(verbose) select 123, generate_series(1,2);
                   QUERY PLAN
-------------------------------------------------
 ProjectSet  (cost=0.00..5.02 rows=1000 width=8)
   Output: 123, generate_series(1, 2)
   ->  Result  (cost=0.00..0.01 rows=1 width=0)

Learned:

  • ProjectSet is able to render simple scalar expressions.

  • if there is no FROM clause, then there is a project set operator but no further relational expression underneath. In other words the implicit relational first operand of a project set generated with no FROM clause is the unary table (with no column).

kena=# explain(verbose) select 123+generate_series(1,2);
                      QUERY PLAN
-------------------------------------------------------
 Result  (cost=0.00..20.02 rows=1000 width=4)
   Output: (123 + (generate_series(1, 2)))
   ->  ProjectSet  (cost=0.00..5.02 rows=1000 width=4)
         Output: generate_series(1, 2)
         ->  Result  (cost=0.00..0.01 rows=1 width=0)

Learned: If the result of a SRF is used in a larger scalar context, the operation is performed by inserting an additional regular projection on top of the "project set" operator.

kena=# explain(verbose) select generate_series(1,2)+10, 123 from kv;
                              QUERY PLAN
-----------------------------------------------------------------------
 Result  (cost=0.00..45249.55 rows=2260000 width=8)
   Output: ((generate_series(1, 2)) + 10), 123
   ->  ProjectSet  (cost=0.00..11349.55 rows=2260000 width=4)
         Output: generate_series(1, 2)
         ->  Seq Scan on kena.kv  (cost=0.00..32.60 rows=2260 width=0)
               Output: k, v

Learned: as an exception to rule 1 above, if rule 2 has caused the introduction of a render stage, then that render stage is used to render non-SRF expressions.

kena=# explain(verbose) select generate_series(1, random()::int);
                    QUERY PLAN
---------------------------------------------------
 ProjectSet  (cost=0.00..5.02 rows=1000 width=4)
   Output: generate_series(1, (random())::integer)
   ->  Result  (cost=0.00..0.01 rows=1 width=0)

Learned: if the input of a SRF is a scalar expression, the "project set" oeprator is able to evaluate it itself (i.e. it does not need a projection "underneath")

kena=# explain(verbose) select 123, generate_series(1,2), generate_series(3,4);
                         QUERY PLAN
-------------------------------------------------------------
 ProjectSet  (cost=0.00..5.02 rows=1000 width=12)
   Output: 123, generate_series(1, 2), generate_series(3, 4)
   ->  Result  (cost=0.00..0.01 rows=1 width=0)

Learned: if there are multiple SRFs at the same level of projection, a single "project set" operator is used (but see below).

kena=# explain(Verbose) select generate_series(1,2) from kv;
                           QUERY PLAN
-----------------------------------------------------------------
 ProjectSet  (cost=0.00..11349.55 rows=2260000 width=4)
   Output: generate_series(1, 2)
   ->  Seq Scan on kena.kv  (cost=0.00..32.60 rows=2260 width=0)
         Output: k, v

Learned: a SRF in render position with a single input table is a project set with that table as relational operand.

kena=# explain(Verbose) select generate_series(1,2) from kv, foo, rows from(unnest(array[1,2,3]));
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 ProjectSet  (cost=0.00..2564018102.50 rows=510760000000 width=4)
   Output: generate_series(1, 2)
   ->  Nested Loop  (cost=0.00..6387402.50 rows=510760000 width=0)
         ->  Nested Loop  (cost=0.00..2864.25 rows=226000 width=0)
               ->  Function Scan on pg_catalog.unnest  (cost=0.00..1.00 rows=100 width=0)
                     Output: unnest.unnest
                     Function Call: unnest('{1,2,3}'::integer[])
               ->  Materialize  (cost=0.00..43.90 rows=2260 width=0)
                     ->  Seq Scan on kena.kv  (cost=0.00..32.60 rows=2260 width=0)
         ->  Materialize  (cost=0.00..43.90 rows=2260 width=0)
               ->  Seq Scan on kena.foo  (cost=0.00..32.60 rows=2260 width=0)

Learned: a SRF in render position with a complex FROM clause uses a single "project set" clause with the entire relational expression of the original FROM clause as operand.

kena=# select *, generate_series(1,2) from kv;
  k  | v  | generate_series
-----+----+-----------------

Learned: the * expands against the original FROM clause, not the FROM clause expanded with "project set".


Drawing the line so far, suppose we introduce a new relational operator project_set(E, f, g, h, ...) which zips (f, g, h, ...) for every row in E.

Then here are the "simple case" transformations on input queries:

  • select ... srf(x) ... with no FROM

    -> proceed as per select ... srf(x) ... from unary_table()

  • select *, srfa(x), srfb(y), z from E and there are N columns in E

    transformed to: select E.*, @(N+1) as "srfa", @(N+2) as "srfb", z from project_set(E, srfa(x), srfb(y))

  • select *, f(srfa(x)), g(srfb(y)), z from E and there are N columns in E

    transformed to: select E.*, f(@(N+1)), g(@(N+2)), z from project_set(E, srfa(x), srfb(y))

All this is not considering SRFs as argument to other SRFs.

@knz
Copy link
Contributor Author

knz commented May 30, 2018

Nested SRFs.

For the analysis I'll be writing

  • g(N, x) to shortengenerate_series(N, x).
  • project_set(E, f(x)) as t to rename the pseudo-table returned by project_set() to t.
  • t:N to refer to the Nth column of the pseudo-table t currently available in the context of an expression

Let's analyze this recursion.

On the left:

  • select foo, g(a, a), g(Z,Z) from E
    ->
            project_set(E, 
                               foo, g(A,A), g(Z, Z)
            ) as t1
    
  • select foo, g(A, g(B, B)), g(Z, Z) from E
    ->
         project_set(
             project_set(E, 
                                g(B,B), g(Z, Z), foo
              ) as t1,
             foo, g(A, t1:1), t1:2
         ) as t2
    
  • select foo, g(A, g(B, g(C,C)), g(Z, Z) from E
    ->
         project_set(
           project_set(
             project_set(E, 
                                g(C,C), g(Z, Z), foo
             ) as t1,
             g(B, t1:1), foo, t1:2
           ) as t2,
           foo, g(A, t2:1), t2:3
         ) as t3
    
  • select foo, g(A, g(B, g(C,g(D,D))), g(Z, Z) from E
    ->
         project_set(
           project_set(
             project_set(
                 project_set(E, 
                                g(D,D), g(Z, Z), foo
                 ) as t1,
                 g(C, t1:1), foo, t1:2
             ) as t2,
             g(B, t2:1), foo, t2:3
           ) as t3,
           foo, g(A, t3:1), t3:3
       ) as t4
    

A few observations already:

  • the non-SRF renders are pulled on the right in the innermost (lowest) project_set
  • they are pulled on the left in the outermost project_set, to respect the original query ordering
  • in the middle something else is happening.

We have something like a recursion.

Let's make an attempt to define it. Let's define XFORM(E, f) to be the result of transforming scalar expression f in the context of a FROM clause E. Then we have:

  • XFORM(E, srf(x)) = project_set(E, srf(x)) if x does not contain a SRF
  • XFORM(E, srf(F)) = project_set(XFORM(E, F) as t, srf(t:1)) otherwise

Let's try this:

  • XFORM(E, g(A,A))
    -> project_set(E, g(A,A))
  • XFORM(E, g(A, g(B,B)))
    -> project_set(XFORM(E, g(B,B)) as t, g(A, t:1))
    -> project_set(project_set(E, g(B,B)) as t, g(A, t:1))
  • XFORM(E, g(A, g(B, g(C, C))))
    -> project_set(XFORM(E, g(B,g(C,C))) as t, g(A, t:1))
    -> project_set(project_set(XFORM(E, g(C,C)) as u, g(B, u:1)) as t, g(A, t:1))
    -> project_set(project_set(project_set(E, g(C,C)) as u, g(B, u:1)) as t, g(A, t:1))

This is equivalent to the observed structure above, looking good!

@knz
Copy link
Contributor Author

knz commented May 30, 2018

Let's try to extend our attempt to define XFORM above to support simple, non-SRF columns.

Let's denote XFORM(E, a, b) the result of transforming the expressions a, b where a may contain a SRF, but b cannot.

We have already:

  • XFORM(E, srf(x), b) = project_set(E, srf(x), b) if x does not contain a SRF
  • XFORM(E, srf(F), b) = project_set(XFORM(E, F, b) as t, srf(t:1), t1:2) if F contains a SRF

And we can add:

  • XFORM(E, a, b) = project(E, a, b) if a doesn't contain SRFs
  • XFORM(E, f(a), b) = project(XFORM(E, a, b) as t, t:1, t:2) if f is not a SRF and a may contain SRFs

@knz
Copy link
Contributor Author

knz commented May 30, 2018

Now that I knew what the recursion would look like, I knew how to recognize it in pg's own source code, and here it is:

 * split_pathtarget_at_srfs
 *              Split given PathTarget into multiple levels to position SRFs safely
 *
 * The executor can only handle set-returning functions that appear at the
 * top level of the targetlist of a ProjectSet plan node.  If we have any SRFs
 * that are not at top level, we need to split up the evaluation into multiple
 * plan levels in which each level satisfies this constraint.  This function
 * creates appropriate PathTarget(s) for each level.
 *
 * As an example, consider the tlist expression
 *              x + srf1(srf2(y + z))
 * This expression should appear as-is in the top PathTarget, but below that
 * we must have a PathTarget containing
 *              x, srf1(srf2(y + z))
 * and below that, another PathTarget containing
 *              x, srf2(y + z)
 * and below that, another PathTarget containing
 *              x, y, z
 * When these tlists are processed by setrefs.c, subexpressions that match
 * output expressions of the next lower tlist will be replaced by Vars,
 * so that what the executor gets are tlists looking like
 *              Var1 + Var2
 *              Var1, srf1(Var2)
 *              Var1, srf2(Var2 + Var3)
 *              x, y, z
 * which satisfy the desired property.
 *
 * Another example is
 *              srf1(x), srf2(srf3(y))
 * That must appear as-is in the top PathTarget, but below that we need
 *              srf1(x), srf3(y)
 * That is, each SRF must be computed at a level corresponding to the nesting
 * depth of SRFs within its arguments.
 *

src/backend/optimizer/util/tlist.c (target list)

@knz
Copy link
Contributor Author

knz commented May 30, 2018

The algorithm to perform this substitution is split into two phases:

  • a first phase splits each target expression into two data structures. This phase is applied to all the target expressions, including the main SELECT targets, but also the other sub-clauses including ORDER BY, GROUP BY, etc (but not HAVING)
  • after the various sub-clauses have been analyzed and planned, the collected data structure pairs are processed to modify the resulting logical plan.

@knz knz changed the title sql: define the semantics of pg's "Project set" properly sql: implement nested SRFs Jun 6, 2018
@knz knz moved this from Current milestone to Feature requests / pie-in-the-skie in (DEPRECATED) SQL Front-end, Lang & Semantics Jun 6, 2018
@knz knz added the X-anchored-telemetry The issue number is anchored by telemetry references. label Mar 13, 2019
@jordanlewis jordanlewis added this to To do in BACKLOG, NO NEW ISSUES: SQL Optimizer via automation Apr 30, 2019
@jordanlewis jordanlewis removed this from Feature requests / pie-in-the-skie in (DEPRECATED) SQL Front-end, Lang & Semantics Apr 30, 2019
@jordanlewis
Copy link
Member

@RaduBerinde moving to planning

@RaduBerinde RaduBerinde moved this from Triage to Lower Priority Backlog in BACKLOG, NO NEW ISSUES: SQL Optimizer Jul 9, 2019
@RaduBerinde RaduBerinde moved this from Lower Priority Backlog to Functional issues in BACKLOG, NO NEW ISSUES: SQL Optimizer Apr 18, 2020
@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
@jordanlewis
Copy link
Member

Still a thing.

@rytaft rytaft added this to Triage in SQL Queries via automation Mar 28, 2023
@rytaft rytaft moved this from Triage to New Backlog in SQL Queries Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-pgcompat Semantic compatibility with PostgreSQL C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team X-anchored-telemetry The issue number is anchored by telemetry references.
Projects
Status: Backlog
SQL Queries
New Backlog
Development

No branches or pull requests

3 participants