Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

TAJO-1352: Improve the join order algorithm to consider missed cases of associative join operators#593

Closed
jihoonson wants to merge 99 commits into
apache:masterfrom
jihoonson:TAJO-1352_4
Closed

TAJO-1352: Improve the join order algorithm to consider missed cases of associative join operators#593
jihoonson wants to merge 99 commits into
apache:masterfrom
jihoonson:TAJO-1352_4

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

Main changes are found at the GreedyHeuristicJoinOrderAlgorithm class. The findBestOrder() function finds the best relation pair among remaining join candidates based on join commutativity and associativity.

Review on Reviewable

…into TAJO-1352_4

Conflicts:
	tajo-plan/src/main/java/org/apache/tajo/plan/expr/RowConstantEval.java
	tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java
@jihoonson
Copy link
Copy Markdown
Contributor Author

Here is the simple evaluation result.

Query

select 
  count(*) 
from 
  lineitem left outer join orders 
    on l_orderkey = o_orderkey 
  left outer join partsupp 
    on ps_suppkey = o_custkey 
  left outer join customer 
    on ps_suppkey = c_custkey 
  left outer join part 
    on p_partkey = c_nationkey

Data

  • TPC-H of scale factor 100

Cluster

  • One master and 3 workers
  • Each worker equips 4 cores, 8 GB memory, and 2 HDDs.

Performance comparison

  • Elapsed time
    • Before patch: 25 mins, 23 sec
    • After patch: 19 mins, 59 sec
  • Performance improvement ratio: about 20%

Query plan

The query execution time is reduced due to the improved query plan as follows.

Before patch

-----------------------------
Query Block Graph
-----------------------------
|-#ROOT
-----------------------------
Optimization Log:
[LogicalPlan]
    > ProjectionNode is eliminated.
[#ROOT]
    > Non-optimized join order: ((((tpch100.lineitem ⟕ tpch100.orders) ⟕ tpch100.partsupp) ⟕ tpch100.customer) ⟕ tpch100.part) (cost: 1.0447965953264456E46)
    > Optimized join order    : ((((tpch100.lineitem ⟕ tpch100.orders) ⟕ tpch100.partsupp) ⟕ tpch100.customer) ⟕ tpch100.part) (cost: 1.0447965859707236E46)
-----------------------------

GROUP_BY(10)()
  => exprs: (count())
  => target list: ?count (INT8)
  => out schema:{(1) ?count (INT8)}
  => in schema:{(0) }
   JOIN(15)(LEFT_OUTER)
     => Join Cond: tpch100.part.p_partkey (INT8) = tpch100.customer.c_nationkey (INT8)
     => target list: 
     => out schema: {(0) }
     => in schema: {(2) tpch100.customer.c_nationkey (INT8), tpch100.part.p_partkey (INT8)}
      SCAN(7) on tpch100.part
        => target list: tpch100.part.p_partkey (INT8)
        => out schema: {(1) tpch100.part.p_partkey (INT8)}
        => in schema: {(9) tpch100.part.p_partkey (INT8), tpch100.part.p_name (TEXT), tpch100.part.p_mfgr (TEXT), tpch100.part.p_brand (TEXT), tpch100.part.p_type (TEXT), tpch100.part.p_size (INT4), tpch100.part.p_container (TEXT), tpch100.part.p_retailprice (FLOAT8), tpch100.part.p_comment (TEXT)}
      JOIN(14)(LEFT_OUTER)
        => Join Cond: tpch100.partsupp.ps_suppkey (INT8) = tpch100.customer.c_custkey (INT8)
        => target list: tpch100.customer.c_nationkey (INT8)
        => out schema: {(1) tpch100.customer.c_nationkey (INT8)}
        => in schema: {(3) tpch100.partsupp.ps_suppkey (INT8), tpch100.customer.c_nationkey (INT8), tpch100.customer.c_custkey (INT8)}
         SCAN(5) on tpch100.customer
           => target list: tpch100.customer.c_nationkey (INT8), tpch100.customer.c_custkey (INT8)
           => out schema: {(2) tpch100.customer.c_nationkey (INT8), tpch100.customer.c_custkey (INT8)}
           => in schema: {(8) tpch100.customer.c_custkey (INT8), tpch100.customer.c_name (TEXT), tpch100.customer.c_address (TEXT), tpch100.customer.c_nationkey (INT8), tpch100.customer.c_phone (TEXT), tpch100.customer.c_acctbal (FLOAT8), tpch100.customer.c_mktsegment (TEXT), tpch100.customer.c_comment (TEXT)}
         JOIN(13)(LEFT_OUTER)
           => Join Cond: tpch100.partsupp.ps_suppkey (INT8) = tpch100.orders.o_custkey (INT8)
           => target list: tpch100.partsupp.ps_suppkey (INT8)
           => out schema: {(1) tpch100.partsupp.ps_suppkey (INT8)}
           => in schema: {(2) tpch100.orders.o_custkey (INT8), tpch100.partsupp.ps_suppkey (INT8)}
            SCAN(3) on tpch100.partsupp
              => target list: tpch100.partsupp.ps_suppkey (INT8)
              => out schema: {(1) tpch100.partsupp.ps_suppkey (INT8)}
              => in schema: {(5) tpch100.partsupp.ps_partkey (INT8), tpch100.partsupp.ps_suppkey (INT8), tpch100.partsupp.ps_availqty (INT4), tpch100.partsupp.ps_supplycost (FLOAT8), tpch100.partsupp.ps_comment (TEXT)}
            JOIN(12)(LEFT_OUTER)
              => Join Cond: tpch100.lineitem.l_orderkey (INT8) = tpch100.orders.o_orderkey (INT8)
              => target list: tpch100.orders.o_custkey (INT8)
              => out schema: {(1) tpch100.orders.o_custkey (INT8)}
              => in schema: {(3) tpch100.lineitem.l_orderkey (INT8), tpch100.orders.o_custkey (INT8), tpch100.orders.o_orderkey (INT8)}
               SCAN(1) on tpch100.orders
                 => target list: tpch100.orders.o_custkey (INT8), tpch100.orders.o_orderkey (INT8)
                 => out schema: {(2) tpch100.orders.o_custkey (INT8), tpch100.orders.o_orderkey (INT8)}
                 => in schema: {(9) tpch100.orders.o_orderkey (INT8), tpch100.orders.o_custkey (INT8), tpch100.orders.o_orderstatus (TEXT), tpch100.orders.o_totalprice (FLOAT8), tpch100.orders.o_orderdate (DATE), tpch100.orders.o_orderpriority (TEXT), tpch100.orders.o_clerk (TEXT), tpch100.orders.o_shippriority (INT4), tpch100.orders.o_comment (TEXT)}
               SCAN(0) on tpch100.lineitem
                 => target list: tpch100.lineitem.l_orderkey (INT8)
                 => out schema: {(1) tpch100.lineitem.l_orderkey (INT8)}
                 => in schema: {(16) tpch100.lineitem.l_orderkey (INT8), tpch100.lineitem.l_partkey (INT8), tpch100.lineitem.l_suppkey (INT8), tpch100.lineitem.l_linenumber (INT8), tpch100.lineitem.l_quantity (FLOAT8), tpch100.lineitem.l_extendedprice (FLOAT8), tpch100.lineitem.l_discount (FLOAT8), tpch100.lineitem.l_tax (FLOAT8), tpch100.lineitem.l_returnflag (TEXT), tpch100.lineitem.l_linestatus (TEXT), tpch100.lineitem.l_shipdate (DATE), tpch100.lineitem.l_commitdate (DATE), tpch100.lineitem.l_receiptdate (DATE), tpch100.lineitem.l_shipinstruct (TEXT), tpch100.lineitem.l_shipmode (TEXT), tpch100.lineitem.l_comment (TEXT)}

After patch

-----------------------------
Query Block Graph
-----------------------------
|-#ROOT
-----------------------------
Optimization Log:
[LogicalPlan]
    > ProjectionNode is eliminated.
[#ROOT]
    > Non-optimized join order: ((((tpch100.lineitem ⟕ tpch100.orders) ⟕ tpch100.partsupp) ⟕ tpch100.customer) ⟕ tpch100.part) (cost: 7.933924122078524E46)
    > Optimized join order    : ((tpch100.lineitem ⟕ (tpch100.orders ⟕ tpch100.partsupp)) ⟕ (tpch100.customer ⟕ tpch100.part)) (cost: 4.016549062824562E47)
-----------------------------

GROUP_BY(10)()
  => exprs: (count())
  => target list: ?count (INT8)
  => out schema:{(1) ?count (INT8)}
  => in schema:{(0) }
   JOIN(15)(LEFT_OUTER)
     => Join Cond: tpch100.partsupp.ps_suppkey (INT8) = tpch100.customer.c_custkey (INT8)
     => target list: 
     => out schema: {(0) }
     => in schema: {(2) tpch100.partsupp.ps_suppkey (INT8), tpch100.customer.c_custkey (INT8)}
      JOIN(14)(LEFT_OUTER)
        => Join Cond: tpch100.part.p_partkey (INT8) = tpch100.customer.c_nationkey (INT8)
        => target list: tpch100.customer.c_custkey (INT8)
        => out schema: {(1) tpch100.customer.c_custkey (INT8)}
        => in schema: {(3) tpch100.customer.c_custkey (INT8), tpch100.customer.c_nationkey (INT8), tpch100.part.p_partkey (INT8)}
         SCAN(7) on tpch100.part
           => target list: tpch100.part.p_partkey (INT8)
           => out schema: {(1) tpch100.part.p_partkey (INT8)}
           => in schema: {(9) tpch100.part.p_partkey (INT8), tpch100.part.p_name (TEXT), tpch100.part.p_mfgr (TEXT), tpch100.part.p_brand (TEXT), tpch100.part.p_type (TEXT), tpch100.part.p_size (INT4), tpch100.part.p_container (TEXT), tpch100.part.p_retailprice (FLOAT8), tpch100.part.p_comment (TEXT)}
         SCAN(5) on tpch100.customer
           => target list: tpch100.customer.c_custkey (INT8), tpch100.customer.c_nationkey (INT8)
           => out schema: {(2) tpch100.customer.c_custkey (INT8), tpch100.customer.c_nationkey (INT8)}
           => in schema: {(8) tpch100.customer.c_custkey (INT8), tpch100.customer.c_name (TEXT), tpch100.customer.c_address (TEXT), tpch100.customer.c_nationkey (INT8), tpch100.customer.c_phone (TEXT), tpch100.customer.c_acctbal (FLOAT8), tpch100.customer.c_mktsegment (TEXT), tpch100.customer.c_comment (TEXT)}
      JOIN(13)(LEFT_OUTER)
        => Join Cond: tpch100.lineitem.l_orderkey (INT8) = tpch100.orders.o_orderkey (INT8)
        => target list: tpch100.partsupp.ps_suppkey (INT8)
        => out schema: {(1) tpch100.partsupp.ps_suppkey (INT8)}
        => in schema: {(3) tpch100.lineitem.l_orderkey (INT8), tpch100.partsupp.ps_suppkey (INT8), tpch100.orders.o_orderkey (INT8)}
         JOIN(12)(LEFT_OUTER)
           => Join Cond: tpch100.partsupp.ps_suppkey (INT8) = tpch100.orders.o_custkey (INT8)
           => target list: tpch100.partsupp.ps_suppkey (INT8), tpch100.orders.o_orderkey (INT8)
           => out schema: {(2) tpch100.partsupp.ps_suppkey (INT8), tpch100.orders.o_orderkey (INT8)}
           => in schema: {(3) tpch100.orders.o_orderkey (INT8), tpch100.orders.o_custkey (INT8), tpch100.partsupp.ps_suppkey (INT8)}
            SCAN(3) on tpch100.partsupp
              => target list: tpch100.partsupp.ps_suppkey (INT8)
              => out schema: {(1) tpch100.partsupp.ps_suppkey (INT8)}
              => in schema: {(5) tpch100.partsupp.ps_partkey (INT8), tpch100.partsupp.ps_suppkey (INT8), tpch100.partsupp.ps_availqty (INT4), tpch100.partsupp.ps_supplycost (FLOAT8), tpch100.partsupp.ps_comment (TEXT)}
            SCAN(1) on tpch100.orders
              => target list: tpch100.orders.o_orderkey (INT8), tpch100.orders.o_custkey (INT8)
              => out schema: {(2) tpch100.orders.o_orderkey (INT8), tpch100.orders.o_custkey (INT8)}
              => in schema: {(9) tpch100.orders.o_orderkey (INT8), tpch100.orders.o_custkey (INT8), tpch100.orders.o_orderstatus (TEXT), tpch100.orders.o_totalprice (FLOAT8), tpch100.orders.o_orderdate (DATE), tpch100.orders.o_orderpriority (TEXT), tpch100.orders.o_clerk (TEXT), tpch100.orders.o_shippriority (INT4), tpch100.orders.o_comment (TEXT)}
         SCAN(0) on tpch100.lineitem
           => target list: tpch100.lineitem.l_orderkey (INT8)
           => out schema: {(1) tpch100.lineitem.l_orderkey (INT8)}
           => in schema: {(16) tpch100.lineitem.l_orderkey (INT8), tpch100.lineitem.l_partkey (INT8), tpch100.lineitem.l_suppkey (INT8), tpch100.lineitem.l_linenumber (INT8), tpch100.lineitem.l_quantity (FLOAT8), tpch100.lineitem.l_extendedprice (FLOAT8), tpch100.lineitem.l_discount (FLOAT8), tpch100.lineitem.l_tax (FLOAT8), tpch100.lineitem.l_returnflag (TEXT), tpch100.lineitem.l_linestatus (TEXT), tpch100.lineitem.l_shipdate (DATE), tpch100.lineitem.l_commitdate (DATE), tpch100.lineitem.l_receiptdate (DATE), tpch100.lineitem.l_shipinstruct (TEXT), tpch100.lineitem.l_shipmode (TEXT), tpch100.lineitem.l_comment (TEXT)}

@jihoonson
Copy link
Copy Markdown
Contributor Author

I expect that the performance will be more improved by simultaneously executing multiple execution blocks after our scheduler is implemented.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to need some comment because it is hard to image its return value from the function name.
Also, estimateRowSize rather than schema may be more intuitive.

@hyunsik
Copy link
Copy Markdown
Member

hyunsik commented Jul 13, 2015

Not all changes are shown in the diff in github due to lots of changes. So, I leave some comments here.

  • cachekey in JoinGraphContext should be changed to a local member variable. Its cache effect is trivial because the object lifecycle is very short.
  • the member variable LOG in GreedyHeuristicJoinOrderAlgorithm is added but is not used.
  • PlannerUtil::isSymmetricJoin includes the comment about commutative. But, the method name has been changed from isCommutativeJoin to isSymmetricJoin. As far as I know, commutative join is used in common for the exact meaning we know. But, symmetric join is not usual. Please see two links: symmetric join (https://goo.gl/d4uOO3) vs. commutative join (https://goo.gl/oFCgtM).

I'm still reviewing the patch. I'll give more comments soon.

@hyunsik
Copy link
Copy Markdown
Member

hyunsik commented Jul 14, 2015

I'm leaving additional comments.

  • Nested if-condition branches in JoinedRelationsVertex::buildPlan for left-deep tree can be simplified. If you intend better code readability, it would be Ok.
  • isSymmetricJoinOnly in JoinGraph only seems to be affected by only the last addJoin() call. But, its name and isSymmetricJoinOnly() name seems that this join graph consists of only symmetric joins.
    • JoinGraph::addJoin cuts the cyclic if it find out the loop. It would be Ok in the current implementation. As a database researcher and aside from this patch, I have just a question. Arbitrary cyclic cuts like this can lose the better join opportunities. Is this singular position true? I'm just wondering. If so, should we find a join order from just a join graph even including loops?

@hyunsik
Copy link
Copy Markdown
Member

hyunsik commented Jul 14, 2015

Additional comments:

  • The javadoc in JoinOrderAlgorithm::FoundJoinOrder includes wrong parameter name. That is, joinGraph should be joinGraphContext.
  • Methods (especially public methods) in JoinOrderingUtil seem to need Javadoc because it may be hard to guess their exact purpose from their signature.

@jihoonson
Copy link
Copy Markdown
Contributor Author

@hyunsik thanks for your review. I've reflected all your comments and added more comments to help your understanding.
Regarding on the question about JoinGraph::addJoin, it seems that there are some misunderstandings. The JoinGraph::addJoin method just checks whether there is a cycle rather than physically cutting it. This is to maintain the root vertexes which the graph traverse is started from. In this patch, the root vertexes are identified if they don't have any incoming edges. However, in the presence of a cycle, every vertex has at least one incoming edge, so cannot be the root vertex. The if clause related to the cycle at JoinGraph.java:43 is to prevent such case.

@hyunsik
Copy link
Copy Markdown
Member

hyunsik commented Jul 17, 2015

Thank you for your work.

I leave additional trivial comments.

  • LogicalPlanRewriteRule in LogicalOptimizer is not used.
  • handleRemainingFiltersIfNecessary needs a brief comment to explain its purpose.
  • JoinOrderingUtil includes several unused imports.
  • The multi line comments In 63 line in GreedyHeuristicAlgorithm.java should use // instead of /*.
  • GreedyHeuristicAlgorithm::prepareGraphUpdate needs a brief comment to its purpose.

The patch seems to be ready to be committed. After your answer, I'll finish the review.

@jihoonson
Copy link
Copy Markdown
Contributor Author

Thank you for the detailed review.
I've reflected all your comments.

@jihoonson
Copy link
Copy Markdown
Contributor Author

Test failures are fixed.

@hyunsik
Copy link
Copy Markdown
Member

hyunsik commented Jul 19, 2015

+1 The latest patch looks good to me.

@asfgit asfgit closed this in bedce3a Jul 20, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants