Skip to content

QueryEvaluation

Brad Bebee edited this page Feb 13, 2020 · 1 revision

Bigdata two primary kinds of joins: (1) nested index joins (pipelined joins); and (2) hash joins (at once).

Pipeline joins

Pipeline joins are "zero investment" joins because the access path is pre-existing, at least for the RDF database. They use a B+Tree index (access path) described by a predicate. The pipeline join also supports cutoff evaluation which is used for sampling the estimated cardinality of join paths during runtime query optimization.

PipelineJoin

SubqueryOp

SubqueryOp?: This is used for pipelined (as-bound) subquery joins. The source solutions are always from the pipeline and are joined against elements read from access paths, just like normal join evaluation. Because this is pipelined evaluation, there is no hash index involved.

Hash joins

Hash joins are built dynamically during query evaluation, either from solutions flowing through the pipeline or from solutions from a subquery or external service. Hash joins are "at once" operators since they rely on a hash index built at runtime over the solutions to be joins. Those solutions may either come from upstream in the pipeline or from a subquery or external service.

Bigdata has hash join operators which run on the JVM heap and are suitable for low data volume. It also has "analytic" hash join operators which run on the native process heap using the memory manager. These joins can handle as much data as you have RAM and do not suffer from GC overhead problems.

SubqueryHashJoinOp

The left is the pipeline, whose solutions are fully materialized in a hash index. The right is the subquery solutions iterator. The subquery is run once using an empty source binding set. We then join each right solution from the subquery iterator with the hash index over the left solutions. If the join is optional, we then output each left solutions which did not join.

HTreeHashJoinOp

The left is the access path solutions iterator and the right is the pipeline. The solutions from the pipeline (right) are fully materialized on a hash index before the join is executed. If the join is optional, then right solutions (from the pipeline) are output if the join failed for that right solution.

NamedSubqueryInclude

The left is the upstream solutions iterator. The right is the named solution set, which was materialized on a hash index by the NamedSubqueryOp?.

Clone this wiki locally