Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add docs of Runtime Filter #6154

Merged
merged 1 commit into from Jul 18, 2021

Conversation

xinyiZzz
Copy link
Contributor

@xinyiZzz xinyiZzz commented Jul 5, 2021

Proposed changes

Add docs of Runtime Filter #6116

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)
  • Code refactor (Modify the code structure, format the code, etc...)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have created an issue on (Fix #ISSUE) and described the bug/feature there in detail
  • Compiling and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • If these changes need document changes, I have updated the document
  • Any dependent changes have been merged

Comment on lines 190 to 267
+-------------------------------------------------------------------+
-- 上面`runtime filters`的行显示了`PLAN FRAGMENT 1`的`2:HASH JOIN`生成了ID为RF000的IN predicate,
-- 其中`test2`.`t2`的key values仅在运行时可知,
-- 在`0:OlapScanNode`使用了该IN predicate用于在读取`test`.`t1`时过滤不必要的数据。

SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2;
-- 返回2行结果[3, 4];

-- 通过query的profile(set is_report_success=true;)可以查看查询内部工作的详细信息,
-- 包括每个Runtime Filter是否下推、等待耗时、以及OLAP_SCAN_NODE从prepare到接收到Runtime Filter的总时长。
RuntimeFilter:in:
- HasPushDownToEngine: true
- AWaitTimeCost: 0ns
- EffectTimeCost: 2.76ms

-- 此外,在profile的OLAP_SCAN_NODE中还可以查看Runtime Filter下推后的过滤效果和耗时。
- RowsVectorPredFiltered: 9.320008M (9320008)
- VectorPredEvalTime: 364.39ms
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be English here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, thks

@morningman morningman added area/runtime-filter kind/docs Categorizes issue or PR as related to documentation. labels Jul 9, 2021
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add these doc to docs/.vuepress/sidebar/*.js


## 使用方式

### 设置Session变量
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议在这里先把所有runtime filter相关参数罗列一些,然后说明,在绝大多数情况下,只需要修改 runtime_filter_type,其他参数保持默认即可。在某些特定场景下,才需要进一步的调整其他参数已达到最优效果

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, thks

Runtime Filter is generated during query planning, constructed in HashJoinNode, and applied in ScanNode.

For example, there is currently a Join query between the T1 table and the T2 table. Its Join mode is HashJoin. T1 is a fact table with 100,000 rows of data. T2 is a dimension table with 100 rows of data. Doris join The actual situation is:
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里写100 rows,下面图里为什么是2000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, thks

## 原理
Runtime Filter在查询规划时生成,在HashJoinNode中构建,在ScanNode中应用。

举个例子,当前存在T1表与T2表的Join查询,它的Join方式为HashJoin,T1是一张事实表,数据行数为100000,T2是一张维度表,数据行数为100,Doris join的实际情况是:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是100还是2000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, thks

set runtime_filter_type="BLOOM_FILTER,IN,MIN_MAX";
```
默认会保守的只使用IN predicate,部分情况下同时使用Bloom Filter、MinMax Filter、IN predicate时性能更高。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分别解释一下Bloom Filter MinMax Filter IN predicate的具体算法吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在下面有对这三个filter说明

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bloom Filter如何过滤的,没有说明白。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bloom Filter的概念和原理感觉没必要在这里解释了,解释起来繁琐,这应该算基础知识…

- **MinMax Filter**: 包含最大值和最小值,从而过滤小于最小值和大于最大值的数据,MinMax Filter的过滤效果与join on clause中Key列的类型和左右表数据分布有关。
- 当join on clause中Key列的类型为int/bigint/double等时,极端情况下,如果左右表的最大最小值相同则没有效果,反之右表最大值小于左表最小值,或右表最小值大于左表最大值,则效果最好。
- 当join on clause中Key列的类型为varchar等时,应用MinMax Filter往往会导致性能降低。
- **IN predicate**: 根据join on clause中Key列在右表上的所有值构建IN predicate,使用构建的IN predicate在左表上过滤,相比loom Filter构建和应用的开销更低,在右表数据量较少时往往性能更高。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loom Filter拼写错误

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, thks

在不同Fragment上构建和应用Runtime Filter时,需要合并Runtime Filter的原因和策略可参阅 [ISSUE 6116](https://github.com/apache/incubator-doris/issues/6116)

#### 3.runtime_filter_wait_time_ms
在开启Runtime Filter后,join左表的ScanNode会为每一个分配给自己的Runtime Filter等待一段时间再扫描数据,默认等待1s(1000ms,以毫秒为单位),即如果ScanNode被分配了3个Runtime Filter,那么它最多会等待3s。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么是3秒?不是并行的么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是,目前是在一个循环里面,一个一个等待的

@xinyiZzz xinyiZzz force-pushed the runtime_filter_docs branch 2 times, most recently from e3e6a6d to a893a64 Compare July 9, 2021 06:43
如果Runtime Filter在ScanNode开始扫描之后到达,则ScanNode不会将该Runtime Filter下推到存储引擎,而是对已经从存储引擎扫描上来的数据,在ScanNode上基于该Runtime Filter使用表达式过滤,之前已经扫描的数据则不会应用该Runtime Filter,这样得到的中间数据规模会大于最优解,但可以避免严重的裂化。

如果集群比较繁忙,并且集群上有许多资源密集型或长耗时的查询,可以考虑增加等待时间,以避免复杂查询错过优化机会。如果集群负载较轻,并且集群上有许多只需要几秒的小查询,可以考虑减少等待时间,以避免每个查询增加1s的延迟。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

错过优化机会是什么意思?如何错过的?如何才能避免错过?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

错过:如果没有在规定时间内等到Runtime filter到达,会不使用Runtime FIlter直接开始扫描,即没用上runtime filter
避免:可以考虑增加等待时间

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman added the approved Indicates a PR has been approved by one committer. label Jul 17, 2021
@morningman morningman merged commit afce23a into apache:master Jul 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. area/runtime-filter kind/docs Categorizes issue or PR as related to documentation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants