Skip to content

Conversation

@lincoln-lil
Copy link
Contributor

What is the purpose of the change

This is the user documentation work for the new LOOKUP hint, includes Chinese version

Brief change log

add user documentation for the new LOOKUP hint in hints.md page

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @public(Evolving): (no)
  • The serializers: (no )
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

@flinkbot
Copy link
Collaborator

flinkbot commented Aug 15, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@lincoln-lil
Copy link
Contributor Author

@flinkbot run azure

Copy link
Contributor

@godfreyhe godfreyhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @lincoln-lil , I left some comments, and please add some doc to explain the relation of retry behavior with/without full cache/partial cache

Comment on lines 124 to 130
<tr>
<th rowspan="1">table</th>
<th>table</th>
<th>N</th>
<th>string</th>
<th>N/A</th>
<th>the table name of the lookup source</th>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<th> is used for table header, while <td> is used for table body.

Comment on lines 138 to 139
<th>value can be 'true' or 'false' to suggest the planner choose the corresponding lookup function.
If the backend lookup source does not support the suggested lookup mode, it will take no effect.</th>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the text should align left

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<th>N</th>
<th>string</th>
<th>N/A</th>
<th>the table name of the lookup source</th>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is full path of table name required ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this is consistent with from clause

<tr>
<th>option type</th>
<th>option name</th>
<th>optional</th>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use required ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

```
Note: the optimizer prefers async lookup if no 'async' option is specified, it will always use sync lookup when:
1. the connector only implements the sync lookup
2. user enables 'TRY_RESOLVE' mode of 'table.optimizer.non-deterministic-update.strategy' and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any doc to explain them more? if existing, please add a link to refer it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there should be a configuration link

Note: the async options are consistent with the async options in [job level Execution Options]({{< ref "docs/dev/table/config" >}}#execution-options),
will use job level configuration if not set. Another difference is that the scope of the LOOKUP hint
is smaller, limited to the table name corresponding to the hint option set in the current lookup
operation (other lookup operations will not be affected by the LOOKUP hint).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Another difference is that the scope of the LOOKUP hint
is smaller, limited to the table name corresponding to the hint option set in the current lookup
operation (other lookup operations will not be affected by the LOOKUP hint)."

It's better we should introduce select scope concept to explain it, which is more specific and precise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you that select scope is necessary for query hints, and I suggested @swuferhong introducing this concept in the begining of query hints in #20513
so this part will not add it again, but after #20351 merged, we can add a link here

##### 3. Enable Delayed Retry Strategy For Lookup
Delayed retry for lookup join is intended to solve the problem of delayed updates in external system
which cause unexpected enrichment with stream data. The hint option 'retry-predicate'='lookup_miss'
can enable retry on both sync and async lookup, only support fixed delay retry strategy currently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only fixed delay retry strategy is supported currently.

Copy link
Contributor Author

@lincoln-lil lincoln-lil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@godfreyhe thanks for your comments here! I've addressed these comments and add a section about 'Effect Of Enabling Caching On Retries'

<tr>
<th>option type</th>
<th>option name</th>
<th>optional</th>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<th>N</th>
<th>string</th>
<th>N/A</th>
<th>the table name of the lookup source</th>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this is consistent with from clause

Comment on lines 138 to 139
<th>value can be 'true' or 'false' to suggest the planner choose the corresponding lookup function.
If the backend lookup source does not support the suggested lookup mode, it will take no effect.</th>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Note: the async options are consistent with the async options in [job level Execution Options]({{< ref "docs/dev/table/config" >}}#execution-options),
will use job level configuration if not set. Another difference is that the scope of the LOOKUP hint
is smaller, limited to the table name corresponding to the hint option set in the current lookup
operation (other lookup operations will not be affected by the LOOKUP hint).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you that select scope is necessary for query hints, and I suggested @swuferhong introducing this concept in the begining of query hints in #20513
so this part will not add it again, but after #20351 merged, we can add a link here

##### Further Notes

###### Effect Of Enabling Caching On Retries
[FLIP-221](https://cwiki.apache.org/confluence/display/FLINK/FLIP-229%3A+Introduces+Join+Hint+for+Flink+SQL+Batch+Job) adds caching support for lookup source,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this url is wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

When PARTIAL caching is enabled, it will lookup from local cache first for a coming record and will
do an external lookup via backend connector if cache miss(if cache hit, then return the record immediately),
and this will trigger a retry when lookup result is empty(same with caching disabled), the final lookup
result is determined when retry done(in PARTIAL caching mode, it will also update local cache).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the retry is complete ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to "the final lookup result is determined when retry completed"

Comment on lines +100 to +108
```sql
SELECT /*+ LOOKUP(key=value[, key=value]*) */

key:
stringLiteral

value:
stringLiteral
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a section ### Syntax

- async options are all optional, will use default value if not configured.
- there is no default value for retry options, all retry options should be set to valid values when need to enable retry.

##### 1. Use Sync And Async Lookup Function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the font is too small, it's better we can change the title levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 5th level is too small, seems we should not follow the sub levels strictly, I'll update this.

stringLiteral
```

The available hint options:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a section ### Lookuup Hint Options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

@lincoln-lil lincoln-lil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@godfreyhe I've addressed all your comments and updated the pr, thanks again!

When PARTIAL caching is enabled, it will lookup from local cache first for a coming record and will
do an external lookup via backend connector if cache miss(if cache hit, then return the record immediately),
and this will trigger a retry when lookup result is empty(same with caching disabled), the final lookup
result is determined when retry done(in PARTIAL caching mode, it will also update local cache).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to "the final lookup result is determined when retry completed"

stringLiteral
```

The available hint options:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

- async options are all optional, will use default value if not configured.
- there is no default value for retry options, all retry options should be set to valid values when need to enable retry.

##### 1. Use Sync And Async Lookup Function
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 5th level is too small, seems we should not follow the sub levels strictly, I'll update this.

@lincoln-lil
Copy link
Contributor Author

@flinkbot run azure

1 similar comment
@lincoln-lil
Copy link
Contributor Author

@flinkbot run azure

Copy link
Contributor

@godfreyhe godfreyhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update

2. configure the async parameters
3. enable delayed retry strategy for lookup

##### Syntax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

### Syntax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to level4 "#### Syntax" ?
because LOOKUP hint is level4 and not displayed on index
image

Comment on lines 89 to 91
### Join Hints

#### LOOKUP Hint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about merge them into ### LOOKUP Join Hints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "### Join Hints" here is reserved for merge new join hint doc, to keep consistent with other join hints, change "#### LOOKUP Hint" -> "#### LOOKUP"

```
Note: the optimizer prefers async lookup if no 'async' option is specified, it will always use sync lookup when:
1. the connector only implements the sync lookup
2. user enables 'TRY_RESOLVE' mode of ['table.optimizer.non-deterministic-update.strategy']({{ < ref "docs/dev/table/config" > }}#table-optimizer-non-deterministic-update-strategy) and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong display result:
image

LOOKUP('table'='Customers', 'async'='true', 'output-mode'='allow_unordered', 'capacity'='100', 'timeout'='180s')
```
注意:联接提示上的异步查找参数和[作业级别配置参数]({{< ref "docs/dev/table/config" >}}#execution-options)的
含义是一致的,没有设置的参数值由默认值生效,另一个区别是联接提示作用的范围更小,仅限于当前联接操作中对应联接提示选项设置的表名(未被联接提示作用的其他联接查询不受影响)。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need new line here

Comment on lines 276 to 277
缓存策略有部分缓存、全部缓存两种,当开启全部缓存时('lookup.cache'='FULL'),重试无法起作用(因为查找表被完整缓存,重试查找没有任何实际意义);当开启部分缓存时,当一条数据开始查找处理时,
先在本地缓存中查找,如果没找到则通过连接器进行外部查找(如果存在,则立即返回),此时查不到的记录和不开启缓存时一样,会触发重试查找,重试结束时的结果即为最终的查找结果(在部分缓存模式下,更新本地缓存)。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当开启全部缓存时 -> 开启全部缓存时
当开启部分缓存时 -> 开启部分缓存时 otherwise, there are two s


#### 关于查找键及 'retry-predicate'='lookup_miss' 重试条件的说明
对不同的连接器,提供的索引查找能力可能是不同的,例如内置的 HBase 连接器,默认仅提供了基于 `rowkey` 的索引查找能力(未
启用二级索引),而对于内置的 JDBC 连接器,默认情况下任何字段都可以被用作索引查找,这是物理存储的特性不同所决定的。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

@lincoln-lil lincoln-lil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@godfreyhe thanks for your detailed comments! I've updated the pr.

Comment on lines 89 to 91
### Join Hints

#### LOOKUP Hint
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "### Join Hints" here is reserved for merge new join hint doc, to keep consistent with other join hints, change "#### LOOKUP Hint" -> "#### LOOKUP"

2. configure the async parameters
3. enable delayed retry strategy for lookup

##### Syntax
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to level4 "#### Syntax" ?
because LOOKUP hint is level4 and not displayed on index
image

@lincoln-lil
Copy link
Contributor Author

@flinkbot run azure

Copy link
Contributor

@godfreyhe godfreyhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@godfreyhe godfreyhe closed this in bbe9690 Sep 2, 2022
huangxiaofeng10047 pushed a commit to huangxiaofeng10047/flink that referenced this pull request Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants