-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] add tutorial for sharding table #551
Conversation
5488eb0
to
f3e6c15
Compare
…ding table with Flink CDC
1bc9468
to
d388c3b
Compare
@openinx hi, could you please help review it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luoyuxia for the great contribution.
Let me take a look for this today ! Thanks @luoyuxia for the work ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This materials is a great one to show people how to integrate database cdc with downstream iceberg/hudi data lake. I will propose to add this to apache iceberg offical blog if possible. Just left several comments that we may need to address.
This tutorial will show how to use Flink CDC to build a real-time data lake for such a scenario. | ||
You can walk through the tutorial easily for the environment is built with docker, and the entire process uses standard SQL syntax without a single line of Java/Scala code or IDE installation. | ||
|
||
The following sections will take the pipeline from MySQL to Iceberg as an example. The overview of the architecture is as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide the apache iceberg official document link for the Iceberg
word ? https://iceberg.apache.org/
|
||
1. Enable checkpoints every 3 seconds | ||
|
||
Checkpoint is disabled by default, we need to enable it to commit Iceberg files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: commit Iceberg transactions.
e78a58a
to
200bdb5
Compare
|
||
![Architecture of Real-Time Data Lake](/_static/fig/real-time-data-lake-tutorial/real-time-data-lake-tutorial.png "architecture of real-time data lake") | ||
|
||
你也可以使用不同的 source 比如 Oracle/Postgres 和 sink 比如 Doris/Hudi 来构建自己的 ETL 流程。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你也可以使用不同的 source 比如 Oracle/Postgres 和 sink 比如 Doris/Hudi 来构建自己的 ETL 流程。 | |
你也可以使用不同的 source 比如 Oracle/Postgres 和 sink 比如 Hudi 来构建自己的 ETL 流程。 |
Doris 就不提了,数据湖专题就只提湖吧。
- MySQL:作为分库分表的数据源,存储本教程的 `user` 表 | ||
|
||
***注意:*** | ||
1. 为了简化整个教程,本教程需要的 jar 包都已经被打包进 SQL-Client 容器中了,如果你想要在自己的 Flink 环境运行本教程,需要下载下面列出的包并且把它们放在 Flink 所在目录的 lib 目录下,即 `FLINK_HOME/lib/`。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL Client 的镜像打包源码(Dockerfile, etc..)最好也公开下,这样大家也可以自己照着做自己的镜像。
每执行一步,我们就可以在 Flink Client CLI 中使用 `SELECT * FROM all_users_sink` 查询表 `all_users_sink` 来看到数据的变化。 | ||
|
||
整体的数据变化如下所示: | ||
![Data Changes in Iceberg](/_static/fig/real-time-data-lake-tutorial/data-changes-in-iceberg.gif "Data Changes in Iceberg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 query 应该是个类似 batch query ,无法看到变化吧?但是看这个图有在变的,是将多个截图拼在了一起么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯 对,这是个 batch query, @openinx 目前如果iceberg 还不支持源数据有update/delete的 streaming query 所以是在每一步执行后就运行这个batch query。
然后多个截图拼起来得到了这个batch query。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那要不最后就展示一个最终结果的截图吧。
```sql | ||
-- Flink SQL | ||
Flink SQL> INSERT INTO all_users_sink select * from user_source; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![Data Changes in Iceberg](/_static/fig/real-time-data-lake-tutorial/data-changes-in-iceberg.gif "Data Changes in Iceberg") | ||
|
||
## 环境清理 | ||
本教程结束后,在 `docker-compose.yml` 文件所在的目录下执行如下命令停止所有容器: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我想的是是不是可以
展示一下iceberg中的文件,metadata 和 data 这种,表示数据确实写入到iceberg当中了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
也可以的。 可以用 tree
命令。
d3c57bf
to
51c1a63
Compare
51c1a63
to
2614b45
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luoyuxia for the great work! LGTM.
…ding table with Flink CDC (#551)
Merged in aa5a2ee |
write tutorials for database/table sharding scenario