Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] add tutorial for sharding table #551

Closed
wants to merge 5 commits into from

Conversation

luoyuxia
Copy link
Contributor

@luoyuxia luoyuxia commented Nov 3, 2021

write tutorials for database/table sharding scenario

@luoyuxia
Copy link
Contributor Author

luoyuxia commented Nov 5, 2021

@openinx hi, could you please help review it?

Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luoyuxia for the great contribution.

@openinx
Copy link
Member

openinx commented Nov 9, 2021

Let me take a look for this today ! Thanks @luoyuxia for the work !

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This materials is a great one to show people how to integrate database cdc with downstream iceberg/hudi data lake. I will propose to add this to apache iceberg offical blog if possible. Just left several comments that we may need to address.

This tutorial will show how to use Flink CDC to build a real-time data lake for such a scenario.
You can walk through the tutorial easily for the environment is built with docker, and the entire process uses standard SQL syntax without a single line of Java/Scala code or IDE installation.

The following sections will take the pipeline from MySQL to Iceberg as an example. The overview of the architecture is as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide the apache iceberg official document link for the Iceberg word ? https://iceberg.apache.org/


1. Enable checkpoints every 3 seconds

Checkpoint is disabled by default, we need to enable it to commit Iceberg files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: commit Iceberg transactions.


![Architecture of Real-Time Data Lake](/_static/fig/real-time-data-lake-tutorial/real-time-data-lake-tutorial.png "architecture of real-time data lake")

你也可以使用不同的 source 比如 Oracle/Postgres 和 sink 比如 Doris/Hudi 来构建自己的 ETL 流程。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
你也可以使用不同的 source 比如 Oracle/Postgres 和 sink 比如 Doris/Hudi 来构建自己的 ETL 流程。
你也可以使用不同的 source 比如 Oracle/Postgres 和 sink 比如 Hudi 来构建自己的 ETL 流程。

Doris 就不提了,数据湖专题就只提湖吧。

- MySQL:作为分库分表的数据源,存储本教程的 `user` 表

***注意:***
1. 为了简化整个教程,本教程需要的 jar 包都已经被打包进 SQL-Client 容器中了,如果你想要在自己的 Flink 环境运行本教程,需要下载下面列出的包并且把它们放在 Flink 所在目录的 lib 目录下,即 `FLINK_HOME/lib/`。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL Client 的镜像打包源码(Dockerfile, etc..)最好也公开下,这样大家也可以自己照着做自己的镜像。

每执行一步,我们就可以在 Flink Client CLI 中使用 `SELECT * FROM all_users_sink` 查询表 `all_users_sink` 来看到数据的变化。

整体的数据变化如下所示:
![Data Changes in Iceberg](/_static/fig/real-time-data-lake-tutorial/data-changes-in-iceberg.gif "Data Changes in Iceberg")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 query 应该是个类似 batch query ,无法看到变化吧?但是看这个图有在变的,是将多个截图拼在了一起么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯 对,这是个 batch query, @openinx 目前如果iceberg 还不支持源数据有update/delete的 streaming query 所以是在每一步执行后就运行这个batch query。
然后多个截图拼起来得到了这个batch query。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那要不最后就展示一个最终结果的截图吧。

```sql
-- Flink SQL
Flink SQL> INSERT INTO all_users_sink select * from user_source;
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里展示一下 flink web ui 的作业图,另外也提一句,这是一个流式作业,所以会源源不断地将数据库新增数据同步到 iceberg中。

image

![Data Changes in Iceberg](/_static/fig/real-time-data-lake-tutorial/data-changes-in-iceberg.gif "Data Changes in Iceberg")

## 环境清理
本教程结束后,在 `docker-compose.yml` 文件所在的目录下执行如下命令停止所有容器:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有没有其他的方式可以展示下 Iceberg 中的数据信息、版本信息? 现在都是在 Flink SQL CLI 中操作,直观上 iceberg 存在感不强。像我们另外的 Demo 会在 Kibana 中展示 ES 的数据,让用户直观感受到数据确实进到 ES 里面了。@openinx 觉得呢?

Copy link
Contributor Author

@luoyuxia luoyuxia Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我想的是是不是可以
展示一下iceberg中的文件,metadata 和 data 这种,表示数据确实写入到iceberg当中了。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也可以的。 可以用 tree 命令。

Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luoyuxia for the great work! LGTM.

wuchong pushed a commit that referenced this pull request Nov 11, 2021
@wuchong
Copy link
Member

wuchong commented Nov 11, 2021

Merged in aa5a2ee

@wuchong wuchong closed this Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants