New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[docs] add tutorial for sharding table #551

Closed

luoyuxia wants to merge 5 commits into apache:master from luoyuxia:tutorial-with-sharding-table

Contributor

luoyuxia commented Nov 3, 2021

write tutorials for database/table sharding scenario


          [docs] add tutorial for sharding table

f3e6c15

luoyuxia force-pushed the tutorial-with-sharding-table branch from 5488eb0 to f3e6c15 Compare

November 3, 2021 09:52

wuchong reviewed

View reviewed changes

docs/content/快速上手/work-with-sharding-table-tutorial-zh.md Outdated Show resolved Hide resolved

docs/content/快速上手/work-with-sharding-table-tutorial-zh.md Outdated Show resolved Hide resolved

docs/content/快速上手/work-with-sharding-table-tutorial-zh.md Outdated Show resolved Hide resolved

docs/content/快速上手/work-with-sharding-table-tutorial-zh.md Outdated Show resolved Hide resolved


          [docs] add tutorial about building real-time data lake for MySQL shar…

d388c3b

…ding table with Flink CDC

luoyuxia force-pushed the tutorial-with-sharding-table branch from 1bc9468 to d388c3b Compare

November 5, 2021 11:53

Contributor Author

luoyuxia commented Nov 5, 2021 •

edited

@openinx hi, could you please help review it?

wuchong reviewed

View reviewed changes

Member

wuchong left a comment

Thanks @luoyuxia for the great contribution.

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved


          fix comments

b09c1c0

Member

openinx commented Nov 9, 2021

Let me take a look for this today ! Thanks @luoyuxia for the work !

openinx reviewed

View reviewed changes

Member

openinx left a comment

This materials is a great one to show people how to integrate database cdc with downstream iceberg/hudi data lake. I will propose to add this to apache iceberg offical blog if possible. Just left several comments that we may need to address.

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated

+              This tutorial will show how to use Flink CDC to build a real-time data lake for such a scenario.
+              You can walk through the tutorial easily for the environment is built with docker, and the entire process uses standard SQL syntax without a single line of Java/Scala code or IDE installation.
+              The following sections will take the pipeline from MySQL to Iceberg as an example. The overview of the architecture is as follows:

Member

openinx Nov 9, 2021

Could you provide the apache iceberg official document link for the Iceberg word ? https://iceberg.apache.org/

docs/content/quickstart/build-real-time-data-lake-tutorial.md Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated


		1. Enable checkpoints every 3 seconds

		Checkpoint is disabled by default, we need to enable it to commit Iceberg files.

Member

openinx Nov 9, 2021

Nit: commit Iceberg transactions.

docs/content/quickstart/build-real-time-data-lake-tutorial.md Outdated Show resolved Hide resolved


          fix comments

200bdb5

luoyuxia force-pushed the tutorial-with-sharding-table branch from e78a58a to 200bdb5 Compare

November 10, 2021 09:17

openinx approved these changes

View reviewed changes

wuchong reviewed

View reviewed changes

docs/content/快速上手/build-real-time-data-lake-tutorial-zh.md Outdated Show resolved Hide resolved

docs/content/快速上手/build-real-time-data-lake-tutorial-zh.md Outdated


		![Architecture of Real-Time Data Lake](/_static/fig/real-time-data-lake-tutorial/real-time-data-lake-tutorial.png "architecture of real-time data lake")

		你也可以使用不同的 source 比如 Oracle/Postgres 和 sink 比如 Doris/Hudi 来构建自己的 ETL 流程。

Member

wuchong Nov 11, 2021

Suggested change

      
            你也可以使用不同的 source 比如 Oracle/Postgres 和 sink 比如 Doris/Hudi 来构建自己的 ETL 流程。
          
            你也可以使用不同的 source 比如 Oracle/Postgres 和 sink 比如 Hudi 来构建自己的 ETL 流程。

Doris 就不提了，数据湖专题就只提湖吧。

docs/content/快速上手/build-real-time-data-lake-tutorial-zh.md Outdated

+              - MySQL：作为分库分表的数据源，存储本教程的 `user` 表
+              ***注意：***
+. 为了简化整个教程，本教程需要的 jar 包都已经被打包进 SQL-Client 容器中了，如果你想要在自己的 Flink 环境运行本教程，需要下载下面列出的包并且把它们放在 Flink 所在目录的 lib 目录下，即 `FLINK_HOME/lib/`。

Member

wuchong Nov 11, 2021

SQL Client 的镜像打包源码（Dockerfile, etc..）最好也公开下，这样大家也可以自己照着做自己的镜像。

docs/content/快速上手/build-real-time-data-lake-tutorial-zh.md Outdated Show resolved Hide resolved

docs/content/快速上手/build-real-time-data-lake-tutorial-zh.md Outdated Show resolved Hide resolved

docs/content/快速上手/build-real-time-data-lake-tutorial-zh.md Show resolved Hide resolved

docs/content/快速上手/build-real-time-data-lake-tutorial-zh.md Outdated Show resolved Hide resolved

docs/content/快速上手/build-real-time-data-lake-tutorial-zh.md Outdated

+                 每执行一步，我们就可以在 Flink Client CLI 中使用 `SELECT * FROM all_users_sink` 查询表 `all_users_sink` 来看到数据的变化。
+                 整体的数据变化如下所示：
+                 ![Data Changes in Iceberg](/_static/fig/real-time-data-lake-tutorial/data-changes-in-iceberg.gif "Data Changes in Iceberg")

Member

wuchong Nov 11, 2021

这个 query 应该是个类似 batch query ，无法看到变化吧？但是看这个图有在变的，是将多个截图拼在了一起么？

Contributor Author

luoyuxia Nov 11, 2021

嗯对，这是个 batch query， @openinx 目前如果iceberg 还不支持源数据有update/delete的 streaming query 所以是在每一步执行后就运行这个batch query。
然后多个截图拼起来得到了这个batch query。

Member

wuchong Nov 11, 2021

那要不最后就展示一个最终结果的截图吧。

docs/content/快速上手/build-real-time-data-lake-tutorial-zh.md

+                 ```sql
+                 -- Flink SQL
+                 Flink SQL> INSERT INTO all_users_sink select * from user_source;
+                 ```

Member

wuchong Nov 11, 2021

这里展示一下 flink web ui 的作业图，另外也提一句，这是一个流式作业，所以会源源不断地将数据库新增数据同步到 iceberg中。

docs/content/快速上手/build-real-time-data-lake-tutorial-zh.md

+                 ![Data Changes in Iceberg](/_static/fig/real-time-data-lake-tutorial/data-changes-in-iceberg.gif "Data Changes in Iceberg")
+              ## 环境清理
+              本教程结束后，在 `docker-compose.yml` 文件所在的目录下执行如下命令停止所有容器：

Member

wuchong Nov 11, 2021

有没有其他的方式可以展示下 Iceberg 中的数据信息、版本信息？现在都是在 Flink SQL CLI 中操作，直观上 iceberg 存在感不强。像我们另外的 Demo 会在 Kibana 中展示 ES 的数据，让用户直观感受到数据确实进到 ES 里面了。@openinx 觉得呢？

Contributor Author

luoyuxia Nov 11, 2021 •

edited

我想的是是不是可以
展示一下iceberg中的文件，metadata 和 data 这种，表示数据确实写入到iceberg当中了。

Member

wuchong Nov 11, 2021

也可以的。可以用 tree 命令。

luoyuxia force-pushed the tutorial-with-sharding-table branch from d3c57bf to 51c1a63 Compare

November 11, 2021 07:38


          fix comments

2614b45

luoyuxia force-pushed the tutorial-with-sharding-table branch from 51c1a63 to 2614b45 Compare

November 11, 2021 07:46

wuchong approved these changes

View reviewed changes

Member

wuchong left a comment

Thanks @luoyuxia for the great work! LGTM.

wuchong pushed a commit that referenced this pull request


          [docs] Add tutorial about building real-time data lake for MySQL shar…

aa5a2ee

…ding table with Flink CDC (#551)

Member

wuchong commented Nov 11, 2021

Merged in aa5a2ee

wuchong closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment