Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Add Server for seatunnel #1947

Open
5 of 17 tasks
dijiekstra opened this issue May 25, 2022 · 27 comments
Open
5 of 17 tasks

[Discuss] Add Server for seatunnel #1947

dijiekstra opened this issue May 25, 2022 · 27 comments

Comments

@dijiekstra
Copy link
Contributor

dijiekstra commented May 25, 2022

Code of Conduct

Search before asking

  • I had searched in the issues and found no similar issues.

Describe the proposal

Background

Why do we need a Server

Suppose I am now a Seatunnel user and I want to import database or business's logs into the OLAP engine. I can only submit tasks from the command line, and the task stop & maintain depends on Spark/Flink. This created a huge amount of extra work for us

  1. Task script management: how to manage scattered scripts on the linux's server? If you need timed trigger, you can only do it with crontab.
  2. The entry of user submission or maintenance is not unified;
  3. I may need pre-process or post-process before and after the task is submitted. Of course, I can also encapsulate my own script. If I need this processing for each task, the task management becomes more complicated.
  4. Of course, I could have used Azkaban or DolphinScheduler to do this, but this bring-in more components and I wanted to synchronize data to OLAP from the start. I didn't need the scheduling engine's capabilities, I just wanted to manage and operate my Seatunnel scripts better. Moreover, the capabilities on Azkaban or DolphinScheduler may not be enough to support our data integration scenarios, since they are not specifically designed for this scenario.

Back to the seatunnel developer's perspective
As a platform, a service, does not provide a visual control platform, only provides the command line interaction, that is unreasonable。
What does the control-platform need to do?
The most important thing is to manage the configuration information of the data integration task.

Users can easily complete task configuration on the WebUI, such as input and output data sources、 field information、 partition information、 filtering conditions、 abnormal data processing、 scheduling time、 concurrency control、 traffic control、 and incremental or full data integration configuration.

In short, it is to enable users to express their business demands through sample configuration information.

In addition to better development, what remains is to make operations easier:

  • Provide task execution log for users to query task execution;

  • Provides management of data sources and permissions, which is common in multi-user and multi-tenant scenarios

  • Provides system load monitoring and task execution alarms

Of course, these capabilities can ideally be integrated with other types of operations on the same platform (because other operations also have similar requirements), so there are higher requirements for the design of control-platform: To be able to reuse the existing capabilities of the scheduling system or maintenance center, if there is no corresponding service, then you should also have a built-in capability to support such things.

But there are some things we can't do depending on other apps, or the ROI is too low.

  1. schema evolution:Of course, this is more dependent on the ability of the engine, but if there is a control-platform, we can do some simple 'SE', such as automatically adding fields, insert deleted-fields with empty data, etc.
  2. data time:For example, in mysql binlog -> hive scenario, data may arrive late for various reasons, so how to partition the data into the correct partition?
  3. dynamic partition:In some scenarios, we need to re-synchronize a copy of historical data to the bigdata cluster. The data already has the corresponding partition field, and we need to rely on this field to insert into the specified Hive partition. This can be easily done in batch processing, but in data integration, we need to pay at great cost. For example, modify or add spark/ Flink connectors. However, if we have a control-platform and we provide 'post-processing' capabilities, seatunnel simply writes data to a partition of a temporary table and then batches the data through Hive/Spark/Flink

Therefore, in order for users to better use Seatunnel, a control-platform is essential for us.

Target

In a word: provide convenient task development and operation and maintenance, can easily complete end-to-end data integration.

Functional Target

  • management
    • datasource management
    • auth management
    • service management
  • Development
    • CRUD of task.
    • database migration.
  • maintenance
    • maintenance panel
    • tmp task maintenance
    • scheduler task maintenance
    • realtime task maintenance
  • monitor
    • service status
    • task metrics (Or relying on Grafana is a better implementation?)
    • alarms
      • alarm configuration
      • alarm records
      • alarm restrain

Rome was not built in a day
There are so many more features to implement than just the ones I've listed. But for the sake of time, I'm going to finish development and maintenance for the time being, and since my tech stack is more Flink oriented, Spark type engines may not support it as quickly.

Maintenance availability

  1. The Server should have only one role, and the multiple-instances of the Server should be guaranteed equal status.

Expansibility

Architectural Design

#1968

Detail Design

#1969

Subsequent planning

This design and development is v1.0, more people can join us to implement more functions

  1. Integration with DolphinScheduler, for example, as scheduler-engine-DS

  2. Web page development, using open source front-end scaffolding to quickly complete the development

背景

为什么我们需要一个Server

假设我现在是一个seatunnel的用户,我现在想要将数据库或者业务日志导入到OLAP引擎中。我现在只能通过命令行的方式进行任务提交,并且任务的停止&运维需要依赖于spark/flink;这给我们带来了巨大的额外工作量

  1. 任务脚本的管理,如何对散落在服务器上的脚本进行管理?如果需要定时同步,那只能通过crontab完成;
  2. 用户的提交和运维的入口不统一;
  3. 在任务提交的前后,我可能需要前置或者后置处理,当然我也可以自己封装一层脚本,如果我每个任务都需要这样的处理,那么任务的管理会变得更加复杂;
  4. 当然,我可以使用azkaban或者dolphinscheduler来完成上述的功能,但是这样引入了更多的组件,而我一开始就只是想将数据同步到OLAP中,我不太需要调度引擎的调度侧的能力,我仅仅是想更好的管理和运维我的seatunnel脚本;而且azkaban或dolphinscheduler上的能力,可能不足以支撑我们数据集成的场景,毕竟它们不是专门针对于数据集成这个场景。

回到seatunnel的开发者角度上来看
作为一个平台,一个服务,不提供可视化的管控平台,只提供命令行交互方式,那就是耍流氓。
管控平台需要做什么?
最主要的是管理数据集成任务的配置信息。
让用户通过WebUI能够轻松的完成任务的配置信息:比如输入&输出数据源、宇段信息、分区信息、过滤条件、异常数据处理、调度时间、并发度控制、流量控制、增量或全量配置等等
总之,就是尽量让用户能够通过配置信息来表达自己的业务诉求。
除了更好的开发,剩下的就是让运维变得更简单:

  • 提供任务执行流水,便于用户查询任务执行情况;
  • 提供数据源和权限的管理,这在多用户和多租户的场景下非常常见
  • 提供系统的负载与监控、任务执行告警等
    当然,这些能力理想状态还是能够与其他类型的作业整合到同一个平台上去(因为其余的作业也有相似的需求),所以这里对管控平台的设计就有更高的要求:能够复用调度系统或者运维中心已有的能力,如果没有对应的服务,那么自己也应该有一套内置的能力来支撑这样的事情。

但是有一些事情,依赖于别的应用是做不了,或者说ROI太低

  1. schema evolution:当然,这更依赖于引擎的能力,但是如果有管控平台,我们可以做一点简单化的SE,比如自动新增字段、将删减字段以空数据插入等,还是可行的
  2. 数据时间:举一个例子,在mysql binlog -> hive 的场景,数据可能因为各种原因最后延迟到达了,那么如何将数据划分到正确的分区?
  3. 动态分区:在某些场景,我们需要将历史数据重新同步一份到大数据集群,数据中已经有对应的分区字段,我们需要依赖这个字段插入到指定hive分区中,这在批处理中很容易就可以做到,但是在数据集成中,我们需要很大的代价,比如修改或新增spark/flink连接器。但如果我们有管控平台,我们提供后置处理的能力,那只需要seatunnel将数据写到一张临时表的分区中,然后再通过hive/spark/flink的批处理,即可完成对应的操作

所以,综上所述,为了用户更好的去使用seatunnel,一个管控平台对于我们来说是必不可少的。

目标

一句话概括:提供便捷的任务开发与运维,能够轻松的完成端到端的数据集成。

功能目标

  • 管理
    • 数据源管理
    • 权限管理
    • 服务管理
  • 开发
    • 任务的CRUD
    • 整库迁移
  • 运维
    • 运维大盘
    • 手动任务运维
    • 周期任务运维
    • 实时任务运维
  • 监控
    • 服务状态
    • 任务指标(或者依赖于Grafana是更好的实现?)
    • 任务告警
      • 告警配置
      • 告警记录
      • 告警抑制

Rome was not built in a day
有太多的功能等着我们去实现,绝对不仅仅只是我列出的这些。但是考虑到时间问题,我打算先暂时完成开发和运维中的事情,并且因为我的技术栈更偏向于Flink,所以Spark类型的引擎可能支持的没那么迅速。

可运维性

  1. Server应该只有一个角色,且Server的多实例需要保证地位均等。

可拓展性

概要设计

#1968

详细设计

#1969

后续规划

本次设计与开发算是v1.0版本,后续需要更多的人加入进来实现更多的功能

  1. 比如与DolphinScheduler的集成,将其集成为scheduler-engine-ds
  2. Web页面的开发,利用开源的前端脚手架快速完成开发

Task list

// i will fill task list soon

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@EricJoy2048
Copy link
Member

This is a very good proposal. I think web server is a very important feature of SeaTunnel.
Could you update the content to English?

@Hisoka-X Hisoka-X added discuss feature New feature labels May 25, 2022
@dijiekstra
Copy link
Contributor Author

This is a very good proposal. I think web server is a very important feature of SeaTunnel. Could you update the content to English?

Thank you for your comment.
I will translate the content to english after all design finish ,it'll be completed before the end of this week

@fanyo190
Copy link

@dijiekstra 您说的功能什么时候发布呢?

@dijiekstra
Copy link
Contributor Author

@dijiekstra 您说的功能什么时候发布呢?

It is still in the design stage and such a large feature needs to be discussed and approved by the community before it can be developed

@dijiekstra
Copy link
Contributor Author

@CalvinKirs @ruanwenjun @gaojun2048
Hi, I have completed most of the design content, please help me review it .
If there is no problem, I will develop based on this design

@dijiekstra
Copy link
Contributor Author

Hi, every contributors or others:
Would u like to work with me on this feature? I need your help. If I was developing by myself, it would take me a lot of time. After all, I have work to do, and I'm not doing open source full-time

@Hisoka-X
Copy link
Member

I'm not sure about we need authority management, task alert cause seatunnel is a framework.

@CalvinKirs
Copy link
Member

CalvinKirs commented May 30, 2022

I'm not sure about we need authority management, task alert cause seatunnel is a framework.

It depends on the community, if SeaTunnel devotes to a platform, this could be good.

@dijiekstra
Copy link
Contributor Author

I'm not sure about we need authority management, task alert cause seatunnel is a framework.

task alert depends on scheduler such as dolphinscheduler or others.
If we want to do it, we must check task status on realtime.

@dijiekstra
Copy link
Contributor Author

Or Scheduler push result to us.

@legendtkl
Copy link
Contributor

Great. This proposal will be a long building.

I suggest to take the integration into consideration, such as user integration with LDAP, auth integration with Ranger, etc.

@dijiekstra
Copy link
Contributor Author

Integration with Ranger is a good idea.
But I haven't connected Ranger, it'll take longer

@dijiekstra
Copy link
Contributor Author

Due to no one has questioned it, I will start development next week. I will update the progress in this issule regularly

@luketalent
Copy link
Contributor

good !! I'm in

@dijiekstra
Copy link
Contributor Author

This is a very good proposal. I think web server is a very important feature of SeaTunnel. Could you update the content to English?

Done.

@EricJoy2048
Copy link
Member

EricJoy2048 commented Jun 14, 2022

[OnlineMeeting&June 7]SeaTunnel community meeting Topic collect #1986

We are looking for people who are willing to work together for this feature. Are you interested in participating?

@mengfeiMonica
Copy link

mengfeiMonica commented Jun 15, 2022

Hi guys , My name is Monica and I am a PM.I designed some parts of function below. Look forward #2099

@dijiekstra
Copy link
Contributor Author

ONLINE API VIEW
URL: https://www.apifox.cn/apidoc/shared-c6e2e561-8e6d-446d-9386-1c4a2c3ab50f PASSWORD : seatunnel-1947

@luketalent
Copy link
Contributor

@dijiekstra I wanna join, how do I start??

@Hxssssss
Copy link

@dijiekstra I wanna join, how do I start??

@songjianet
Copy link
Member

I will handle the work of the front-end part accordingly, please refer to the changes in the front-end part #2076.

@songjianet
Copy link
Member

Is ST not considering the function of theme switching?

@dijiekstra
Copy link
Contributor Author

Is ST not considering the function of theme switching?
Ignore these features for now. Thanks for your idea.

@dijiekstra
Copy link
Contributor Author

The basic script management is already, I'll focus on the development of integration with Scheduler

@2013650523
Copy link
Contributor

@dijiekstra I wanna join, how do I start??

@zhuangchong
Copy link
Contributor

What tasks are currently unclaimed?

@jianneng-fit2cloud
Copy link

目前在用的时候有问题呀,No matched script save dir [/dj],这个不知道怎么处理

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests