Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Introduce flink to dataproxy module #1599

Closed
ghost opened this issue Oct 9, 2021 · 7 comments
Closed

[Feature] Introduce flink to dataproxy module #1599

ghost opened this issue Oct 9, 2021 · 7 comments

Comments

@ghost
Copy link

ghost commented Oct 9, 2021

Description

Apache inlong contains four module, includes Ingest, Converge, Cache, Consume. After looking into the project, we
can find that there is a similar structure in several modules, we can call it "source - channel - sink"(SCS in short).

  • Ingest module, there is a customized "source - channel - sink" structure.
  • DataProxy module, we can see this module depends on flume project, which is a typical SCS structure.
  • Consume module, we can also think flink is also a SCS structure.

Several disadvantages:
In Ingest module, we may need to implement many sources ourselves, inculdes streaming and batch ingest, like cdc, file, socket...
In DataProxy module, as we known, the development of flume project is getting slower.

We can see that it's hard for us to maintain them, we need to learn the flume api and implements many connectors. The apache flink already contains rich connectors, and support sql grammar, batch and streaming source connectors, also
contains many other excellent features, we can benefit from it. So, here we can merge some functions of the Ingest, Converge module and introduce flink to dataproxy module.

Architecture

Here the new architecture diagram compared to origin.

Origin

image

New

image

@baomingyu
Copy link
Contributor

Thank you @leo65535 for your proposal, great ideas. The scheme of introducing Flink described here can take advantage of the ecological advantages of Flink connector to complete the data collection function. However, it is necessary to introduce a complete set of deployment and management mechanisms for Flink. Compared with Inlong, the positioning of data access is more important, and the service use, operation and maintenance deployment and maintenance are also slightly complicated.

DataProxy currently has two functions:

  1. Convergence link
    The main purpose here is to avoid too many links between the client and the MQ server. DataProxy can handle this problem intensively, avoiding each message middleware that is connected later to deal with this problem by itself.
  2. Lightweight data cache (memory + local disk)
    The consideration here is that when the subsequent message middleware write jitters, the overall write performance can be well maintained.

DataProxy is based on the actual problems encountered in online big data scenarios, and the introduced design is relatively lightweight, and it is also convenient for operation, maintenance, deployment and management. At the same time, DataProxy has also passed the test of the current internal big data scenario, and is considered a relatively stable design and implementation. If Flink's solution is used, the two points described above need to be further considered and adapted.

Therefore, the above introduction of Flink's design, Inlong should not be introduced in the short term, but Flink's current connector method is indeed a big advantage of it, and we will continue to pay attention to the development here.

@ghost
Copy link
Author

ghost commented Oct 12, 2021

Thanks @baomingyu, get your points.

From the history of data transmission development, we used sqoop -> flume -> datax -> flinksql in my company, flink has brought us great benefits:

  • Work with resource framework, like yarn, k8s
  • Easy to control source/sink parallelism
  • Friendly to developers
CREATE TABLE source
(
    id      bigint,
    user_id bigint,
    name    STRING
) WITH (
      'connector' = 'x-mysql',
      'url' = 'jdbc:mysql://ip:3308/tutiao?useSSL=false',
      'table-name' = 'kudu',
      'username' = 'username',
      'password' = 'password'
);

CREATE TABLE sink
(
    id      bigint,
    user_id bigint,
    name    string
) WITH (
      'connector' = 'x-hdfs'
      ,'path' = 'hdfs://ns/user/hive/warehouse/kudu_txt'
      ,'fileName' = 'pt=1'
      ,'properties.hadoop.user.name' = 'root'
      ,'sink.parallelism' = '3'
);

insert into sink
select *
from source u;

About the DataProxy's two functions:

Convergence link

Flink can meet this requirement, we can let socket as the flink source, and package sdk to user client.

Lightweight data cache (memory + local disk)

This is not a big deal, because for big data scenario, only if our system can scale-out the mq, it will not a big deal. In fact, use Memory or Local Disk is a danger policy, it will cause data lost. The key point is that we should foucus on the performance of the mq.

@gosonzhang
Copy link
Contributor

@leo65535, I also believe that Flink can definitely meet the current functional requirements of DataProxy: realize the access proxy function, meet the data access of different protocols, package the data, and report to different target sets, and include persistent cache processing capabilities for abnormal conditions.

But we will not consider using heavy components such as Flink on this module for the time being: for this kind of access side module, the capabilities and community popularity that the component can provide are not the focus of our environmental considerations, if there is no required function, we will directly implement it; we focus more on stability, performance, and how to manage and control convenience under the super-large scale. We tend to make it into something like a water pipe, which is stable and easy to use: if the flow is not enough, add another one; if the pipe breaks, just replace it.

In the processing of some complex logic, we will use Flink, such as our Sort ETL module, which requires fast iterative functions and directly connects with changing business needs, but it is not directly and purely used too: after meeting business needs, we will contribute our optimized places to the community.

BTW, After using Flume for so many years, we also feel that some places do not meet our needs. For example, the performance of persistent storage is relatively poor, the configuration of Source, Channel, and Sink needs to be controlled by configuration files, etc., and the community is not active enough. We have sorted out its set of functions and considered launching its transformation at a certain point. If you are interested in this module, we can discuss it together。

@dockerzhang
Copy link
Contributor

@leo65535 @gosonzhang Thank you for the great thinking of InLong. I try to understand this question from another side.
we agree that apache flink has rich and mature connectors, maybe we could add a plugin implement like DataProxy on Flink in the future. But at present, we will keep the flume implement for DataProxy, and will not replace it with Flink in the recent versions for some reasons:
1, InLong has an internal version that serving lots of Tencent businesses for years, we found that there are no MQ services that could process the super-large scale data, that why we add DataProxy(using flume) and develop TubeMQ.
2, we will focus on adding more data flow for the following versions, like ClickHouse、IceBerg、HBase, and DataProxy support Pulsar, we will not replace it with Flink immediately.

@ghost
Copy link
Author

ghost commented Oct 13, 2021

Hi @dockerzhang @gosonzhang

But at present, we will keep the flume implement for DataProxy, and will not replace it with Flink in the recent versions

+1, you are right.

maybe we could add a plugin implement like DataProxy on Flink in the future.

Yeah, this is what I want to talk about, not to replace flume right now.

we will focus on adding more data flow for the following versions, like ClickHouse、IceBerg、HBase, and DataProxy support Pulsar.

In fact, we can start with sort module.

@gosonzhang
Copy link
Contributor

@leo65535

It is better to combine your ideas and scenarios to refine them.

We only reuse a part of the Flink connector and Flink format content, the reasons for not fully adopting the Flink community connector are:

  1. Sort needs to support dynamic update of metadata, and the connectors of the community are static
  2. Sort is currently a multi-tenant model, and the connectors in the community are all single-tenant
  3. Indicators and exception handling, we need to collect indicators, and some of them need to trigger the alarm mechanism

Hope it can help you!

@gosonzhang
Copy link
Contributor

The issue has not been dealt with for a long time, so I closed it first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants