New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Introduce flink to dataproxy module #1599
Comments
Thank you @leo65535 for your proposal, great ideas. The scheme of introducing Flink described here can take advantage of the ecological advantages of Flink connector to complete the data collection function. However, it is necessary to introduce a complete set of deployment and management mechanisms for Flink. Compared with Inlong, the positioning of data access is more important, and the service use, operation and maintenance deployment and maintenance are also slightly complicated. DataProxy currently has two functions:
DataProxy is based on the actual problems encountered in online big data scenarios, and the introduced design is relatively lightweight, and it is also convenient for operation, maintenance, deployment and management. At the same time, DataProxy has also passed the test of the current internal big data scenario, and is considered a relatively stable design and implementation. If Flink's solution is used, the two points described above need to be further considered and adapted. Therefore, the above introduction of Flink's design, Inlong should not be introduced in the short term, but Flink's current connector method is indeed a big advantage of it, and we will continue to pay attention to the development here. |
Thanks @baomingyu, get your points. From the history of data transmission development, we used
About the DataProxy's two functions:
Flink can meet this requirement, we can let socket as the flink source, and package sdk to user client.
This is not a big deal, because for big data scenario, only if our system can scale-out the mq, it will not a big deal. In fact, use |
@leo65535, I also believe that Flink can definitely meet the current functional requirements of DataProxy: realize the access proxy function, meet the data access of different protocols, package the data, and report to different target sets, and include persistent cache processing capabilities for abnormal conditions. But we will not consider using heavy components such as Flink on this module for the time being: for this kind of access side module, the capabilities and community popularity that the component can provide are not the focus of our environmental considerations, if there is no required function, we will directly implement it; we focus more on stability, performance, and how to manage and control convenience under the super-large scale. We tend to make it into something like a water pipe, which is stable and easy to use: if the flow is not enough, add another one; if the pipe breaks, just replace it. In the processing of some complex logic, we will use Flink, such as our Sort ETL module, which requires fast iterative functions and directly connects with changing business needs, but it is not directly and purely used too: after meeting business needs, we will contribute our optimized places to the community. BTW, After using Flume for so many years, we also feel that some places do not meet our needs. For example, the performance of persistent storage is relatively poor, the configuration of Source, Channel, and Sink needs to be controlled by configuration files, etc., and the community is not active enough. We have sorted out its set of functions and considered launching its transformation at a certain point. If you are interested in this module, we can discuss it together。 |
@leo65535 @gosonzhang Thank you for the great thinking of InLong. I try to understand this question from another side. |
+1, you are right.
Yeah, this is what I want to talk about, not to replace flume right now.
In fact, we can start with |
@leo65535 It is better to combine your ideas and scenarios to refine them. We only reuse a part of the Flink connector and Flink format content, the reasons for not fully adopting the Flink community connector are:
Hope it can help you! |
The issue has not been dealt with for a long time, so I closed it first |
Description
Apache inlong contains four module, includes Ingest, Converge, Cache, Consume. After looking into the project, we
can find that there is a similar structure in several modules, we can call it "source - channel - sink"(SCS in short).
Several disadvantages:
In Ingest module, we may need to implement many sources ourselves, inculdes streaming and batch ingest, like cdc, file, socket...
In DataProxy module, as we known, the development of flume project is getting slower.
We can see that it's hard for us to maintain them, we need to learn the flume api and implements many connectors. The apache flink already contains rich connectors, and support sql grammar, batch and streaming source connectors, also
contains many other excellent features, we can benefit from it. So, here we can merge some functions of the Ingest, Converge module and introduce flink to dataproxy module.
Architecture
Here the new architecture diagram compared to origin.
Origin
New
The text was updated successfully, but these errors were encountered: