Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC-69] Hudi 1.X #8679
[RFC-69] Hudi 1.X #8679
Changes from 1 commit
dfce3b6
a581937
3c12d25
96fd9ac
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you started to call "batch" as old school 👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, ingest is completely incremental now - across industry. Once upon a time, it was unthinkable. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth focusing on bringing the cost of streaming ingestion/processing down, I think multitable deltastreamer concepts or similar will be very important for wider adoption.
In my experience stream data processing can get expensive esp. in cloud when you pay for time your streaming spark job runs on something like emr or glue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Efforts like Record index, should bring that down dramatically I feel for say random write workloads. This is definitely sth to measure and baseline and set plans for, but not sure how to pull a concrete project around this yet. Multi delta streamer needs more love for sure. will make it 1.1 for now though, since we need to front load other things before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#6612 should help reduce costs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, the current implementation of Hudi is worked as micro-batch data lake, not a really streaming lakehouse. Will we propose to build the really streaming lakehouse via Hudi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. There was a discussion in 2021 on how we can make streaming writes more efficient, especially for Flink, by redesigning the core abstraction. We should revisit that. cc @danny0405
https://lists.apache.org/thread/fsxbjm1w3gmn818lxn79lm6s56892s40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @garyli1019 as he had a lot of ideas on this topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yihua, IMO, the proposal in above discussion is only about making the Flink writer in a streaming
fashion. But the streaming lakehouse kid is mainly end-to-end streaming read and write to make the data in the lake really streaming processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I think @SteNicholas is pointing towards more general-purpose streaming capabilities such as watermarks, windows and accumulators - https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/. Please correct me if i'm wrong.
We should certainly revive that devlist thread for a detailed discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Materialized views are great and are supported in Trino as well. Should t be that hard to implement after Trino can write to Hudi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar, @kazdy, could we also support streaming view? In bilibili internal, we builds the streaming view on Hudi based on the watermark. Meanwhile, we also supports materialized views which is cached in Alluxio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar Hudi + Dynamic table aims to resolve 2 use cases:
REAL
incremeltal style, the intermediate accumulation can be checkpointed and reused.The end-to-end latency is in minutes range, not seconds or even milliseconds. I can see the materializatin view can be very useful in this streaming ingestion and near-real-time analytical use cases. The MVs can serve as a pre-aggregation layer to speed up the query raised by frontier users. If we make the pre-aggregation flexible enough, we can even embed custom agg logic from of user, to supply a direct service layer! Inspired by some engines like Apache Pinot and �Apache Kylin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In streaming world, watermark and event time are very important, no mater for query/write a single table or between multiple tables.
For example, query a snapshot of table at a specified event time instead of commit time. Or join snapshots of two table at the same event time.
When building streaming warehouse or streaming materialized view, do we also consider introducing the concepts of event time and watermark in Hudi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beyond1920 that's a good point of using watermark. +1 to provide the watermark to query engine. This could work pretty well with materialized view and some query logics are needed to get a view with precise eventtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for this, make it a really good experience to use Hudi via SQL so that's a more pleasant experience for DWH turned DE folks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. the direction seems to be that query side smartness is increasingly being pushed to Hudi's layer rightfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that you mention SQL support multiple times in the RFC - to bring equal experience to DWH and RDBMS so that SQL-heavy users can use Hudi without friction. Do we plan to make SQL support a priority in all areas (read, write, indexing, e.g.,
CREATE INDEX
, concurrency control, etc.) and even more usable than data source / write client APIs?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yihua +1. All operation on the Hudi could use SQL to express and execute, which reduces users' usage and learning costs and decouples from the engine so that not only uses the SQL of engine like Flink SQL etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot agree more, it should be super easy to start using Hudi with SQL (and Python which is by far the most popular language I see in Hudi slack?) as a first-class citizen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kazdy, Iceberg has supported PyIceberg. IMO, we could support the PyHudi to provide ML developer to access Hudi data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the single node python use-case is good. I wonder if we just add support to popular frameworks like polars directly instead. thoughts? @kazdy @SteNicholas ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar I think there are upsides to having eg a rust client with Arrow underneath and maintain bindings for Python.
Then integration with Polars, Pandas, Apache Data Fusion, or whatever comes up later should be fairly easy by using hudi-rs crate.
At the same time, we can have Python client in the form of PyPi package with pretty low effort thanks to bindings.
@SteNicholas PyIceberg if I'm not wrong is written in Python in full and uses PyArrow underneath. That's another option, but it feels like hudi-rs + python bindings will create less overhead to maintain it overall.
Btw once we have python bindings/client I will be happy to add Hudi support to aws sdk for pandas, good chunk Hudi Slack community seems to use Hudi on AWS so it makes sense to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am more inclined towards this. Just have two native implementations - Java, Rust. and wrap others. I just need to make sure Rust to C++ works great as well. Did not have a stellar experience here with Go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are some changes to be done to Hudi Spark integration, then maybe it's the right time to
use spark hidden _metadata field (which is a struct) to keep hoodie meta fields there, if user wants to get these then it can be done with "select _metadata", this feels like breaking change so maybe it's the right time to do it? (btw It's possible to hide meta columns in the new Presto integration with a config afaik)
https://spark.apache.org/docs/3.2.1/api/java/org/apache/spark/sql/connector/catalog/MetadataColumn.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes even on trino. something to consider. When we were writing Hudi originally, there was no Spark DataSet APIs even, FWIW :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking here. https://issues.apache.org/jira/browse/HUDI-6488 feel free to grab it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this point, but what do you have in mind? What needs to be changed to support the relational data model better? Make precombine field optional or something else?
How does this relate to the below point "Beyond structured Data"? How to marry these two together elegantly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kazdy I have some concrete thoughts here. Will write up in the next round. By and large, to generalize keys, we need to just stop special casing record keys in indexing layer and have metadata layer, build indexes for any column for e.g Some of these are already there. Another aspect is introducing the key constraint key words similar to RDBMS-es (
unique key
, may be even composite keys).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can study the experience offered by json/document stores more here. and see if we can borrow some approaches.. e.g Mongo. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar, the definition of data lake is unstructured data oriented, therefore the data model of Hudi could upgrade to support unstructured data oriented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar
I don't know how well both can be supported at the same time.
Eg in sql server when json data is stored in a column they parse it and store it internally in a normalized form in internal tables and then when a user issues a query to read it then it's parsed back to json format.
This is clever but feels clunky at the same time.
On the other hand, Mongo is super flexible but is/was? not that great at joins so that is not aligned with the relational model.
I myself sometimes keep json data as strings in Hudi tables, but that's mostly because I was scared of schema evolution and compatibility issues and I can not be sure what will I get from the data producer. At the same time, I wanted to use some of Hudi capabilities.
So for semi-structured data solution can be to introduce super flexibility with schemas in Hudi + maybe add another base format like BSON to support this flexibility easier.
But might be you are thinking about something far beyond this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kazdy @SteNicholas I owe you both a response here. Researching more. Will circle back here. Thanks for the great points raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. my conclusions here @kazdy @SteNicholas after researching many systems that are used for ML/AI use-cases that involve unstructured data.
I think both of these can be supported as long as new storage format can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concrete ideas to try once we have the storage format finalized more and some abstractions are in place. LMK if you want to flesh this out more and do some research ahead of it. cc @bhasudha who's looking into some of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn’t agree more.
It can be felt that there is a lack of consensus in the community in this regard.
What do you think about Hudi-metaserver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BruceKellan Actually got heavy pushbacks before - valid concerns on running yet another metastore. We could take a flexible approach to bridge the lack of consensus. I feel we can abstract the metadata fetching out such that metadata can be read from storage/metadata table (or) from the metaserver. I think the meta server today does sth similar today; It's not a required component, to query the timeline i.e instead of .hoodie, you'd ask the metaserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I always thought the Metadata Server is kind of a requirement for the implementation of additional clients such as the Rust+Python bindings client. But do I understand right that you are planning on making the metaserver optional and still support all operation with the clients going directly to the metadata log files?
Are you planning to change the file format of the metadata logs, such that it's easier to read them from outside of Java or are they going to stay the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moritzmeister yes. the idea is the latter. Metaserver will be optional and will speed up query planning and bring other benefits like UI/RBAC and so on. Planning to track format changes here. https://issues.apache.org/jira/browse/HUDI-6242 , main thing to figure out how to make the metadata indexed in HFile into another format. Looking into Lance format, which has readers already in Rust/C++ (but not java, we'd need to add that). One way or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this.
In our internal HUDI practice at our company (Kuaishou), we have encountered the need for concurrent processing of a HUDI table. For example, multiple streaming jobs processing a HUDI table simultaneously to achieve the requirement of Streaming joins and Streaming union. Having a long running servicer which could act as a coordinator between multiple jobs would make everything easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. if we can embrace this model, management becomes much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it will get less pushbacks if it's easy to deploy in cloud (eg from marketplace on aws).
Being in a smaller company and having another thing to deploy and maintain can be a no go. Esp that the rest of the stack can be and probably is serverless.
Having a choice here would be the best from user perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes k8s support on eks with a helm chart? All major cloud providers support some level of managed k8s. We can start from there, and let the community add more things to it. It ll be hard for us to build any native integrations anyway. We can be OSS focussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting for this part. For
embracing a hybrid architecture
, we have done some work around this. I think the key is to a abstract, unified, well-designed api to response the query (like methods inFSUtils
) and operations (like manage instants) of metadata, and ingest it to Hudi Client (aka HoodieTableMetaClient). After then, we can implement different timeline services by parsing.hoodie
directory or connecting to the backed hudi metastore service easily. @vinothchandarThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YannByron agree. Anything in the works, in terms of upstreaming? :) I have now pushed out the metaserve to 1.1.0 since it needs broad alignment across the community. So we have sometime, but love to just some alignment here and move on ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar yep, we wanna work on it, and already have some research. will propose a RFC (maybe an umbrella one) in the short run, then let's discuss deeper. cc @Zouxxyy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Goes without saying that we need more research here, but, I am wondering how this intersects with "generalized" data model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think UDFs and materialized views (MV) will be key for ML ecosystem.
Any plan of supporting MVs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to understand what supporting MVs mean better. See my comment on Flink MV using Dynamic table above. and consolidate there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty excited to see how Hudi can help AI/ML use cases. I believe we're just scratching the surface here and going to provide efficient mutation solutions tailed for AI/ML on top of a generic framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is amazing as it's mostly driven by the community, crowdsourcing ideas from various use cases in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see that catalog manager is blue. metaserver is actually positioned towards Hudi's own catalog service, so it should be yellow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel Catalog is a little different, where we need access controls and other permissions added. I was thinking about just how to scale metadata for planning. May be the metaserver authors can help clarify? @minihippo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Maybe at least index system, stronger transaction management (including cross-table transaction) should be counted in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infinite timeline and data history will be gold 🏅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time-travel writes looks interesting to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% with you!
One might think that serializable would be the norm in production databases. But, it's far from reality even for RDBMS. There is an awesome SIGMOD talk by Dr. Andy Pavlo which discussed this point. Sharing the screenshot from the slides - https://www.cs.cmu.edu/~pavlo/slides/pavlo-keynote-sigmod2017.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codope, could we build HTAP database like TiDB etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be hard to support any real, high perf transactional workloads on the lake IMO. Our commit times will still be say 20-30 seconds. Those are long-running transactions in the transactional world (speaking from my Oracle work ex) :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector database is quite a hot topic recently, and I take a look at it trying to figure out what it is. Looks like it's a database with a special index provide similarity search. With the pluggable indexes and clustering, Hudi is in a sweet spot in this race!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@garyli1019 came across this. https://www.linkedin.com/pulse/text-based-search-from-elastic-vector-kaushik-muniandi , thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice article, elastic search just can't handle that much of data. That would be really helpful if this user could elaborate more details about their use cases. A general purpose LLM + a database that store the customized data = a customized AI assistant. Searching could be the next battle ground for the lakehouse tech.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@garyli1019 sg. Interested in trying https://github.com/lancedb/lance as a base file format and adding some capabilities through one of the existing engines? I can see elastic, pg etc all adding specialized indexes. I think the vector search itself will be commodotized soon. It's just another query type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, lots of new stuff coming out recently, let me take a look 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When improving Hudi's multi-modal index for query performance, should we also think about the user experience of creating, managing, and "tuning" the indexes? Say, an advanced user may create a specific index such as B-tree for query speedup, and PostgreSQL has this functionality: https://www.postgresql.org/docs/current/indexes.html. Also, query engine can pick and choose what they need in the integration. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. a db advisor type thing would be awesome to have. it can recommend indexes and such. We can do it in 1.1 though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we also thinking about some AI component in the query planning phase which will automatically infer which type of index to use for what columns so as to get the best possible read performance? that would be super powerful.
I understand, this could be very challenging, but would be really nice if we can start some initiatives or abstractions towards it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK if we need AI . but yeah index selection is a hard problem we'd solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have plan to introduce query/reation/fine-grained SQL semantics cache layer, so that these cache can be share by common sub-clauses of different queries, can even can be served as pre-aggregation for online materialize views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe cache-aware query planning and execution with distributed cache is sth relevant we can take on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing data on query patterns and performance is needed for training a model to optimize query planning if that is going to be anywhere near helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know that @XuQianJin-Stars is working on a UI component as part of platformization; an RFC is pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. A Hudi UI would be amazing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan, is Hudi UI only worked for admin? Could the Hudi UI display the metadata and data quantity of Hudi table and related metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UI would be a big-killer, @XuQianJin-Stars is pushing forward this and they put it into production in Tencent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan do we have an RFC or Epic for this already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar just created this one https://issues.apache.org/jira/browse/HUDI-6255
@XuQianJin-Stars can you add more details and link any issues to it pls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved it to 1.1.0 in scoping, so that it's easier with newer APIs available. but @XuQianJin-Stars , please feel free get started sooner if you think it a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the integration with other database like Doris, StarRocks etc be include? Meanwhile, is the C++ client which reads and writes Hudi data file proposed to achive for AI scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on first. If we can solidify the format changes, then a Java API and a Rust/C++ API for metadata table, timeline/snapshots, and Filegroup reader/writer will be cool. Boils down to finding enough contributors to drive it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some concrete ideas for AI scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar, I would like to contribute the Rust/C++ client and driver the contributors in China to contribute. Meanwhile, AI scenarios have the ideas that dstributed and incremental reading of Hudi data via C++/Python client, like Kafka python client which dstributed reading message of Kafka topic. This idea comes from AI cases in BiliBili.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar, another idea is that integrates with the OpenAI API to answer user question of Hudi and provides the usage or SQL of Hudi feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds amazing, I'm out of bunden with all kinds of issues and feature enquires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yihua, most AI scenarios are focused on the feature engineering, model training and prediction. Feature data could store in Hudi as json format and may only use several feature in json which need materialized column. Meanwhile, model is unstructured data that Hudi doesn't support at present. From user perspective, ML engineer uses the C++/Python distributed client to access the Hudi data and incremental consume of Hudi data. Therefore, the features including materialized column, unstructured data support and C++/Python distributed client are mainly required for AI scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SteNicholas @danny0405 @yihua This is a good read on this topic of MLOps and ML workloads their intersection with regular data eng https://www.cpard.xyz/posts/mlops_is_mostly_data_engineering/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SteNicholas I am also interested in helping with Rust/C++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SteNicholas @jonvex filed https://issues.apache.org/jira/browse/HUDI-6486 under this Table Format API for now. We need to first finalize the 1.0 storage format draft and get busy on this. Please let me know how you want to engage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, could we consider some functions for cloud native like separation of hot and cold data, integration with K8S operator etc? IMO, the future trend of database is closer to cloud native.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SteNicholas Interesting. I was thinking that we make operators for components like metaserver or cache server (depending on how we build it) or the table management server . Could you expand on the hot/cold data separation idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar, metaserver or cache server or the table management server could serve as kubernetes service. The hot/cold data separation idea refers to CloudJump: Optimizing Cloud Databases for Cloud Storages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SteNicholas ! My thoughts are also to start with sth that can run these on k8s. I will read the paper and reflect back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking here. https://issues.apache.org/jira/browse/HUDI-6489 Happy to help firm up a RFC/design, we can evolve as 1.0 format evolves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @SteNicholas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to discuss how the 0.x releases (e.g., 0.14) will work alongside 1.x development, to avoid conflict and rebasing difficulties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I am thinking we merge all the core code abstraction changes - HoodieSchema, HoodieData and such into 0.X line to make rebasing easier, then fork off the 1.0 feature branch.