[RFC-69] Hudi 1.X #8679

vinothchandar · 2023-05-10T06:06:05Z

Change Logs

Summarized current state and efforts
Explain the different components in Hudi, in relation to database system design
Sketch goals for Hudi 1.0 to build alignment

Impact

N/A

Risk level (write none, low medium or high below)

None

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

- Summarized current state and efforts - Explain the different components in Hudi, in relation to database system design - Sketch goals for Hudi 1.0 to build alignment

rfc/rfc-69/rfc-69.md

SteNicholas · 2023-05-10T08:29:32Z

rfc/rfc-69/rfc-69.md

+Rome was not built in a day, so can't the Hudi 1.x vision also? This section outlines the first 1.0 release goals and the potentially must-have changes to be front-loaded. 
+This RFC solicits more feedback and contributions from the community for expanding the scope or delivering more value to the users in the 1.0 release.
+
+In short, we propose Hudi 1.0 try and achieve the following.


Could the integration with other database like Doris, StarRocks etc be include? Meanwhile, is the C++ client which reads and writes Hudi data file proposed to achive for AI scenario?

+1 on first. If we can solidify the format changes, then a Java API and a Rust/C++ API for metadata table, timeline/snapshots, and Filegroup reader/writer will be cool. Boils down to finding enough contributors to drive it

Do you have some concrete ideas for AI scenarios?

@vinothchandar, I would like to contribute the Rust/C++ client and driver the contributors in China to contribute. Meanwhile, AI scenarios have the ideas that dstributed and incremental reading of Hudi data via C++/Python client, like Kafka python client which dstributed reading message of Kafka topic. This idea comes from AI cases in BiliBili.

@vinothchandar, another idea is that integrates with the OpenAI API to answer user question of Hudi and provides the usage or SQL of Hudi feature.

Sounds amazing, I'm out of bunden with all kinds of issues and feature enquires.

@yihua, most AI scenarios are focused on the feature engineering, model training and prediction. Feature data could store in Hudi as json format and may only use several feature in json which need materialized column. Meanwhile, model is unstructured data that Hudi doesn't support at present. From user perspective, ML engineer uses the C++/Python distributed client to access the Hudi data and incremental consume of Hudi data. Therefore, the features including materialized column, unstructured data support and C++/Python distributed client are mainly required for AI scenarios.

@SteNicholas @danny0405 @yihua This is a good read on this topic of MLOps and ML workloads their intersection with regular data eng https://www.cpard.xyz/posts/mlops_is_mostly_data_engineering/

@SteNicholas I am also interested in helping with Rust/C++

@SteNicholas @jonvex filed https://issues.apache.org/jira/browse/HUDI-6486 under this Table Format API for now. We need to first finalize the 1.0 storage format draft and get busy on this. Please let me know how you want to engage.

rfc/rfc-69/rfc-69.md

xushiyan

Great summary of the past, present, and future!

xushiyan · 2023-05-10T14:47:41Z

rfc/rfc-69/rfc-69.md

+
+![](./hudi-dblayers.png)
+
+_Reference diagram highlighting existing (green) and new (yellow) Hudi components, along with external components (blue)._


i see that catalog manager is blue. metaserver is actually positioned towards Hudi's own catalog service, so it should be yellow?

I feel Catalog is a little different, where we need access controls and other permissions added. I was thinking about just how to scale metadata for planning. May be the metaserver authors can help clarify? @minihippo ?

Agree. Maybe at least index system, stronger transaction management (including cross-table transaction) should be counted in.

rfc/rfc-69/rfc-69.md

xushiyan · 2023-05-10T14:59:01Z

rfc/rfc-69/rfc-69.md

+
+Shared components include replication, loading, and various utilities, complete with a catalog or metadata server. Most databases hide the underlying format/storage complexities, providing users 
+with many data management tools. Hudi is no exception. Hudi has battle-hardened bulk and continuous data loading utilities (deltastreamer, flinkstreamer tools, along with Kafka Connect Sink), 
+a comprehensive set of table services (cleaning, archival, compaction, clustering, indexing, ..), admin CLI and much much more. The community has been working on new server components 


i know that @XuQianJin-Stars is working on a UI component as part of platformization; an RFC is pending.

Great. A Hudi UI would be amazing.

@xushiyan, is Hudi UI only worked for admin? Could the Hudi UI display the metadata and data quantity of Hudi table and related metrics?

The UI would be a big-killer, @XuQianJin-Stars is pushing forward this and they put it into production in Tencent.

@xushiyan do we have an RFC or Epic for this already.

@vinothchandar just created this one https://issues.apache.org/jira/browse/HUDI-6255
@XuQianJin-Stars can you add more details and link any issues to it pls?

I have moved it to 1.1.0 in scoping, so that it's easier with newer APIs available. but @XuQianJin-Stars , please feel free get started sooner if you think it a good idea.

kazdy · 2023-05-10T19:46:16Z

rfc/rfc-69/rfc-69.md

+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up.


I'm all for this, make it a really good experience to use Hudi via SQL so that's a more pleasant experience for DWH turned DE folks.

kazdy · 2023-05-10T19:49:33Z

rfc/rfc-69/rfc-69.md

+*   **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for 
+keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, 
+relational data model for Hudi tables.


I really like this point, but what do you have in mind? What needs to be changed to support the relational data model better? Make precombine field optional or something else?

How does this relate to the below point "Beyond structured Data"? How to marry these two together elegantly?

@kazdy I have some concrete thoughts here. Will write up in the next round. By and large, to generalize keys, we need to just stop special casing record keys in indexing layer and have metadata layer, build indexes for any column for e.g Some of these are already there. Another aspect is introducing the key constraint key words similar to RDBMS-es (unique key, may be even composite keys).

How does this relate to the below point "Beyond structured Data"? How to marry these two together elegantly?

I think we can study the experience offered by json/document stores more here. and see if we can borrow some approaches.. e.g Mongo. wdyt?

@vinothchandar, the definition of data lake is unstructured data oriented, therefore the data model of Hudi could upgrade to support unstructured data oriented.

How does this relate to the below point "Beyond structured Data"? How to marry these two together elegantly?

I think we can study the experience offered by json/document stores more here. and see if we can borrow some approaches.. e.g Mongo. wdyt?

@vinothchandar
I don't know how well both can be supported at the same time.
Eg in sql server when json data is stored in a column they parse it and store it internally in a normalized form in internal tables and then when a user issues a query to read it then it's parsed back to json format.
This is clever but feels clunky at the same time.

On the other hand, Mongo is super flexible but is/was? not that great at joins so that is not aligned with the relational model.

I myself sometimes keep json data as strings in Hudi tables, but that's mostly because I was scared of schema evolution and compatibility issues and I can not be sure what will I get from the data producer. At the same time, I wanted to use some of Hudi capabilities.

So for semi-structured data solution can be to introduce super flexibility with schemas in Hudi + maybe add another base format like BSON to support this flexibility easier.
But might be you are thinking about something far beyond this?

@kazdy @SteNicholas I owe you both a response here. Researching more. Will circle back here. Thanks for the great points raised.

Ok. my conclusions here @kazdy @SteNicholas after researching many systems that are used for ML/AI use-cases that involve unstructured data.

Most of them store images, videos etc as-is and index different metadata different types, to support queries.

Second approach is some kind of structure is added to the data e.g json/xml and the querying is more "forgiving" (think schema-on-read) actually implemented.

I think both of these can be supported as long as new storage format can

Support multiple file formats within a single file group

Flexible indexing system

Clean APIs to access at different levels (raw files -> to build higher level librariies), SQL)

I have some concrete ideas to try once we have the storage format finalized more and some abstractions are in place. LMK if you want to flesh this out more and do some research ahead of it. cc @bhasudha who's looking into some of these.

nsivabalan · 2023-05-10T15:50:16Z

rfc/rfc-69/rfc-69.md

+Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. 
+Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. 
+Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines.


you started to call "batch" as old school 👏

Well, ingest is completely incremental now - across industry. Once upon a time, it was unthinkable. :)

It might be worth focusing on bringing the cost of streaming ingestion/processing down, I think multitable deltastreamer concepts or similar will be very important for wider adoption.
In my experience stream data processing can get expensive esp. in cloud when you pay for time your streaming spark job runs on something like emr or glue.

Agree. Efforts like Record index, should bring that down dramatically I feel for say random write workloads. This is definitely sth to measure and baseline and set plans for, but not sure how to pull a concrete project around this yet. Multi delta streamer needs more love for sure. will make it 1.1 for now though, since we need to front load other things before

#6612 should help reduce costs as well.

rfc/rfc-69/rfc-69.md

nsivabalan · 2023-05-10T20:58:20Z

rfc/rfc-69/rfc-69.md

+The buffer manager component manages dirtied blocks of storage and also caches data for faster query responses. In Hudi's context, we want to bring to life our now long-overdue 
+columnar [caching service](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#lake-cache) that can sit transparently between lake storage and query engines while understanding 
+transaction boundaries and record mutations. The tradeoffs in designing systems that balance read, update and memory costs are detailed in the [RUM conjecture](https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf). 
+Our basic idea here is to optimize for read (faster queries served out of cache) and update (amortizing MoR merge costs by continuously compacting in-memory) costs while adding the cost of cache/memory to the system. 


are we also thinking about some AI component in the query planning phase which will automatically infer which type of index to use for what columns so as to get the best possible read performance? that would be super powerful.
I understand, this could be very challenging, but would be really nice if we can start some initiatives or abstractions towards it

IDK if we need AI . but yeah index selection is a hard problem we'd solve.

Do we have plan to introduce query/reation/fine-grained SQL semantics cache layer, so that these cache can be share by common sub-clauses of different queries, can even can be served as pre-aggregation for online materialize views.

maybe cache-aware query planning and execution with distributed cache is sth relevant we can take on.

Pre-existing data on query patterns and performance is needed for training a model to optimize query planning if that is going to be anywhere near helpful.

rfc/rfc-69/rfc-69.md

BruceKellan · 2023-05-11T01:28:31Z

rfc/rfc-69/rfc-69.md

+*   **Serverful and Serverless:** Data lakes have historically been about jobs triggered periodically or on demand. Even though many metadata scaling challenges can be solved by a well-engineered metaserver 
+(similar to what modern cloud warehouses do anyway), the community has been hesitant towards a long-running service in addition to their data catalog or a Hive metaserver. In fact, our timeline server efforts were stalled 
+due to a lack of consensus in the community. However, as needs like concurrency control evolve, proprietary solutions emerge to solve these very problems around open formats. It's probably time to move towards a truly-open 
+solution for the community by embracing a hybrid architecture where we employ server components for table metadata while remaining server-less for data.


Couldn’t agree more.
It can be felt that there is a lack of consensus in the community in this regard.
What do you think about Hudi-metaserver?

@BruceKellan Actually got heavy pushbacks before - valid concerns on running yet another metastore. We could take a flexible approach to bridge the lack of consensus. I feel we can abstract the metadata fetching out such that metadata can be read from storage/metadata table (or) from the metaserver. I think the meta server today does sth similar today; It's not a required component, to query the timeline i.e instead of .hoodie, you'd ask the metaserver.

So I always thought the Metadata Server is kind of a requirement for the implementation of additional clients such as the Rust+Python bindings client. But do I understand right that you are planning on making the metaserver optional and still support all operation with the clients going directly to the metadata log files?
Are you planning to change the file format of the metadata logs, such that it's easier to read them from outside of Java or are they going to stay the same?

@moritzmeister yes. the idea is the latter. Metaserver will be optional and will speed up query planning and bring other benefits like UI/RBAC and so on. Planning to track format changes here. https://issues.apache.org/jira/browse/HUDI-6242 , main thing to figure out how to make the metadata indexed in HFile into another format. Looking into Lance format, which has readers already in Rust/C++ (but not java, we'd need to add that). One way or the other.

SteNicholas · 2023-05-11T04:25:35Z

rfc/rfc-69/rfc-69.md

+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" 


Generally speaking, the current implementation of Hudi is worked as micro-batch data lake, not a really streaming lakehouse. Will we propose to build the really streaming lakehouse via Hudi?

This is a good point. There was a discussion in 2021 on how we can make streaming writes more efficient, especially for Flink, by redesigning the core abstraction. We should revisit that. cc @danny0405

https://lists.apache.org/thread/fsxbjm1w3gmn818lxn79lm6s56892s40

cc @garyli1019 as he had a lot of ideas on this topic.

@yihua, IMO, the proposal in above discussion is only about making the Flink writer in a streaming
fashion. But the streaming lakehouse kid is mainly end-to-end streaming read and write to make the data in the lake really streaming processing.

Good point! I think @SteNicholas is pointing towards more general-purpose streaming capabilities such as watermarks, windows and accumulators - https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/. Please correct me if i'm wrong.
We should certainly revive that devlist thread for a detailed discussion.

+1 Materialized views are great and are supported in Trino as well. Should t be that hard to implement after Trino can write to Hudi.

@vinothchandar, @kazdy, could we also support streaming view? In bilibili internal, we builds the streaming view on Hudi based on the watermark. Meanwhile, we also supports materialized views which is cached in Alluxio.

Could we not accomplish this with some enhancements in Hudi + Dynamic tables. cc @danny0405 thoughts

@vinothchandar Hudi + Dynamic table aims to resolve 2 use cases:

the ability to query the tables of intermediate layer, like the raw and silver tables, these tables can be updated in continuous streaming style and also queriable.

to unlock the streaming computation like aggregation in REAL incremeltal style, the intermediate accumulation can be checkpointed and reused.

The end-to-end latency is in minutes range, not seconds or even milliseconds. I can see the materializatin view can be very useful in this streaming ingestion and near-real-time analytical use cases. The MVs can serve as a pre-aggregation layer to speed up the query raised by frontier users. If we make the pre-aggregation flexible enough, we can even embed custom agg logic from of user, to supply a direct service layer! Inspired by some engines like Apache Pinot and �Apache Kylin.

In streaming world, watermark and event time are very important, no mater for query/write a single table or between multiple tables.
For example, query a snapshot of table at a specified event time instead of commit time. Or join snapshots of two table at the same event time.
When building streaming warehouse or streaming materialized view, do we also consider introducing the concepts of event time and watermark in Hudi?

@beyond1920 that's a good point of using watermark. +1 to provide the watermark to query engine. This could work pretty well with materialized view and some query logics are needed to get a view with precise eventtime.

yihua

I love the Hudi 1.X vision of "transactional database for the lake with polyglot persistence". I'm all in on driving some core initiatives in 1.X 🙌

rfc/rfc-69/rfc-69.md

yihua · 2023-05-11T06:03:04Z

rfc/rfc-69/rfc-69.md

+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" 


This is a good point. There was a discussion in 2021 on how we can make streaming writes more efficient, especially for Flink, by redesigning the core abstraction. We should revisit that. cc @danny0405

https://lists.apache.org/thread/fsxbjm1w3gmn818lxn79lm6s56892s40

yihua · 2023-05-11T06:15:25Z

rfc/rfc-69/rfc-69.md

+around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for 
+keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, 


I love that you mention SQL support multiple times in the RFC - to bring equal experience to DWH and RDBMS so that SQL-heavy users can use Hudi without friction. Do we plan to make SQL support a priority in all areas (read, write, indexing, e.g., CREATE INDEX, concurrency control, etc.) and even more usable than data source / write client APIs?

@yihua +1. All operation on the Hudi could use SQL to express and execute, which reduces users' usage and learning costs and decouples from the engine so that not only uses the SQL of engine like Flink SQL etc.

I cannot agree more, it should be super easy to start using Hudi with SQL (and Python which is by far the most popular language I see in Hudi slack?) as a first-class citizen

@kazdy, Iceberg has supported PyIceberg. IMO, we could support the PyHudi to provide ML developer to access Hudi data.

yes the single node python use-case is good. I wonder if we just add support to popular frameworks like polars directly instead. thoughts? @kazdy @SteNicholas ?

@vinothchandar I think there are upsides to having eg a rust client with Arrow underneath and maintain bindings for Python.
Then integration with Polars, Pandas, Apache Data Fusion, or whatever comes up later should be fairly easy by using hudi-rs crate.
At the same time, we can have Python client in the form of PyPi package with pretty low effort thanks to bindings.

@SteNicholas PyIceberg if I'm not wrong is written in Python in full and uses PyArrow underneath. That's another option, but it feels like hudi-rs + python bindings will create less overhead to maintain it overall.

Btw once we have python bindings/client I will be happy to add Hudi support to aws sdk for pandas, good chunk Hudi Slack community seems to use Hudi on AWS so it makes sense to do so.

I think there are upsides to having eg a rust client with Arrow underneath and maintain bindings for Python.

I am more inclined towards this. Just have two native implementations - Java, Rust. and wrap others. I just need to make sure Rust to C++ works great as well. Did not have a stellar experience here with Go.

yihua · 2023-05-11T06:21:41Z

rfc/rfc-69/rfc-69.md

+solution for the community by embracing a hybrid architecture where we employ server components for table metadata while remaining server-less for data.
+*   **Beyond structured Data**: Even as we solved challenges around ingesting, storing, managing and transforming data in parquet/avro/orc, there is still a majority of other data that does not benefit from these capabilities. 
+Using Hudi's HFile tables for ML Model serving is an emerging use case with users who want a lower-cost, lightweight means to serve computed data directly off the lake storage. Often, unstructured data like JSON and blobs 
+like images must be pseudo-modeled with some structure, leading to poor performance or manageability. With the meteoric rise of AI/ML in recent years, the lack of support for complex, unstructured, large blobs in a project like Hudi will only fragment data in lakes.


Pretty excited to see how Hudi can help AI/ML use cases. I believe we're just scratching the surface here and going to provide efficient mutation solutions tailed for AI/ML on top of a generic framework.

yihua · 2023-05-11T06:25:28Z

rfc/rfc-69/rfc-69.md

+*   **Even greater self-management**: Hudi offers the most extensive set of capabilities today in open-source data lake management, from ingesting data to optimizing data and automating various bookkeeping activities to 
+automatically manage table data and metadata. Seeing how the community has used this management layer to up-level their data lake experience is impressive. However, we have plenty of capabilities to be added, e.g., 
+reverse streaming data into other systems or [snapshot management](https://github.com/apache/hudi/pull/6576/files) or [diagnostic reporters](https://github.com/apache/hudi/pull/6600) or cross-region logical replication or 
+record-level [time-to-live management](https://github.com/apache/hudi/pull/8062), to name a few.


This part is amazing as it's mostly driven by the community, crowdsourcing ideas from various use cases in production.

yihua · 2023-05-11T06:41:21Z

rfc/rfc-69/rfc-69.md

+The buffer manager component manages dirtied blocks of storage and also caches data for faster query responses. In Hudi's context, we want to bring to life our now long-overdue 
+columnar [caching service](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#lake-cache) that can sit transparently between lake storage and query engines while understanding 
+transaction boundaries and record mutations. The tradeoffs in designing systems that balance read, update and memory costs are detailed in the [RUM conjecture](https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf). 
+Our basic idea here is to optimize for read (faster queries served out of cache) and update (amortizing MoR merge costs by continuously compacting in-memory) costs while adding the cost of cache/memory to the system. 


maybe cache-aware query planning and execution with distributed cache is sth relevant we can take on.

yihua · 2023-05-11T06:43:14Z

rfc/rfc-69/rfc-69.md

+The buffer manager component manages dirtied blocks of storage and also caches data for faster query responses. In Hudi's context, we want to bring to life our now long-overdue 
+columnar [caching service](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#lake-cache) that can sit transparently between lake storage and query engines while understanding 
+transaction boundaries and record mutations. The tradeoffs in designing systems that balance read, update and memory costs are detailed in the [RUM conjecture](https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf). 
+Our basic idea here is to optimize for read (faster queries served out of cache) and update (amortizing MoR merge costs by continuously compacting in-memory) costs while adding the cost of cache/memory to the system. 


Pre-existing data on query patterns and performance is needed for training a model to optimize query planning if that is going to be anywhere near helpful.

yihua · 2023-05-11T06:49:20Z

rfc/rfc-69/rfc-69.md

+Rome was not built in a day, so can't the Hudi 1.x vision also? This section outlines the first 1.0 release goals and the potentially must-have changes to be front-loaded. 
+This RFC solicits more feedback and contributions from the community for expanding the scope or delivering more value to the users in the 1.0 release.
+
+In short, we propose Hudi 1.0 try and achieve the following.


@SteNicholas I have the same idea of using LLM to help answer community questions on Hudi, at least providing some hints at the minimum if the machine answer is not perfect 😄

yihua · 2023-05-11T06:51:21Z

rfc/rfc-69/rfc-69.md

+Rome was not built in a day, so can't the Hudi 1.x vision also? This section outlines the first 1.0 release goals and the potentially must-have changes to be front-loaded. 
+This RFC solicits more feedback and contributions from the community for expanding the scope or delivering more value to the users in the 1.0 release.
+
+In short, we propose Hudi 1.0 try and achieve the following.


@SteNicholas do you see any other requirements for AI use cases beyond improving the performance of readers and writers using C++/Rust? Anything on the storage format, representation of the data in the table to foster fast mutation?

yihua · 2023-05-11T06:55:06Z

rfc/rfc-69/rfc-69.md

+In short, we propose Hudi 1.0 try and achieve the following.
+
+1. Incorporate all/any changes to the format - timeline, log, metadata table...
+2. Put up new APIs, if any, across - Table metadata, snapshots, index/metadata table, key generation, record merger,...


Do you want to discuss how the 0.x releases (e.g., 0.14) will work alongside 1.x development, to avoid conflict and rebasing difficulties?

Good point. I am thinking we merge all the core code abstraction changes - HoodieSchema, HoodieData and such into 0.X line to make rebasing easier, then fork off the 1.0 feature branch.

kazdy · 2023-05-11T18:32:01Z

rfc/rfc-69/rfc-69.md

+around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for 
+keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, 


If there are some changes to be done to Hudi Spark integration, then maybe it's the right time to
use spark hidden _metadata field (which is a struct) to keep hoodie meta fields there, if user wants to get these then it can be done with "select _metadata", this feels like breaking change so maybe it's the right time to do it? (btw It's possible to hide meta columns in the new Presto integration with a config afaik)
https://spark.apache.org/docs/3.2.1/api/java/org/apache/spark/sql/connector/catalog/MetadataColumn.html

yes even on trino. something to consider. When we were writing Hudi originally, there was no Spark DataSet APIs even, FWIW :)

Tracking here. https://issues.apache.org/jira/browse/HUDI-6488 feel free to grab it!

codope · 2023-05-12T03:58:01Z

rfc/rfc-69/rfc-69.md

+(similar to what modern cloud warehouses do anyway), the community has been hesitant towards a long-running service in addition to their data catalog or a Hive metaserver. In fact, our timeline server efforts were stalled 
+due to a lack of consensus in the community. However, as needs like concurrency control evolve, proprietary solutions emerge to solve these very problems around open formats. It's probably time to move towards a truly-open 
+solution for the community by embracing a hybrid architecture where we employ server components for table metadata while remaining server-less for data.
+*   **Beyond structured Data**: Even as we solved challenges around ingesting, storing, managing and transforming data in parquet/avro/orc, there is still a majority of other data that does not benefit from these capabilities. 


+1
Goes without saying that we need more research here, but, I am wondering how this intersects with "generalized" data model.

Also, I think UDFs and materialized views (MV) will be key for ML ecosystem.
Any plan of supporting MVs?

I want to understand what supporting MVs mean better. See my comment on Flink MV using Dynamic table above. and consolidate there?

SteNicholas · 2023-05-12T04:04:32Z

rfc/rfc-69/rfc-69.md

+Rome was not built in a day, so can't the Hudi 1.x vision also? This section outlines the first 1.0 release goals and the potentially must-have changes to be front-loaded. 
+This RFC solicits more feedback and contributions from the community for expanding the scope or delivering more value to the users in the 1.0 release.
+
+In short, we propose Hudi 1.0 try and achieve the following.


BTW, could we consider some functions for cloud native like separation of hot and cold data, integration with K8S operator etc? IMO, the future trend of database is closer to cloud native.

@SteNicholas Interesting. I was thinking that we make operators for components like metaserver or cache server (depending on how we build it) or the table management server . Could you expand on the hot/cold data separation idea?

@vinothchandar, metaserver or cache server or the table management server could serve as kubernetes service. The hot/cold data separation idea refers to CloudJump: Optimizing Cloud Databases for Cloud Storages.

Thanks @SteNicholas ! My thoughts are also to start with sth that can run these on k8s. I will read the paper and reflect back.

Tracking here. https://issues.apache.org/jira/browse/HUDI-6489 Happy to help firm up a RFC/design, we can evolve as 1.0 format evolves.

cc @SteNicholas

codope · 2023-05-12T04:05:46Z

rfc/rfc-69/rfc-69.md

+our [metaserver](https://github.com/apache/hudi/pull/4718) that serves only timeline metadata today. The paper (pg 81) describes the tradeoffs between the common concurrency control techniques in 
+databases: _2Phase Locking_ (hard to implement without a central transaction manager), _OCC_ (works well w/o contention, fails very poorly with contention) and _MVCC_ (yields high throughputs, but relaxed 
+serializability in some cases). Hudi implements OCC between concurrent writers while providing MVCC-based concurrency for writers and table services to avoid any blocking between them. Taking a step back, 
+we need to ask ourselves if we are building an OLTP relational database to avoid falling into the trap of blindly applying the same concurrency control techniques that apply to them to the high-throughput 


100% with you!
One might think that serializable would be the norm in production databases. But, it's far from reality even for RDBMS. There is an awesome SIGMOD talk by Dr. Andy Pavlo which discussed this point. Sharing the screenshot from the slides - https://www.cs.cmu.edu/~pavlo/slides/pavlo-keynote-sigmod2017.pdf

@codope, could we build HTAP database like TiDB etc?

It will be hard to support any real, high perf transactional workloads on the lake IMO. Our commit times will still be say 20-30 seconds. Those are long-running transactions in the transactional world (speaking from my Oracle work ex) :)

vinothchandar · 2023-05-16T15:06:22Z

As next steps, I am consolidating these into the RFC itself, and getting our JIRAs into initial shape.

parisni · 2023-05-16T15:54:29Z

Hi @vinothchandar I suggest 3 features that we found very important, (some are in the roadmap but not mentioned here) and we d'love to see soon in hudi:

object storage optimization
secondary indexes
partition evolution (changing the partition layout for all or part of the table)

soumilshah1995 · 2023-05-16T17:09:02Z

Ideas

Data Replication and Multi-Data Center Support: Enable replication of Hudi tables across multiple data centers or cloud regions, providing high availability, disaster recovery, and improved data locality for distributed applications.
Machine Learning Integration: Provide native integration with machine learning frameworks like Apache Spark or TensorFlow, allowing users to train and deploy ML models directly on data stored in Hudi, enabling real-time predictions and analytics.
Geospatial Data Support: Extend Hudi to natively support geospatial data types and operations, enabling spatial analytics, geospatial indexing, and querying capabilities within Hudi tables.

pratyakshsharma

Nice writeup. Looks promising.

rfc/rfc-69/rfc-69.md

pratyakshsharma · 2023-05-12T21:25:34Z

rfc/rfc-69/rfc-69.md

+The log manager component in a database helps organize logs for recovery of the database during crashes, among other things. At the transactional layer, Hudi implements ways to organize data into file groups and 
+file slices and stores events that modify the table state in a timeline. Hudi also tracks inflight transactions using marker files for effective rollbacks. Since the lake stores way more data than typical operational 
+databases or data warehouses while needing much longer record version tracking, Hudi generates record-level metadata that compresses well to aid in features like change data capture or incremental queries, effectively 
+treating data itself as a log. In the future, we would want to continue improving the data organization in Hudi, to provide scalable, infinite timeline and data history, time-travel writes, storage federation and other features.


time-travel writes looks interesting to me.

rfc/rfc-69/rfc-69.md

jhon-ye · 2023-05-17T04:01:45Z

hudi support sync database throughout cdc tools like flink-cdc or seatunnel-cdc

beyond1920 · 2023-05-17T07:03:12Z

rfc/rfc-69/rfc-69.md

+*   **Serverful and Serverless:** Data lakes have historically been about jobs triggered periodically or on demand. Even though many metadata scaling challenges can be solved by a well-engineered metaserver 
+(similar to what modern cloud warehouses do anyway), the community has been hesitant towards a long-running service in addition to their data catalog or a Hive metaserver. In fact, our timeline server efforts were stalled 
+due to a lack of consensus in the community. However, as needs like concurrency control evolve, proprietary solutions emerge to solve these very problems around open formats. It's probably time to move towards a truly-open 
+solution for the community by embracing a hybrid architecture where we employ server components for table metadata while remaining server-less for data.


+1 on this.
In our internal HUDI practice at our company (Kuaishou), we have encountered the need for concurrent processing of a HUDI table. For example, multiple streaming jobs processing a HUDI table simultaneously to achieve the requirement of Streaming joins and Streaming union. Having a long running servicer which could act as a coordinator between multiple jobs would make everything easier.

yes. if we can embrace this model, management becomes much simpler.

Maybe it will get less pushbacks if it's easy to deploy in cloud (eg from marketplace on aws).

Being in a smaller company and having another thing to deploy and maintain can be a no go. Esp that the rest of the stack can be and probably is serverless.

Having a choice here would be the best from user perspective.

yes k8s support on eks with a helm chart? All major cloud providers support some level of managed k8s. We can start from there, and let the community add more things to it. It ll be hard for us to build any native integrations anyway. We can be OSS focussed.

xiarixiaoyao · 2023-05-17T14:59:52Z

merge-on-read implementation is friendly for update. but has poor query performance due to the following reasons

when query mor table, hudi need perform a merge operation between base file and log file which brings additional cpu/memory cost
DataSkipping worked inefficient, since log is unsorted, min-max index is invalid.
second index cannot worked on log file directly, and have a poor performance

how about introduce delete vector to hudi just like doris/delta lake /hologress.
here is delta lake desigin
https://docs.google.com/document/d/1lv35ZPfioopBbzQ7zT82LOev7qV7x4YNLkMr2-L5E_M/edit#heading=h.dd2cc57oc5wk
with delta vector:

eliminated the data merge operation,
MDT and the min-max/second index can work well.
Of course, we also lost our payload capability if we use delete vector

parisni · 2023-05-17T19:37:05Z

how about introduce delete vector to hudi just like doris/delta lake /hologress. here is delta lake desigin

more generally, hudi has been innovating and ahead of its time. It's now time to look at challengers innovation and backport them if possible. Just doing so, 1.x will be innovative bcoz hudi has already tons of unique features. But there are features out there it are missing in hudi and it's time to introduce them too. (again s3 optimized object storage, datasource v2 full support, partition evolution and here proposal about vector delete)

bvaradar

1.0 vision is incredible !

bvaradar · 2023-05-19T07:22:24Z

rfc/rfc-69/rfc-69.md

+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution.


Agree. the direction seems to be that query side smartness is increasingly being pushed to Hudi's layer rightfully.

rfc/rfc-69/rfc-69.md

vinothchandar

Resolving few comments. I have cleaned up our existing 1.0 JIRAs. Will pull together suggestions from here into JIRA epics with an organization for 1.0 and a follow on 1.1 and get more feedback from you all.

Still would love some concrete feedback on how we manage the 1.x feature branch along with 0.x

vinothchandar · 2023-05-19T17:06:54Z

rfc/rfc-69/rfc-69.md

+Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. 
+Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. 
+Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines.


Agree. Efforts like Record index, should bring that down dramatically I feel for say random write workloads. This is definitely sth to measure and baseline and set plans for, but not sure how to pull a concrete project around this yet. Multi delta streamer needs more love for sure. will make it 1.1 for now though, since we need to front load other things before

rfc/rfc-69/rfc-69.md

vinothchandar · 2023-05-19T17:12:44Z

rfc/rfc-69/rfc-69.md

+around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for 
+keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, 


I think there are upsides to having eg a rust client with Arrow underneath and maintain bindings for Python.

I am more inclined towards this. Just have two native implementations - Java, Rust. and wrap others. I just need to make sure Rust to C++ works great as well. Did not have a stellar experience here with Go.

vinothchandar · 2023-05-19T17:32:04Z

rfc/rfc-69/rfc-69.md

+*   **Serverful and Serverless:** Data lakes have historically been about jobs triggered periodically or on demand. Even though many metadata scaling challenges can be solved by a well-engineered metaserver 
+(similar to what modern cloud warehouses do anyway), the community has been hesitant towards a long-running service in addition to their data catalog or a Hive metaserver. In fact, our timeline server efforts were stalled 
+due to a lack of consensus in the community. However, as needs like concurrency control evolve, proprietary solutions emerge to solve these very problems around open formats. It's probably time to move towards a truly-open 
+solution for the community by embracing a hybrid architecture where we employ server components for table metadata while remaining server-less for data.


yes k8s support on eks with a helm chart? All major cloud providers support some level of managed k8s. We can start from there, and let the community add more things to it. It ll be hard for us to build any native integrations anyway. We can be OSS focussed.

vinothchandar · 2023-05-19T17:33:20Z

rfc/rfc-69/rfc-69.md

+(similar to what modern cloud warehouses do anyway), the community has been hesitant towards a long-running service in addition to their data catalog or a Hive metaserver. In fact, our timeline server efforts were stalled 
+due to a lack of consensus in the community. However, as needs like concurrency control evolve, proprietary solutions emerge to solve these very problems around open formats. It's probably time to move towards a truly-open 
+solution for the community by embracing a hybrid architecture where we employ server components for table metadata while remaining server-less for data.
+*   **Beyond structured Data**: Even as we solved challenges around ingesting, storing, managing and transforming data in parquet/avro/orc, there is still a majority of other data that does not benefit from these capabilities. 


I want to understand what supporting MVs mean better. See my comment on Flink MV using Dynamic table above. and consolidate there?

vinothchandar · 2023-05-19T17:35:32Z

rfc/rfc-69/rfc-69.md

+on MVCC to build indexes without blocking writers and still be consistent upon completion with the table data. Our focus has thus far been more narrowly aimed at using indexing techniques for write performance, 
+while queries benefit from files and column statistics metadata for planning. In the future, we want to generalize support for using various index types uniformly across writes and queries so that queries can be planned, 
+optimized and executed efficiently on top of Hudi's indices. This is now possible due to having Hudi's connectors for popular open-source engines like Presto, Spark and Trino. 
+New [secondary indexing schemes](https://github.com/apache/hudi/pull/5370) and a proposal for built-in index functions to index values derived from columns have already been added.


yes. a db advisor type thing would be awesome to have. it can recommend indexes and such. We can do it in 1.1 though.

vinothchandar · 2023-05-19T17:36:31Z

rfc/rfc-69/rfc-69.md

+
+Shared components include replication, loading, and various utilities, complete with a catalog or metadata server. Most databases hide the underlying format/storage complexities, providing users 
+with many data management tools. Hudi is no exception. Hudi has battle-hardened bulk and continuous data loading utilities (deltastreamer, flinkstreamer tools, along with Kafka Connect Sink), 
+a comprehensive set of table services (cleaning, archival, compaction, clustering, indexing, ..), admin CLI and much much more. The community has been working on new server components 


@xushiyan do we have an RFC or Epic for this already.

vinothchandar · 2023-05-19T17:39:17Z

rfc/rfc-69/rfc-69.md

+Rome was not built in a day, so can't the Hudi 1.x vision also? This section outlines the first 1.0 release goals and the potentially must-have changes to be front-loaded. 
+This RFC solicits more feedback and contributions from the community for expanding the scope or delivering more value to the users in the 1.0 release.
+
+In short, we propose Hudi 1.0 try and achieve the following.


@SteNicholas @danny0405 @yihua This is a good read on this topic of MLOps and ML workloads their intersection with regular data eng https://www.cpard.xyz/posts/mlops_is_mostly_data_engineering/

vinothchandar

Resolving few comments. I have cleaned up our existing 1.0 JIRAs. Will pull together suggestions from here into JIRA epics with an organization for 1.0 and a follow on 1.1 and get more feedback from you all.

Still would love some concrete feedback on how we manage the 1.x feature branch along with 0.x

vinothchandar · 2023-05-24T00:13:32Z

Folks - I have cleaned up a lot of items and streamlined most of the format, concurrency, metadata level changes - anything that affects storage bits and APIs, here as 1.0

The remaining ideas are added to a 1.1.0 version here, along with others. We can continue to flesh out these in more detail (may need research), while we make concrete progress on 1.0 alpha version. We could pull stuff in based on how it shapes up. Thoughts?

garyli1019 · 2023-06-02T02:39:44Z

rfc/rfc-69/rfc-69.md

+
+
+The access methods component encompasses indexes, metadata and storage layout organization techniques exposed to reads/writes on the database. Last year, we added 
+a new [multi-modal index](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi) with support for [asynchronous index building](https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md) based 


vector database is quite a hot topic recently, and I take a look at it trying to figure out what it is. Looks like it's a database with a special index provide similarity search. With the pluggable indexes and clustering, Hudi is in a sweet spot in this race!

@garyli1019 came across this. https://www.linkedin.com/pulse/text-based-search-from-elastic-vector-kaushik-muniandi , thoughts?

very nice article, elastic search just can't handle that much of data. That would be really helpful if this user could elaborate more details about their use cases. A general purpose LLM + a database that store the customized data = a customized AI assistant. Searching could be the next battle ground for the lakehouse tech.

@garyli1019 sg. Interested in trying https://github.com/lancedb/lance as a base file format and adding some capabilities through one of the existing engines? I can see elastic, pg etc all adding specialized indexes. I think the vector search itself will be commodotized soon. It's just another query type.

Interesting, lots of new stuff coming out recently, let me take a look 👀

vinothchandar · 2023-06-02T06:50:22Z

Still cleaning up JIRAs, doing some benchmarks. will update RFC with some concrete points for 1.0, around format changes. Stay tuned!

SteNicholas · 2023-06-02T12:25:08Z

@vinothchandar, would the OneTable feature introduce in Hudi 1.X? Meanwhile, could Hudi provide self-optimizing(user-insensitive asynchronous self-optimization mechanisms could keep lakehouse fresh and healthy) feature?

BTW, did the database system design compare with Materialize which is a streaming database purpose-built for low-latency applications?

According to above ideas, what's the difference between the database based on Hudi 1.X and other databases? Or what are the advantages and what problems can help users solve compared to other databases like Hologres of Aliyun which supports second level streaming warehouse?

YannByron · 2023-06-13T02:03:21Z

rfc/rfc-69/rfc-69.md

+*   **Serverful and Serverless:** Data lakes have historically been about jobs triggered periodically or on demand. Even though many metadata scaling challenges can be solved by a well-engineered metaserver 
+(similar to what modern cloud warehouses do anyway), the community has been hesitant towards a long-running service in addition to their data catalog or a Hive metaserver. In fact, our timeline server efforts were stalled 
+due to a lack of consensus in the community. However, as needs like concurrency control evolve, proprietary solutions emerge to solve these very problems around open formats. It's probably time to move towards a truly-open 
+solution for the community by embracing a hybrid architecture where we employ server components for table metadata while remaining server-less for data.


Exciting for this part. For embracing a hybrid architecture, we have done some work around this. I think the key is to a abstract, unified, well-designed api to response the query (like methods in FSUtils) and operations (like manage instants) of metadata, and ingest it to Hudi Client (aka HoodieTableMetaClient). After then, we can implement different timeline services by parsing .hoodie directory or connecting to the backed hudi metastore service easily. @vinothchandar

@YannByron agree. Anything in the works, in terms of upstreaming? :) I have now pushed out the metaserve to 1.1.0 since it needs broad alignment across the community. So we have sometime, but love to just some alignment here and move on ahead.

@vinothchandar yep, we wanna work on it, and already have some research. will propose a RFC (maybe an umbrella one) in the short run, then let's discuss deeper. cc @Zouxxyy

vinothchandar · 2023-07-05T20:49:04Z

@parisni All of these are in the 1.0.0 epics. Welcome contributions.
#8679 (comment)

@parisni @xiarixiaoyao On the delete vector like design. It's actually quite possible today, by simply writing delete blocks and creating new parquet files. Tracking here. https://issues.apache.org/jira/browse/HUDI-6490 Agree, on the tradeoffs cited. We can easily support both. There are few other things to consider on how it affects layout as well.

vinothchandar · 2023-07-05T20:58:33Z

@soumilshah1995 All great ideas. Tracking as below for 1.1.0

https://issues.apache.org/jira/browse/HUDI-6491 (does not need to wait for 1.0 IMO)
https://issues.apache.org/jira/browse/HUDI-6492 (does not need to wait for 1.0, but good to do one time efforts with new APIs)
https://issues.apache.org/jira/browse/HUDI-6493 (does need to wait for 1.0, for indexing subsystem)

vinothchandar · 2023-07-05T21:11:07Z

JIRA organization is now in-place and if some of the reviewers here approve - esp PMC members/committers, @danny0405 @leesf @yanghua @yihua @nsivabalan @xushiyan @bvaradar @codope @YannByron @SteNicholas @XuQianJin-Stars @pratyakshsharma - I will land this and we can get to work.

n3nash · 2023-07-06T23:51:32Z

@vinothchandar Great write up and some very cool ideas and suggestions already! It's an impressive roadmap. Just adding some thoughts and personal opinions from previous experiences, what I've heard from others users and gaps on the most interesting (and impactful) aspects as features are being prioritized.

Table APIs : Like already called out in the RFC, enables Hudi for faster query engine integrations and expands the ecosystem for varied kinds of data users in the community. This is one of the pieces that has been missing for Hudi's unique features to be adopted more widely. Additionally, with these API's, folks who want (and miss) a higher level, "simpler" interface can start to play with Hudi more easily.
Caching Service : One of the powerful and unique parts of Hudi's design is the file group layout. Building an integrated caching service that can tap into this will provide higher performance than a general table format design. With DuckDB et all taking off, having low latency access to hot data more smartly would go a long way.
Metastore / Catalog service : Look forward to how writers and readers can deeply integrate and utilize the power of Hudi's data model without being constricted on Hive serde, query planning etc. Additionally, this will centralize other table services / operations making it easier for users to manage them uniformly (across deltastreamer, stand-alone spark or flink jobs etc)
Additional Indexing techniques : Due to the pluggable architecture of Hudi and the work already done upfront, it's easier to support different forms of indexing. Personally, I've seen geospatial indexing to be a requirement for many users.

yihua

LGTM

soumilshah1995 · 2023-07-07T21:08:50Z

Looks Great

vinothchandar · 2023-07-11T15:56:29Z

@n3nash Thanks for the comments! agree with them all. A bunch of this work is scheduled for 1.1. We will focus on making the timeline/storage facing changes in the first 1.0 release!

[RFC-69] Hudi 1.x

dfce3b6

- Summarized current state and efforts - Explain the different components in Hudi, in relation to database system design - Sketch goals for Hudi 1.0 to build alignment

vinothchandar added the rfc label May 10, 2023

SteNicholas reviewed May 10, 2023

View reviewed changes

rfc/rfc-69/rfc-69.md Outdated Show resolved Hide resolved

SteNicholas reviewed May 10, 2023

View reviewed changes

rfc/rfc-69/rfc-69.md Outdated Show resolved Hide resolved

vinothchandar changed the title ~~[RFC-69] Hudi 1.X~~ [DOCS] [RFC-69] Hudi 1.X May 10, 2023

xushiyan reviewed May 10, 2023

View reviewed changes

kazdy reviewed May 10, 2023

View reviewed changes

nsivabalan reviewed May 10, 2023

View reviewed changes

BruceKellan reviewed May 11, 2023

View reviewed changes

SteNicholas reviewed May 11, 2023

View reviewed changes

yihua reviewed May 11, 2023

View reviewed changes

kazdy reviewed May 11, 2023

View reviewed changes

codope reviewed May 12, 2023

View reviewed changes

SteNicholas reviewed May 12, 2023

View reviewed changes

codope reviewed May 12, 2023

View reviewed changes

pratyakshsharma reviewed May 16, 2023

View reviewed changes

beyond1920 reviewed May 17, 2023

View reviewed changes

bvaradar reviewed May 19, 2023

View reviewed changes

Address minor issues from review feedback

a581937

vinothchandar commented May 19, 2023

View reviewed changes

This comment was marked as duplicate.

Sign in to view

garyli1019 reviewed Jun 2, 2023

View reviewed changes

YannByron reviewed Jun 13, 2023

View reviewed changes

Adding status of JIRA issues; table to be updated subsequently.

3c12d25

Finalizing 1.0/1.1 JIRAs

96fd9ac

vinothchandar changed the title ~~[DOCS] [RFC-69] Hudi 1.X~~ [RFC-69] Hudi 1.X Jul 5, 2023

danny0405 approved these changes Jul 6, 2023

View reviewed changes

yihua approved these changes Jul 7, 2023

View reviewed changes

vinothchandar merged commit cb48527 into apache:master Jul 11, 2023
1 of 3 checks passed

YannByron mentioned this pull request Dec 6, 2023

[SUPPORT][SPARK][NATIVE] make hudi integrate into gluten/velox #10252

Open


		![](./hudi-dblayers.png)

		_Reference diagram highlighting existing (green) and new (yellow) Hudi components, along with external components (blue)._


		## Future Opportunities

		We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes"



		The access methods component encompasses indexes, metadata and storage layout organization techniques exposed to reads/writes on the database. Last year, we added
		a new [multi-modal index](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi) with support for [asynchronous index building](https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md) based

[RFC-69] Hudi 1.X #8679

[RFC-69] Hudi 1.X #8679

Conversation

vinothchandar commented May 10, 2023 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

SteNicholas May 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteNicholas May 11, 2023 • edited Loading

Choose a reason for hiding this comment

danny0405 May 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xushiyan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteNicholas May 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kazdy May 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan May 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 May 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yihua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar commented May 10, 2023 •

edited

Loading

SteNicholas May 10, 2023 •

edited

Loading

SteNicholas May 11, 2023 •

edited

Loading

danny0405 May 11, 2023 •

edited

Loading

SteNicholas May 11, 2023 •

edited

Loading

kazdy May 10, 2023 •

edited

Loading

nsivabalan May 10, 2023 •

edited

Loading

danny0405 May 17, 2023 •

edited

Loading

kazdy May 16, 2023 •

edited

Loading