[SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

Hi All, I have query regarding CDC using hudi.

 I am using SPARK Datasource API for upserts and delete on HUDI. What is the best way of doing deletes in hudi? 
Our code flow is , 
read Kafka -> persist DF in memory -> filter upserts > Write to Hudi -> Filter Deletes -> Write to hudi.. 

Is this the right of handling both upsert and deletes from incoming streams… The problem with this approach is, hudi does indexing twice for a single batch of records as we do upsert separately and delete separately. I would like to have your suggestions for improving our pipeline.
can we use “_hoodie_is_deleted” in Spark Datasource API. We can append a new column with _hoodie_is_deleted as true for delete records and false for insert/update records.. If we use “_hoodie_is_deleted”, will hudi hard delete the row or does it make it null? Pls confirm. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating #2238

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating #2238

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions