Skip to content

[SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating #2238

@RajasekarSribalan

Description

@RajasekarSribalan

Hi All, I have query regarding CDC using hudi.

I am using SPARK Datasource API for upserts and delete on HUDI. What is the best way of doing deletes in hudi?
Our code flow is ,
read Kafka -> persist DF in memory -> filter upserts > Write to Hudi -> Filter Deletes -> Write to hudi..

Is this the right of handling both upsert and deletes from incoming streams… The problem with this approach is, hudi does indexing twice for a single batch of records as we do upsert separately and delete separately. I would like to have your suggestions for improving our pipeline.
can we use “_hoodie_is_deleted” in Spark Datasource API. We can append a new column with _hoodie_is_deleted as true for delete records and false for insert/update records.. If we use “_hoodie_is_deleted”, will hudi hard delete the row or does it make it null? Pls confirm.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions