Hi All, I have query regarding CDC using hudi.
I am using SPARK Datasource API for upserts and delete on HUDI. What is the best way of doing deletes in hudi?
Our code flow is ,
read Kafka -> persist DF in memory -> filter upserts > Write to Hudi -> Filter Deletes -> Write to hudi..
Is this the right of handling both upsert and deletes from incoming streams… The problem with this approach is, hudi does indexing twice for a single batch of records as we do upsert separately and delete separately. I would like to have your suggestions for improving our pipeline.
can we use “_hoodie_is_deleted” in Spark Datasource API. We can append a new column with _hoodie_is_deleted as true for delete records and false for insert/update records.. If we use “_hoodie_is_deleted”, will hudi hard delete the row or does it make it null? Pls confirm.