New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help with Reading Kafka topic written using Debezium Connector - Deltastreamer #2149
Comments
@ashishmgofficial : You need to plugin a transformer class to only select the columns you need and record-payload to handle deletions. We are currently in the process of adding the transformer to OSS Hudi but broadly here is how it will look like gist :
|
@bvaradar So in this case we should be giving updated schema file for the target ? |
@ashishmgofficial : Yes, you are correct. You could create custom SchemaProvider that inherits from say Confluent Schema Registry based schema provider. Please see below for an example implementation.
|
@bvaradar : Thanks for the code . I followed your instructions but tried to add _is_hoodie_deleted column to the dataset using following code for testing Im getting the following error with the code mentioned in the post
Transformer
SchemaProviderDebezium
|
@ashishmgofficial : You dont need _hoodie_is_deleted if you are using the custom transformer. |
@bvaradar I changed the code to as previous and ran the deltastreamer . But some reason is causing error and data is getting rolled back :
|
I found this Error message in the logs :
Im able to see the transformed dataset properly when i sysout the dataset from transformer :
|
@ashishmgofficial : The exception you pasted is not the real root-cause. You should see the root-cause exceptions in the executor logs as well. It would be easy to debug if we know the root-cause. Can you find the exceptions and paste it here ? But from the logs you pasted, It should be coming from the payload class. |
@ashishmgofficial : the one you pasted is only driver side logs. Do you have executor logs ? If you have spark history server setup, you can look at the tasks sections in the failed stage to look at exception. You can also simply collect all the logs (executor + driver) and attach them (e:g : yarn logs -applicationId ) |
@bvaradar My bad...Im attaching the logs |
@ashishmgofficial : This looks like schema mismatch issue. There might be a bug in the schema provider implementation that I pasted. Can you also attach the schema from schema registry to help debug (I guess you are using http://xxxxx:8081/subjects/airflow.public.motor_crash_violation_incidents-value/versions/latest) ? |
@bvaradar Yes Im using the above mentioned url for schema
|
@ashishmgofficial : I tried to repro with the schema you have sent but am unable to. I guess the correct schema-provider may not have been set correctly. Can you paste the entire spark-submit command here along with all hoodie configs (the one in the ticket description is old without the new changes that was proposed). |
@bvaradar Please find the details :
hudi-kafka.properties
|
From your earlier link : #2149 (comment) I do see in the SchemaProvider, hodieDeleteField is set wrongly. Field hoodieDeleteField = registrySchema.getField("op"); Have you removed hoodieDeleteField from both SchemaProvider and Transformer ? |
@bvaradar I have changed all the code to as how you had send earlier So the HoodieDeleateField is not present now |
Avro Payload :
SchemaProvider :
Transformer :
I have added these three classes to hudi-utilities and hudi-common and build the jar. |
@ashishmgofficial : I think there is a bug in the Schema provider implementation:
should have been
Can you try with that change ? |
@bvaradar Thanks for noticing it. I think that solved the previous error but producing following error now :
I think this error earlier also in the same thread when i was trying to add _hoodie_is_deleted field |
Following the Kafka data as consumed using Kafkacat
|
Let me try with the sample data you provided and get back over the weekend. |
@ashishmgofficial : It looks like the json data and the avro schema are not matching correctly. When I read the file through spark directly (please see below), I am getting an different schema than the one you provided. This is because debezium is configured to write in "JSON_SCHEMA" mode which I think is the default. This has both data and schema inlined and is inefficient in space. Since you are actually managing avro schemas, can you configure Debezium to write avro records directly rather than json. In my experiments (with a custom schema), I saw 8x speeded in Debezium by changing the format from json_schema to avro. If you still want to write as json, disable inline schema by setting the below debezium configs to false: ========== scala> val df = spark.read.json("file:///var/hoodie/ws/docker/inp.json") scala> df.printSchema() |
@bvaradar The json I had provided is the output of kafkacat utility which outputs as json. In our process we have Key = String and Value as AVRO for Kafka. Now the different schema is due to the inline data types in the json output of kafkacat which is read as is by spark |
@ashishmgofficial : Would it be possible to dump the avro records (value) as-is in a file and attach ? |
@bvaradar PFA below the files |
@ashishmgofficial : It took a while to debug this. Basically, the problem is in how spark deduces avro schema from ROw (code in spark-avro). This is incompatible with the schema passed through schema registry. Here is a gist which avoids the problem with a workaround: https://gist.github.com/bvaradar/f2dbb50f7c7a82178c04d41603269306 Please try this and see if you are able to ingest successfully... |
@bvaradar getting following error in patch :
Doing git patch for the first time .Might be Im doing something silly
|
@ashishmgofficial : I found a simpler way to workaround this and updated the gist: https://gist.github.com/bvaradar/f2dbb50f7c7a82178c04d41603269306 Can you refresh the above link. You should download to local file and apply patch -p1 < <file_path> |
@bvaradar Thanks !!! .. It seems to ingest properly. I will test all scenarios like delete etc and let you know . Thanks for such amazing support .!! |
@bvaradar The patch worked successfully for Insert and upserts except for Delete. Attaching the executor logs, KafkaCat outputs for ref : I had issued delete for record with inc_id = 3 |
@bvaradar I changed postgres configuration and now the debezium delete action doesnt create null value in "Before" :
But still hudi puts up earlier error |
This is clearly the error : "Caused by: java.lang.NullPointerException: Null value appeared in non-nullable field:" Would be helpful if you the dataset in avro format and I can try reproducing it |
@bvaradar I thought that at first. To confirm the same I retried the scenario multiple times. Im getting the same error everytime. Only during Deletes airflow.public.motor_crash_violation_incidents+0+0000000000 (1).avro.zip Following is the table when I read the above avro :
|
@ashishmgofficial : With your provided avro file, I am able to ingest without any errors.
Logs
I am able to read the newly added data successfully too:
Do yo see anything I am missing here ? |
BTW, it looks like both create and delete have the same last_modified_ts which means that precombine would not have deleted the records. Is this fake data ? If so, can you set the deletion timestamp to be higher ? |
@bvaradar Isnt the Delete worked fine for me as I replace I checked the same scenario using the spark submit you had provided with
|
I followed these steps :
AvroKafkaSource :
AvroDFSSource :
|
@ashishmgofficial : If I need to test with Kafka, would need a way to generate both Key and Value payload. Do you have some script to publish records to Kafka ? BTW, yeah, you are right about _ts_ms ordering field |
@bvaradar We are using the Debezium postgres connector of Confluent Kafka |
@bvaradar I can provide all the SQL's in Postgres which I'm using to reproduce this though :
Insert records :
Issue Delete :
These changes are automatically picked by the Confluent Kafka's Postgres Debezium Connector and written to topic |
Not sure if this is gonna be of any help but attaching the latest logs. I can see this messages towards the end
|
@ashishmgofficial : THis turned out to be unrelated to Hudi. I tested with the debezium local setup. Debezium is writing 2 kafka records for each delete records with one of the record having value set to "null". You can inspect the kafka topic using kafka-avro-console-consumer. This "null" record is causing the spark row encoding to fail. root@schemaregistry:/# kafka-avro-console-consumer --bootstrap-server kafka:9092 --topic debezium.public.motor_crash_violation_incidents --offset 'earliest' --partition 0 -property schema.registry.url=http://localhost:8085 --property print.key=true
|
@bvaradar Yes I think thats the tombstone event. You can disable it with configs. I believe tombstones.on.delete = false |
@ashishmgofficial : Let us know after you update the debezium setting if things work fine end to end. |
@bvaradar You are correct. It worked fine once the config was added. For some reason , kafkacat was not showing up the tombstone record . |
@bvaradar @ashishmgofficial I used the patch mentioned in #2149 (comment) and the instructions from #2149 (comment) but I got
I suppose something is wrong with my build ? Either way is there a timeline if and when the community will integrate a DebeziumAvroPayload to Hudi ? Thanks and sorry for mentioning this is in a closed issue . |
@toninis this is kind of weird, given the snippet that has the constructor. the class seems to be there in the build. |
@vinothchandar Thanks for your response at the time . |
Hi Team,
Im facing this use case where I need to ingest data from kafka topic usinf Deltastreamer which is loaded using Debezium connector. So the topic contains schema which contains fields like
before, after, ts_ms, op, source
etc. Im providing record key asafter.id
and precombine key withafter.timestamp
but still the entire debezium output is being ingested.Please find my properties
The text was updated successfully, but these errors were encountered: