-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] SqlQueryBasedTransformer new field issue with PostgresDebeziumSource #11468
Comments
slack chat Thread https://apache-hudi.slack.com/archives/C4D716NPQ/p1718691646054409 this issue arises when attempting to use transformer with --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer Throws an issue even when using Select * |
@ashwinagalcha-ps The sql should have SRC also, right? |
@ad1happy2go I missed to add here. But we did test with in the query. Sorry for the confusion, i'll update the description. |
@ad1happy2go It was already added here but since it was plain text and not code github skipped |
Hey Aditya I personally verified SQL transformer throws error Here is lab : https://github.com/soumilshah1995/universal-datalakehouse-postgres-ingestion-deltastreamer I changed the streamer code
if you see we just tried simple select *
|
Thanks @soumilshah1995 . Will checkout. |
aye aye captain |
Any one solve it? I'm also get this error when use kafka + Debezium + Streamer (with transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer ) . The error only happen when checkpoint at end of kafka topic and when hoodie.deltastreamer.transformer.sql pecifically reference to column name.
Seem like when apply tranform class in spark submit, even when have 0 new record the Streamer force use my sql string in config to tmp table (which empty)
|
When using Kafka + Debezium + Streamer, we are able to write data and the job works fine, but when using the SqlQueryBasedTransformer, it is able to write data on S3 with the new field but ultimately the job fails.
query=SELECT *, extract(year from a.created_at) as year FROM <SRC> a
Below are the Hudi Deltastreamer job configs:
"--table-type", "COPY_ON_WRITE",
"--source-class", "org.apache.hudi.utilities.sources.debezium.PostgresDebeziumSource",
"--transformer-class", "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer",
"--hoodie-conf", f"hoodie.streamer.transformer.sql={query}",
"--source-ordering-field", output["source_ordering_field"],
"--target-base-path", f"s3a://{env_params['deltastreamer_bucket']}/{db_name}/{schema}/{output['table_name']}/",
"--target-table", output["table_name"],
"--auto.offset.reset=earliest
"--props", properties_file,
"--payload-class", "org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload",
"--enable-hive-sync",
"--hoodie-conf", "hoodie.datasource.hive_sync.mode=hms",
"--hoodie-conf", "hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true",
"--hoodie-conf", f"hoodie.deltastreamer.source.kafka.topic={connector_name}.{schema}.{output['table_name']}",
"--hoodie-conf", f"schema.registry.url={env_params['schema_registry_url']}",
"--hoodie-conf", f"hoodie.deltastreamer.schemaprovider.registry.url={env_params['schema_registry_url']}/subjects/{connector_name}.{schema}.{output['table_name']}-value/versions/latest",
"--hoodie-conf", "hoodie.deltastreamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer",
"--hoodie-conf", "hoodie.datasource.hive_sync.use_jdbc=false",
"--hoodie-conf", f"hoodie.datasource.hive_sync.database={output['hive_database']}",
"--hoodie-conf", f"hoodie.datasource.hive_sync.table={output['table_name']}",
"--hoodie-conf", "hoodie.datasource.hive_sync.metastore.uris=",
"--hoodie-conf", "hoodie.datasource.hive_sync.enable=true",
"--hoodie-conf", "hoodie.datasource.hive_sync.support_timestamp=true",
"--hoodie-conf", "hoodie.deltastreamer.source.kafka.maxEvents=100000",
"--hoodie-conf", f"hoodie.datasource.write.recordkey.field={output['record_key']}",
"--hoodie-conf", f"hoodie.datasource.write.precombine.field={output['precombine_field']}",
"--hoodie-conf", f"hoodie.datasource.hive_sync.glue_database={output['hive_database']}",
"--continuous"
Properties file:
bootstrap.servers=host:port
auto.offset.reset=earliest
schema.registry.url=http://host:8081
Expected behavior: To be able to extract a new field (year) in the target hudi table with the help of SqlQueryBasedTransformer.
Environment Description
Hudi version : 0.14.0
Spark version : 3.4.1
Hadoop version : 3.3.4
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Base image & jars:
public.ecr.aws/ocean-spark/spark:platform-3.4.1-hadoop-3.3.4-java-11-scala-2.12-python-3.10-gen21
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/0.14.0/hudi-spark3.4-bundle_2.12-0.14.0.jar https://repo1.maven.org/maven2/org/apache/hudi/hudi-utilities-bundle_2.12/0.14.0/hudi-utilities-bundle_2.12-0.14.0.jar
Stacktrace
The text was updated successfully, but these errors were encountered: