[SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta column #5189

nsivabalan · 2022-03-31T15:27:14Z

Tips before filing an issue

Describe the problem you faced

From user:
I am trying to read a hoodie table and write to a hoodie table using delta streamer and I am getting this error:

Steps to reproduce:

create first hudi table using ConfluentAvroKafkaSource ->
second by HoodieIncrSource consuming output of first table -> 
third by HoodieIncrSource and consuming output of second table 
( error is on incremental runs of deltastreamer in 3rd table )

stacktrace:

client token: N/A
	 diagnostics: User class threw exception: org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `_hoodie_partition_path`;
	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:90)
	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:70)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:440)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
	at com.navi.sources.HoodieIncrSource.fetchNextBatch(HoodieIncrSource.java:122)
	at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43)
	at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:76)
	at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:95)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:388)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:283)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
	at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:728)

write configs:

spark-submit --master yarn --jars /usr/lib/spark/external/lib/spark-avro.jar,s3://***/jars/hudi-utilities-bundle_2.12-0.10.0.jar --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.executor.cores=3 --conf spark.driver.memory=4g --conf spark.driver.memoryOverhead=800m --conf spark.executor.memoryOverhead=1800m --conf spark.executor.memory=16g --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.initialExecutors=1 --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=6 --conf spark.scheduler.mode=FAIR --conf spark.task.maxFailures=5 --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.yarn.max.executor.failures=5 --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true --conf spark.sql.catalogImplementation=hive --deploy-mode cluster s3://*****/jars/deltastreamer-addons-1.1-SNAPSHOT.jar --hoodie-conf hoodie.parquet.compression.codec=snappy --hoodie-conf hoodie.deltastreamer.source.hoodieincr.num_instants=10 --hoodie-conf hoodie.datasource.write.partitionpath.field= --table-type COPY_ON_WRITE --source-class com.navi.sources.HoodieIncrSource --hoodie-conf hoodie.deltastreamer.source.hoodieincr.path=s3://*****/input_path --hoodie-conf hoodie.metrics.on=true --hoodie-conf hoodie.metrics.reporter.type=PROMETHEUS_PUSHGATEWAY --hoodie-conf hoodie.metrics.pushgateway.host=pushgateway.prod.navi-tech.in --hoodie-conf hoodie.metrics.pushgateway.port=443 --hoodie-conf hoodie.metrics.pushgateway.delete.on.shutdown=false --hoodie-conf hoodie.metrics.pushgateway.job.name=*** --hoodie-conf hoodie.metrics.pushgateway.random.job.name.suffix=false --hoodie-conf hoodie.metrics.reporter.metricsname.prefix=hudi --target-base-path s3://*****/output_path --target-table some_table --enable-sync --hoodie-conf hoodie.datasource.hive_sync.database=db --hoodie-conf hoodie.datasource.hive_sync.table=out_tbl --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor --hoodie-conf hoodie.clustering.inline=true --hoodie-conf hoodie.clustering.inline.max.commits=2 --hoodie-conf hoodie.datasource.write.recordkey.field=contact_number_cleaned --hoodie-conf hoodie.datasource.write.precombine.field=id --hoodie-conf hoodie.datasource.clustering.inline.enable=true --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=134217728 --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=273741824 --source-ordering-field id --hoodie-conf transformer.normalize.json.column=id --hoodie-conf "hoodie.deltastreamer.transformer.sql=select id,col1,col2 from <SRC>)" --transformer-class com.custom.transform.ArrayJsonToStructTypeTransformer,org.apache.hudi.utilities.transform.SqlQueryBasedTransformer

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

The text was updated successfully, but these errors were encountered:

low-on-mana · 2022-04-01T04:34:37Z

Environment Description

    Hudi version : 0.10.0

    Spark version : 3.0.1

    Hive version : Hive 3.1.2

    Hadoop version :

    Storage (HDFS/S3/GCS..) : S3

    Running on Docker? (yes/no) : no

nsivabalan · 2022-04-04T14:00:25Z

@harsh1231 : Can you take a look at this issue

harsh1231 · 2022-04-04T19:01:23Z

@bvaradar can you please share some context on why can't we delete _hoodie_partition_path but all other source meta columns

hudi/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java

Line 176 in b28f0d6

// Remove Hoodie meta columns except partition path from input source

nsivabalan · 2022-04-16T04:44:41Z

@harsh1231 : in the mean time (until @bvaradar responds), can you investigate as to why we are encountering duplicate issue.

harsh1231 · 2022-04-19T09:52:42Z

@nsivabalan @lowmmrfeeder load from previous table itself is failing
at com.navi.sources.HoodieIncrSource.fetchNextBatch(HoodieIncrSource.java:122)
this looks like your internal class @lowmmrfeeder , you should check why 2nd tables have duplicate columns with name _hoodie_partition_path

low-on-mana · 2022-04-19T10:37:43Z

@harsh1231 I am not sure of this, we aren't using columnname _hoodie_partition_path anywhere explicitly.
My guess is deltastreamer adds this column _hoodie_partition_path in 1st transformed table. On chaining a tranformation it tries adding it again 2nd time without removing the original column as we can read it in here

hudi/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java

Line 176 in b28f0d6

// Remove Hoodie meta columns except partition path from input source

. I haven't gone in depth, so I might be wrong.

harsh1231 · 2022-04-20T05:48:41Z

@lowmmrfeeder Got it , got distracted because of private package com.navi , let me try to reproduce this then we can work on fix

low-on-mana · 2022-04-20T07:07:16Z

Ohh ya, we did little modification to HoodieIncrSource.java and created our own class. The modifications were related to reading whole source table when doing incremental read with deltastreamer for first time ( Its supported now in HoodieIncrSource but wasnt there sometime back ), but rest of code is same.
We were able to reproduce this issue with current state of HoodieIncrSource

yihua · 2022-04-29T22:38:02Z

@harsh1231 Have you made any progress on reproducing the problem?

nsivabalan · 2022-05-12T19:40:09Z

@harsh1231 : can you spend some time on this. would be good to get to the bottom of this.

nsivabalan · 2022-11-03T03:43:20Z

I could able to reproduce this. will try to put in a fix by this weekend.

nsivabalan · 2022-11-03T22:02:27Z

https://issues.apache.org/jira/browse/HUDI-5157

nsivabalan · 2022-11-03T22:05:19Z

#7132

nsivabalan changed the title ~~[SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta colunms~~ [SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta column Mar 31, 2022

nsivabalan added the hudistreamer issues related to Hudi streamer (Formely deltastreamer) label Apr 16, 2022

nsivabalan added the priority:major degraded perf; unable to move forward; potential bugs label Apr 16, 2022

yihua assigned harsh1231 Apr 29, 2022

nsivabalan added this to Hudi Issue Support Aug 28, 2022

xushiyan moved this to 🚧 Needs Repro in Hudi Issue Support Oct 30, 2022

nsivabalan moved this from 🚧 Needs Repro to 🏁 Triaged in Hudi Issue Support Nov 3, 2022

nsivabalan closed this as completed Nov 3, 2022

Repository owner moved this from 🏁 Triaged to ✅ Done in Hudi Issue Support Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta column #5189

[SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta column #5189

nsivabalan commented Mar 31, 2022

low-on-mana commented Apr 1, 2022

nsivabalan commented Apr 4, 2022

harsh1231 commented Apr 4, 2022

nsivabalan commented Apr 16, 2022

harsh1231 commented Apr 19, 2022 •

edited

Loading

low-on-mana commented Apr 19, 2022

harsh1231 commented Apr 20, 2022

low-on-mana commented Apr 20, 2022

yihua commented Apr 29, 2022

nsivabalan commented May 12, 2022

nsivabalan commented Nov 3, 2022

nsivabalan commented Nov 3, 2022

nsivabalan commented Nov 3, 2022

[SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta column #5189

[SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta column #5189

Comments

nsivabalan commented Mar 31, 2022

low-on-mana commented Apr 1, 2022

nsivabalan commented Apr 4, 2022

harsh1231 commented Apr 4, 2022

nsivabalan commented Apr 16, 2022

harsh1231 commented Apr 19, 2022 • edited Loading

low-on-mana commented Apr 19, 2022

harsh1231 commented Apr 20, 2022

low-on-mana commented Apr 20, 2022

yihua commented Apr 29, 2022

nsivabalan commented May 12, 2022

nsivabalan commented Nov 3, 2022

nsivabalan commented Nov 3, 2022

nsivabalan commented Nov 3, 2022

harsh1231 commented Apr 19, 2022 •

edited

Loading