Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta column #5189

Closed
nsivabalan opened this issue Mar 31, 2022 · 13 comments
Assignees
Labels
hudistreamer issues related to Hudi streamer (Formely deltastreamer) priority:major degraded perf; unable to move forward; potential bugs

Comments

@nsivabalan
Copy link
Contributor

Tips before filing an issue

Describe the problem you faced

From user:
I am trying to read a hoodie table and write to a hoodie table using delta streamer and I am getting this error:

Steps to reproduce:

create first hudi table using ConfluentAvroKafkaSource ->
second by HoodieIncrSource consuming output of first table -> 
third by HoodieIncrSource and consuming output of second table 
( error is on incremental runs of deltastreamer in 3rd table )

stacktrace:

client token: N/A
	 diagnostics: User class threw exception: org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `_hoodie_partition_path`;
	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:90)
	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:70)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:440)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
	at com.navi.sources.HoodieIncrSource.fetchNextBatch(HoodieIncrSource.java:122)
	at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43)
	at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:76)
	at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:95)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:388)
	at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:283)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
	at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:728)

write configs:

spark-submit --master yarn --jars /usr/lib/spark/external/lib/spark-avro.jar,s3://***/jars/hudi-utilities-bundle_2.12-0.10.0.jar --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.executor.cores=3 --conf spark.driver.memory=4g --conf spark.driver.memoryOverhead=800m --conf spark.executor.memoryOverhead=1800m --conf spark.executor.memory=16g --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.initialExecutors=1 --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=6 --conf spark.scheduler.mode=FAIR --conf spark.task.maxFailures=5 --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.yarn.max.executor.failures=5 --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true --conf spark.sql.catalogImplementation=hive --deploy-mode cluster s3://*****/jars/deltastreamer-addons-1.1-SNAPSHOT.jar --hoodie-conf hoodie.parquet.compression.codec=snappy --hoodie-conf hoodie.deltastreamer.source.hoodieincr.num_instants=10 --hoodie-conf hoodie.datasource.write.partitionpath.field= --table-type COPY_ON_WRITE --source-class com.navi.sources.HoodieIncrSource --hoodie-conf hoodie.deltastreamer.source.hoodieincr.path=s3://*****/input_path --hoodie-conf hoodie.metrics.on=true --hoodie-conf hoodie.metrics.reporter.type=PROMETHEUS_PUSHGATEWAY --hoodie-conf hoodie.metrics.pushgateway.host=pushgateway.prod.navi-tech.in --hoodie-conf hoodie.metrics.pushgateway.port=443 --hoodie-conf hoodie.metrics.pushgateway.delete.on.shutdown=false --hoodie-conf hoodie.metrics.pushgateway.job.name=*** --hoodie-conf hoodie.metrics.pushgateway.random.job.name.suffix=false --hoodie-conf hoodie.metrics.reporter.metricsname.prefix=hudi --target-base-path s3://*****/output_path --target-table some_table --enable-sync --hoodie-conf hoodie.datasource.hive_sync.database=db --hoodie-conf hoodie.datasource.hive_sync.table=out_tbl --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor --hoodie-conf hoodie.clustering.inline=true --hoodie-conf hoodie.clustering.inline.max.commits=2 --hoodie-conf hoodie.datasource.write.recordkey.field=contact_number_cleaned --hoodie-conf hoodie.datasource.write.precombine.field=id --hoodie-conf hoodie.datasource.clustering.inline.enable=true --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=134217728 --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=273741824 --source-ordering-field id --hoodie-conf transformer.normalize.json.column=id --hoodie-conf "hoodie.deltastreamer.transformer.sql=select id,col1,col2 from <SRC>)" --transformer-class com.custom.transform.ArrayJsonToStructTypeTransformer,org.apache.hudi.utilities.transform.SqlQueryBasedTransformer

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

@nsivabalan nsivabalan changed the title [SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta colunms [SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta column Mar 31, 2022
@low-on-mana
Copy link
Contributor

Environment Description

    Hudi version : 0.10.0

    Spark version : 3.0.1

    Hive version : Hive 3.1.2

    Hadoop version :

    Storage (HDFS/S3/GCS..) : S3

    Running on Docker? (yes/no) : no

@nsivabalan
Copy link
Contributor Author

@harsh1231 : Can you take a look at this issue

@harsh1231
Copy link
Contributor

@bvaradar can you please share some context on why can't we delete _hoodie_partition_path but all other source meta columns

// Remove Hoodie meta columns except partition path from input source

@nsivabalan
Copy link
Contributor Author

@harsh1231 : in the mean time (until @bvaradar responds), can you investigate as to why we are encountering duplicate issue.

@nsivabalan nsivabalan added the hudistreamer issues related to Hudi streamer (Formely deltastreamer) label Apr 16, 2022
@nsivabalan nsivabalan added the priority:major degraded perf; unable to move forward; potential bugs label Apr 16, 2022
@harsh1231
Copy link
Contributor

harsh1231 commented Apr 19, 2022

@nsivabalan @lowmmrfeeder load from previous table itself is failing
at com.navi.sources.HoodieIncrSource.fetchNextBatch(HoodieIncrSource.java:122)
this looks like your internal class @lowmmrfeeder , you should check why 2nd tables have duplicate columns with name _hoodie_partition_path

@low-on-mana
Copy link
Contributor

@harsh1231 I am not sure of this, we aren't using columnname _hoodie_partition_path anywhere explicitly.
My guess is deltastreamer adds this column _hoodie_partition_path in 1st transformed table. On chaining a tranformation it tries adding it again 2nd time without removing the original column as we can read it in here

// Remove Hoodie meta columns except partition path from input source
. I haven't gone in depth, so I might be wrong.

@harsh1231
Copy link
Contributor

@lowmmrfeeder Got it , got distracted because of private package com.navi , let me try to reproduce this then we can work on fix

@low-on-mana
Copy link
Contributor

Ohh ya, we did little modification to HoodieIncrSource.java and created our own class. The modifications were related to reading whole source table when doing incremental read with deltastreamer for first time ( Its supported now in HoodieIncrSource but wasnt there sometime back ), but rest of code is same.
We were able to reproduce this issue with current state of HoodieIncrSource

@yihua
Copy link
Contributor

yihua commented Apr 29, 2022

@harsh1231 Have you made any progress on reproducing the problem?

@nsivabalan
Copy link
Contributor Author

@harsh1231 : can you spend some time on this. would be good to get to the bottom of this.

@xushiyan xushiyan moved this to 🚧 Needs Repro in Hudi Issue Support Oct 30, 2022
@nsivabalan
Copy link
Contributor Author

I could able to reproduce this. will try to put in a fix by this weekend.

@nsivabalan nsivabalan moved this from 🚧 Needs Repro to 🏁 Triaged in Hudi Issue Support Nov 3, 2022
@nsivabalan
Copy link
Contributor Author

Repository owner moved this from 🏁 Triaged to ✅ Done in Hudi Issue Support Nov 3, 2022
@nsivabalan
Copy link
Contributor Author

#7132

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hudistreamer issues related to Hudi streamer (Formely deltastreamer) priority:major degraded perf; unable to move forward; potential bugs
Projects
Archived in project
Development

No branches or pull requests

4 participants