Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] BQ synch tool not working with HUDI bundle jar #10629

Closed
masthanmca opened this issue Feb 6, 2024 · 6 comments
Closed

[SUPPORT] BQ synch tool not working with HUDI bundle jar #10629

masthanmca opened this issue Feb 6, 2024 · 6 comments
Labels
gcp-support issues related to google ecosystem meta-sync on-call-triaged priority:major degraded perf; unable to move forward; potential bugs

Comments

@masthanmca
Copy link

Tips before filing an issue

  • Have you gone through our FAQs? yes

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced
BQ sync is not working with hudi bundle jar
A clear and concise description of the problem.
I wanted to enable BQ sync while writing ingest the data into HUDI table using manifest file.
To Reproduce

Steps to reproduce the behavior:

  1. create data frame with any schema
  2. use the below options for Bq sync along with the other default HUDI configurations
  3.     hiveConfigs.put("org.apache.hudi.gcp.bigquery.BigQuerySyncTool", "true")
     hiveConfigs.put("hoodie.gcp.bigquery.sync.project_id", bqSyncProjectId)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.dataset_name", bqSyncDatasetName)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.table_name", hoodieHiveSyncTable)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.dataset_location", "us")
     hiveConfigs.put("hoodie.gcp.bigquery.sync.source_uri", bqSyncSourceUri)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.source_uri_prefix", bqSyncSourceUriPrefix)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.base_path", bqSyncBasePath)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.partition_fields", hoodieHiveSyncPartitionFields)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.use_bq_manifest_file", "true")
    
  4. write the data frame in HUDI table.ds.write.format(HudiFormat).options(hoodieConfigs).options(hiveConfigs).mode(writeMode).save(location)

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.14.0

  • Spark version : 3.3.2

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : GCS

  • Running on Docker? (yes/no) :no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

No error , but external table not created in Big Query

@ad1happy2go
Copy link
Collaborator

@masthanmca
Is the the first time you are facing this issue or after upgrade you started facing this one.

Your configurations also looks wrong? From where you got these or which doc you referred?
can you refer - https://hudi.apache.org/docs/gcp_bigquery/

@codope codope added meta-sync gcp-support issues related to google ecosystem priority:major degraded perf; unable to move forward; potential bugs labels Feb 7, 2024
@abhishekshenoy
Copy link

abhishekshenoy commented Feb 19, 2024

Facing the same issue , does not work with org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.1 .

Hudi Write to path works , Hive Sync works but BQ sync does not work.

For now have taken this route based on a flag to manually perform the BQSync with BQSyncTool post the dataframe.write

#9355 (comment)

@ad1happy2go
Copy link
Collaborator

@abhishekshenoy @masthanmca That (#9355 (comment)) i.e. BigQuerySyncTool is the correct way of doing BQ sync with batch jobs.

The another way is doing this with HudiStreamer.

@abhishekshenoy
Copy link

abhishekshenoy commented Feb 20, 2024

@ad1happy2go @the-other-tim-brown

But should nt that be internally called when we are providing the Hudi Bq 
configs and enabling META_SYNC_ENABLED. 

In my case we use df.write.options(hudiAndHiveAndBQConfigs).save() and 
the hudiAndHiveAndBQConfigs has both hive and bq related configs . 

*But still only hive sync happens implicitly*. 

Is it by design that as part of our write function we need to perform both 

df.write.options(hudiAndHiveAndBQConfigs).save()
new BigQuerySyncTool(getBigQueryProps).syncHoodieTable()

@ad1happy2go
Copy link
Collaborator

@masthanmca @abhishekshenoy I went through the code and identified that we need to set both the class names to do both metastync together. The default value for below prop is just hive sync. I tried with 0.14.1 hudi version and after write and hive sync completed, it tried to do Big query sync also.

"hoodie.meta.sync.client.tool.class" : "org.apache.hudi.hive.HiveSyncTool,org.apache.hudi.gcp.bigquery.BigQuerySyncTool"

@ad1happy2go
Copy link
Collaborator

@masthanmca Closing out this issue as I confirmed it works. Please reopen in case you still see this issue.

@codope codope closed this as completed Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gcp-support issues related to google ecosystem meta-sync on-call-triaged priority:major degraded perf; unable to move forward; potential bugs
Projects
Archived in project
Development

No branches or pull requests

4 participants