[HUDI-6333] allow using manifest file directly to create a bigquery external table#8898
Merged
danny0405 merged 1 commit intoapache:masterfrom Jun 22, 2023
Merged
[HUDI-6333] allow using manifest file directly to create a bigquery external table#8898danny0405 merged 1 commit intoapache:masterfrom
danny0405 merged 1 commit intoapache:masterfrom
Conversation
Contributor
Author
|
Hi @yihua , could you please help review or find someone to review this one? Thanks. |
danny0405
reviewed
Jun 21, 2023
|
|
||
| private boolean tableExists(HoodieBigQuerySyncClient bqSyncClient, String tableName) { | ||
| if (bqSyncClient.tableExists(tableName)) { | ||
| LOG.info(tableName + " already exists"); |
Contributor
There was a problem hiding this comment.
Do we need an extra util method just for purpose of logging?
Contributor
Author
There was a problem hiding this comment.
not necessary, it's just to avoid the redundant code logging table already exists.
danny0405
reviewed
Jun 21, 2023
| manifestFileWriter.getManifestSourceUri(true), | ||
| config.getString(BIGQUERY_SYNC_SOURCE_URI_PREFIX)); | ||
| LOG.info("Table creation using the manifest file complete for " + tableName); | ||
| } |
Contributor
There was a problem hiding this comment.
Table creation using the manifest file complete for " + tableName
-> Completed table {} creation using the manifest file
Contributor
Author
There was a problem hiding this comment.
Changed as suggested, thanks!
| extraOptions = String.format("hive_partition_uri_prefix=\"%s\",", sourceUriPrefix); | ||
| } | ||
| String query = | ||
| String.format( |
Contributor
There was a problem hiding this comment.
Do we have a test case for the creation of the manifest file?
Contributor
Author
There was a problem hiding this comment.
Yes, covered in TestManifestFileWriter. Thanks for your review!
Collaborator
Contributor
danny0405
approved these changes
Jun 22, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Logs
To query Hudi table from bigquery, the current BigQuerySyncTool creates two bigquery external tables, one over the data files and the other over a manifest file that contains the data file name. Based on these two tables, it creates a View to reflect the latest version of data using the following query:
SELECT * FROM data_table WHERE _hoodie_file_name IN ( SELECT filename FROM manifest_file_table).The direct reason for such a workaround is that bigquery cannot support manifest file. However, bigquery is rolling out its manifest file support now. This feature allows users to use manifest files rather than data files as source URIs. Right now the roll-out seems to cover non-partitioned external tables (using hive partition would return an error
not supported for hive partition), which should be covering partitioned external tables soon.Given this new bigquery feature, it would be better to update BigQuerySyncTool correspondingly:
hoodie.gcp.bigquery.sync.use_bq_manifest_fileintroduced in this PR to control whether to use the bq manifest file feature. When this options is true, BigQuerySyncTool will create a manifest file with absolute path and issue aCREATE EXTERNAL TABLE ... OPTIONS (uris=[MANIFEST_FILE_URI], format="PARQUET", file_set_spec_type="NEW_LINE_DELIMITED_MANIFEST", ...)query to bigquery.hoodie.datasource.write.drop.partition.columnsas false and allow users to not specify thehoodie.gcp.bigquery.sync.source_uri_prefix, such that the partition columns can be written into the parquet files and the BigQuerySyncTool will create a non-partitioned external table. Query this external table will produce the same results as querying the aforementioned View.Impact
A more efficient way to query Hudi table from bigquery.
Risk level (write none, low medium or high below)
None. The existing view-based approach will still work.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist