Skip to content

[HUDI-6333] allow using manifest file directly to create a bigquery external table#8898

Merged
danny0405 merged 1 commit intoapache:masterfrom
jp0317:manifest
Jun 22, 2023
Merged

[HUDI-6333] allow using manifest file directly to create a bigquery external table#8898
danny0405 merged 1 commit intoapache:masterfrom
jp0317:manifest

Conversation

@jp0317
Copy link
Contributor

@jp0317 jp0317 commented Jun 7, 2023

Change Logs

To query Hudi table from bigquery, the current BigQuerySyncTool creates two bigquery external tables, one over the data files and the other over a manifest file that contains the data file name. Based on these two tables, it creates a View to reflect the latest version of data using the following query: SELECT * FROM data_table WHERE _hoodie_file_name IN ( SELECT filename FROM manifest_file_table).

The direct reason for such a workaround is that bigquery cannot support manifest file. However, bigquery is rolling out its manifest file support now. This feature allows users to use manifest files rather than data files as source URIs. Right now the roll-out seems to cover non-partitioned external tables (using hive partition would return an error not supported for hive partition), which should be covering partitioned external tables soon.

Given this new bigquery feature, it would be better to update BigQuerySyncTool correspondingly:

  1. Allow creating a bigquery compatible manifest file which expects absolute path of data files. This has been done in HUDI-6254.
  2. Allow using the new manifest file to create external table directly. Users can set the option hoodie.gcp.bigquery.sync.use_bq_manifest_file introduced in this PR to control whether to use the bq manifest file feature. When this options is true, BigQuerySyncTool will create a manifest file with absolute path and issue a CREATE EXTERNAL TABLE ... OPTIONS (uris=[MANIFEST_FILE_URI], format="PARQUET", file_set_spec_type="NEW_LINE_DELIMITED_MANIFEST", ...) query to bigquery.
  3. Avoid breaking existing user workflows. In case there are some users relying on the view-based workaround, it probably make sense to keep the workaround alive at least for now. That would require maintaining two versions of manifest files.
  4. Provide a temporary workaround for using bigquery manifest file support till this feature extends to partitioned table. The partition columns will not be recognized by creating a hive-partitioned external table. A non-partitioned external table will only have columns from the parquet data files. To keep the partition columns, a workaround is to set the hoodie.datasource.write.drop.partition.columns as false and allow users to not specify the hoodie.gcp.bigquery.sync.source_uri_prefix, such that the partition columns can be written into the parquet files and the BigQuerySyncTool will create a non-partitioned external table. Query this external table will produce the same results as querying the aforementioned View.

Impact

A more efficient way to query Hudi table from bigquery.

Risk level (write none, low medium or high below)

None. The existing view-based approach will still work.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua self-assigned this Jun 15, 2023
@yihua yihua added the area:query-engine Query engine integrations label Jun 15, 2023
@jp0317
Copy link
Contributor Author

jp0317 commented Jun 20, 2023

Hi @yihua , could you please help review or find someone to review this one? Thanks.


private boolean tableExists(HoodieBigQuerySyncClient bqSyncClient, String tableName) {
if (bqSyncClient.tableExists(tableName)) {
LOG.info(tableName + " already exists");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an extra util method just for purpose of logging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary, it's just to avoid the redundant code logging table already exists.

manifestFileWriter.getManifestSourceUri(true),
config.getString(BIGQUERY_SYNC_SOURCE_URI_PREFIX));
LOG.info("Table creation using the manifest file complete for " + tableName);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table creation using the manifest file complete for " + tableName
-> Completed table {} creation using the manifest file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as suggested, thanks!

extraOptions = String.format("hive_partition_uri_prefix=\"%s\",", sourceUriPrefix);
}
String query =
String.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test case for the creation of the manifest file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, covered in TestManifestFileWriter. Thanks for your review!

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@jp0317 jp0317 requested a review from danny0405 June 22, 2023 03:17
@danny0405
Copy link
Contributor

@danny0405 danny0405 merged commit 5fd9263 into apache:master Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:query-engine Query engine integrations

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

4 participants