[BEAM-8458] Add option to set temp dataset in BigQueryIO.Read #9852

iht · 2019-10-22T19:34:50Z

When using fromQuery, BigQueryIO creates a temp dataset and table to store the results of the query. Therefore, Beam requires permissions to create datasets just to be able to run a query, a very broad permission. With this option, BigQueryIO can write the temp results of the query to a pre-existing dataset, and therefore it only needs permissions to run queries and create tables (inside that dataset, not in general) to be able to use from Query.

For instance, if we use BigQueryIO.from like in this example:

PCollection<TableRow> rows = p.apply(BigQueryIO.readTableRows().from(tableSpec));

we only need to assign the role roles/bigquery.jobUser to the Apache Beam service (e.g. Dataflow) to be able to extract data from tableSpec.

However, if we want to read from a view, and we try the following code (where query is trying to extract data from a view) with that role, it will fail because it does not have permissions to create datasets (and fromQuery creates a temporary dataset and table):

PCollection<TableRow> rows = 
    p.apply(BigQueryIO.readTableRows().fromQuery(query).usingStandardSql());

So in order to use views with BigQueryIO and Apache Beam, we need to give the role roles/bigquery.user. This role is much broader than bigquery.jobUser, and it allows the pipeline to create as many datasets as it would like in the project where it is reading data from.

With this PR, I am adding a new option to BigQueryIO, withQueryTempDataset, that makes it possible to set an existing dataset to create the necessary temporary tables. Thus, users wanting to limit the amount of permissions granted to Apache Beam could create a dataset prior to creating the pipeline, and assign permissions in that dataset only to the Apache Beam service account.

This is a much narrower set of permissions for the pipeline, confining the permission of the pipeline to write in BigQuery only to the specified dataset. (compared to the current requirement to give permissions to create datasets anywhere in the project, in order to read from views).

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

[X ] Choose reviewer(s) and mention them in a comment (R: @username).
[X ] Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

iht · 2019-10-22T19:35:21Z

R: @chamikaramj

When using fromQuery, BigQueryIO creates a temp dataset to store the results of the query. Therefore, Beam requires permissions to create datasets just to be able to run a query. With this option, BigQueryIO can write the temp results of the query to a pre-existing dataset, and therefore it only needs permissions to run queries and create tables to be able to use from Query.

stale · 2020-01-08T00:57:27Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

iht · 2020-01-10T19:51:26Z

I will resolve the conflicts, and will add some more docs for this PR.

iht · 2020-02-02T15:44:31Z

(PR text updated to provide more details about the intent of this PR)

chamikaramj

Thanks!

chamikaramj · 2020-02-03T18:39:39Z

...google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryHelpers.java

-  static TableReference createTempTableReference(String projectId, String jobUuid) {
-    String queryTempDatasetId = "temp_dataset_" + jobUuid;
+  static TableReference createTempTableReference(
+      String projectId, String jobUuid, Optional<String> queryTempDatasetIdOpt) {


Just "tempDatasetId" should be good I think.

chamikaramj · 2020-02-03T18:43:26Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

+ * <p>Users can optionally specify a query priority using {@link
+ * TypedRead#withQueryPriority(TypedRead.QueryPriority)} and a geographic location where the query
+ * will be executed using {@link TypedRead#withQueryLocation(String)}. Query location must be
+ * specified for jobs that are not executed in US or EU, or if you are reading from an authorized


Did you mean to mention "withQueryTempDataset" here ?

No, I accidentally reformatted these lines, that's why they are in the PR. Let me see if I can undo these changes.

chamikaramj · 2020-02-03T18:45:51Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

@@ -1342,16 +1354,23 @@ void cleanup(ContextContainer c) throws Exception {
              BigQueryOptions options = c.getPipelineOptions().as(BigQueryOptions.class);
              String jobUuid = c.getJobId();

+              Optional<String> queryTempDataset = Optional.ofNullable(getQueryTempDataset());


If dataset is provided by the user we should try to validate (before pipeline submission) that it exists. (unless user specified withoutValidation())

Got it, let me add that.

Commit added for that. See also reply to another of your comments.

chamikaramj · 2020-02-03T18:48:48Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

+     * needed. No other tables in the dataset will be modified. If your job does not have
+     * permissions to create a new dataset, and you want to use {@link #fromQuery(String)} (for
+     * instance, to read from a view), you should use this option. Remember that the dataset must
+     * exist and your job needs permissions to create and remove tables inside that dataset.


We should also make sure that any table that Beam create or delete dynamically does not conflict with an existing table in the Dataset (at runtime).

I have just added two commits. If the user specifies the temp dataset and it is using fromQuery

check that the specified dataset exists

check that the destination table does not exist, to avoid overwriting any existing table in the dataset specified by the user (unlikely, due to the random generation of uuids for the temp tables, but not impossible)

chamikaramj · 2020-02-03T18:50:01Z

...cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryQuerySourceDef.java

@@ -68,13 +78,15 @@ private BigQueryQuerySourceDef(
      Boolean useLegacySql,
      BigQueryIO.TypedRead.QueryPriority priority,
      String location,
+      String queryTempDataset,


Just "tempDatasetId"

...cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryQuerySourceDef.java

iht · 2020-02-05T08:26:43Z

I am working on addressing the suggestions from the code review, and submitting additional commits. Still WIP. I will write again once I have submitted all the changes.

…dition, it should be *not*

Only if the user is using fromQuery and not skipping validation

When the user specifies the temp dataset, Beam does not control its creation, and it might write over an existing table if the generated table name collides with an existing table (unlikely, but not impossible)

iht · 2020-02-07T15:53:16Z

I have now addressed all your comments @chamikaramj

Please have a look at the new changes.

Thanks.

Fix conflicts.

aaltay · 2020-02-20T23:44:44Z

R: @chamikaramj / @pabloem -- could you please take a look?

chamikaramj · 2020-02-25T18:24:20Z

LGTM. Thanks.

chamikaramj · 2020-02-25T18:24:43Z

Retest this please

chamikaramj · 2020-02-25T18:24:59Z

Run Java PostCommit

chamikaramj · 2020-02-25T18:25:08Z

Run Dataflow ValidatesRunner

chamikaramj · 2020-02-26T00:06:14Z

Run Java PreCommit

chamikaramj · 2020-02-26T00:10:26Z

Run Java PreCommit

chamikaramj · 2020-02-26T18:20:31Z

Run JavaPortabilityApi PreCommit

chamikaramj · 2020-02-26T19:40:17Z

Error for failing task seems to be unrelated.
Task 'javaPreCommitPortabilityApiJava11' not found in root project 'beam'.
org.gradle.execution.TaskSelectionException: Task 'javaPreCommitPortabilityApiJava11' not found in root project 'beam'

Merging.

iht · 2020-02-26T22:06:20Z

Thank you!

iht force-pushed the query_temp_dataset branch from 41d0d95 to fe1dfa0 Compare October 28, 2019 10:28

iht force-pushed the query_temp_dataset branch from 330626d to b1ede9f Compare October 28, 2019 11:06

aaltay requested a review from chamikaramj November 9, 2019 00:31

stale bot added the stale label Jan 8, 2020

stale bot removed the stale label Jan 10, 2020

Merge remote-tracking branch 'upstream/master' into query_temp_dataset

9f23215

chamikaramj reviewed Feb 3, 2020

View reviewed changes

Some renames, addressing suggestions from code review. Still WIP

f21ea2e

iht added 3 commits February 5, 2020 17:59

Isolate condition in boolean variable for clarity, catched bug in con…

2505251

…dition, it should be *not*

Check that dataset exists if specified by the user

860547a

Only if the user is using fromQuery and not skipping validation

If user specifies existing temp dataset, don't overwrite tables

cef74a2

When the user specifies the temp dataset, Beam does not control its creation, and it might write over an existing table if the generated table name collides with an existing table (unlikely, but not impossible)

Merge remote-tracking branch 'upstream/master' into query_temp_dataset

287422e

Fix conflicts.

probot-autolabeler bot added gcp io java labels Feb 17, 2020

chamikaramj merged commit 2a4092d into apache:master Feb 26, 2020

iht deleted the query_temp_dataset branch February 26, 2020 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-8458] Add option to set temp dataset in BigQueryIO.Read #9852

[BEAM-8458] Add option to set temp dataset in BigQueryIO.Read #9852

iht commented Oct 22, 2019 •

edited

iht commented Oct 22, 2019

stale bot commented Jan 8, 2020

iht commented Jan 10, 2020

iht commented Feb 2, 2020

chamikaramj left a comment

chamikaramj Feb 3, 2020

iht Feb 5, 2020

chamikaramj Feb 3, 2020

iht Feb 5, 2020

chamikaramj Feb 3, 2020

iht Feb 5, 2020

iht Feb 7, 2020

chamikaramj Feb 3, 2020

iht Feb 7, 2020

chamikaramj Feb 3, 2020

iht Feb 5, 2020

iht commented Feb 5, 2020

iht commented Feb 7, 2020

aaltay commented Feb 20, 2020

chamikaramj commented Feb 25, 2020

chamikaramj commented Feb 25, 2020

chamikaramj commented Feb 25, 2020

chamikaramj commented Feb 25, 2020

chamikaramj commented Feb 26, 2020

chamikaramj commented Feb 26, 2020

chamikaramj commented Feb 26, 2020

chamikaramj commented Feb 26, 2020

iht commented Feb 26, 2020

[BEAM-8458] Add option to set temp dataset in BigQueryIO.Read #9852

[BEAM-8458] Add option to set temp dataset in BigQueryIO.Read #9852

Conversation

iht commented Oct 22, 2019 • edited

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

iht commented Oct 22, 2019

stale bot commented Jan 8, 2020

iht commented Jan 10, 2020

iht commented Feb 2, 2020

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iht commented Feb 5, 2020

iht commented Feb 7, 2020

aaltay commented Feb 20, 2020

chamikaramj commented Feb 25, 2020

chamikaramj commented Feb 25, 2020

chamikaramj commented Feb 25, 2020

chamikaramj commented Feb 25, 2020

chamikaramj commented Feb 26, 2020

chamikaramj commented Feb 26, 2020

chamikaramj commented Feb 26, 2020

chamikaramj commented Feb 26, 2020

iht commented Feb 26, 2020

iht commented Oct 22, 2019 •

edited