[BEAM-1909] BigQuery read transform fails for DirectRunner when querying non-US regions #2509

ubunatic · 2017-04-12T13:19:42Z

I partially fixed the issue by getting the location of the source Dataset. Then I use this location as location of the created temp dataset. The added parameters are optional and should not break anything.

Note: The solution works for me when using the DirectRunner if I create a BigQuerySource with table=<some-table>. It does not work for a BigQuerySource with query=<some-query>.
The corresponding Jira issue should not yet be closed.

asfbot · 2017-04-12T14:24:10Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9443/

Build result: FAILURE

[...truncated 2.40 MB...] at java.lang.Thread.run(Thread.java:745)Caused by: org.apache.maven.plugin.MojoExecutionException: Command execution failed. at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:302) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) ... 31 moreCaused by: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711) at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289) ... 33 more2017-04-12T14:08:06.192 [ERROR] 2017-04-12T14:08:06.192 [ERROR] Re-run Maven using the -X switch to enable full debug logging.2017-04-12T14:08:06.192 [ERROR] 2017-04-12T14:08:06.192 [ERROR] For more information about the errors and possible solutions, please read the following articles:2017-04-12T14:08:06.192 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException2017-04-12T14:08:06.192 [ERROR] 2017-04-12T14:08:06.192 [ERROR] After correcting the problems, you can resume the build with the command2017-04-12T14:08:06.192 [ERROR] mvn -rf :beam-sdks-pythonchannel stoppedSetting status of d773fbd to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9443/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install
--none--

sb2nov · 2017-04-12T14:31:51Z

.gitignore

@@ -52,6 +52,9 @@ hs_err_pid*.log
 # Ignore MacOSX files.
 .DS_Store

+# ignore cythonized files


I think this was just added in #2494

OK I will check that and also fix the below issues.

sb2nov · 2017-04-12T14:33:58Z

sdks/python/apache_beam/io/gcp/bigquery.py

+    try:
+      tr = source_table_reference
+      if tr is not None:
+        if tr.projectId is None: table_project_id = project_id


Nit: avoid inline statements and format the statements inside if on the next line.

Sorry, I will use your linting rules from now on.

sb2nov · 2017-04-12T14:42:46Z

sdks/python/apache_beam/io/gcp/bigquery.py

+      tr = source_table_reference
+      if tr is not None:
+        if tr.projectId is None: table_project_id = project_id
+        else:                    table_project_id = tr.projectId


Question let's say I'm reading from a public table such as Github archive then won't this fail as the temp table is being created in a project I don't have write access to. Can we keep the same project but pass in a location based on the source.

cc @chamikaramj who might know more about this.

This code section is just for reading the location from the given source.
The temp creation will then use the jobs project id but use the same location as the source table.
The source table's project id is ignored after this section.

sb2nov · 2017-04-12T14:43:43Z

sdks/python/apache_beam/io/gcp/bigquery.py

-        dataset = bigquery.Dataset(
-            datasetReference=bigquery.DatasetReference(
-                projectId=project_id, datasetId=dataset_id))
+        dr = bigquery.DatasetReference(


Nit: Rename dr to dataset_reference

OK.
Question: Do you have any rule for abbreviating short-lived variables?

dhalperi · 2017-04-12T15:51:34Z

R: @sb2nov @chamikaramj

ubunatic · 2017-04-14T23:04:03Z

sdks/python/apache_beam/io/gcp/bigquery.py

    dataset_id = BigQueryWrapper.TEMP_DATASET + self._temporary_table_suffix
+    location = None
+


I removed the try-except. If a source table is given, we should be able t read it's location. There is nothing to except.

ubunatic · 2017-04-14T23:07:25Z

sdks/python/run_pylint.sh

@@ -23,7 +23,7 @@
 #
 # The exit-code of the script indicates success or a failure.

-set -e
+set -o errexit


using set -o errexit is preferable, since it is more explicit and thus more readable

ubunatic · 2017-04-14T23:09:42Z

sdks/python/run_pylint.sh

@@ -39,14 +39,19 @@ EXCLUDED_GENERATED_FILES=(

 FILES_TO_IGNORE=""
 for file in "${EXCLUDED_GENERATED_FILES[@]}"; do
-  if [[ $FILES_TO_IGNORE ]]; then
+  if test -n "$FILES_TO_IGNORE"; then


square brackets in Bash are always subject to confusion, just use test thus the reader may type man test to check the options

ubunatic · 2017-04-14T23:10:32Z

sdks/python/run_pylint.sh

    FILES_TO_IGNORE="$FILES_TO_IGNORE, "
  fi
  FILES_TO_IGNORE="$FILES_TO_IGNORE$(basename $file)"
 done
 echo "Skipping lint for generated files: $FILES_TO_IGNORE"

+if test $# -gt 0


pylint on the whole sdk may take some time. This option helps to run it on the modules you are working on

coveralls · 2017-04-15T00:03:11Z

Coverage increased (+0.007%) to 70.474% when pulling 9bac351 on ubunatic:temp-table-region into f30d5b9 on apache:master.

asfbot · 2017-04-15T00:05:20Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9561/
--none--

chamikaramj

Thanks for the PR.

The way to support queries in general will be to run a dry run of the query and get the location using the information available in the result of the query. See below for how this is done in Java SDK.

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTableRowIterator.java#L382

I OK with that part not being implemented in this PR.

chamikaramj · 2017-04-16T18:10:46Z

sdks/python/apache_beam/io/gcp/bigquery.py

@@ -607,7 +607,8 @@ def __init__(self, source, test_bigquery_client=None, use_legacy_sql=True,

  def __enter__(self):
    self.client = BigQueryWrapper(client=self.test_bigquery_client)
-    self.client.create_temporary_dataset(self.executing_project)
+    self.client.create_temporary_dataset(
+        self.executing_project, self.source.table_reference)


Can we pass location here instead of a table reference (so that we don't have to update this function when we support queries) ?

+1 move the location fetching logic to a separate helper function.

Agree, that would be better.

chamikaramj · 2017-04-16T18:50:28Z

sdks/python/apache_beam/io/gcp/bigquery.py

+    if tr is not None:
+      if tr.projectId is None:
+        # if the source table has no projectId, assume the given project_id
+        source_project_id = project_id


You should use the executing project here (project_id). The entity executing the query might not have permissions to create a Dataset in the project that owns the source table (but new Dataset should be in the same region as the source Dataset).

Ignore this one. I see that you are only using 'source_project_id' to get the table (not to create the Datasaet). But can we not do that and pass location to this function as I mentioned in my other comment ?

yes, passing location is better

chamikaramj · 2017-04-16T18:52:37Z

sdks/python/run_pylint.sh

@@ -23,7 +23,7 @@
 #
 # The exit-code of the script indicates success or a failure.

-set -e
+set -o errexit


Please move changes to run_pylint.sh to a separate PR.

+1 to moving to an independent PR

ubunatic · 2017-04-18T19:21:19Z

use this updated PR: #2582

Uwe Jugel added 8 commits April 10, 2017 14:24

better log message for bigquery temp tables

7953784

fixed spelling, improved wording

6526638

use log formatting instead of string formatting

4378ace

Merge remote-tracking branch 'origin/master' into temp-table-log-message

b818777

ignore cythonized files

0075add

add temp dataset location for non-query BigQuerySource

9f966e6

Merge branch 'master' into temp-table-region

7d9f5ee

fixed location comment

d773fbd

sb2nov suggested changes Apr 12, 2017

View reviewed changes

Uwe Jugel added 5 commits April 14, 2017 23:41

fixed var names and lint errors

7439fa6

Merge remote-tracking branch 'origin/master' into temp-table-region

b273849

reverted .gitignore

f6da0fa

removed try-except, added comments

79b844e

added module arg, explicit errexit, explicit non-zero test

9bac351

ubunatic commented Apr 14, 2017

View reviewed changes

chamikaramj reviewed Apr 16, 2017

View reviewed changes

ubunatic mentioned this pull request Apr 18, 2017

[BEAM-1909] BigQuery read transform fails for DirectRunner when querying non-US regions #2582

Closed

ubunatic closed this Apr 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-1909] BigQuery read transform fails for DirectRunner when querying non-US regions #2509

[BEAM-1909] BigQuery read transform fails for DirectRunner when querying non-US regions #2509

ubunatic commented Apr 12, 2017

asfbot commented Apr 12, 2017

sb2nov Apr 12, 2017

ubunatic Apr 13, 2017

sb2nov Apr 12, 2017

ubunatic Apr 13, 2017

sb2nov Apr 12, 2017

ubunatic Apr 13, 2017

sb2nov Apr 12, 2017

ubunatic Apr 13, 2017

dhalperi commented Apr 12, 2017

ubunatic Apr 14, 2017

ubunatic Apr 14, 2017

ubunatic Apr 14, 2017

ubunatic Apr 14, 2017

coveralls commented Apr 15, 2017

asfbot commented Apr 15, 2017

chamikaramj left a comment

chamikaramj Apr 16, 2017

sb2nov Apr 17, 2017

ubunatic Apr 18, 2017

chamikaramj Apr 16, 2017

chamikaramj Apr 16, 2017

ubunatic Apr 18, 2017

chamikaramj Apr 16, 2017

sb2nov Apr 17, 2017

ubunatic Apr 18, 2017

ubunatic commented Apr 18, 2017

		dataset_id = BigQueryWrapper.TEMP_DATASET + self._temporary_table_suffix
		location = None

[BEAM-1909] BigQuery read transform fails for DirectRunner when querying non-US regions #2509

[BEAM-1909] BigQuery read transform fails for DirectRunner when querying non-US regions #2509

Conversation

ubunatic commented Apr 12, 2017

asfbot commented Apr 12, 2017

Build result: FAILURE

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhalperi commented Apr 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Apr 15, 2017

asfbot commented Apr 15, 2017

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ubunatic commented Apr 18, 2017