-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-1909] BigQuery read transform fails for DirectRunner when querying non-US regions #2509
Conversation
Refer to this link for build results (access rights to CI server needed): Build result: FAILURE[...truncated 2.40 MB...] at java.lang.Thread.run(Thread.java:745)Caused by: org.apache.maven.plugin.MojoExecutionException: Command execution failed. at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:302) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) ... 31 moreCaused by: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711) at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289) ... 33 more2017-04-12T14:08:06.192 [ERROR] 2017-04-12T14:08:06.192 [ERROR] Re-run Maven using the -X switch to enable full debug logging.2017-04-12T14:08:06.192 [ERROR] 2017-04-12T14:08:06.192 [ERROR] For more information about the errors and possible solutions, please read the following articles:2017-04-12T14:08:06.192 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException2017-04-12T14:08:06.192 [ERROR] 2017-04-12T14:08:06.192 [ERROR] After correcting the problems, you can resume the build with the command2017-04-12T14:08:06.192 [ERROR] mvn -rf :beam-sdks-pythonchannel stoppedSetting status of d773fbd to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9443/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install--none-- |
.gitignore
Outdated
@@ -52,6 +52,9 @@ hs_err_pid*.log | |||
# Ignore MacOSX files. | |||
.DS_Store | |||
|
|||
# ignore cythonized files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was just added in #2494
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I will check that and also fix the below issues.
try: | ||
tr = source_table_reference | ||
if tr is not None: | ||
if tr.projectId is None: table_project_id = project_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: avoid inline statements and format the statements inside if on the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I will use your linting rules from now on.
tr = source_table_reference | ||
if tr is not None: | ||
if tr.projectId is None: table_project_id = project_id | ||
else: table_project_id = tr.projectId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question let's say I'm reading from a public table such as Github archive then won't this fail as the temp table is being created in a project I don't have write access to. Can we keep the same project but pass in a location based on the source.
cc @chamikaramj who might know more about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code section is just for reading the location from the given source.
The temp creation will then use the jobs project id but use the same location as the source table.
The source table's project id is ignored after this section.
dataset = bigquery.Dataset( | ||
datasetReference=bigquery.DatasetReference( | ||
projectId=project_id, datasetId=dataset_id)) | ||
dr = bigquery.DatasetReference( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Rename dr to dataset_reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Question: Do you have any rule for abbreviating short-lived variables?
dataset_id = BigQueryWrapper.TEMP_DATASET + self._temporary_table_suffix | ||
location = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the try-except. If a source table is given, we should be able t read it's location. There is nothing to except.
@@ -23,7 +23,7 @@ | |||
# | |||
# The exit-code of the script indicates success or a failure. | |||
|
|||
set -e | |||
set -o errexit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using set -o errexit
is preferable, since it is more explicit and thus more readable
@@ -39,14 +39,19 @@ EXCLUDED_GENERATED_FILES=( | |||
|
|||
FILES_TO_IGNORE="" | |||
for file in "${EXCLUDED_GENERATED_FILES[@]}"; do | |||
if [[ $FILES_TO_IGNORE ]]; then | |||
if test -n "$FILES_TO_IGNORE"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
square brackets in Bash are always subject to confusion, just use test
thus the reader may type man test
to check the options
FILES_TO_IGNORE="$FILES_TO_IGNORE, " | ||
fi | ||
FILES_TO_IGNORE="$FILES_TO_IGNORE$(basename $file)" | ||
done | ||
echo "Skipping lint for generated files: $FILES_TO_IGNORE" | ||
|
||
if test $# -gt 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pylint
on the whole sdk may take some time. This option helps to run it on the modules you are working on
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
The way to support queries in general will be to run a dry run of the query and get the location using the information available in the result of the query. See below for how this is done in Java SDK.
I OK with that part not being implemented in this PR.
@@ -607,7 +607,8 @@ def __init__(self, source, test_bigquery_client=None, use_legacy_sql=True, | |||
|
|||
def __enter__(self): | |||
self.client = BigQueryWrapper(client=self.test_bigquery_client) | |||
self.client.create_temporary_dataset(self.executing_project) | |||
self.client.create_temporary_dataset( | |||
self.executing_project, self.source.table_reference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass location here instead of a table reference (so that we don't have to update this function when we support queries) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 move the location fetching logic to a separate helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, that would be better.
if tr is not None: | ||
if tr.projectId is None: | ||
# if the source table has no projectId, assume the given project_id | ||
source_project_id = project_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use the executing project here (project_id). The entity executing the query might not have permissions to create a Dataset in the project that owns the source table (but new Dataset should be in the same region as the source Dataset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this one. I see that you are only using 'source_project_id' to get the table (not to create the Datasaet). But can we not do that and pass location to this function as I mentioned in my other comment ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, passing location is better
@@ -23,7 +23,7 @@ | |||
# | |||
# The exit-code of the script indicates success or a failure. | |||
|
|||
set -e | |||
set -o errexit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move changes to run_pylint.sh to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to moving to an independent PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
use this updated PR: #2582 |
I partially fixed the issue by getting the
location
of the sourceDataset
. Then I use this location as location of the created temp dataset. The added parameters are optional and should not break anything.Note: The solution works for me when using the
DirectRunner
if I create aBigQuerySource
withtable=<some-table>
. It does not work for aBigQuerySource
withquery=<some-query>
.The corresponding Jira issue should not yet be closed.