Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-1909] BigQuery read transform fails for DirectRunner when querying non-US regions #2509

Closed
wants to merge 13 commits into from

Conversation

ubunatic
Copy link

I partially fixed the issue by getting the location of the source Dataset. Then I use this location as location of the created temp dataset. The added parameters are optional and should not break anything.

Note: The solution works for me when using the DirectRunner if I create a BigQuerySource with table=<some-table>. It does not work for a BigQuerySource with query=<some-query>.
The corresponding Jira issue should not yet be closed.

@asfbot
Copy link

asfbot commented Apr 12, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9443/

Build result: FAILURE

[...truncated 2.40 MB...] at java.lang.Thread.run(Thread.java:745)Caused by: org.apache.maven.plugin.MojoExecutionException: Command execution failed. at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:302) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) ... 31 moreCaused by: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711) at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289) ... 33 more2017-04-12T14:08:06.192 [ERROR] 2017-04-12T14:08:06.192 [ERROR] Re-run Maven using the -X switch to enable full debug logging.2017-04-12T14:08:06.192 [ERROR] 2017-04-12T14:08:06.192 [ERROR] For more information about the errors and possible solutions, please read the following articles:2017-04-12T14:08:06.192 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException2017-04-12T14:08:06.192 [ERROR] 2017-04-12T14:08:06.192 [ERROR] After correcting the problems, you can resume the build with the command2017-04-12T14:08:06.192 [ERROR] mvn -rf :beam-sdks-pythonchannel stoppedSetting status of d773fbd to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9443/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install
--none--

.gitignore Outdated
@@ -52,6 +52,9 @@ hs_err_pid*.log
# Ignore MacOSX files.
.DS_Store

# ignore cythonized files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was just added in #2494

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will check that and also fix the below issues.

try:
tr = source_table_reference
if tr is not None:
if tr.projectId is None: table_project_id = project_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: avoid inline statements and format the statements inside if on the next line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I will use your linting rules from now on.

tr = source_table_reference
if tr is not None:
if tr.projectId is None: table_project_id = project_id
else: table_project_id = tr.projectId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question let's say I'm reading from a public table such as Github archive then won't this fail as the temp table is being created in a project I don't have write access to. Can we keep the same project but pass in a location based on the source.

cc @chamikaramj who might know more about this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code section is just for reading the location from the given source.
The temp creation will then use the jobs project id but use the same location as the source table.
The source table's project id is ignored after this section.

dataset = bigquery.Dataset(
datasetReference=bigquery.DatasetReference(
projectId=project_id, datasetId=dataset_id))
dr = bigquery.DatasetReference(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Rename dr to dataset_reference

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
Question: Do you have any rule for abbreviating short-lived variables?

@dhalperi
Copy link
Contributor

R: @sb2nov @chamikaramj

dataset_id = BigQueryWrapper.TEMP_DATASET + self._temporary_table_suffix
location = None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the try-except. If a source table is given, we should be able t read it's location. There is nothing to except.

@@ -23,7 +23,7 @@
#
# The exit-code of the script indicates success or a failure.

set -e
set -o errexit
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using set -o errexit is preferable, since it is more explicit and thus more readable

@@ -39,14 +39,19 @@ EXCLUDED_GENERATED_FILES=(

FILES_TO_IGNORE=""
for file in "${EXCLUDED_GENERATED_FILES[@]}"; do
if [[ $FILES_TO_IGNORE ]]; then
if test -n "$FILES_TO_IGNORE"; then
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

square brackets in Bash are always subject to confusion, just use test thus the reader may type man test to check the options

FILES_TO_IGNORE="$FILES_TO_IGNORE, "
fi
FILES_TO_IGNORE="$FILES_TO_IGNORE$(basename $file)"
done
echo "Skipping lint for generated files: $FILES_TO_IGNORE"

if test $# -gt 0
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pylint on the whole sdk may take some time. This option helps to run it on the modules you are working on

@coveralls
Copy link

Coverage Status

Coverage increased (+0.007%) to 70.474% when pulling 9bac351 on ubunatic:temp-table-region into f30d5b9 on apache:master.

@asfbot
Copy link

asfbot commented Apr 15, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9561/
--none--

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

The way to support queries in general will be to run a dry run of the query and get the location using the information available in the result of the query. See below for how this is done in Java SDK.

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTableRowIterator.java#L382

I OK with that part not being implemented in this PR.

@@ -607,7 +607,8 @@ def __init__(self, source, test_bigquery_client=None, use_legacy_sql=True,

def __enter__(self):
self.client = BigQueryWrapper(client=self.test_bigquery_client)
self.client.create_temporary_dataset(self.executing_project)
self.client.create_temporary_dataset(
self.executing_project, self.source.table_reference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass location here instead of a table reference (so that we don't have to update this function when we support queries) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 move the location fetching logic to a separate helper function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, that would be better.

if tr is not None:
if tr.projectId is None:
# if the source table has no projectId, assume the given project_id
source_project_id = project_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use the executing project here (project_id). The entity executing the query might not have permissions to create a Dataset in the project that owns the source table (but new Dataset should be in the same region as the source Dataset).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this one. I see that you are only using 'source_project_id' to get the table (not to create the Datasaet). But can we not do that and pass location to this function as I mentioned in my other comment ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, passing location is better

@@ -23,7 +23,7 @@
#
# The exit-code of the script indicates success or a failure.

set -e
set -o errexit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move changes to run_pylint.sh to a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to moving to an independent PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@ubunatic
Copy link
Author

use this updated PR: #2582

@ubunatic ubunatic closed this Apr 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants