Skip to content

[BEAM-2991] Sets a TTL on BigQueryIO.read().fromQuery() temp dataset#3883

Closed
jkff wants to merge 1 commit intoapache:masterfrom
jkff:bq-temp-ttl
Closed

[BEAM-2991] Sets a TTL on BigQueryIO.read().fromQuery() temp dataset#3883
jkff wants to merge 1 commit intoapache:masterfrom
jkff:bq-temp-ttl

Conversation

@jkff
Copy link
Contributor

@jkff jkff commented Sep 22, 2017

Also fixes a bug where we start the query job twice - once to extract the files, once to get schema. Luckily it doesn't actually run twice, because inserting the same job a second time gives an ignorable error, but it was still icky.

Also adds some logging.

R: @reuvenlax

@jkff jkff force-pushed the bq-temp-ttl branch 2 times, most recently from 9b1aaa5 to d06d4b2 Compare September 25, 2017 19:27
@jkff
Copy link
Contributor Author

jkff commented Sep 25, 2017

Rebased to fix compilation error. PTAL.

@jkff
Copy link
Contributor Author

jkff commented Sep 25, 2017

retest this please

Also fixes a bug where we start the query job twice -
once to extract the files, once to get schema. Luckily it doesn't
actually run twice, because inserting the same job a second time gives
an ignorable error, but it was still icky.

Also adds some logging.
Copy link
Contributor

@reuvenlax reuvenlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, please change description to link to appropriate JIRA

@jkff jkff changed the title Sets a TTL on BigQueryIO.read().fromQuery() temp dataset [BEAM-2991] Sets a TTL on BigQueryIO.read().fromQuery() temp dataset Sep 27, 2017
@asfgit asfgit closed this in 41239d8 Sep 27, 2017
@jkff jkff deleted the bq-temp-ttl branch September 27, 2017 05:35
@polleyg
Copy link
Contributor

polleyg commented Oct 13, 2017

@jkff - Has Dataflow always created Datasets/Tables in BigQuery? I've never seen it do that before, and I always through it exported tables/queries to GCS for reading into the pipeline. The issue we are now seeing is that our Dataflow pipelines share the same project id as our BigQuery users.

So, these temp datasets/tables are now showing up in their web UI. Our BigQuery users know nothing about Dataflow and the pipelines we run behind the scenes for them - they just ETL the data in and out of BigQuery. It's abstracted from the BigQuery users.

We've had several users contact us asking why they suddenly see these weird looking datasets and tables in the BigQuery Web UI e.g:

screen shot 2017-10-09 at 5 06 15 pm

This is not a good experience for our BQ users.

Also, what if a batch pipeline takes more than 24hrs? We've had batch pipelines run close to that due to sheer volume of data its ingesting and writing.

Cheers,
Graham

@jkff
Copy link
Contributor Author

jkff commented Oct 14, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants