[BEAM-2991] Sets a TTL on BigQueryIO.read().fromQuery() temp dataset by jkff · Pull Request #3883 · apache/beam

jkff · 2017-09-22T02:05:09Z

Also fixes a bug where we start the query job twice - once to extract the files, once to get schema. Luckily it doesn't actually run twice, because inserting the same job a second time gives an ignorable error, but it was still icky.

Also adds some logging.

R: @reuvenlax

jkff · 2017-09-25T19:28:14Z

Rebased to fix compilation error. PTAL.

jkff · 2017-09-25T22:51:06Z

retest this please

Also fixes a bug where we start the query job twice - once to extract the files, once to get schema. Luckily it doesn't actually run twice, because inserting the same job a second time gives an ignorable error, but it was still icky. Also adds some logging.

reuvenlax

however, please change description to link to appropriate JIRA

polleyg · 2017-10-13T04:53:50Z

@jkff - Has Dataflow always created Datasets/Tables in BigQuery? I've never seen it do that before, and I always through it exported tables/queries to GCS for reading into the pipeline. The issue we are now seeing is that our Dataflow pipelines share the same project id as our BigQuery users.

So, these temp datasets/tables are now showing up in their web UI. Our BigQuery users know nothing about Dataflow and the pipelines we run behind the scenes for them - they just ETL the data in and out of BigQuery. It's abstracted from the BigQuery users.

We've had several users contact us asking why they suddenly see these weird looking datasets and tables in the BigQuery Web UI e.g:

This is not a good experience for our BQ users.

Also, what if a batch pipeline takes more than 24hrs? We've had batch pipelines run close to that due to sheer volume of data its ingesting and writing.

Cheers,
Graham

jkff · 2017-10-14T22:08:51Z

Yes, I believe dataflow always created temporary datasets when reading from a query (except perhaps in some versions before 2.0? I'm not sure but I don't think so, because Bigquery doesn't have a feature to directly export results of a large query to files). Query results are exported to a dataset, then the dataset is exported to files and the dataset is deleted, then the job processes the files and the files are deleted at some point.

…

On Thu, Oct 12, 2017, 9:53 PM Graham Polley ***@***.***> wrote: @jkff <https://github.com/jkff> - Has Dataflow always created Datasets/Tables in BigQuery? I've never seen it do that before, and I always through it exported tables/queries to GCS for reading into the pipeline. The issue we are now seeing is that our Dataflow pipelines share the same project id as our BigQuery users. So, these temp datasets/tables are now showing up in their web UI. Our BigQuery users know nothing about Dataflow and the pipelines we run behind the scenes for them - they just ETL the data in and out of BigQuery. It's abstracted from the BigQuery users. We've had several users contact us asking why they suddenly see these weird looking datasets and tables in the BigQuery Web UI e.g: [image: screen shot 2017-10-09 at 5 06 15 pm] <https://user-images.githubusercontent.com/5554342/31530787-77e96922-b02e-11e7-8cf6-28aa24d87211.png> This is not a good experience for our BQ users. Also, what if a batch pipeline takes more than 24hrs? We've had batch pipelines run close to that due to sheer volume of data its ingesting and writing. Cheers, Graham — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3883 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAU-qv-pEY2XlUrrhAMeYtjhomX1KN02ks5sruzjgaJpZM4PgIq-> .

jkff force-pushed the bq-temp-ttl branch 2 times, most recently from 9b1aaa5 to d06d4b2 Compare September 25, 2017 19:27

jkff force-pushed the bq-temp-ttl branch from d06d4b2 to 911327b Compare September 26, 2017 18:50

reuvenlax approved these changes Sep 27, 2017

View reviewed changes

jkff changed the title ~~Sets a TTL on BigQueryIO.read().fromQuery() temp dataset~~ [BEAM-2991] Sets a TTL on BigQueryIO.read().fromQuery() temp dataset Sep 27, 2017

asfgit closed this in 41239d8 Sep 27, 2017

jkff deleted the bq-temp-ttl branch September 27, 2017 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-2991] Sets a TTL on BigQueryIO.read().fromQuery() temp dataset#3883

[BEAM-2991] Sets a TTL on BigQueryIO.read().fromQuery() temp dataset#3883
jkff wants to merge 1 commit intoapache:masterfrom
jkff:bq-temp-ttl

jkff commented Sep 22, 2017

Uh oh!

jkff commented Sep 25, 2017

Uh oh!

jkff commented Sep 25, 2017

Uh oh!

reuvenlax left a comment

Uh oh!

polleyg commented Oct 13, 2017

Uh oh!

jkff commented Oct 14, 2017 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jkff commented Sep 22, 2017

Uh oh!

jkff commented Sep 25, 2017

Uh oh!

jkff commented Sep 25, 2017

Uh oh!

reuvenlax left a comment

Choose a reason for hiding this comment

Uh oh!

polleyg commented Oct 13, 2017

Uh oh!

jkff commented Oct 14, 2017 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants