[BEAM-6553] A Python SDK sink that supports File Loads into BQ #7655

pabloem · 2019-01-28T19:09:25Z

This PR implements file loads into BQ from the Python SDK for Batch pipelines.

r: @chamikaramj

Open question

Need to improve documentation significantly before merging.

Features being implemented

Use of temporary tables for atomicity given a subset of load jobs failed
Support for multiple destinations on streaming - left for later PR.
File Loads transform has not been wired up to WriteToBigQuery
Support for time partitioning - left for later PR.
Support for multiple destinations
If a worker dies, and the same load job is triggered, it will fail because it willhave the same name as before.

Test results

pabloem · 2019-01-29T21:19:12Z

Run Python Dataflow ValidatesRunner

pabloem · 2019-01-29T22:43:26Z

Run Python Dataflow ValidatesRunner

pabloem · 2019-01-30T18:16:26Z

Run Python PostCommit

chamikaramj · 2019-01-30T18:45:11Z

cc: @reuvenlax

pabloem · 2019-01-31T02:09:39Z

Run Python PostCommit

pabloem · 2019-01-31T04:06:38Z

Run Python PreCommit

pabloem · 2019-01-31T07:24:09Z

Run Python PostCommit

pabloem · 2019-02-02T00:39:13Z

Run Python PostCommit

pabloem · 2019-02-05T18:11:45Z

Fixed lint issue. Ready for review.

chamikaramj

Thanks Pablo. This looks great.

sdks/python/apache_beam/io/gcp/bigquery.py

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py

sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py

chamikaramj · 2019-02-07T19:17:04Z

Sorry approved by mistake. Just wanted to send comments.

chamikaramj

(resending as a change request)

pabloem · 2019-02-07T23:44:23Z

@chamikaramj I've addressed comments. LMK if that is reasonable. Overview of changes:

Improved documentation
Job names are deterministic, so a reinsertion of the same job should fail.

sdks/python/apache_beam/io/gcp/bigquery.py

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py

pabloem · 2019-02-07T23:05:45Z

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py

+    job_references = [elm[1] for elm in dest_ids_list]
+
+    while True:
+      status = self._check_job_states(job_references)


The BQ wrapper has retries. This is similarly handled for e.g. inserts[1][2]

[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L635-L646
[2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L578-L617

sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py

pabloem · 2019-02-07T23:08:23Z

sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py

+                  equal_to([job_reference]), label='CheckJobs')
+
+
+@unittest.skipIf(HttpError is None, 'GCP dependencies are not installed')


@chamikaramj there are three integration tests added here. They test different kinds of functionality. They can be reduced to two, but I'd say two is the minimum.

Removed one IT. Two more are left with a number of unit tests.

pabloem · 2019-02-08T23:31:56Z

Run Python PostCommit

chamikaramj

Thanks!

chamikaramj · 2019-02-12T22:17:17Z

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py

+_MAXIMUM_LOAD_SIZE = 15 * ONE_TERABYTE
+
+# Big query only supports up to 10 thousand URIs for a single load job.
+_MAXIMUM_SOURCE_URIS = 10*1000


Where is this used ?

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py

chamikaramj · 2019-02-12T22:19:41Z

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py

+
+
+def _generate_file_prefix(pipeline_gcs_location):
+  # If a gcs location is provided to the pipeline, then we shall use that.


Hm? GCS is what it says : )

chamikaramj · 2019-02-12T22:46:15Z

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py

+
+    copy_job_name = '%s_copy_%s_to_%s' % (
+        job_name_prefix,
+        _bq_uuid('%s:%s.%s' % (copy_from_reference.projectId,


Will we end up creating the same UUID for the same destination table ? Will this result in jobs being rejected when we try to execute multiple jobs against the same destination table ? It'll be interesting to see what Java SDK currently does in this case.

There is now an index for jobs to the same table, and this id is not regenerated.

chamikaramj · 2019-02-12T22:49:14Z

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py

+        _bq_uuid('%s:%s.%s' % (copy_from_reference.projectId,
+                               copy_from_reference.datasetId,
+                               copy_from_reference.tableId)),
+        _bq_uuid('%s:%s.%s' % (copy_to_reference.projectId,


We should check if the job already exists in BQ and use that if the job exists (it's important to make sure that the BQ ID is unique for this Beam/Dataflow job instance)

The id is generated once, and not recomputed, so it's always the same for a single pipeline execution. We've resolved this in person.

chamikaramj · 2019-02-12T22:50:49Z

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py

+      table_reference.tableId = job_name
+      yield pvalue.TaggedOutput(TriggerLoadJobs.TEMP_TABLES, table_reference)
+
+    job_reference = self.bq_wrapper.perform_load_job(


We should also check if the job exists before creating to be resilient to workitem/bundle failures.

chamikaramj · 2019-02-12T22:53:15Z

sdks/python/apache_beam/io/gcp/bigquery_tools.py

+
+  @retry.with_exponential_backoff(
+      num_retries=MAX_RETRIES,
+      retry_filter=retry.retry_on_server_errors_and_timeout_filter)


We should be careful about timeout filters. Timeout does not give any guarantees regarding whether the job was received by the BQ service or not.

pabloem · 2019-02-14T00:51:47Z

Run Python PostCommit

chamikaramj · 2019-02-15T00:14:06Z

Please resolve the conflicts.

Improving todos, documentation

pabloem · 2019-02-16T01:20:40Z

Run Python PostCommit

chamikaramj

Thanks Pablo. This looks great. LGTM.

Please squash commits and self merge.

apache#7655)" This reverts commit cdea885.

…e#7655) This reverts commit 583cc5e.

[BEAM-6711] [BEAM-6553] A Python SDK sink that supports File Loads into BQ (#7655)

apache#7655)" This reverts commit cdea885.

…e#7655) This reverts commit 583cc5e.

pabloem force-pushed the bqloads branch from c18c0c8 to 8d931b4 Compare January 29, 2019 00:25

pabloem force-pushed the bqloads branch from 77ee8d1 to 4be6f94 Compare January 30, 2019 18:43

pabloem changed the title ~~A Python SDK sink that supports File Loads into BQ~~ [BEAM-6553] A Python SDK sink that supports File Loads into BQ Jan 30, 2019

pabloem mentioned this pull request Jan 30, 2019

[BEAM-6768] BigQuery dynamic destinations for Python SDK Streaming Inserts #7677

Merged

pabloem force-pushed the bqloads branch from 18f3bb9 to 61209ac Compare February 6, 2019 17:54

chamikaramj approved these changes Feb 7, 2019

View reviewed changes

chamikaramj requested changes Feb 7, 2019

View reviewed changes

pabloem commented Feb 8, 2019

View reviewed changes

chamikaramj reviewed Feb 12, 2019

View reviewed changes

pabloem added 7 commits February 15, 2019 13:35

First sketch for BQ File Loads in Python SDK

8fcb947

Sketching multiple destinations functionality

2c8dcd9

BigQuery file loads wired up to WriteToBigQuery

04adbb2

Creating temporary tables for atomicity on multiple load jobs

b7d86ba

Improving todos, documentation

Improving logging

a7b92ac

Fixing DirectRunner ITs

b6bd4e8

Deleting temporary tables and testng corner cases

4bdf12d

pabloem added 4 commits February 15, 2019 15:36

Fixing lint issue

a6f6b61

Removing one integration test. Addressing comments.

3ac5526

Rewindowing input

06625a9

Managing the limit of files per job

541fa10

pabloem force-pushed the bqloads branch from fc5d11a to 541fa10 Compare February 15, 2019 23:39

chamikaramj approved these changes Feb 17, 2019

View reviewed changes

pabloem merged commit cdea885 into apache:master Feb 17, 2019

pabloem deleted the bqloads branch February 17, 2019 02:59

pabloem added a commit to pabloem/beam that referenced this pull request Feb 19, 2019

Revert "[BEAM-6553] A Python SDK sink that supports File Loads into BQ (

583cc5e

apache#7655)" This reverts commit cdea885.

pabloem mentioned this pull request Feb 19, 2019

Revert "[BEAM-6553] A Python SDK sink that supports File Loads into B… #7887

Merged

pabloem added a commit to pabloem/beam that referenced this pull request Feb 19, 2019

[BEAM-6553] A Python SDK sink that supports File Loads into BQ (apach…

af853d5

…e#7655) This reverts commit 583cc5e.

pabloem added a commit that referenced this pull request Feb 22, 2019

Merge pull request #7892 from pabloem/bqloads

4a30e8c

[BEAM-6711] [BEAM-6553] A Python SDK sink that supports File Loads into BQ (#7655)

Juta pushed a commit to Juta/beam that referenced this pull request Feb 25, 2019

Revert "[BEAM-6553] A Python SDK sink that supports File Loads into BQ (

609a45f

apache#7655)" This reverts commit cdea885.

Juta pushed a commit to Juta/beam that referenced this pull request Feb 25, 2019

[BEAM-6553] A Python SDK sink that supports File Loads into BQ (apach…

89b54c6

…e#7655) This reverts commit 583cc5e.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-6553] A Python SDK sink that supports File Loads into BQ #7655

[BEAM-6553] A Python SDK sink that supports File Loads into BQ #7655

pabloem commented Jan 28, 2019 •

edited

Loading

pabloem commented Jan 29, 2019

pabloem commented Jan 29, 2019

pabloem commented Jan 30, 2019

chamikaramj commented Jan 30, 2019

pabloem commented Jan 31, 2019

pabloem commented Jan 31, 2019

pabloem commented Jan 31, 2019

pabloem commented Feb 2, 2019

pabloem commented Feb 5, 2019

chamikaramj left a comment

chamikaramj commented Feb 7, 2019

chamikaramj left a comment

pabloem commented Feb 7, 2019

pabloem Feb 7, 2019

pabloem Feb 7, 2019

pabloem Feb 8, 2019

pabloem commented Feb 8, 2019

chamikaramj left a comment

chamikaramj Feb 12, 2019

chamikaramj Feb 12, 2019

pabloem Feb 13, 2019

chamikaramj Feb 12, 2019

pabloem Feb 13, 2019

chamikaramj Feb 12, 2019

pabloem Feb 13, 2019

chamikaramj Feb 12, 2019

chamikaramj Feb 12, 2019

pabloem commented Feb 14, 2019

chamikaramj commented Feb 15, 2019

pabloem commented Feb 16, 2019

chamikaramj left a comment

		equal_to([job_reference]), label='CheckJobs')


		@unittest.skipIf(HttpError is None, 'GCP dependencies are not installed')



		def _generate_file_prefix(pipeline_gcs_location):
		# If a gcs location is provided to the pipeline, then we shall use that.

[BEAM-6553] A Python SDK sink that supports File Loads into BQ #7655

[BEAM-6553] A Python SDK sink that supports File Loads into BQ #7655

Conversation

pabloem commented Jan 28, 2019 • edited Loading

Open question

Features being implemented

Test results

pabloem commented Jan 29, 2019

pabloem commented Jan 29, 2019

pabloem commented Jan 30, 2019

chamikaramj commented Jan 30, 2019

pabloem commented Jan 31, 2019

pabloem commented Jan 31, 2019

pabloem commented Jan 31, 2019

pabloem commented Feb 2, 2019

pabloem commented Feb 5, 2019

chamikaramj left a comment

Choose a reason for hiding this comment

chamikaramj commented Feb 7, 2019

chamikaramj left a comment

Choose a reason for hiding this comment

pabloem commented Feb 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Feb 8, 2019

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Feb 14, 2019

chamikaramj commented Feb 15, 2019

pabloem commented Feb 16, 2019

chamikaramj left a comment

Choose a reason for hiding this comment

pabloem commented Jan 28, 2019 •

edited

Loading