[BEAM-2405] Write to BQ using the streaming API by sb2nov · Pull Request #3288 · apache/beam

sb2nov · 2017-06-02T23:15:57Z

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

Make sure the PR title is formatted like:
[BEAM-<Jira issue #>] Description of pull request
Make sure tests pass via mvn clean verify.
Replace <Jira issue #> in the title with the actual Jira issue
number, if there is one.
If this contribution is large, please file an Apache
Individual Contributor License Agreement.

This should be ready for review so that we can test it with streaming pipelines

There are few followup items after this PR:

Integration with the Dataflow runner to work in both Batch and Streaming
Migrate the tornadoes example to use this and deprecate the sink interface

coveralls · 2017-06-03T00:39:52Z

Coverage decreased (-0.01%) to 70.626% when pulling 4792096 on sb2nov:BEAM-BQ-SINK into 9cdae6c on apache:master.

sb2nov · 2017-06-03T03:33:46Z

R: @chamikaramj PTAL

coveralls · 2017-06-03T04:50:35Z

Coverage remained the same at 70.64% when pulling 4533a80 on sb2nov:BEAM-BQ-SINK into 9cdae6c on apache:master.

chamikaramj

Thanks.

chamikaramj · 2017-06-05T22:23:37Z

sdks/python/apache_beam/io/gcp/bigquery.py

    request = bigquery.BigqueryTablesInsertRequest(
        projectId=project_id, datasetId=dataset_id, table=table)
    response = self.client.tables.Insert(request)
+    logging.info("Created the table with id %s", table_id)


logging.debug ?

"Created a table"

chamikaramj · 2017-06-05T22:24:59Z

sdks/python/apache_beam/io/gcp/bigquery.py

+
+  def __init__(self, table_id, dataset_id, project_id, batch_size, schema,
+               create_disposition, write_disposition, client):
+    self.table_id = table_id


Please add a doc comment.

chamikaramj · 2017-06-05T22:26:12Z

sdks/python/apache_beam/io/gcp/bigquery.py

+    self._rows_buffer = []
+    # Transform the table schema into a bigquery.TableSchema instance.
+    if isinstance(self.schema, basestring):
+      # TODO(silviuc): Should add a regex-based validation of the format.


Are we still hoping to do this (TODO) ?

chamikaramj · 2017-06-05T22:27:34Z

sdks/python/apache_beam/io/gcp/bigquery.py

+    if isinstance(self.schema, basestring):
+      # TODO(silviuc): Should add a regex-based validation of the format.
+      table_schema = bigquery.TableSchema()
+      schema_list = [s.strip(' ') for s in self.schema.split(',')]


chamikaramj · 2017-06-05T22:28:15Z

sdks/python/apache_beam/io/gcp/bigquery.py

+        field_schema = bigquery.TableFieldSchema()
+        field_schema.name = field_name
+        field_schema.type = field_type
+        field_schema.mode = 'NULLABLE'


Do we support other modes ?

not in the string schema input

chamikaramj · 2017-06-05T22:37:43Z

sdks/python/apache_beam/io/gcp/bigquery.py

+        create_disposition=self.create_disposition,
+        write_disposition=self.write_disposition,
+        client=self.test_client)
+    return pcoll | 'Write to BQ' >> ParDo(bigquery_write_fn)


BigQuery instead of BQ here ?

chamikaramj · 2017-06-05T22:48:04Z

sdks/python/apache_beam/io/gcp/bigquery_test.py

+    client.tables.Get.return_value = bigquery.Table(
+        tableReference=bigquery.TableReference(
+            projectId='project_id', datasetId='dataset_id', tableId='table_id'))
+    client.tabledata.InsertAll.return_value = \


I believe we usually use () instead of \ for line breaks.

chamikaramj · 2017-06-05T22:50:03Z

sdks/python/apache_beam/io/gcp/bigquery_test.py

+    self.assertTrue(client.tables.Get.called)
+    self.assertTrue(client.tables.Insert.called)
+
+  def test_dofn_client_process_flush_not_called(self):


A better name might be "test_dofn_client_process_performs_batching".

chamikaramj · 2017-06-05T22:52:12Z

sdks/python/apache_beam/io/gcp/bigquery_test.py

+    fn.finish_bundle()
+    # InsertRows called in finish bundle
+    self.assertTrue(client.tabledata.InsertAll.called)
+


Also, add a test that writes zero records.

chamikaramj · 2017-06-05T22:52:38Z

sdks/python/apache_beam/io/gcp/bigquery.py

+      schema_list = [s.strip(' ') for s in self.schema.split(',')]
+      for field_and_type in schema_list:
+        field_name, field_type = field_and_type.split(':')
+        field_schema = bigquery.TableFieldSchema()


Please add tests for schema handling logic here.

sb2nov

Thanks for the review.

sb2nov · 2017-06-06T18:04:41Z

sdks/python/apache_beam/io/gcp/bigquery.py

    request = bigquery.BigqueryTablesInsertRequest(
        projectId=project_id, datasetId=dataset_id, table=table)
    response = self.client.tables.Insert(request)
+    logging.info("Created the table with id %s", table_id)


sb2nov · 2017-06-06T18:04:54Z

sdks/python/apache_beam/io/gcp/bigquery.py

+    self.create_disposition = create_disposition
+    self.write_disposition = write_disposition
+    self._rows_buffer = []
+    self._max_batch_size = batch_size or 500


This is based on what Java had

sb2nov · 2017-06-06T18:07:21Z

sdks/python/apache_beam/io/gcp/bigquery.py

+    self._rows_buffer = []
+    # Transform the table schema into a bigquery.TableSchema instance.
+    if isinstance(self.schema, basestring):
+      # TODO(silviuc): Should add a regex-based validation of the format.


sb2nov · 2017-06-06T18:07:59Z

sdks/python/apache_beam/io/gcp/bigquery.py

+    if isinstance(self.schema, basestring):
+      # TODO(silviuc): Should add a regex-based validation of the format.
+      table_schema = bigquery.TableSchema()
+      schema_list = [s.strip(' ') for s in self.schema.split(',')]


I would have just liked to deprecate this string schema thing and asked everyone to create a table schema object but that is a larger change

sb2nov · 2017-06-06T18:08:11Z

sdks/python/apache_beam/io/gcp/bigquery.py

+        field_schema = bigquery.TableFieldSchema()
+        field_schema.name = field_name
+        field_schema.type = field_type
+        field_schema.mode = 'NULLABLE'


not in the string schema input

sb2nov · 2017-06-06T18:11:18Z

sdks/python/apache_beam/io/gcp/bigquery_test.py

+    fn.finish_bundle()
+    # InsertRows called in finish bundle
+    self.assertTrue(client.tabledata.InsertAll.called)
+


sb2nov · 2017-06-06T18:48:11Z

sdks/python/apache_beam/io/gcp/bigquery.py

+
+  def __init__(self, table_id, dataset_id, project_id, batch_size, schema,
+               create_disposition, write_disposition, client):
+    self.table_id = table_id


sb2nov · 2017-06-06T18:48:40Z

sdks/python/apache_beam/io/gcp/bigquery.py

+    if isinstance(self.schema, basestring):
+      # TODO(silviuc): Should add a regex-based validation of the format.
+      table_schema = bigquery.TableSchema()
+      schema_list = [s.strip(' ') for s in self.schema.split(',')]


sb2nov · 2017-06-06T18:54:45Z

sdks/python/apache_beam/io/gcp/bigquery.py

+      schema_list = [s.strip(' ') for s in self.schema.split(',')]
+      for field_and_type in schema_list:
+        field_name, field_type = field_and_type.split(':')
+        field_schema = bigquery.TableFieldSchema()


sb2nov · 2017-06-06T18:54:54Z

sdks/python/apache_beam/io/gcp/bigquery.py

+class WriteToBigQuery(PTransform):
+
+  def __init__(self, table, dataset=None, project=None, schema=None,
+               create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,


coveralls · 2017-06-06T20:22:20Z

Coverage decreased (-0.004%) to 70.63% when pulling 85ffcb2 on sb2nov:BEAM-BQ-SINK into a054550 on apache:master.

chamikaramj · 2017-06-06T20:50:41Z

LGTM. Thanks.

sb2nov force-pushed the BEAM-BQ-SINK branch from 4792096 to 4533a80 Compare June 3, 2017 03:33

chamikaramj requested changes Jun 5, 2017

View reviewed changes

sb2nov commented Jun 6, 2017

View reviewed changes

Sourabh Bajaj added 2 commits June 6, 2017 11:56

[BEAM-2405] Write to BQ using the streaming API

df4da24

PR review -1

85ffcb2

sb2nov force-pushed the BEAM-BQ-SINK branch from 110b8cd to 85ffcb2 Compare June 6, 2017 18:57

asfgit closed this in fdfd775 Jun 6, 2017

Conversation

sb2nov commented Jun 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jun 3, 2017

Uh oh!

sb2nov commented Jun 3, 2017

Uh oh!

coveralls commented Jun 3, 2017

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sb2nov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jun 6, 2017

Uh oh!

chamikaramj commented Jun 6, 2017

Uh oh!

sb2nov commented Jun 2, 2017 •

edited

Loading