[BEAM-6769][BEAM-7327] add it test for writing and reading with bigqu… #8621

Juta · 2019-05-20T09:41:51Z

…ery io

This pr adds it suites for python 3.6. This is is part of a series of PRs with goal to make Apache Beam PY3 compatible. The proposal with the outlined approach has been documented here: https://s.apache.org/beam-python-3.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

Juta · 2019-05-20T09:42:09Z

Run Python PostCommit

Juta · 2019-05-20T12:01:50Z

Run Python PostCommit

Juta · 2019-05-20T12:56:21Z

Run Python PostCommit

Juta · 2019-05-20T14:02:35Z

Run Python PostCommit

Juta · 2019-05-20T14:58:55Z

Run Python PostCommit

…ery io

Juta · 2019-05-21T09:52:40Z

Run Python PostCommit

Juta · 2019-05-21T14:57:12Z

Run Python PostCommit

Juta · 2019-05-21T16:27:02Z

Run Python PostCommit

pabloem · 2019-05-21T18:13:36Z

Passing postcommits:
https://builds.apache.org/job/beam_PostCommit_Python_Verify_PR/706/
https://builds.apache.org/job/beam_PostCommit_Python3_Verify_PR/343/

Now you can just fix the lint issue, and should be good to go : )

pabloem · 2019-05-21T18:45:30Z

sdks/python/apache_beam/runners/dataflow/dataflow_runner.py

@@ -654,12 +654,17 @@ def apply_WriteToBigQuery(self, transform, pcoll, options):
      return self.apply_PTransform(transform, pcoll, options)
    else:
      from apache_beam.io.gcp.bigquery_tools import parse_table_schema_from_json
+      if transform.schema == beam.io.gcp.bigquery.SCHEMA_AUTODETECT \
+          or transform.schema is None:
+        schema = transform.schema


Maybe here we want schema to be none always? that way we don't have to special case further down?

actually, I've added a comment about this here: https://issues.apache.org/jira/browse/BEAM-7382

If we have autodetection while using the BigQuerySink, we should error out, as it is not supported.

Juta · 2019-05-27T11:48:14Z

Run Python PostCommit

Juta · 2019-05-27T13:14:43Z

Run Python PostCommit

Juta · 2019-05-27T14:25:20Z

Run Python PostCommit

Juta · 2019-05-27T15:51:41Z

@tvalentyn, @pabloem: R (the postcommits are failing but I don't think it is related to my pr)

tvalentyn · 2019-05-28T05:31:14Z

sdks/python/apache_beam/io/gcp/bigquery_tools.py

@@ -925,6 +925,12 @@ def __iter__(self):
      if self.schema is None:
        self.schema = schema
      for row in rows:
+        # return base64 encoded bytes as byte type on python 3
+        # to match behavior DataflowRunner


Let's say "which matches the behavior of Beam Java SDK". Note that this does not depend on Py3.

tvalentyn

Thanks, Juta

tvalentyn · 2019-05-28T05:36:46Z

sdks/python/apache_beam/io/gcp/bigquery_tools.py

@@ -998,6 +1004,13 @@ def encode(self, table_row):
    # This code will catch this error to emit an error that explains
    # to the programmer that they have used NAN/INF values.
    try:
+      # on python 3 base64 bytes are decoded to strings before being send to bq
+      if sys.version[0] == '3':


Let's replace == '3' with > 2.

tvalentyn · 2019-05-28T05:37:33Z

sdks/python/apache_beam/io/gcp/bigquery_tools.py

@@ -998,6 +1004,13 @@ def encode(self, table_row):
    # This code will catch this error to emit an error that explains
    # to the programmer that they have used NAN/INF values.
    try:
+      # on python 3 base64 bytes are decoded to strings before being send to bq
+      if sys.version[0] == '3':
+        if type(table_row) == str:


When do we go through this branch? Why do we only go through json.loads() here on Python 3, but not Python 2?

This codepath will decode bytes when they are written to bigquery (it expects bytes to be base-64 encoded but allows them to have type bytes or str) This is to allow the same data that is read by bq to also be writen to bq.

in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py#L54 data is written to bigquery as a string so that is why we first do a json.loads ()before handling the bytes decoding (and this is only necessary in python 3).

I still don't understand why/when we may receive a string here and not a dictionary (as per docstring) during encode() call. The test you referenced seems to call json.loads() before ingesting data into BQ:

beam/sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py

Line 83 in 23115ce

_ELEMENTS = list([json.loads(elm[1]) for elm in _DESTINATION_ELEMENT_PAIRS])

. I created a PR on top of yours to see if the test pass without these lines: #8709, waiting for the tests to finish.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py#L244. In this test no json.loads() is done. Should I edit the tests to pass dicts instead of strings?

I think that the test needs to be fixed to pass dicts or BQ IO file loads method needs to use a different codec, not a codec that expects dicts.
I suggest we change the test to pass dicts. @pabloem, @chamikaramj - do you have a different opinion on this?

tvalentyn · 2019-05-28T05:44:29Z

sdks/python/apache_beam/io/gcp/bigquery_tools.py

@@ -998,6 +1004,13 @@ def encode(self, table_row):
    # This code will catch this error to emit an error that explains
    # to the programmer that they have used NAN/INF values.
    try:
+      # on python 3 base64 bytes are decoded to strings before being send to bq


nit: let's change base64 bytes to base64-encoded bytes.

Juta · 2019-05-28T13:16:03Z

Run Python PreCommit

Juta · 2019-05-28T13:16:17Z

Run Python PostCommit

Juta · 2019-05-28T14:41:46Z

@tvalentyn PTAL

tvalentyn · 2019-05-28T23:20:03Z

Run Python PostCommit

tvalentyn

Thanks a lot Juta for making this change and also improving BQ test coverage. The change looks good to me except for one place I still don't understand.

tvalentyn · 2019-05-29T02:04:56Z

sdks/python/test-suites/direct/py35/build.gradle

@@ -32,6 +32,8 @@ task postCommitIT(dependsOn: 'installGcpTest') {
        "apache_beam.io.gcp.pubsub_integration_test:PubSubIntegrationTest",
        "apache_beam.io.gcp.big_query_query_to_table_it_test:BigQueryQueryToTableIT",
        "apache_beam.io.gcp.bigquery_io_read_it_test",
+        "apache_beam.io.gcp.bigquery_read_it_test",


We should probably rename apache_beam.io.gcp.bigquery_io_read_it_test in the future or combine it with the scenarios added here to avoid naming collision. We can do it in a later change.

tvalentyn · 2019-05-29T05:41:24Z

sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py

+                                  str(int(time.time())),
+                                  random.randint(0, 10000))
+    self.bigquery_client.get_or_create_dataset(self.project, self.dataset_id)
+    self.output_table = "%s.{}" % (self.dataset_id)


I find this field confusing since at this point output table is not created yet. We could return a fully-qualified table name in create_table() method instead of storing the pattern.

tvalentyn · 2019-05-29T05:54:45Z

sdks/python/apache_beam/io/gcp/bigquery_tools.py

@@ -998,6 +1004,13 @@ def encode(self, table_row):
    # This code will catch this error to emit an error that explains
    # to the programmer that they have used NAN/INF values.
    try:
+      # on python 3 base64 bytes are decoded to strings before being send to bq
+      if sys.version[0] == '3':
+        if type(table_row) == str:


I still don't understand why/when we may receive a string here and not a dictionary (as per docstring) during encode() call. The test you referenced seems to call json.loads() before ingesting data into BQ:

beam/sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py

Line 83 in 23115ce

_ELEMENTS = list([json.loads(elm[1]) for elm in _DESTINATION_ELEMENT_PAIRS])

. I created a PR on top of yours to see if the test pass without these lines: #8709, waiting for the tests to finish.

Juta · 2019-05-29T08:10:50Z

Run Python PostCommit

Juta · 2019-05-29T09:09:50Z

Run Python PostCommit

Juta · 2019-05-29T16:18:37Z

Run Python PostCommit

tvalentyn

Thanks, Juta, this looks good to me. Very happy to see an increase in test coverage.
@pabloem, can you please take a look and help with the merge?

pabloem

LGTM. Just left one question that can be answered after merge.

pabloem · 2019-05-29T20:18:31Z

sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py

+  def decode(self, encoded_table_row):
+    return self.coder.decode(encoded_table_row)
+
+


We can merge this now. I am just curious: Why did you add this custom coder?

Please take a look at #8621 (comment) thread and let us know if you think we should do something else here.

Juta force-pushed the bq-bytes-test branch from c387ee2 to 00e8d2e Compare May 20, 2019 12:01

Juta force-pushed the bq-bytes-test branch from 00e8d2e to a6c42e4 Compare May 20, 2019 12:56

Juta force-pushed the bq-bytes-test branch 3 times, most recently from 5e22885 to ffa3399 Compare May 20, 2019 14:02

Juta force-pushed the bq-bytes-test branch from ffa3399 to ee28fcf Compare May 20, 2019 14:58

Juta force-pushed the bq-bytes-test branch from ee28fcf to 0d350bf Compare May 21, 2019 08:47

[BEAM-6769][BEAM-7327] add it test for writing and reading with bigqu…

fba59d2

…ery io

Juta force-pushed the bq-bytes-test branch 2 times, most recently from a5edbee to 95aa720 Compare May 21, 2019 09:52

Juta force-pushed the bq-bytes-test branch from 95aa720 to 75db3ef Compare May 21, 2019 14:56

Juta force-pushed the bq-bytes-test branch 2 times, most recently from df83d85 to 99f3cbc Compare May 21, 2019 16:26

pabloem reviewed May 21, 2019

View reviewed changes

Juta force-pushed the bq-bytes-test branch 2 times, most recently from a8e0895 to f4725df Compare May 27, 2019 11:45

Juta added 2 commits May 27, 2019 15:13

[BEAM-7327] handle bytes in bq io and add tests for schema autodetection

011a29e

[BEAM-7382] error out when using bq schema auto detect on native sink

41cdfbd

Juta force-pushed the bq-bytes-test branch from f4725df to 41cdfbd Compare May 27, 2019 13:14

tvalentyn reviewed May 28, 2019

View reviewed changes

tvalentyn reviewed May 29, 2019

View reviewed changes

Juta force-pushed the bq-bytes-test branch from 1293b5a to cb3dbd2 Compare May 29, 2019 08:10

fixup: update tests and comments for bytes encoding in bigquery io

cdffd74

Juta force-pushed the bq-bytes-test branch from cb3dbd2 to cdffd74 Compare May 29, 2019 09:07

fixup: add custom row coder in bigquery file loads tests

d574bfb

tvalentyn approved these changes May 29, 2019

View reviewed changes

pabloem reviewed May 29, 2019

View reviewed changes

pabloem merged commit 07190ee into apache:master May 29, 2019

		def decode(self, encoded_table_row):
		return self.coder.decode(encoded_table_row)

[BEAM-6769][BEAM-7327] add it test for writing and reading with bigqu… #8621

[BEAM-6769][BEAM-7327] add it test for writing and reading with bigqu… #8621

Conversation

Juta commented May 20, 2019

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Juta commented May 20, 2019

Juta commented May 20, 2019

Juta commented May 20, 2019

Juta commented May 20, 2019

Juta commented May 20, 2019

Juta commented May 21, 2019

Juta commented May 21, 2019

Juta commented May 21, 2019

pabloem commented May 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Juta commented May 27, 2019

Juta commented May 27, 2019

Juta commented May 27, 2019

Juta commented May 27, 2019

Choose a reason for hiding this comment

tvalentyn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tvalentyn May 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Juta commented May 28, 2019

Juta commented May 28, 2019

Juta commented May 28, 2019

tvalentyn commented May 28, 2019

tvalentyn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tvalentyn May 29, 2019 • edited Loading

Choose a reason for hiding this comment

Juta commented May 29, 2019

Juta commented May 29, 2019

Juta commented May 29, 2019

tvalentyn left a comment

Choose a reason for hiding this comment

pabloem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tvalentyn May 29, 2019 • edited Loading

Choose a reason for hiding this comment

tvalentyn May 29, 2019 •

edited

Loading

tvalentyn May 29, 2019 •

edited

Loading

tvalentyn May 29, 2019 •

edited

Loading