forked from apache/beam
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix BigQueryIO.Read to work the same in Direct and Dataflow runners
This is a partial revert of commits f5e3b8e and 18c82ad. When running a batch Dataflow job on Cloud Dataflow service, the data are produced by running a BigQuery export job and then reading all the files in parallel. When run in the DirectPipelineRunner, BigQuery's JSON API is used directly. These data come back in different formats. To compensate, we use BigQueryTableRowIterator to normalize the behavior in DirectPipelineRunner to the behavior seen when running on the service. (We cannot change this decision without a major breaking change.) This patch fixes some discrepancies in the way that BigQueryTableRowIterator is implemented. Specifically, *) In commit 18c82ad (response to issue apache#20) we updated the format of timestamps to be printed as strings. However, we did not correctly match the behavior of BigQuery export. Here is a sample set of times from the export job vs the JSON API. 2016-01-06 06:38:00 UTC 1.45206228E9 2016-01-06 06:38:11 UTC 1.452062291E9 2016-01-06 06:38:11.1 UTC 1.4520622911E9 2016-01-06 06:38:11.12 UTC 1.45206229112E9 2016-01-06 06:38:11.123 UTC 1.452062291123E9 * 2016-01-06 06:38:11.1234 UTC 1.4520622911234E9 2016-01-06 06:38:11.12345 UTC 1.45206229112345E9 2016-01-06 06:38:11.123456 UTC 1.452062291123456E9 Before, only the * test would have passed. *) In commit f5e3b8e we updated TableRow iterator to preserve the usual TableRow field `f` corresponding to getF(), which returns a list of fields in Schema order. This was my mistaken attempt to better support users who have prior experience with BigQuery's API and expect to use getF()/getV(). However, there were two issues: 1. this change did not affect the behavior in the DataflowPipelineRunner. 2. this was actually a breaking backwards-incompatible change, because common downstream DoFns may iterate over the keys of the TableRow, and it added the field "f". So we should not propagate the change to DataflowPipelineRunner, but instead we should revert the change to BigQueryTableRowIterator. (Note this is also a slightly-backwards-incompatible change, but it's reverting to old behavior and users are more likely to be depending on DataflowPipelineRunner rather than DirectPipelineRunner.) Fix both these issues and add tests. This is still ugly for now. The long-term fix here is to support a parser that lets users skip TableRow altogether and goes straight to POJOs of their choosing (See apache#41). That would also eliminate our performance and typing issues using TableRow as an inner type in pipelines (See e.g. http://stackoverflow.com/questions/33622227/dataflow-mixing-integer-long-types). ----Release Notes---- [] ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=111746236
- Loading branch information
Showing
3 changed files
with
130 additions
and
98 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.