BigQuery: Use Unmarshalling API to handle parsing #2533

armanbilge · 2020-11-24T08:18:34Z

References #2353

My attempt to support arbitrary JSON parsers by using the Akka HTTP unmarshaling API for GoogleBigQuerySource, as discussed in #2353 (cc @ennru, @mrooding).

To implement this, the major changes I made are:

Remove the parserFn: JsObject => Try[T] argument from GoogleBigQuerySource, add a type parameter J for the intermediate JSON representation (e.g. JsValue in Spray), and require three implicit unmarshallers:
- jsonUnmarshaller: FromEntityUnmarshaller[J]
- responseUnmarshaller: Unmarshaller[J, BigQueryJsonProtocol.Response]
- unmarshaller: Unmarshaller[J, T]
Expose the previously internal API impl.parser.Parser.ParserJsonProtocol as the public BigQueryJsonProtocol. This is so the user can provide an unmarshaller for BigQueryJsonProtocol.Response which is used internally by the GoogleBigQuerySource implementation.
Provide a default implementation supporting Spray: users can import implicits from scaladsl.SprayJsonSupport._. Similarly, a utility class javadsl.SprayJsonSupport provides Java users with the necessary unmarshallers.

I've updated the tests/examples to demonstrate my new API. For the Scala API, it reduces boilerplate code by removing the need to manually implement the parserFn and instead implicitly deriving the necessary unmarshaller from the in-scope Spray JsonFormat.

Remaining work

The use of the unmarshaling API introduced several more Futures and mapAsync stages to the graph in impl.parser.Parser. The overhead can probably be reduced by judicial use of ~~ExecutionContext.parisitic~~ FastFuture.
Update documentation.

Thanks for your consideration and looking forward to your feedback.

seglo

I like the idea. I'm afraid the use of type classes might make the API a little more cumbersome for new users, especially since multiple unmarshallers are required to support the intermediate need to deserialize results to access pageToken and jobIds (as you mention here #2353 (comment)), but if the need for this is explained clearly in the docs then I don't see a problem.

We're planning an Alpakka 3.0 soon, so breaking API changes may be alright. This is also a new connector with limited users. @ennru WDYT of rewriting this API? Do we need to support both for a time?

armanbilge · 2020-11-25T22:06:31Z

Thanks for your review and support! Alpakka BigQuery is currently marked as "API may change" so I was under the impression that this would be okay. However, it is straightforward to continue to support the current API that uses the parserFn: JsObject => Try[T] instead of unmarshallers.

I'm afraid the use of type classes might make the API a little more cumbersome for new users

If new users are encouraged to use the default Spray API, then an import from scaladsl.SprayJsonSupport._ provides the necessary type classes. In fact, it reduces the boilerplate of manually writing a parserFn. Knowledge of the "intermediate" unmarshallers is only necessary for users who want to bring their own JSON parser.

I agree that Java users will find this new API much more cumbersome.

seglo · 2020-11-26T15:40:53Z

Alpakka BigQuery is currently marked as "API may change" so I was under the impression that this would be okay.

That's true. There should be no problem, but I'm leaning towards supporting both.

However, it is straightforward to continue to support the current API that uses the parserFn: JsObject => Try[T] instead of unmarshallers.

I agree that Java users will find this new API much more cumbersome.

I think it would be a good idea to support both APIs. We can keep the docs for this connector the same, but add a new section that discusses using the new API for custom serdes.

armanbilge · 2020-11-26T22:54:54Z

Makes sense, I restored the old API.

However, I still think the docs should be updated with the new API. The new API is not just for custom serdes, it reduces boilerplate code (at least for Scala users). It is also more in line with the Akka HTTP API which is entirely based on marshaling.

Consider the following example from the docs.

Old API:

case class User(uid: String, name: String)
implicit val userFormatter = jsonFormat2(User)

def parserFn(result: JsObject): Try[User] = Try(result.convertTo[User])
val userStream =
  GoogleBigQuerySource
    .runQuery[User]("SELECT uid, name FROM bigQueryDatasetName.myTable", parserFn, BigQueryCallbacks.ignore, config)

New API:

import akka.stream.alpakka.googlecloud.bigquery.scaladsl.SprayJsonSupport._
case class User(uid: String, name: String)
implicit val userFormatter = jsonFormat2(User)

val userStream =
  GoogleBigQuerySource[User]
    .runQuery("SELECT uid, name FROM bigQueryDatasetName.myTable", BigQueryCallbacks.ignore, config)

Thanks again for your feedback!

seglo · 2020-11-27T17:31:39Z

However, I still think the docs should be updated with the new API. The new API is not just for custom serdes, it reduces boilerplate code (at least for Scala users). It is also more in line with the Akka HTTP API which is entirely based on marshaling.

That's true. Ok, let's use it as the default example for Scala, but use the old API for Java.

I think we should still have a custom serdes section where you can show using something custom, like your Circe use case.

armanbilge · 2020-11-30T14:05:58Z

Thanks, I'm working on updating the docs now. However, while doing so I realized that the current docs are wrong/misleading.

Consider this example from the current documentation.

case class User(uid: String, name: String)
implicit val userFormatter = jsonFormat2(User)

def parserFn(result: JsObject): Try[User] = Try(result.convertTo[User])
val userStream =
  GoogleBigQuerySource
    .runQuery[User]("SELECT uid, name FROM bigQueryDatasetName.myTable", parserFn, BigQueryCallbacks.ignore, config)

The documentation implies that by simply providing an parser for JsObject => User you can get a stream of Users from the GoogleBigQuerySource. However, the actual implementation passes a JsObject of the form

{
  "rows": [
    { "uid": "someUID", "name": "someName"},
    ...
  ],
  ...
}

I.e., the parserFn must retrieve the rows key and parse an array of Users. So the parserFn in the example would fail.

Actually, I prefer the API described in the docs: in most cases, I just want a Source[User] without the hassles of retrieving rows and doing mapConcat. Some power users may want access to a "rawer" result (the API currently implemented). Ideally, both should be available to users. My (breaking) proposal would be for the implementation of GoogleBigQuerySource.runQuery to be changed to match the docs, while power users can use GoogleBigQuerySource.raw to access the full JSON response. Of course, non-breaking fixes would be to add a new method or simply leave as-is and correct the docs.

Not sure whether fixing any of this is in-scope for my PR, especially since it warrants a bigger discussion. I was planning to demonstrate the custom serdes by reusing the User example from above, but it's difficult to imitate an example that is broken. Thanks for your attention to this.

seglo · 2020-11-30T16:38:22Z

the parserFn must retrieve the rows key and parse an array of Users. So the parserFn in the example would fail.

Actually, I prefer the API described in the docs: in most cases, I just want a Source[User] without the hassles of retrieving rows and doing mapConcat. Some power users may want access to a "rawer" result (the API currently implemented). Ideally, both should be available to users.

I agree that parsing rows probably wasn't the intent of this higher level API (hence the docs). I think it's fine to make this breaking change, but I'll leave it up to you to determine if it's something you want to tackle in this PR.

If it's not too much trouble to update the existing docs to match the existing behaviour so that you may add your new section then that's probably the best path forward for now. You can make a follow up PR if you wish.

This problem is a consequence of having docs examples that don't actually run. Ideally the snippet would run as part of the test suite.

armanbilge · 2020-12-01T01:08:26Z

Thanks as always for your fast responses!

I think it's fine to make this breaking change, but I'll leave it up to you to determine if it's something you want to tackle in this PR.

I ended up making this change as it was closely intertwined with my parsing implementation.

I've now updated the docs with an expanded section on parsing and an example using Circe. For this purpose I added Circe as a test dependency; hope this is okay.

This problem is a consequence of having docs examples that don't actually run. Ideally the snippet would run as part of the test suite.

I agree, should be possible to setup with Hoverfly API simulator like the end-to-end spec?

Hopefully, no further major changes/additions to my PR and only minor fixes from here. Thanks again for all of your helpful feedback!

armanbilge · 2020-12-01T08:15:32Z

I take it back, there is one more thing that I think should be fixed: the fact that the user-provided parser has to be "bulletproof" or else risk creating a Source that is stuck in an infinite loop makes for a poor API experience that is difficult to debug. Even with parsers 100% automatically derived from case classes there can still be simple user errors if the case class doesn't perfectly match the query.

@tg44, I see that you authored the commit that first mentions this. Do you mind clarifying why this is necessary: is it in fact true that the Google BigQuery API can return a blank response with no error code/message? Is this documented anywhere? Thanks in advance.

tg44 · 2020-12-01T09:44:17Z

It was more than 2.5 years ago when I wrote this, so my memories and the underlying infra could have changed till then.

As I remember BQ returns with pages of data. And it is doing some map-reduce like thing in the background. So you start a query, and BQ starts the inner data gathering, you want to get a page, and you only get rows back for that page if a full page of rows is gathered already. This means that if your data is big enough and your query is slow enough and your data is not evenly distributed, and you consume returned data fast enough, you can get back "empty" responses, which translated to "data is not ready, pls come back later". In most cases, you won't see this with small databases. (I think a suitable test-case would be millions of lines where only a really-really small percentage matches the "where" case. But probably we should reread the docs to be sure.)

I hoped we have tests for the observed behavior, but I only found these tests and on the usage side with a fully described QueryResponse hierarchy. The catch in the "original" repository was that we haven't done any serious deserialization, because our use-case was to transform multiple types of databases to a unified API. Also, we needed the metadata from the requests for whatever reasons that I totally forget :D (BTW as I know the linked "original" repo is in production in the last 2 years or so, so at least it works!) As I remember I refactored some code when those moved here, but the two implementations should mostly identical.

If you have any other questions, I will try to remember harder, and I hope my WOT helped at least a bit.

armanbilge · 2020-12-01T09:55:54Z

Thanks so much for the fast response @tg44! This was really, really helpful, I appreciate you thinking back a couple years. This makes a lot of sense, the main clarification I'm looking for is what exactly "empty" means? Literally an empty response with no body, or e.g. a JSON object with no "rows" key? Thanks!

tg44 · 2020-12-01T11:04:57Z

Based o the code and my memories, I think the rows are not present in this case. This is somewhat in align with the documentation where we have a note, namely; Present only when the query completes successfully. The "completes successfully" part is quite mystical tbh...

armanbilge · 2020-12-01T11:10:40Z

Ok, thanks, that was my understanding too. So then it is totally possible to check/guard for this without relying on the user-provided parser to be "bulletproof".

tg44 · 2020-12-05T16:29:47Z

I just checked the PR. I think it's mostly good, only one thing bothers me, which was not documented before, and now I understand why it was not easy to use for others :) "What is the json/string that we need to provide a parser to?"

From now on I could be totally wrong, and if that happened sorry for that, I still didn't try the connector current state against a real endpoint. Also, I have old memories and I just browsed the code and the BQ documentation for validating my beliefs.

I think the old implementation wanted a parser for the whole ResponseBody. I think your implementation does this too (based on the java tests), but the scala example is not really doing this (it uses the response rows as I understand the code which is a rows: Option[Seq[T]] from the response body).

BUT my main concern is;

The REST-based representation of this data leverages a series of JSON f,v objects for indicating fields and values.

Which means none of the provided auto derived parsers will actually parse any of the data from the real service (or at least I didn't found code that flattens the schema+rows to a case-class). I found an example response which exactly describes why we let the user parse the data without implementing a handy parser :D That response is ugly as hell... Also it uses strings everywhere... So for correctly parsing this shitstorm, you need to properly parse the schema, then parse the rows one-by-one and - based on the schema - you need to convert the string values to the correct primitives.

At this point you have two options; generate a new json where the auto derived decoder will work (a normal one with actual key-values :D ), or you somehow force the user to give you a member from a new typeclass which can detect the fields, check the types, and wraps them to a requested type... (We chose the second when we wanted a JsObject => Option[T] from the user as a "parser function".)

I think we should add an e2e like test with real requests and responses to mock the whole flow. Probably the documentation was unusable before, but my spider-sense screams that we will break a hard to use but working code with this modification.

EDIT; I found a post in my opened tabs about nested results just to push this even further :D

armanbilge · 2020-12-05T16:48:26Z

Thanks for your review! Appreciate your feedback on this.

I think the old implementation wanted a parser for the whole ResponseBody.

This is true.

I think your implementation does this too

Yes and no. See my comment above and @seglo's response. Basically, what we decided was that GoogleBigQuerySource.runQuery will be a simplified API that only requires a parser for a single row. For users that need access to the whole response body, they can use GoogleBigQuerySource.raw. This is indeed a breaking change.

or at least I didn't found code that flattens the schema+rows to a case-class

Is this the flattening code you are looking for?

I think we should add an e2e like test with real requests and responses to mock the whole flow.

There is already an end-to-end test present that appears to be based on a real request and response flow.

Hope this all makes sense!

tg44 · 2020-12-05T17:08:27Z

Is this the flattening code you are looking for?

Nope;

If you check the example response. The rows has a structure of;

"rows": [
  {
   "f": [
    {
     "v": "3290275"
    },
    {
     "v": "5l"
    },
    {
     "v": "5l"
    },
    {
     "v": "1322575099"
    },
    {
     "v": "1.322575099E9"
    },
    ...
   ]
  },
  ...
]

I tried to find the code which transforms this f/v thing to a json which can be parsed by the auto derived decoder. Also it is possible that the BQ documentation is somewhat outdated and you can get back the results as a

"rows": [
  {"id": 3290275, "by": "5l", "author": "5l", "time": 1322575099, "time_ts": 1.322575099E9, ...},
  ...
]

format with a flag/param, but I hasn't found that either.

armanbilge · 2020-12-05T17:28:55Z

Aha, thanks for clarifying, totally my bad ... I didn't realize the rows were encoded like this. That is ugly! Good catch, thank you.

As it stands, this doesn't break my PR, but it does mean we can't take advantage of Spray's automatic JSON-parser-from-case-class derivation. Users would have to implement their own parsers. You are right that the examples would not work.

However, I believe this is very fixable. Currently, we use jsonFormatN(SomeCaseClass) to automatically derive the parser. Instead, we can provide our own methods e.g. bigQueryJsonFormatN(SomeCaseClass) that derive a JSON parser specifically for the BigQuery format.

Thanks again for catching this! Really appreciate your review.

armanbilge · 2020-12-05T20:38:28Z

Just pushed a fix for this: bigQueryJsonFormatN(SomeCaseClass), a drop-in replacement for jsonFormatN(SomeCaseClass) that handles BigQuery's representation of table rows.

I've added an additional test to the e2e spec that demonstrates correct unmarshalling of a real API response into a case class with an automatically derived parser.

All that remains is to update the docs to explain this new API. Thanks again for your help with this @tg44!

armanbilge · 2020-12-06T07:45:49Z

google-cloud-bigquery/src/test/java/docs/javadsl/GoogleBigQuerySourceDoc.java

+          String uid = ((JsString) f.elements().apply(0).asJsObject().fields().apply("v")).value();
+          String name = ((JsString) f.elements().apply(1).asJsObject().fields().apply("v")).value();
+          return new User(uid, name);
+        });


I think we should consider providing a different default JSON implementation for the Java API, e.g. Jackson. Spray does not have a Java API and is painful to use as this example illustrates.

I would recommend doing this in a follow up PR, if you're interested.

tg44

I think it's a heroic work! Some minor concerns below.

(And I totally agree that this response that google gives back is ugly...)

tg44 · 2020-12-06T11:00:00Z

docs/src/main/paradox/google-cloud-bigquery.md

+When calling the `GoogleBigQuerySource.raw` method, your parser is passed the [entire response body](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/getQueryResults#response-body).
+When calling the `GoogleBigQuerySource.runQuery` method, your parser is passed one-by-one each entry in the `rows` key of the response.
+Note that these entries are represented as "a series of JSON f,v objects for indicating fields and values" ([reference documentation](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/getQueryResults#body.GetQueryResultsResponse.FIELDS.rows)).
+@scala[To correctly generate a Spray `JsonFormat[T]` for your case class `T`, you must `import akka.stream.alpakka.googlecloud.bigquery.scaladsl.BigQueryJsonProtocol._` and use the provided `bigQueryJsonFormatN(T)` methods as a drop-in replacement for the usual `jsonFormatN(T)`. Furthermore, the order and presence of parameters in `T` is required to strictly match the query.]


I'm not sure, that BQ can't mess up the param order. I hope they didn't go that far, but would be good to check it...

You raise an important point; I'll look into it.

This stackoverflow answer suggests that it is "pretty safe" to assume this.

I hope so. My only concern is that there is a "schema" field that returns the explicit return order. I hope that they don't mess up things for sport :D

...rc/main/scala/akka/stream/alpakka/googlecloud/bigquery/scaladsl/BigQueryProductFormats.scala

seglo · 2020-12-07T21:03:39Z

Thanks a lot for jumping in @tg44. Do you think there's anything else that needs to be addressed?

tg44 · 2020-12-08T09:38:37Z

@seglo I didn't found anything else. I think it is a great work overall, and I'm glad that others could build something more useful/user-friendly on the top of our "minimal" code :D You can always mention me under BQ PRs or issues, I'm happy to help if I can!

armanbilge · 2020-12-08T23:09:44Z

Thanks, I appreciate the kind words! FYI I am planning to follow up with another PR implementing Sinks for streaming/loading data into BigQuery, if you have any experience with that.

tg44 · 2020-12-09T09:39:26Z

I think we never needed to load data to BQ. When we needed that feature (tests) we simply inserted it with SQL like insert into.

seglo · 2020-12-14T16:31:02Z

You can always mention me under BQ PRs or issues, I'm happy to help if I can!

Great. Thanks for the offer @tg44 :)

seglo

LGTM. Thanks for all the work @armanbilge and thank you again @tg44 for your reviews and guidance.

armanbilge · 2020-12-14T16:44:53Z

Awesome! Thanks for merging and to everyone for all of your reviews and feedback.

armanbilge added 2 commits November 24, 2020 00:00

BigQuery: Use Unmarshaling API to handle parsing

5f93166

Remove unnecessary imports

f4a58be

seglo reviewed Nov 25, 2020

View reviewed changes

Simplify implementation

970fc1c

probot-autolabeler bot added the p:google-cloud-bigquery label Nov 25, 2020

Make using the Scala API less cumbersome

7db2406

Restore old API alongside new API

b423036

armanbilge added 2 commits November 27, 2020 00:15

Use FastFuture where possible

520486b

Handle empty response

bcdccb8

Revert Java examples to old API

2ca063d

armanbilge added 4 commits November 30, 2020 10:27

Fix runQuery API to match doc examples

baa10b5

Add logger for parser errors

7f06f4c

Change jobReferenceFormat to non-root JsonFormat

70bf61a

Add custom parsing example and update docs

8686475

probot-autolabeler bot added dependency-change For PRs changing the version of a dependency. documentation labels Dec 1, 2020

armanbilge added 2 commits December 1, 2020 05:42

Use jobComplete key to decide if to retry request

4658403

Simplify tests

048071b

armanbilge added 4 commits December 5, 2020 10:54

Rename BigQueryJsonProtocol to BigQueryResponseJsonProtocol

4e62eac

Implement BigQueryJsonProtocol

00a9a58

Demonstrate bigQueryJsonFormatN in example doc

29d8da6

Add e2e test for runQuery with derived jsonFormat

294d6ee

armanbilge added 4 commits December 5, 2020 12:57

Fix broken parsers in custom parsing examples

f6f02b9

Update docs with additional details about parsing

5cd36a1

Refactor BigQueryResponseJsonProtocol to client.ResponseJsonProtocol

41c798d

Fix Java example for docs

5183584

armanbilge commented Dec 6, 2020

View reviewed changes

Fix docs

3c7eca2

tg44 reviewed Dec 6, 2020

View reviewed changes

armanbilge added 2 commits December 6, 2020 04:34

Use boilerplate plugin for BigQuery JSON formats

4180ddd

Simplify implementation

58ffb1d

Fix BigQueryJsonProtocol for collections, add test

a32a060

seglo approved these changes Dec 14, 2020

View reviewed changes

seglo changed the title ~~BigQuery: Use Unmarshaling API to handle parsing~~ BigQuery: Use Unmarshalling API to handle parsing Dec 14, 2020

seglo merged commit 99b10b5 into akka:master Dec 14, 2020

armanbilge mentioned this pull request Dec 20, 2020

BigQuery: New BigQuery connector #2548

Merged

4 tasks

seglo added this to the 3.0.0-M1 milestone Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: Use Unmarshalling API to handle parsing #2533

BigQuery: Use Unmarshalling API to handle parsing #2533

armanbilge commented Nov 24, 2020 •

edited

Loading

seglo left a comment •

edited

Loading

armanbilge commented Nov 25, 2020

seglo commented Nov 26, 2020

armanbilge commented Nov 26, 2020

seglo commented Nov 27, 2020

armanbilge commented Nov 30, 2020

seglo commented Nov 30, 2020

armanbilge commented Dec 1, 2020

armanbilge commented Dec 1, 2020

tg44 commented Dec 1, 2020 •

edited

Loading

armanbilge commented Dec 1, 2020

tg44 commented Dec 1, 2020

armanbilge commented Dec 1, 2020

tg44 commented Dec 5, 2020 •

edited

Loading

armanbilge commented Dec 5, 2020 •

edited

Loading

tg44 commented Dec 5, 2020 •

edited

Loading

armanbilge commented Dec 5, 2020

armanbilge commented Dec 5, 2020

armanbilge Dec 6, 2020

seglo Dec 7, 2020

tg44 left a comment

tg44 Dec 6, 2020

armanbilge Dec 6, 2020

armanbilge Dec 6, 2020 •

edited

Loading

tg44 Dec 6, 2020

seglo commented Dec 7, 2020

tg44 commented Dec 8, 2020

armanbilge commented Dec 8, 2020

tg44 commented Dec 9, 2020

seglo commented Dec 14, 2020

seglo left a comment

armanbilge commented Dec 14, 2020

BigQuery: Use Unmarshalling API to handle parsing #2533

BigQuery: Use Unmarshalling API to handle parsing #2533

Conversation

armanbilge commented Nov 24, 2020 • edited Loading

seglo left a comment • edited Loading

Choose a reason for hiding this comment

armanbilge commented Nov 25, 2020

seglo commented Nov 26, 2020

armanbilge commented Nov 26, 2020

seglo commented Nov 27, 2020

armanbilge commented Nov 30, 2020

seglo commented Nov 30, 2020

armanbilge commented Dec 1, 2020

armanbilge commented Dec 1, 2020

tg44 commented Dec 1, 2020 • edited Loading

armanbilge commented Dec 1, 2020

tg44 commented Dec 1, 2020

armanbilge commented Dec 1, 2020

tg44 commented Dec 5, 2020 • edited Loading

armanbilge commented Dec 5, 2020 • edited Loading

tg44 commented Dec 5, 2020 • edited Loading

armanbilge commented Dec 5, 2020

armanbilge commented Dec 5, 2020

armanbilge Dec 6, 2020

Choose a reason for hiding this comment

seglo Dec 7, 2020

Choose a reason for hiding this comment

tg44 left a comment

Choose a reason for hiding this comment

tg44 Dec 6, 2020

Choose a reason for hiding this comment

armanbilge Dec 6, 2020

Choose a reason for hiding this comment

armanbilge Dec 6, 2020 • edited Loading

Choose a reason for hiding this comment

tg44 Dec 6, 2020

Choose a reason for hiding this comment

seglo commented Dec 7, 2020

tg44 commented Dec 8, 2020

armanbilge commented Dec 8, 2020

tg44 commented Dec 9, 2020

seglo commented Dec 14, 2020

seglo left a comment

Choose a reason for hiding this comment

armanbilge commented Dec 14, 2020

armanbilge commented Nov 24, 2020 •

edited

Loading

seglo left a comment •

edited

Loading

tg44 commented Dec 1, 2020 •

edited

Loading

tg44 commented Dec 5, 2020 •

edited

Loading

armanbilge commented Dec 5, 2020 •

edited

Loading

tg44 commented Dec 5, 2020 •

edited

Loading

armanbilge Dec 6, 2020 •

edited

Loading