[BEAM-3773][SQL] Add EnumerableConverter for JDBC support by apilloud · Pull Request #5173 · apache/beam

apilloud · 2018-04-18T22:58:40Z

This adds a converter from BeamRelNode to EnumerableRel. The calcite JDBC engine can execute any plan with a root node of the EnumerableRel type.

Follow this checklist to help us incorporate your contribution quickly and easily:

apilloud · 2018-04-18T23:00:39Z

R: @kennknowles
cc: @akedin

kennknowles

Can this be tested in isolation at all? Or can it be tested via other test suites?

kennknowles · 2018-04-20T03:39:30Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java

+  }
+
+  public static Enumerable<Object> toEnumerable(BeamRelNode node) {
+    PipelineOptions options = PipelineOptionsFactory.create();


Just checking - since Collector only works on the DirectRunner it seems fine to hardcode it here. But are the options specified elsewhere when using SQL so here it would just be validation that the configuration is supported?

Pipelines that aren't supported should get a NullPointerException when the Collector tries to process an element. I considered adding a check but decided it could wait because this is the only public path in. Is the user able to change the runner without the ability to set the PipelineOptions?

I'd rather get the options plumbed deliberately here. Assuming decent testing, it won't get left forever, but it does leave the code in a sort of weird stage. Does it add a ton of scope to plumb them?

Plumbing the PipelineOptions into BeamEnumerableConverter is not easy as they will have to take a trip through the calcite JDBC interface on the way here (possibly on the jdbc connection string). I can add them as an argument to toEnumerable, but that will just move this line up to implement.

kennknowles · 2018-04-20T03:41:28Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java

+  }
+
+  private static class Collector extends DoFn<Row, Void> {
+    // This will only work on the direct runner.


It seems like there are two routes that are interesting here: (1) expose the ability to observe PCollection contents in the DirectRunner. This used to be the case when it only supported bounded collections. It is a different situation now, as the contents are never materialized. But that could change.

The other thing that would potentially make it cross runner, and what Scio does, is to write it to a sink. It could be TextIO writing to the globally-required tempLocation. We should definitely learn from Scio either way.

I agree on writing to a sink. I was considering a version that sinks to BigQuery and read back from there. There are many options with different tradeoffs, we should consider them as followups.

Yea, I think we probably want to be able to set options to control that if/when we do implement.

Maybe just (w/ JIRA) add checkArgument(options.getRunner().getCanonicalName().equals("org.apache.beam.runners.direct.DirectRunner"))?

Filed a jira: https://issues.apache.org/jira/browse/BEAM-4203

apilloud · 2018-04-20T22:22:38Z

And this one is green.

akedin · 2018-04-23T17:26:47Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java

+
+    Collector.globalValues.put(id, values);
+    run(options, node, new Collector());
+    Collector.globalValues.remove(id);


Is it necessary to keep a static state? Cannot this be an instance field of the Collector?

The Collector is serialized by beam and a copy is passed to the worker. As a result each worker ends up with a copy of the collector, so the original queue won't contain any values added by the worker. The static state is circumventing Beam's requirement that there be no global state. This only works in the direct runner because all workers are in the same process.

It seems like a pattern of concern that this only works on the direct runner but maybe is positioned in the codebase in a way that makes us think it should work more broadly. It could be solved just by really clear signaling and validation, perhaps, and maybe TODO JIRAs? I'm looking for options to make everyone happy and unblock your other work.

I'm explicitly violating the contract here, but being able to collect results in memory seems like a useful testing pattern for the direct runner. Is this something that might be able to live in a util folder inside the direct runner?

akedin · 2018-04-23T17:29:44Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java

+/**
+ * BeamRelNode to replace a {@code Enumerable} node.
+ */
+public class BeamEnumerableConverter extends ConverterImpl implements EnumerableRel {


Can you document the behavior and how this class is used? And maybe an integration test of some kind?

A integration test would be good, I'll get one added. I started down the unit test path last week and ended up deleting all the code that was unit testable when I found a library that implemented it.

akedin · 2018-04-23T17:33:22Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java

+        .apply(node.toPTransform())
+        .apply(ParDo.of(doFn));
+    PipelineResult result = pipeline.run();
+    result.waitUntilFinish();


Does this work in general case? E.g. what happens when the input is unbounded? My understanding is that Enumerables can be unbounded as well, and JDBC supports paginated unbounded results as well, is this going to work?

This does not work in the general case, it only works on pipelines that eventually finish. I wanted to get something in for my DDL work, we can make this better in a followup.

kennknowles · 2018-04-24T22:56:50Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java

+  }
+
+  public static Enumerable<Object> toEnumerable(BeamRelNode node) {
+    PipelineOptions options = PipelineOptionsFactory.create();


I'd rather get the options plumbed deliberately here. Assuming decent testing, it won't get left forever, but it does leave the code in a sort of weird stage. Does it add a ton of scope to plumb them?

kennknowles · 2018-04-24T22:58:24Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java

+  }
+
+  private static class Collector extends DoFn<Row, Void> {
+    // This will only work on the direct runner.


Yea, I think we probably want to be able to set options to control that if/when we do implement.

kennknowles · 2018-04-24T22:59:25Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java

+
+    Collector.globalValues.put(id, values);
+    run(options, node, new Collector());
+    Collector.globalValues.remove(id);


It seems like a pattern of concern that this only works on the direct runner but maybe is positioned in the codebase in a way that makes us think it should work more broadly. It could be solved just by really clear signaling and validation, perhaps, and maybe TODO JIRAs? I'm looking for options to make everyone happy and unblock your other work.

kennknowles

Let's unblock w/ JIRAs as we all agree which piece to come back to.

kennknowles · 2018-04-30T20:17:30Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java

+  private static Enumerable<Object> count(PipelineOptions options, BeamRelNode node) {
+    PipelineResult result = run(options, node, new RowCounter());
+    MetricQueryResults metrics = result.metrics().queryMetrics(MetricsFilter.builder()
+        .addNameFilter(MetricNameFilter.named(BeamEnumerableConverter.class, "rows"))


We probably want to spend some time on the best way to get reliable counts that are meaningful in the way that JDBC expects. I think the SQL <-> IO adapter may have to own it, for those Beam connectors that do things like write "mutations" that are not rows at all, etc.

Created a jira: https://issues.apache.org/jira/browse/BEAM-4202

kennknowles · 2018-04-30T20:18:21Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java

+  }
+
+  private static class Collector extends DoFn<Row, Void> {
+    // This will only work on the direct runner.


Maybe just (w/ JIRA) add checkArgument(options.getRunner().getCanonicalName().equals("org.apache.beam.runners.direct.DirectRunner"))?

apilloud · 2018-04-30T23:25:32Z

Fixed the issues from the comments and added tests at the toEnumerable layer. There are still some missing pieces for a functional calcite JDBC test, so that will have to come once those features are added.

apilloud force-pushed the enumerable branch from 18ca5ac to 5d3599e Compare April 19, 2018 23:19

kennknowles reviewed Apr 20, 2018

View reviewed changes

apilloud force-pushed the enumerable branch from 5d3599e to 487ddf4 Compare April 20, 2018 19:02

akedin reviewed Apr 23, 2018

View reviewed changes

kennknowles requested changes Apr 24, 2018

View reviewed changes

kennknowles requested changes Apr 30, 2018

View reviewed changes

[SQL] Add BeamEnumerableConverter

96b647a

apilloud force-pushed the enumerable branch from 487ddf4 to 96b647a Compare April 30, 2018 23:23

kennknowles approved these changes May 3, 2018

View reviewed changes

kennknowles merged commit bf94e36 into apache:master May 3, 2018

Conversation

apilloud commented Apr 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apilloud commented Apr 18, 2018

Uh oh!

kennknowles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apilloud commented Apr 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kennknowles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apilloud commented Apr 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

apilloud commented Apr 18, 2018 •

edited

Loading