Add UnboundedReadFromBoundedSource, and use it in Dataflow runner by peihe · Pull Request #339 · apache/beam

peihe · 2016-05-16T19:30:44Z

This PR will make Dataflow streaming runner work with BoundedSources, such as TextIO and AvroIO.

peihe · 2016-05-16T21:00:28Z

tgroh · 2016-05-16T21:30:30Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java

    }

+    /**
+     * Returns a new {@link UnboundedReadFromBoundedSourceTest}.


tgroh · 2016-05-17T23:00:07Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedReadFromBoundedSource.java

+    }
+
+    @Override
+    public boolean requiresDeduping() {


This is incomplete - You must provide an implementation of getCurrentRecordId() for requiresDeduping() to function (otherwise it will throw an exception in calls to getCurrentRecordId()). Probably right to use (element number, shardId) in this invocation. AFAIK, deduping is best-effort (not over the life of a pipeline), so this may produce duplicate elements in some cases.

This also does not solve the problem of progress - the reader may be discarded before it returns false to start() or advance() (as is done in the InProcessPipelineRunner), which may cause to the reader producing the same subset of elements and never completing the input. I believe reading the entire contents into an Iterable within start() and outputting (and then flattening) that Iterable will provide completion and progress guarantees, excepting the case in which not all elements fit in memory. Potentially the produced Iterable can be implemented as a channel back to the underlying BoundedReader (and thus lazily produce elements as TimestampedValues, which can be written to some channel and cleared out of memory if supported by the runner).

removed requiresDeduping and getCurrentRecordId, since checkpoint is added.

peihe · 2016-05-18T18:17:17Z

PTAL

(Jenkins is seems broken for unrelated reasons.)

tgroh · 2016-05-18T20:18:15Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedReadFromBoundedSource.java

+    class Reader extends UnboundedReader<T> {
+      private final PipelineOptions options;
+
+      private @Nullable final List<TimestampedValue<T>> residualElements;


Avoid null - this is just empty if there are no residual elements.

tgroh · 2016-05-18T21:41:18Z

LGTM after null fix

@dpmills @dhalperi

dpmills · 2016-05-18T21:56:14Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedReadFromBoundedSource.java

+      public void finalizeCheckpoint() {}
+    }
+
+    private static class CheckpointCoder<T> extends AtomicCoder<Checkpoint<T>> {


This should be a StandardCoder; it is parameterized by elemCoder, so it is not atomic

peihe · 2016-05-18T23:13:08Z

PTAL

dpmills · 2016-05-18T23:26:00Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedReadFromBoundedSource.java

+import javax.annotation.Nullable;
+
+/**
+ * {@link PTransform} that performs a unbounded read from an {@link BoundedSource}.


This isn't a PTransform any more

Nevermind on this one, not sure what I was thinking.

peihe · 2016-06-06T00:22:04Z

@dhalperi feedback?

dhalperi · 2016-06-09T18:40:52Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedReadFromBoundedSource.java

+public class UnboundedReadFromBoundedSource<T> extends PTransform<PInput, PCollection<T>> {
+  private final BoundedSource<T> source;
+
+  public UnboundedReadFromBoundedSource(BoundedSource<T> source) {


peihe · 2016-06-21T01:34:00Z

Ready to review

dhalperi · 2016-06-23T00:42:37Z

.../direct-java/src/main/java/org/apache/beam/runners/direct/UnboundedReadEvaluatorFactory.java

      }
    }

+<<<<<<< HEAD


this was from an outdated diff

dhalperi · 2016-06-23T00:58:18Z

LGTM

dhalperi · 2016-06-23T00:58:51Z

Minor fixes, ping me when it's green to merge this and the other PR.

peihe · 2016-06-23T01:22:34Z

Addressed comments, and rebased for conflicts.

dhalperi · 2016-06-23T04:37:02Z

...oogle-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java

 import static com.google.common.base.Preconditions.checkArgument;
 import static com.google.common.base.Preconditions.checkState;

+import org.apache.beam.runners.core.UnboundedReadFromBoundedSource;


@kennknowles @davorbonaci adding this here would make the Dataflow runner depend on runners-core. This violates our prior assumption that only the service half of the runner should depend on runner core.

Opinions?

dhalperi · 2016-06-23T04:38:07Z

Doesn't build, for checkstyle but also more fundamentally for the issue identified above. Let's discuss tomorrow.

kennknowles · 2016-06-23T20:17:49Z

runners/google-cloud-dataflow-java/pom.xml

    <dependency>
      <groupId>org.apache.beam</groupId>
      <artifactId>beam-runners-core-java</artifactId>
-      <scope>runtime</scope>


Discussed offline. I believe we have good options for short term and long term to avoid bringing back this dependency:

Longer term: Put the functionality in runners-core but only invoke it in the Dataflow service. If this turns out to be easy then we should do it right away.

Short-term: Put the functionality elsewhere, in one of:

Dataflow runner module. This is my preference. Right now it is really a matter of how the Dataflow runner works that it is necessary to have this adapter. If another runner decides that it wants to go this same route, then we might be in the longer term scenario anyhow.

SDK. Mostly harmless & we should move it out prior to a stable release.

Some io-core module that is not quite the grab bag that runners-core is slated to be.

This highlights the issue that there are really two needs for a utility library:

Help implement the Beam model on the backend. This will generally impact a service which can be updated transparently and in an agile manner. It presupposes the service is aware of Beam constructs.

Help to put together a translation from a Beam pipeline to an underlying backend. This will generally occur in SDK-adjacent client-side code, which cannot be updated easily.

We can conflate the two without much of a downside, as long as we shade on both sides, but it helps me to think of them separately. The main risk being that we come to have too thick a client that is hard to update. Or we could split them pretty easily. I propose we wait and see.

dhalperi · 2016-06-24T00:46:26Z

LGTM. Will merge when green.

[BEAM-3121] Remove broken docker script and documentation

peihe force-pushed the unbounded-read branch from d9f25a4 to 2e1c9d6 Compare May 16, 2016 20:10

tgroh reviewed May 16, 2016
View reviewed changes

peihe force-pushed the unbounded-read branch from 6e84d46 to a0da764 Compare May 16, 2016 22:24

tgroh reviewed May 17, 2016
View reviewed changes

peihe force-pushed the unbounded-read branch 2 times, most recently from c35fdf3 to 14cc999 Compare May 18, 2016 02:04

peihe changed the title ~~Add toUnbounded() to get a Unbounded Read from a Bounded Read.~~ Add UnboundedReadFromBoundedSource, and use it in Dataflow runner May 18, 2016

peihe force-pushed the unbounded-read branch 2 times, most recently from cb15794 to d9ec83a Compare May 18, 2016 18:51

tgroh reviewed May 18, 2016
View reviewed changes

dpmills reviewed May 18, 2016
View reviewed changes

peihe force-pushed the unbounded-read branch 3 times, most recently from 1017ce9 to 8588f6d Compare May 20, 2016 20:14

dhalperi reviewed Jun 9, 2016
View reviewed changes

dhalperi reviewed Jun 23, 2016
View reviewed changes

peihe added 9 commits June 22, 2016 18:12

Add toUnbounded() to get a Unbounded Read from a Bounded Read.

0544b51

fixup: addressed feedback

de01db2

fixup: addressed feedback

0f7ab0b

fixup: addressed feedback

e4ccde1

addressed comments

9cc3412

addressed comments

120d4c7

test unsplittable source and refactor the Reader code

2908b4a

Addressed comments

44b595d

addressed comments

91527cd

peihe force-pushed the unbounded-read branch from fa8d62e to 91527cd Compare June 23, 2016 01:16

dhalperi reviewed Jun 23, 2016
View reviewed changes

fix checkstyle and build

46abb90

peihe force-pushed the unbounded-read branch from a9e77fd to 46abb90 Compare June 23, 2016 07:40

kennknowles reviewed Jun 23, 2016
View reviewed changes

Revert Dataflow runner changes

847421a

asfgit closed this in 7745b92 Jun 24, 2016

dhalperi added a commit to dhalperi/beam that referenced this pull request Aug 23, 2016

Release Info: fix official name of SDK (apache#339)

933bda8

peihe deleted the unbounded-read branch August 21, 2017 02:25

iemejia pushed a commit to iemejia/beam that referenced this pull request Jan 12, 2018

Merge pull request apache#339 from herohde/portability

887d75f

[BEAM-3121] Remove broken docker script and documentation

Conversation

peihe commented May 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peihe commented May 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peihe commented May 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgroh commented May 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peihe commented May 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peihe commented Jun 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peihe commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhalperi commented Jun 23, 2016

Uh oh!

dhalperi commented Jun 23, 2016

Uh oh!

peihe commented Jun 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhalperi commented Jun 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhalperi commented Jun 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

peihe commented May 16, 2016 •

edited

Loading

peihe commented May 16, 2016 •

edited

Loading