Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-79] merge gearpump-runner into master #3611

Closed
wants to merge 207 commits into from

Conversation

manuzhang
Copy link
Contributor

@manuzhang manuzhang commented Jul 21, 2017

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Each commit in the pull request should have a meaningful subject line and body.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

manuzhang and others added 30 commits July 20, 2016 08:42
They have apparently deleted the SNAPSHOT jar and now builds are failing.
The only way it's safe to split a compressed file is if the file is not compressed. This can
only happen when the source itself is splittable, and that in turn will result in the inner
source's reader being returned. A CompressedReader will only be created in the event that
the file is NOT splittable. So remove all the logic handling splittable compressed readers,
and instead go with the logic when we know/assume the file is compressed.

* TextIO: test compression with larger files

It is important for correctness that we test with large files
because otherwise the compressed file may be larger than the
uncompressed file, which could mask bugs

* TextIOTest: flesh out more

* TextIOTest: add large uncompressed file
If the test hangs due to bugs, the infrastructure should kill it.
If provided with an Unbounded PCollection, Write will fail due to
restriction of calling finalize only once. This error message fails in a
deep stack trace based on it not being possible to apply a GroupByKey.
Instead, throw immediately on application with a specific error message.
These writes should be forbidden based on the boundedness of the input
PCollection. As Write explicitly forbids the application of the
transform to an Unbounded PCollection, this will be equivalent in most
cases; In cases where the input PCollection is Bounded, due to an
UnboundedReadFromBoundedSource, the write will function as expected and
does not need to be forbidden.
Aggregator is the model level concept. Counter was specific to the
Dataflow Runner, and is now not needed as part of Beam.
This cleans up any state stored within the Transform Evaluator Factory.
This is a more focused interface that interacts with a DoFn before it
is available for use and after it has completed and the reference is
lost. It is required to properly support setup and teardown, as the
fields in a ThreadLocal cannot all be cleaned up without additional
tracking.

Part of BEAM-452.
Methods annotated with these annotations are used to perform expensive
setup work and clean up a DoFn after another method throws an exception
or the DoFn is discarded.
These tests are not yet functional in all runners, and this makes them
easier to ignore.
Until we implement it for Dataflow runner.
manuzhang and others added 11 commits June 19, 2017 16:03
  Remove unused codes
  Fix kryo exception
  Fix PCollectionView translation
  upgrade to gearpump 0.8.4-SNAPSHOT
  Fix side input handling in DoFnFunction
  Respect WindowFn#getOutputTime in gearpump-runner
  Activate Gearpump local-validates-runner-tests in precommit
  Update against master changes
  Update gearpump-runner against master changes
  Update gearpump-runner against master changes
  Update gearpump-runner against master changes.
  [BEAM-972] Add more unit test to Gearpump runner
  [BEAM-972] Add unit tests to Gearpump runner
  [BEAM-79] Fix gearpump-runner merge conflicts and test failure
  enable ParDoTest
  [BEAM-79] Add SideInput support for GearpumpRunner
  [BEAM-79] Support merging windows in GearpumpRunner
  [BEAM-79] Fix PostCommit test confs for Gearpump runner
  note thread is interrupted on InterruptedException
  Remove cache for Gearpump on travis
  reduce timeout to wait for result
  fix ParDo.BoundMulti translation
  return encoded key for GroupByKey translation
  support OutputTimeFn
  update to latest gearpump dsl function interface
  fix group by window
  activate ROS on Gearpump by default
  update ROS configurations
  [BEAM-1180] Implement GearpumpPipelineResult
  [BEAM-79] Upgrade to beam-0.5.0-incubating-SNAPSHOT
  [BEAM-79] Update to latest Gearpump API
  Fix NoOpAggregatorFactory
  Remove print to stdout
  Skip window assignment when windows don't change
  Add Window.Bound translator
  Upgrade Gearpump version
  [BEAM-79] fix gearpump runner build failure
  [BEAM-79] update GearpumpPipelineResult
  [BEAM-79] Port Gearpump runner from OldDoFn to new DoFn
  upgrade gearpump-runner to 0.4.0-incubating-SNAPSHOT
  remove "pipeline" in runner name
  post-merge fix
  [BEAM-79] fix integration-test failure
  fix import order
  !fixup Minor javadoc clean-up
  Added even more javadoc to TextIO#withHeader and TextIO#withFooter (2).
  Added even more javadoc to TextIO#withHeader and TextIO#withFooter.
  Added javadoc to TextIO#withHeader and TextIO#withFooter.
  Reverted header and footer to be of type String.
  Revised according to comments following a code review.
  Add header/footer support to TextIO.Write
  [BEAM-242] Enable and fix checkstyle in Flink runner examples
  Remove timeout in JAXBCoderTest
  Be more accepting in UnboundedReadDeduplicatorTest
  BigQuery: limit max job polling time to 1 minute
  [BEAM-242] Enable checkstyle and fix checkstyle errors in Flink runner
  [BEAM-456] Add MongoDbIO
  FluentBackoff: a replacement for a variety of custom backoff implementations
  Remove the DataflowRunner instructions from examples
  Put classes in runners-core package into runners.core namespace
  Delegate populateDipslayData to wrapped combineFn's
  Fixed Combine display data
  Cloud Datastore naming clean-up
  DatastoreIO SplitQueryFn integration test
  Add Latest CombineFn and PTransforms
  Remove empty unused method in TestStreamEvaluatorFactory
  Test that multiple instances of TestStream are supported
  Correct some accidental renames
  Fix condition in FlinkStreamingPipelineTranslator
  Address comments of Flink Side-Input PR
  [BEAM-569] Define maxNumRecords default value to Long.MAX_VALUE in JmsIO
  Add LeaderBoardTest
  take advantage of setup/teardown for KafkaWriter
  Returned KafkaIO getWatermark log line in debug mode
  [BEAM-572] Remove Spark Reference in WordCount
  Update Dataflow Container Version
  [BEAM-313] Provide a context for SparkRunner
  DataflowRunner: get PBegin from PInput
  [BEAM-592] Fix SparkRunner Dependency Problem in WordCount
  Fix javadoc in Kinesis
  Organize imports in Kinesis
  kinesis: a connector for Amazon Kinesis
  [BEAM-589] Fixing IO.Read transformation
  Query latest timestamp
  travis.yml: disable updating snapshots
  Added support for reporting aggregator values to Spark sinks
  [BEAM-294] Rename dataflow references to beam
  Modified BigtableIO to use DoFn setup/tearDown methods instead of startBundle/finishBundle
  checkstyle: prohibit API client repackaged Guava
  Make WriteTest more resilient to Randomness
  Update DoFn javadocs to remove references to OldDoFn and Dataflow
  [BEAM-545] Promote JobName to PipelineOptions
  Move the samples data to gs://apache-beam-samples/
  Cleanup some javadoc that referring Dataflow
  BigQueryIO.Write: raise size limit to 11 TiB
  Optimize imports
  Update checkstyle.xml to put all imports in one group
  Fix Exception Unwrapping in TestFlinkRunner
  Make ParDoLifecycleTest Serializable to Fix Test with TupleTag
  Use AllPanes as the PaneExtractor in IterableAssert
  ...
…branch

  Don't call .testingPipelineOptions() a second time
  GCP IO ITs now all use --project option
  Select SDK distribution based on the selected SDK name
  [BEAM-2373] Upgrade commons-compress dependency version to 1.14
  Define the projectId in the SpannerIO Read Test (utest, not itest)
  Use SDK harness container for FnAPI jobs when worker_harness_container_image is not specified. Add a separate image tag to use with the SDK harness container.
  Ditch apache commons
  Add PubSub I/O support to Python DirectRunner
  Only use ASCII 'a' through 'z' for temporary Spanner tables
  ReduceFnRunner.onTrigger: add short circuit for empty pane, and move inputWM and pane after the short circuit.
  WindowingStrategy: add OnTimeBehavior to control whether to emit empty ON_TIME pane.
  Removed OnceTriggerStateMachine
  Visit composite nodes when checking for picklability.
  Upgrade beam bigtable client dependency to 0.9.7.1
  Add a Combine Test for Sliding Windows without Context
  [BEAM-2389] moved GcpCoreApiSurfaceTest to corresponding module, adapted exposed packagees
  Add Experimental annotation to AMQP and refine Kind for the Experimental IOs
  [BEAM-2488] Elasticsearch IO should read also in replica shards
  Use PCollectionViews.toAdditionalInputs in Combine
  Use PCollectionViews.toAdditionalInputs in ParDo
  Use PCollectionViews.toAdditionalInputs in ParDoMultiOverrideFactory
  Fix getAdditionalInputs for SplittableParDo transforms
  Add utility to expand list of PCollectionViews
  Read api with naive implementation
  Pre read api refactoring. Extract `SpannerConfig` and `AbstractSpannerFn`
  Bump spanner version
  [BEAM-1187] Improve logging to contain the number of retries done due to IOException and unsuccessful response codes.
  Add WindowFn#assignsToOneWindow
  Use installed distribution name for sdk name
  [BEAM-2522] upgrading jackson to 2.8.9 (mitigating apache#1599)
  Enable grpc controller in fn_api_runner
  Removed uses of proto builder clone method
  [BEAM-2514] Improve error message on missing required value
  [BEAM-1237] Create AmqpIO
  Implement streaming GroupByKey in Python DirectRunner
  Bump Dataflow worker to 0623
  Reintroduces DoFn.ProcessContinuation (Dataflow worker compatibility part)
  Remove old deprecated PubSub code
  Fix a typo in function args
  Avoid pickling the entire pipeline per-transform.
  Fix python fn API data plane remote grpc port access
  [BEAM-2745] Add Jenkins Suite for Python Performance Test
  [BEAM-2489] Use dynamic ES port in HIFIOWithElasticTest
  [BEAM-2497] Fix the reading of concat gzip files
  Allow output from FinishBundle in DoFnTester
  DataflowRunner: Reject merging windowing for stateful ParDo
… to gearpump 0.8.4

  Fix ParDoTest#testPipelineOptionsParameter
  Upgrade to gearpump 0.8.4
  Fix javadoc generation for AmqpIO, CassandraIO and HCatalogIO
  Simplified ByteBuddyOnTimerInvokerFactory
  Fix bad merge
  Made DataflowRunner TransformTranslator public
  Process timer firings for a window together
  Ignore processing time timers in expired windows
  Add timeout to initialization of partition in KafkaIO
  [BEAM-2534] Handle offset gaps in Kafka messages.
  Fix PValue input in _PubSubReadEvaluator
  Update SDK dependencies
  Disallow Combiner Lifting for multi-window WindowFns
  [BEAM-2553] Update Maven exec plugin to 1.6.0 to incorporate messaging improvements
  Website Mergebot Job
  Update Python SDK version
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare branch release-2.1.0
  For GCS operations use an http client with a default timeout value.
  [BEAM-2530] Fix compilation of modules with Java 9 that depend on jdk.tools
  Make modules that depend on Hadoop and Spark use the same version property
  Fix DoFn javadoc: StateSpec does not require a key
  Add support for PipelineOptions parameters
  Properly convert milliseconds whether there's less than 3/more than 9 digits. TimeUtil did not properly convert (and returned null) when the number of digits for fractions of seconds was less than 3 digits or more than 9 digits. The solution is to pad with zeros when there is less than 3 digits and to truncate when there is more than 3.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.007%) to 70.602% when pulling daa7566 on manuzhang:gearpump-runner into 1d9160f on apache:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 70.599% when pulling 49d4ed5 on manuzhang:gearpump-runner into 81c2e90 on apache:master.

@manuzhang
Copy link
Contributor Author

R: @kennknowles

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the poms.

Actually, I suggest you open a pull request from gearpump-runner to master both on the main apache/beam fork. This makes it very clearly exactly what is proposed.

Then, you can separately do little PRs from manuzhaing:whatever to origin:gearpump-runner to address comments and those commits will automatically show up in the main PR.

Some things can wait until after the merge, but getting the slower tests to be postcommit-only must happen before the merge.

<profile>
<id>local-validates-runner-tests</id>
<!-- TODO: once on master branch, move to postcommit only -->
<activation><activeByDefault>true</activeByDefault></activation>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now this should not be active by default. It should run only on postcommit.

@@ -89,6 +89,18 @@
</dependencies>
</profile>

<!-- Include the Apache Gearpump (incubating) runner with -P gearpump-runner -->
<profile>
<id>gearpump-runner</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look in this file for the jenkins-precommit bits. You can add an execution for the Gearpump runner.

@manuzhang
Copy link
Contributor Author

@kennknowles split this into #3636 and #3637. I will leave it open for a while since it is referenced on mailing list

@kennknowles
Copy link
Member

Great. Those PRs look just right. Fine to leave this open - but links will keep working even when close.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.08%) to 70.502% when pulling d56606d on manuzhang:gearpump-runner into 81c2e90 on apache:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.007%) to 70.509% when pulling a56c599 on manuzhang:gearpump-runner into 71196ec on apache:master.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling b0ed584 on manuzhang:gearpump-runner into ** on apache:master**.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet