Skip to content

Conversation

@bowenli86
Copy link
Member

@bowenli86 bowenli86 commented Nov 1, 2018

What is the purpose of the change

The motivation is

  1. enabling users to set a read start position for files, so they can process files that are only modified after a given timestamp
  2. exposing more file information to users and providing them with a more flexible file filter interface to define their own filtering rules

Brief change log

  • add FileFilter interface that users can access all available information of a file and set filtering rules
  • allow users to set FileFilter to FileInputFormat
  • add FileModTimeFilter, in which users can set a read start position for files so Flink only process files that are modified after the given timestamp

Verifying this change

This change added tests and can be verified as follows:

  • extended unit tests for FileInputFormat in FileInputFormatTest
  • added FileModTimeFilterTest

Does this pull request potentially affect one of the following parts:

  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes )

Documentation

  • Does this pull request introduce a new feature? (yes )
  • If yes, how is the feature documented? - Documentation will be added in another PR in a different jira ticket

@bowenli86
Copy link
Member Author

@kl0u @fhueske Can you take a look?

@bowenli86 bowenli86 changed the title [FLINK-10168] Add FileFilter interface and FileModTimeFilter which sets a read start position for files by modification time [FLINK-10168] [DataStream API] Add FileFilter interface and FileModTimeFilter which sets a read start position for files by modification time Nov 5, 2018
@bowenli86 bowenli86 changed the title [FLINK-10168] [DataStream API] Add FileFilter interface and FileModTimeFilter which sets a read start position for files by modification time [FLINK-10168] [Core & DataStream] Add FileFilter to FileInputFormat and FileModTimeFilter which sets a read start position for files by modification time Nov 15, 2018
bowenli86 and others added 27 commits December 4, 2018 10:25
… FileModTimeFilter which sets a read start position for files by modification time
The code examples for Scala and Java are both broken,
and set a bad example in terms of efficiency.

This closes #7232.
This enables submission of multiple jobs in the Jepsen tests. The job
specifications are in an .edn file that must be passed as command line
arguments. The checker verifies that all jobs are running at the end of the
test. The job cancellation function now cancels all jobs at once.

Delete script/run-tests.sh and move code to docker/run-tests.sh.
This enables toggling the setup of test dependencies, such as Hadooop, Mesos,
and, ZooKeeper through the --test-spec edn file. The type of the Flink cluster
can also be specified via: :flink-yarn-job :flink-yarn-session
:flink-mesos-session, and :flink-standalone-session.

Retryable operations that exhausted all attempts, now propagate the exception by
default. This is to fail fast if during the db/setup!, an operation, such as
Flink job submission, fails to complete.
…itting records

We need to make sure that concurrent access to the RecordWriter is protected by
a lock. It seems that everything but the StreamIterationHead was synchronizing
on the checkpoint lock and hence we sync here as well.
… tests should fail if exception isn't thrown

This commit strengthens tests in StateBackendMigrationTestBase that
depend on a certain state operation (restoring state, accessing state,
etc.) to be failing to assert correct behaviour. However, we previously
do not really fail the test if no exception was thrown when there should
be.

This also caught some bugs in the test itself which had the tests
verifying incorrect behaviour.
…ackend should always be compatible

Previously, we were only checking if the new namespace serializer is
incompatible, while properly we should be checking that it is strictly
compatible.

This doesn't affect any user expected behaviour, since the namespace
serializer is never exposed to users.
mxm and others added 27 commits December 14, 2018 14:22
The traversal of the DAG is not efficient enough at some places which can lead
to very long plan creation times.

This introduces caching for the traversal to avoid traversing nodes multiple
times. Caching is performed at two places:

- when registering Kryo types
- when determining the maximum parallelism
…ckend and add harness tests for CollectAggFunction

This closes #7253
The test completes in 500ms on my machine. On Travis the resources are much
more limited and it fails occasionally.

The following should ensure the test does not fail anymore:

- Increase test timeout to 30 seconds
- Reduce complexity of tests by decreasing branching factor
… FileModTimeFilter which sets a read start position for files by modification time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.