New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-2314] Make Streaming File Sources Persistent #2020
Conversation
* A default implementation is the {@link DefaultFilter} which excludes files starting with ".", "_", or | ||
* contain the "_COPYING_" in their names. This can be retrieved by {@link DefaultFilter#getInstance()}. | ||
* */ | ||
public interface FilePathFilter extends Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably make this @PublicEvolving, just to be on the save site. I can fix it up when merging.
All in all, very good work! One thing I'd like to change is the order of parameters in the
|
@aljoscha Thanks a lot for the comments! |
Thanks, the changes look good. R: @StephanEwen for taking a look at the API, you would only look at |
CC: @StephanEwen By the way, it might not look like it but the only additional methods this introduces on
The rest are unfortunately public methods and we can't remove them, even though some should probably be removed. |
try { | ||
postSubmit(); | ||
} catch (Exception e1) { | ||
e1.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not forward the exception here? You introduced this block so that postSubmit()
also runs when the SuccessException
was thrown, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aljoscha ! Done.
private transient FileInputSplit currSplit; | ||
|
||
private transient FileInputSplit restoredSplit; | ||
private transient Long restoredOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a primitive type here
e78457e
to
fb9c949
Compare
@@ -82,6 +82,8 @@ | |||
*/ | |||
private static int MAX_SAMPLE_LEN; | |||
|
|||
private boolean restoring = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probable left from a previous change
The changes look good, we just have to figure out what to do about the methods on |
It adds a method failExternally() to the StreamTask, so that custom Operators can make their containing task fail when needed.
This adds a new interface called CheckpointableInputFormat which describes input formats whose state is queryable, i.e. getCurrentChannelState() returns where the reader is in the underlying source, and they can resume reading from a user-specified position. This functionality is not yet leveraged by current readers.
This is meant to replace the different file reading sources in Flink streaming. Now there is one monitoring source with DOP 1 monitoring a directory and assigning input split to downstream readers. In addition, it makes the new features added by FLINK-3717 work together with the aforementioned entities (the monitor and the readers) in order to have fault tolerant file sources and exactly once guarantees. This does not replace the old API calls. This will be done in a future commit.
This commit takes the changes from the previous commits and wires them into the API, both Java and Scala. While doing so, some changes were introduced to the classes actually doing the work, either as bug fixes, or as new design choices.
aef24e8
to
3903562
Compare
The input formats still have a leftover field that stores the split. After that, the only thing that remains is the API methods. Also what was the reason for the new code in |
2ef1593
to
e1dac4b
Compare
a96f4de
to
b25ef87
Compare
Hi @kl0u, this PR is merged, right? |
This PR solves FLINK-2314 and combines a number of sub-tasks. In addition, it solves FLINK-3896 which was introduced as part of this task.
The way File Input sources are now processed is the following:
* One task monitors (parallelism 1) a user-specified path for new files/data
* The above task assigns FileInputSplits to downstream (parallel) readers to actually read the data
The monitoring entity scans the path, splits the files to be processed in splits, and assigns them downstream. For now, two modes are supported. These are the PROCESS_ONCE which just processes the current contents of the path and exits, and the REPROCESS_WITH_APPENDED which periodically monitors the path and reprocesses new files and (the entire contents of) files with new data.
In addition, these sources are checkpointed, i.e. in the case of a task failure the job will resume from where it left off.
Finally, some changes were introduced in the way we are handling FileInputFormats after discussions with @aljoscha .