Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-4133] Reflect streaming file source changes in documentation #2198

Closed
wants to merge 1 commit into from

Conversation

kl0u
Copy link
Contributor

@kl0u kl0u commented Jul 4, 2016

The title says it all.
It documents the continuous file processing semantics recently introduced in the Streaming API.


1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its contents are re-processed entirely. This can brake the "exactly-once" semantics, as appending data at the end of a file will lead to **all** its contents being re-processed.

2. If the `watchType` is set to `FileProcessingMode#PROCESS_ONCE`, the source scans the path **once** and exits, without waiting for the readers to finish reading. This leads to no more checkpoints after that point, thus providing reduced fault-tolerance guarantees.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are referring to the enum types using a . and a #.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Item 2: doesn't mean that the files are not read completely, right?
Its just that the checkpointing won't work anymore once the input splits have been send to the readers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not keeping the file-monitoring task alive until the job is cancelled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I will integrate that
  2. You are right. Files are read completely. The only problem is that upon recovery, the job will restart from the last checkpoint, i.e. the last before the source closes.
  3. when we process once, there is no explicit cancelation required. We decided to close the source explicitly, rather than wait for an external signal. This is also reasonable for resource utilization.

@kl0u
Copy link
Contributor Author

kl0u commented Jul 5, 2016

@rmetzger I updated the PR.
Please let me know if you have any further comments.

@rmetzger
Copy link
Contributor

rmetzger commented Jul 5, 2016

Thank you. I'll merge it.

@kl0u
Copy link
Contributor Author

kl0u commented Jul 5, 2016

Thanks @rmetzger !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants