New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3916] [Streaming] discover new appended data for fileStream() #2806
Conversation
QA tests have started for PR 2806 at commit
|
QA tests have finished for PR 2806 at commit
|
Test FAILed. |
QA tests have started for PR 2806 at commit
|
QA tests have finished for PR 2806 at commit
|
Test FAILed. |
QA tests have started for PR 2806 at commit
|
Tests timed out for PR 2806 at commit |
QA tests have started for PR 2806 at commit
|
QA tests have finished for PR 2806 at commit
|
@tdas Could you help to review this? The failed tests run stable locally, I'm investigating it. |
QA tests have started for PR 2806 at commit
|
QA tests have finished for PR 2806 at commit
|
@davies this is a significant PR. Lets talk about this PR after the 1.2 rush is over. |
There has been significant refactoring done in the FileInputStream. Can you update the PR accordingly? |
Also, I took a quick look at the PR. Its seems a little complicated to understand just by looking at the code, so could you write a short design doc (or update the PR description) on the high-level technique used to implement this. It does not have to be very detailed, just enough for any one understand the logic and then verify it in the code. |
Since we are not working on this feature right now, mind closing this? We can open it again when we are want to work on it. |
In a case that new data will be appended to existed files continuously, then fileStream() should discovery the new appended data. This patch brings this ability to fileStream.
In order to get an RDD based on partial data of file, added a private partialHadoopRDD() API.
cc @tdas