[Spark-14976][Streaming] make StreamingContext.textFileStream support wildcard#12752
[Spark-14976][Streaming] make StreamingContext.textFileStream support wildcard#12752mwws wants to merge 4 commits intoapache:masterfrom
Conversation
make StreamingContext.textFileStream support wildcard add a related unit test
|
Test build #57236 has finished for PR 12752 at commit
|
| def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold) | ||
| } | ||
| val newFiles = fs.listStatus(directoryPath, filter).map(_.getPath.toString) | ||
| val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath) |
There was a problem hiding this comment.
This looks wrong. Now you're only looking for nested directories, right?
There was a problem hiding this comment.
No. fs.globStatus Return all the files that match filePattern and it will not dive into the nested folder.
Say folder structure looks like following
/root
..../Jack
......../subDir1
......../JackFile1
..../Rose
......../subDir1
......../subDir2
......../RoseFile1
..../file1
Case 1: /root/Jack/subDir1 it will monitor any new files added into exact "/root/Jack/subDir1" if "JackFile1" is a new added file, it's NOT in scope.
Case 2: /root/Jack it will monitor any new files added into "/root/Jack" folder. New files added into "/root/Jack/subDir1" is NOT in scope.
Case 3: /root/*/subDir1 it will minitor any new files added into both "/root/Jack/subDir1" and "/root/Rose/subDir1".
Case 4: /root/Rose/* it will monitor any new files added into "/root/Rose/subDir1" and "/root/Rose/subDir2".
There was a problem hiding this comment.
Got it, right, so I can still specify a directory and it will match that directory only in the first globStatus. That makes sense. (Nit in the line below: you don't need the extra braces and will probably have to remove the space before 'dir'.)
|
Test build #57262 has finished for PR 12752 at commit
|
|
Test build #57294 has finished for PR 12752 at commit
|
|
The failed test is not related to my change (I think PR#12416 break spark CI) |
|
Jenkins, retest this please |
|
Test build #57301 has finished for PR 12752 at commit
|
|
I'd prefer someone like @tdas to sign off as it's kind of a change in what the API supports, but it looks reasonable to me as it supports existing behavior and mimics behavior of another similar API. |
| var testDir: File = null | ||
| try { | ||
| val batchDuration = Seconds(2) | ||
| val testDir = Utils.createTempDir() |
There was a problem hiding this comment.
Oh I just noticed this copy-and-pasted an error from the tests above. Remove "val" on this line.
|
Test build #58302 has finished for PR 12752 at commit
|
|
Merged to master/2.0 |
… wildcard ## What changes were proposed in this pull request? make StreamingContext.textFileStream support wildcard like /home/user/*/file ## How was this patch tested? I did manual test and added a new unit test case Author: mwws <wei.mao@intel.com> Author: unknown <maowei@maowei-MOBL.ccr.corp.intel.com> Closes #12752 from mwws/SPARK_FileStream. (cherry picked from commit 3359781) Signed-off-by: Sean Owen <sowen@cloudera.com>
What changes were proposed in this pull request?
make StreamingContext.textFileStream support wildcard
like /home/user/*/file
How was this patch tested?
I did manual test and added a new unit test case