-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-2373: HDFS Spout should support multiple ignore extensions #1957
Conversation
Can someone take a look at this? |
|
||
/** | ||
* @deprecated use {@link #setIgnoreSuffix(List)} instead. | ||
*/ | ||
public HdfsSpout setIgnoreSuffix(String ignoreSuffix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No needed to deprecate old method and add new one. You just extend the existing method to accept a space separate list of file extensions. That would preserve compatibility and extend it at the same time. conversion to an array can be done internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Suffixes separated by ','
@roshannaik Can you again review it please? |
*/ | ||
public static ArrayList<Path> listFilesByModificationTimeWithIgnoreSuffixes(FileSystem fs, Path directory, | ||
long olderThan, List<String> ignoreSuffixes) throws IOException { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line.
* @return | ||
*/ | ||
private static boolean filterSufix(String name, List<String> ignoreSuffixes) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
@pasalkarsachin1 looks good. Minor nits. Also update the doc here with new config changes https://github.com/apache/storm/tree/master/external/storm-hdfs#hdfs-spout |
@harshach Done. Thanks for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you been able to do any manual testing to validate this ? Would be nice to include a UT.
@@ -447,7 +449,12 @@ public void open(Map conf, TopologyContext context, SpoutOutputCollector collect | |||
|
|||
// -- ignore file names config | |||
if ( conf.containsKey(Configs.IGNORE_SUFFIX) ) { | |||
this.ignoreSuffix = conf.get(Configs.IGNORE_SUFFIX).toString(); | |||
ignoreSuffixes = Arrays.asList(conf.get(Configs.IGNORE_SUFFIX).toString().split(",")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wont handle inputs like: ".ignore1, .ignore2 ,.ignore3" where user may put one or more spaces before/after the comma. We should allow for that. Also would be nice to trim off any spaces at the start and end of the entire string.
external/storm-hdfs/README.md
Outdated
@@ -547,7 +548,7 @@ Only methods mentioned in **bold** are required. | |||
| **.setArchiveDir()** |~~hdfsspout.archive.dir~~ | | After a file is processed completely it will be moved to this HDFS directory. If this directory does not exist it will be created. E.g. /data/done| | |||
| **.setBadFilesDir()** |~~hdfsspout.badfiles.dir~~ | | if there is an error parsing a file's contents, the file is moved to this location. If this directory does not exist it will be created. E.g. /data/badfiles | | |||
| .setLockDir() |~~hdfsspout.lock.dir~~ | '.lock' subdirectory under hdfsspout.source.dir | Dir in which lock files will be created. Concurrent HDFS spout instances synchronize using *lock* files. Before processing a file the spout instance creates a lock file in this directory with same name as input file and deletes this lock file after processing the file. Spouts also periodically makes a note of their progress (wrt reading the input file) in the lock file so that another spout instance can resume progress on the same file if the spout dies for any reason.| | |||
| .setIgnoreSuffix() |~~hdfsspout.ignore.suffix~~ | .ignore | File names with this suffix in the in the hdfsspout.source.dir location will not be processed| | |||
| .setIgnoreSuffix() |~~hdfsspout.ignore.suffix~~ | .ignore | You can provide comma separated suffixes. File names with this suffixes in the hdfsspout.source.dir location will not be processed| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest rephrasing the description to
"Comma separated list of file name suffixes. File with matching suffixes in the hdfsspout.source.dir location will not be processed"
@roshannaik I have added test case too. We have running this code in our staging environment |
external/storm-hdfs/README.md
Outdated
@@ -547,7 +548,7 @@ Only methods mentioned in **bold** are required. | |||
| **.setArchiveDir()** |~~hdfsspout.archive.dir~~ | | After a file is processed completely it will be moved to this HDFS directory. If this directory does not exist it will be created. E.g. /data/done| | |||
| **.setBadFilesDir()** |~~hdfsspout.badfiles.dir~~ | | if there is an error parsing a file's contents, the file is moved to this location. If this directory does not exist it will be created. E.g. /data/badfiles | | |||
| .setLockDir() |~~hdfsspout.lock.dir~~ | '.lock' subdirectory under hdfsspout.source.dir | Dir in which lock files will be created. Concurrent HDFS spout instances synchronize using *lock* files. Before processing a file the spout instance creates a lock file in this directory with same name as input file and deletes this lock file after processing the file. Spouts also periodically makes a note of their progress (wrt reading the input file) in the lock file so that another spout instance can resume progress on the same file if the spout dies for any reason.| | |||
| .setIgnoreSuffix() |~~hdfsspout.ignore.suffix~~ | .ignore | File names with this suffix in the in the hdfsspout.source.dir location will not be processed| | |||
| .setIgnoreSuffix() |~~hdfsspout.ignore.suffix~~ | .ignore | Comma separated list of file name suffixes. File with matching suffixes in the hdfsspout.source.dir location will not be processed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor correction ... it should say "Files with matching suffixes". Looks good otherwise.
@roshannaik Done |
You made the fix in the wrong place. The sentence should read |
@roshannaik My bad. I have corrected it. |
+1 |
Can I squash my changes then? |
sure |
updated as per comment
…into STORM-2373
@roshannaik Can you help me in squash this? I am not able to do it :( |
This changes is just to support multiple ignore extension for below cases Incase there is another distcp process which copies data to source directory & creates extension of _COPYING_ User have different file extensions which are not supported by reader
…into STORM-2373
@roshannaik Thanks for document. Can you merge 8da34ca? Then I will close PR. |
Can someone merge it? |
@pasalkarsachin1 I am going to merge it in but before can you squash your commits into a single commit. |
@pasalkarsachin1 this PR still shows commits as 13. you can do git rebase -i HEAD~13 on this branch and squash commits and do a force push to same branch. I don't want to cherry-pick single commit. It will cleaner if we do this via PR. |
@pasalkarsachin1 its unclear what you have done with the latest (13th) commit. |
@roshannaik @harshach Can you help me in merging it. Unfortunately I am not able to squash it :( |
We are closing stale Pull Requests to make the list more manageable. Please re-open any Pull Request that has been closed in error. Closes apache#608 Closes apache#639 Closes apache#640 Closes apache#648 Closes apache#662 Closes apache#668 Closes apache#692 Closes apache#705 Closes apache#724 Closes apache#728 Closes apache#730 Closes apache#753 Closes apache#803 Closes apache#854 Closes apache#922 Closes apache#986 Closes apache#992 Closes apache#1019 Closes apache#1040 Closes apache#1041 Closes apache#1043 Closes apache#1046 Closes apache#1051 Closes apache#1078 Closes apache#1146 Closes apache#1164 Closes apache#1165 Closes apache#1178 Closes apache#1213 Closes apache#1225 Closes apache#1258 Closes apache#1259 Closes apache#1268 Closes apache#1272 Closes apache#1277 Closes apache#1278 Closes apache#1288 Closes apache#1296 Closes apache#1328 Closes apache#1342 Closes apache#1353 Closes apache#1370 Closes apache#1376 Closes apache#1391 Closes apache#1395 Closes apache#1399 Closes apache#1406 Closes apache#1410 Closes apache#1422 Closes apache#1427 Closes apache#1443 Closes apache#1462 Closes apache#1468 Closes apache#1483 Closes apache#1506 Closes apache#1509 Closes apache#1515 Closes apache#1520 Closes apache#1521 Closes apache#1525 Closes apache#1527 Closes apache#1544 Closes apache#1550 Closes apache#1566 Closes apache#1569 Closes apache#1570 Closes apache#1575 Closes apache#1580 Closes apache#1584 Closes apache#1591 Closes apache#1600 Closes apache#1611 Closes apache#1613 Closes apache#1639 Closes apache#1703 Closes apache#1711 Closes apache#1719 Closes apache#1737 Closes apache#1760 Closes apache#1767 Closes apache#1768 Closes apache#1785 Closes apache#1799 Closes apache#1822 Closes apache#1824 Closes apache#1844 Closes apache#1874 Closes apache#1918 Closes apache#1928 Closes apache#1937 Closes apache#1942 Closes apache#1951 Closes apache#1957 Closes apache#1963 Closes apache#1964 Closes apache#1965 Closes apache#1967 Closes apache#1968 Closes apache#1971 Closes apache#1985 Closes apache#1986 Closes apache#1998 Closes apache#2031 Closes apache#2032 Closes apache#2071 Closes apache#2076 Closes apache#2108 Closes apache#2119 Closes apache#2128 Closes apache#2142 Closes apache#2174 Closes apache#2206 Closes apache#2297 Closes apache#2322 Closes apache#2332 Closes apache#2341 Closes apache#2377 Closes apache#2414 Closes apache#2469
We are closing stale Pull Requests to make the list more manageable. Please re-open any Pull Request that has been closed in error. Closes apache#608 Closes apache#639 Closes apache#640 Closes apache#648 Closes apache#662 Closes apache#668 Closes apache#692 Closes apache#705 Closes apache#724 Closes apache#728 Closes apache#730 Closes apache#753 Closes apache#803 Closes apache#854 Closes apache#922 Closes apache#986 Closes apache#992 Closes apache#1019 Closes apache#1040 Closes apache#1041 Closes apache#1043 Closes apache#1046 Closes apache#1051 Closes apache#1078 Closes apache#1146 Closes apache#1164 Closes apache#1165 Closes apache#1178 Closes apache#1213 Closes apache#1225 Closes apache#1258 Closes apache#1259 Closes apache#1268 Closes apache#1272 Closes apache#1277 Closes apache#1278 Closes apache#1288 Closes apache#1296 Closes apache#1328 Closes apache#1342 Closes apache#1353 Closes apache#1370 Closes apache#1376 Closes apache#1391 Closes apache#1395 Closes apache#1399 Closes apache#1406 Closes apache#1410 Closes apache#1422 Closes apache#1427 Closes apache#1443 Closes apache#1462 Closes apache#1468 Closes apache#1483 Closes apache#1506 Closes apache#1509 Closes apache#1515 Closes apache#1520 Closes apache#1521 Closes apache#1525 Closes apache#1527 Closes apache#1544 Closes apache#1550 Closes apache#1566 Closes apache#1569 Closes apache#1570 Closes apache#1575 Closes apache#1580 Closes apache#1584 Closes apache#1591 Closes apache#1600 Closes apache#1611 Closes apache#1613 Closes apache#1639 Closes apache#1703 Closes apache#1711 Closes apache#1719 Closes apache#1737 Closes apache#1760 Closes apache#1767 Closes apache#1768 Closes apache#1785 Closes apache#1799 Closes apache#1822 Closes apache#1824 Closes apache#1844 Closes apache#1874 Closes apache#1918 Closes apache#1928 Closes apache#1937 Closes apache#1942 Closes apache#1951 Closes apache#1957 Closes apache#1963 Closes apache#1964 Closes apache#1965 Closes apache#1967 Closes apache#1968 Closes apache#1971 Closes apache#1985 Closes apache#1986 Closes apache#1998 Closes apache#2031 Closes apache#2032 Closes apache#2071 Closes apache#2076 Closes apache#2108 Closes apache#2119 Closes apache#2128 Closes apache#2142 Closes apache#2174 Closes apache#2206 Closes apache#2297 Closes apache#2322 Closes apache#2332 Closes apache#2341 Closes apache#2377 Closes apache#2414 Closes apache#2469
This changes is just to support multiple ignore extension for below cases