Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-2373: HDFS Spout should support multiple ignore extensions #1957

Closed
wants to merge 13 commits into from

Conversation

pasalkarsachin1
Copy link
Contributor

@pasalkarsachin1 pasalkarsachin1 commented Feb 22, 2017

This changes is just to support multiple ignore extension for below cases

  1. Incase there is another distcp process which copies data to source directory & creates extension of _COPYING_
  2. User have different file extensions which are not supported by reader

@pasalkarsachin1
Copy link
Contributor Author

Can someone take a look at this?


/**
* @deprecated use {@link #setIgnoreSuffix(List)} instead.
*/
public HdfsSpout setIgnoreSuffix(String ignoreSuffix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No needed to deprecate old method and add new one. You just extend the existing method to accept a space separate list of file extensions. That would preserve compatibility and extend it at the same time. conversion to an array can be done internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Suffixes separated by ','

@pasalkarsachin1
Copy link
Contributor Author

@roshannaik Can you again review it please?

*/
public static ArrayList<Path> listFilesByModificationTimeWithIgnoreSuffixes(FileSystem fs, Path directory,
long olderThan, List<String> ignoreSuffixes) throws IOException {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line.

* @return
*/
private static boolean filterSufix(String name, List<String> ignoreSuffixes) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@harshach
Copy link
Contributor

@pasalkarsachin1 looks good. Minor nits. Also update the doc here with new config changes https://github.com/apache/storm/tree/master/external/storm-hdfs#hdfs-spout

@pasalkarsachin1
Copy link
Contributor Author

@harshach Done. Thanks for review.

Copy link
Contributor

@roshannaik roshannaik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you been able to do any manual testing to validate this ? Would be nice to include a UT.

@@ -447,7 +449,12 @@ public void open(Map conf, TopologyContext context, SpoutOutputCollector collect

// -- ignore file names config
if ( conf.containsKey(Configs.IGNORE_SUFFIX) ) {
this.ignoreSuffix = conf.get(Configs.IGNORE_SUFFIX).toString();
ignoreSuffixes = Arrays.asList(conf.get(Configs.IGNORE_SUFFIX).toString().split(","));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wont handle inputs like: ".ignore1, .ignore2 ,.ignore3" where user may put one or more spaces before/after the comma. We should allow for that. Also would be nice to trim off any spaces at the start and end of the entire string.

@@ -547,7 +548,7 @@ Only methods mentioned in **bold** are required.
| **.setArchiveDir()** |~~hdfsspout.archive.dir~~ | | After a file is processed completely it will be moved to this HDFS directory. If this directory does not exist it will be created. E.g. /data/done|
| **.setBadFilesDir()** |~~hdfsspout.badfiles.dir~~ | | if there is an error parsing a file's contents, the file is moved to this location. If this directory does not exist it will be created. E.g. /data/badfiles |
| .setLockDir() |~~hdfsspout.lock.dir~~ | '.lock' subdirectory under hdfsspout.source.dir | Dir in which lock files will be created. Concurrent HDFS spout instances synchronize using *lock* files. Before processing a file the spout instance creates a lock file in this directory with same name as input file and deletes this lock file after processing the file. Spouts also periodically makes a note of their progress (wrt reading the input file) in the lock file so that another spout instance can resume progress on the same file if the spout dies for any reason.|
| .setIgnoreSuffix() |~~hdfsspout.ignore.suffix~~ | .ignore | File names with this suffix in the in the hdfsspout.source.dir location will not be processed|
| .setIgnoreSuffix() |~~hdfsspout.ignore.suffix~~ | .ignore | You can provide comma separated suffixes. File names with this suffixes in the hdfsspout.source.dir location will not be processed|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest rephrasing the description to
"Comma separated list of file name suffixes. File with matching suffixes in the hdfsspout.source.dir location will not be processed"

@pasalkarsachin1
Copy link
Contributor Author

@roshannaik I have added test case too. We have running this code in our staging environment

@@ -547,7 +548,7 @@ Only methods mentioned in **bold** are required.
| **.setArchiveDir()** |~~hdfsspout.archive.dir~~ | | After a file is processed completely it will be moved to this HDFS directory. If this directory does not exist it will be created. E.g. /data/done|
| **.setBadFilesDir()** |~~hdfsspout.badfiles.dir~~ | | if there is an error parsing a file's contents, the file is moved to this location. If this directory does not exist it will be created. E.g. /data/badfiles |
| .setLockDir() |~~hdfsspout.lock.dir~~ | '.lock' subdirectory under hdfsspout.source.dir | Dir in which lock files will be created. Concurrent HDFS spout instances synchronize using *lock* files. Before processing a file the spout instance creates a lock file in this directory with same name as input file and deletes this lock file after processing the file. Spouts also periodically makes a note of their progress (wrt reading the input file) in the lock file so that another spout instance can resume progress on the same file if the spout dies for any reason.|
| .setIgnoreSuffix() |~~hdfsspout.ignore.suffix~~ | .ignore | File names with this suffix in the in the hdfsspout.source.dir location will not be processed|
| .setIgnoreSuffix() |~~hdfsspout.ignore.suffix~~ | .ignore | Comma separated list of file name suffixes. File with matching suffixes in the hdfsspout.source.dir location will not be processed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor correction ... it should say "Files with matching suffixes". Looks good otherwise.

@pasalkarsachin1
Copy link
Contributor Author

@roshannaik Done

@roshannaik
Copy link
Contributor

You made the fix in the wrong place. The sentence should read
"Comma separated list of file name suffixes. Files with matching suffixes in the hdfsspout.source.dir location will not be processed ".

@pasalkarsachin1
Copy link
Contributor Author

@roshannaik My bad. I have corrected it.

@roshannaik
Copy link
Contributor

+1
Thanks for the contribution.

@pasalkarsachin1
Copy link
Contributor Author

Can I squash my changes then?

@roshannaik
Copy link
Contributor

sure

@pasalkarsachin1
Copy link
Contributor Author

pasalkarsachin1 commented Mar 1, 2017

@roshannaik Can you help me in squash this? I am not able to do it :(

@roshannaik
Copy link
Contributor

This changes is just to support multiple ignore extension for below cases
	Incase there is another distcp process which copies data to source directory & creates extension of _COPYING_
	User have different file extensions which are not supported by reader
@pasalkarsachin1
Copy link
Contributor Author

@roshannaik Thanks for document. Can you merge 8da34ca? Then I will close PR.

@pasalkarsachin1
Copy link
Contributor Author

Can someone merge it?

@harshach
Copy link
Contributor

harshach commented Mar 7, 2017

@pasalkarsachin1 I am going to merge it in but before can you squash your commits into a single commit.

@pasalkarsachin1
Copy link
Contributor Author

@harshach I have already squash it in 8da34ca

@harshach
Copy link
Contributor

harshach commented Mar 7, 2017

@pasalkarsachin1 this PR still shows commits as 13. you can do git rebase -i HEAD~13 on this branch and squash commits and do a force push to same branch. I don't want to cherry-pick single commit. It will cleaner if we do this via PR.

@roshannaik
Copy link
Contributor

@pasalkarsachin1 its unclear what you have done with the latest (13th) commit.
What you need is the first 12 commits squashed into 1. That will leave you with 1 commit only (and not 13 like you have now). I think you may need to get rid of this latest 13th commit before you squash the remaining 12 ... to get this right.

@pasalkarsachin1
Copy link
Contributor Author

@roshannaik @harshach Can you help me in merging it. Unfortunately I am not able to squash it :(

d2r pushed a commit to d2r/storm that referenced this pull request Oct 16, 2018
We are closing stale Pull Requests to make the list more manageable.

Please re-open any Pull Request that has been closed in error.

Closes apache#608
Closes apache#639
Closes apache#640
Closes apache#648
Closes apache#662
Closes apache#668
Closes apache#692
Closes apache#705
Closes apache#724
Closes apache#728
Closes apache#730
Closes apache#753
Closes apache#803
Closes apache#854
Closes apache#922
Closes apache#986
Closes apache#992
Closes apache#1019
Closes apache#1040
Closes apache#1041
Closes apache#1043
Closes apache#1046
Closes apache#1051
Closes apache#1078
Closes apache#1146
Closes apache#1164
Closes apache#1165
Closes apache#1178
Closes apache#1213
Closes apache#1225
Closes apache#1258
Closes apache#1259
Closes apache#1268
Closes apache#1272
Closes apache#1277
Closes apache#1278
Closes apache#1288
Closes apache#1296
Closes apache#1328
Closes apache#1342
Closes apache#1353
Closes apache#1370
Closes apache#1376
Closes apache#1391
Closes apache#1395
Closes apache#1399
Closes apache#1406
Closes apache#1410
Closes apache#1422
Closes apache#1427
Closes apache#1443
Closes apache#1462
Closes apache#1468
Closes apache#1483
Closes apache#1506
Closes apache#1509
Closes apache#1515
Closes apache#1520
Closes apache#1521
Closes apache#1525
Closes apache#1527
Closes apache#1544
Closes apache#1550
Closes apache#1566
Closes apache#1569
Closes apache#1570
Closes apache#1575
Closes apache#1580
Closes apache#1584
Closes apache#1591
Closes apache#1600
Closes apache#1611
Closes apache#1613
Closes apache#1639
Closes apache#1703
Closes apache#1711
Closes apache#1719
Closes apache#1737
Closes apache#1760
Closes apache#1767
Closes apache#1768
Closes apache#1785
Closes apache#1799
Closes apache#1822
Closes apache#1824
Closes apache#1844
Closes apache#1874
Closes apache#1918
Closes apache#1928
Closes apache#1937
Closes apache#1942
Closes apache#1951
Closes apache#1957
Closes apache#1963
Closes apache#1964
Closes apache#1965
Closes apache#1967
Closes apache#1968
Closes apache#1971
Closes apache#1985
Closes apache#1986
Closes apache#1998
Closes apache#2031
Closes apache#2032
Closes apache#2071
Closes apache#2076
Closes apache#2108
Closes apache#2119
Closes apache#2128
Closes apache#2142
Closes apache#2174
Closes apache#2206
Closes apache#2297
Closes apache#2322
Closes apache#2332
Closes apache#2341
Closes apache#2377
Closes apache#2414
Closes apache#2469
d2r pushed a commit to d2r/storm that referenced this pull request Oct 16, 2018
We are closing stale Pull Requests to make the list more manageable.

Please re-open any Pull Request that has been closed in error.

Closes apache#608
Closes apache#639
Closes apache#640
Closes apache#648
Closes apache#662
Closes apache#668
Closes apache#692
Closes apache#705
Closes apache#724
Closes apache#728
Closes apache#730
Closes apache#753
Closes apache#803
Closes apache#854
Closes apache#922
Closes apache#986
Closes apache#992
Closes apache#1019
Closes apache#1040
Closes apache#1041
Closes apache#1043
Closes apache#1046
Closes apache#1051
Closes apache#1078
Closes apache#1146
Closes apache#1164
Closes apache#1165
Closes apache#1178
Closes apache#1213
Closes apache#1225
Closes apache#1258
Closes apache#1259
Closes apache#1268
Closes apache#1272
Closes apache#1277
Closes apache#1278
Closes apache#1288
Closes apache#1296
Closes apache#1328
Closes apache#1342
Closes apache#1353
Closes apache#1370
Closes apache#1376
Closes apache#1391
Closes apache#1395
Closes apache#1399
Closes apache#1406
Closes apache#1410
Closes apache#1422
Closes apache#1427
Closes apache#1443
Closes apache#1462
Closes apache#1468
Closes apache#1483
Closes apache#1506
Closes apache#1509
Closes apache#1515
Closes apache#1520
Closes apache#1521
Closes apache#1525
Closes apache#1527
Closes apache#1544
Closes apache#1550
Closes apache#1566
Closes apache#1569
Closes apache#1570
Closes apache#1575
Closes apache#1580
Closes apache#1584
Closes apache#1591
Closes apache#1600
Closes apache#1611
Closes apache#1613
Closes apache#1639
Closes apache#1703
Closes apache#1711
Closes apache#1719
Closes apache#1737
Closes apache#1760
Closes apache#1767
Closes apache#1768
Closes apache#1785
Closes apache#1799
Closes apache#1822
Closes apache#1824
Closes apache#1844
Closes apache#1874
Closes apache#1918
Closes apache#1928
Closes apache#1937
Closes apache#1942
Closes apache#1951
Closes apache#1957
Closes apache#1963
Closes apache#1964
Closes apache#1965
Closes apache#1967
Closes apache#1968
Closes apache#1971
Closes apache#1985
Closes apache#1986
Closes apache#1998
Closes apache#2031
Closes apache#2032
Closes apache#2071
Closes apache#2076
Closes apache#2108
Closes apache#2119
Closes apache#2128
Closes apache#2142
Closes apache#2174
Closes apache#2206
Closes apache#2297
Closes apache#2322
Closes apache#2332
Closes apache#2341
Closes apache#2377
Closes apache#2414
Closes apache#2469
@asfgit asfgit closed this in #2880 Oct 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants