Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
A utility that returns a map from node -> (number of bytes) for a given path.
This is useful for applications that require data locality. That is, choosing to schedule a job where the majority of a file is located.
Example test output for a 15b file with 8b blocks:
p.s. I'm definitely open to a name change.
The "extends Configured" bit handles the getConf/setConf stuff for you-- you just need to do:
ToolRunner.run(new Configuration(), new HDFSFileFinder(), args);
in the main method.
Might be worth looking at FileSystem.globStatus here-- it seems like giving a glob argument on the commandline would be pretty common.