GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
The getSize() and materialize() throw exception if the collection is created from Existing but Empty file. For example:
PCollection<String> data = pipeline.readTextFile(nonEmptyInputPath);
PCollection<String> emptyPCollection = data.filter(new FalseFilterFn());
pipeline.readTextFile(outputPath).getSize() -> Exception!!!
The PCollectionGetSizeTest.java illustrates the issue
To solve it i have modified the SourceTargetHelper.getPathSize() to return (-1) on non-existing files (use to be 0)! This allows client apps to distinct the empty (e.g. size 0) from non-existing files (size =-1) - used by InputCollection.getSizeInternal() to throw an exception only on non-existing files.
Note that FileSystem.listStatus(..) returns an empty (not null) array instead on non-existing file. (e.g. no exception is thrown)
It will be nice if someone can review those changes to ensure that the approach make sense
add collections getsize test
fix NPE on InputCollection getSize()
Merge remote-tracking branch 'upstream/master'
fix type inport type
merge with the pull #30
Resolve an exception condition on checking the size of existing but e…
Just a small note on FileSystem.listStatus: unfortunately there is a difference in the way the HDFS FileSystem behaves in comparison to the LocalFileSystem (within CDH3). LocalFileSystem will indeed return an empty array of FileStatus objects, but the HDFS FileSystem implementation will return null. It looks like both of these situations are still handled though.
The behavior of the HDFS FileSystem implementation is in line with what would be expected (i.e. throw FileNotFoundException if a file/directory doesn't exist) in the trunk of the Apache Hadoop svn, so we can assume that this will become available in a CDH release sometime in the future.
I agree with Gabriel that throwing FileNotFoundException if a path doesn't exist is a logical behavior. In fact I first implemented throwing FNFE in SourceTargetHelper.getPathSize() instead of -1 but it had broader cascading impact on Crunch's implementation. So I replaced with -1 as an intermediate step.
The main question about #31 change is to decide whether we should threat the "path doesn't exist" differently from existing but empty path or not (as the current implementation)
Agree that throwing FNFE is the right move when the file doesn't exist, and I'm glad about the path we're taking to get there.
Merge pull request #31 from tzolov/master
Resolve an exception on getSize for existing but empty PCollection
Just a note to acknowledge the technical depth that comes with this approach. Although the current solution (e.g size=-1 if path not-exist) does the job it also makes it easy to introduce changes in the future that could lead to negative size calculation. Currently the code that prevent this from happening is the sz > 0 clause in the the PCollectionImpl.getSize() method implementation.
Throwing FNFE seems to be cleaner and fail-fast approach. Going in that direction though will have broader impact on the Crunch code.
For example as checked exception the FNFE will require modification of all getSize methods. Also the SourceTargetHelper.getPathSize() is used by the Source as well as Target paths and FNFE is expected to be thrown only in case of Source non-existing paths. The size of Target non-existing Path is 0 (see: PCollectionImpl.getSize()).