Resolve an exception on getSize for existing but empty PCollection #31

merged 6 commits into from May 26, 2012


None yet
3 participants

tzolov commented May 25, 2012

The getSize() and materialize() throw exception if the collection is created from Existing but Empty file. For example:

PCollection<String> data = pipeline.readTextFile(nonEmptyInputPath);
PCollection<String> emptyPCollection = data.filter(new FalseFilterFn());
emptyPCollection.write(sequenceFile(outputPath, strings()));
pipeline.readTextFile(outputPath).getSize() -> Exception!!!

The illustrates the issue

To solve it i have modified the SourceTargetHelper.getPathSize() to return (-1) on non-existing files (use to be 0)! This allows client apps to distinct the empty (e.g. size 0) from non-existing files (size =-1) - used by InputCollection.getSizeInternal() to throw an exception only on non-existing files.
Note that FileSystem.listStatus(..) returns an empty (not null) array instead on non-existing file. (e.g. no exception is thrown)
Additional changes:

  • InputCollection.getSizeInternal() returns ISE in case of negative size (e.g. -1)
  • CompositePathIterable.create() - throws exception on non-existing files.
  • FileSourceImpl.getSize() - re-throws any IOE as ISE (use to return 1L instead)

It will be nice if someone can review those changes to ensure that the approach make sense


gabrielreid commented May 26, 2012

Just a small note on FileSystem.listStatus: unfortunately there is a difference in the way the HDFS FileSystem behaves in comparison to the LocalFileSystem (within CDH3). LocalFileSystem will indeed return an empty array of FileStatus objects, but the HDFS FileSystem implementation will return null. It looks like both of these situations are still handled though.

The behavior of the HDFS FileSystem implementation is in line with what would be expected (i.e. throw FileNotFoundException if a file/directory doesn't exist) in the trunk of the Apache Hadoop svn, so we can assume that this will become available in a CDH release sometime in the future.


tzolov commented May 26, 2012

I agree with Gabriel that throwing FileNotFoundException if a path doesn't exist is a logical behavior. In fact I first implemented throwing FNFE in SourceTargetHelper.getPathSize() instead of -1 but it had broader cascading impact on Crunch's implementation. So I replaced with -1 as an intermediate step.

The main question about #31 change is to decide whether we should threat the "path doesn't exist" differently from existing but empty path or not (as the current implementation)


jwills commented May 26, 2012

Agree that throwing FNFE is the right move when the file doesn't exist, and I'm glad about the path we're taking to get there.

jwills added a commit that referenced this pull request May 26, 2012

Merge pull request #31 from tzolov/master
Resolve an exception on getSize for existing but empty PCollection

@jwills jwills merged commit 2b420a0 into cloudera:master May 26, 2012


tzolov commented May 28, 2012

Just a note to acknowledge the technical depth that comes with this approach. Although the current solution (e.g size=-1 if path not-exist) does the job it also makes it easy to introduce changes in the future that could lead to negative size calculation. Currently the code that prevent this from happening is the sz > 0 clause in the the PCollectionImpl.getSize() method implementation.
Throwing FNFE seems to be cleaner and fail-fast approach. Going in that direction though will have broader impact on the Crunch code.
For example as checked exception the FNFE will require modification of all getSize methods. Also the SourceTargetHelper.getPathSize() is used by the Source as well as Target paths and FNFE is expected to be thrown only in case of Source non-existing paths. The size of Target non-existing Path is 0 (see: PCollectionImpl.getSize()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment