Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-24669: Improve FileSystem usage in Hive::loadPartitionInternal #1893

Merged
merged 8 commits into from Jan 26, 2021

Conversation

pvargacl
Copy link
Contributor

What changes were proposed in this pull request?

Improve FileSystem usage in Hive::loadPartitionInternal to improve performance n S3

Why are the changes needed?

Performance improvement

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Current unit tests + local performance measurements.

return true;
}
path = path.getParent();
} while (!path.equals(basePath));
Copy link
Member

@deniskuzZ deniskuzZ Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that path.equals(basePath) won't ever be true? or path becomes null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will fail with a nullpointer if you call it with an unrelated path, I will add the nullcheck

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is significantly better on S3 as it will go from an O(dirs) number of LIST calls, at 500ms+/dir to 1 LIST per few hundred files; 200 on a versioned bucket, I think.

s3 and soon abfs will do async prefetch of results. If you can do any overlap of computation then for large/deep dir trees, even that list cost might be swallowed. But as that would be a more significant change, it's something to consider in future rather than do in this PR.

Do log the iterator.toString @ debug though; with the HADOOP-16830 patch in the s3a incremental iterators all print out the performance details of their network IO.

@@ -117,6 +117,10 @@ public static long createTestFileId(
}
return result;
}
public static List<Path> listPath(final FileSystem fs, final Path path, final PathFilter filter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. if you could process the results with any computation per record, you will get full benefits of the async page fetch offered by s3a and (soon) abfs; at 600ms a list for 200 records on s3, that's potentially 3ms/record saving
  2. If listLocatedFileStatus() logged @ debug the toString() value of the iterator, the S3A FS iterator will print out its IOStats, including #of S3 list requests and min/mean/max durations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @steveloughran for taking a look at this. I will try to take advantage of the async prefetch feature in later PRs, it seems promising, but it would need bigger code change. I will checkout the IOStats downstream, I have seen it is already available there.

@@ -117,6 +117,10 @@ public static long createTestFileId(
}
return result;
}
public static List<Path> listPath(final FileSystem fs, final Path path, final PathFilter filter,
final boolean recursive) throws IOException {
return listLocatedFileStatus(fs, path, filter, recursive).stream().map(FileStatus::getPath).collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting: put map/collect on new lines

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants