-
-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce scan time by avoid having redundant file access #850
Reduce scan time by avoid having redundant file access #850
Conversation
Even though the Files API might be considered as a newer variant, but for example on Windows systems certain methods are four times slower than the File counterparts.
Both FileSlice & PathSlice do the same, so no need for this additional check.
Probably it was there only to set the open state to true, so that the close works as intended.
The file attributes are already used when creating a new Resource so we can easily pass it to the Resource for further usage, thus can share the already loaded data for a Path. If from a subPath it turns out to be a file, then we do not need to check at the end whether it's a directory so we now remove the files from the pathsInDir.
The previous commits introduced some code that would not work on JRE 7, which are fixed here.
Flip the condition check to first check the fileName since it is cheaper and also use the built-in filtering. The latter does not really contribute to the performance benefit though.
ClasspathElementDir already checks whether the resource can be accessed and is a file before creating it. So, in order to avoid the redundant file access we do not check the same in PathSlice created via these Resources.
@attilapuskas Looks good to me -- this is an epic PR! Thank you! |
Released in 4.8.171. Many thanks! |
Thanks for the very quick response! :) |
|
||
// Determine whether this is a modular jar running under JRE 9+ | ||
final boolean isModularJar = VersionFinder.JAVA_MAJOR_VERSION >= 9 && getModuleName() != null; | ||
|
||
// Only scan files in directory if directory is not only an ancestor of an accepted path | ||
if (parentMatchStatus != ScanSpecPathMatch.ANCESTOR_OF_ACCEPTED_PATH) { | ||
// Do preorder traversal (files in dir, then subdirs), to reduce filesystem cache misses | ||
for (final Path subPath : pathsInDir) { | ||
for (final Path subPath : new ArrayList<>(pathsInDir)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could avoid the copy of the list and the subsequent linear scans for pathsInDir.remove
by using an Iterator instead of the for-loop along with Iterator.remove
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I just selected one of the smallest & easy change to avoid any side effects and considered that one also rather cheap, at least negligible regarding the whole performance.
@@ -396,16 +401,19 @@ private void scanPathRecursively(final Path path, final LogNode log) { | |||
return; | |||
} | |||
Collections.sort(pathsInDir); | |||
FileUtils.FileAttributesGetter getFileAttributes = FileUtils.createCachedAttributesGetter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand the purpose of the cached attributes here? As far as I see it, each path is used once with the getFileAttributes. I'm likely missing smth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I believe when I introduced the cache based lookup it was still the case that we needed it. Maybe I had a version were 'Recurse into subdirectories' also used the already cached attributes to check if a path is a directory and I missed it from the final state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to take another pass at this! Thanks!
These changes are meant to reduce the general file access and keep the same scan results as before.
Most of the time it is about removing some duplicated canRead, isFile and similar calls that are usually top contibutors of the scanning time especially on Windows based systems with any kind of antivirus/firewall enabled.
I have tested these on a project using ~500 classpath entries, from which ~380 are jar files. Thanks to the extended ClassGraph configuration it was possible to filter out most of the artifacts that are not relevant, but even in that state it was a measurable improvement.