Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce scan time by avoid having redundant file access #850

Merged
merged 9 commits into from
Apr 17, 2024

Conversation

attilapuskas
Copy link

These changes are meant to reduce the general file access and keep the same scan results as before.
Most of the time it is about removing some duplicated canRead, isFile and similar calls that are usually top contibutors of the scanning time especially on Windows based systems with any kind of antivirus/firewall enabled.

I have tested these on a project using ~500 classpath entries, from which ~380 are jar files. Thanks to the extended ClassGraph configuration it was possible to filter out most of the artifacts that are not relevant, but even in that state it was a measurable improvement.

before changes included
all 21 s 11 s
filtered 9 s 6 s

Attila Puskas added 9 commits April 17, 2024 14:32
Even though the Files API might be considered as a newer variant,
but for example on Windows systems certain methods are four times
slower than the File counterparts.
Both FileSlice & PathSlice do the same, so no need for this
additional check.
Probably it was there only to set the open state to true, so that
the close works as intended.
The file attributes are already used when creating a new Resource
so we can easily pass it to the Resource for further usage, thus
can share the already loaded data for a Path.

If from a subPath it turns out to be a file, then we do not need
to check at the end whether it's a directory so we now remove the
files from the pathsInDir.
The previous commits introduced some code that would not work on
JRE 7, which are fixed here.
Flip the condition check to first check the fileName since it is
cheaper and also use the built-in filtering. The latter does not
really contribute to the performance benefit though.
ClasspathElementDir already checks whether the resource can be
accessed and is a file before creating it. So, in order to avoid
the redundant file access we do not check the same in PathSlice
created via these Resources.
@lukehutch
Copy link
Member

@attilapuskas Looks good to me -- this is an epic PR! Thank you!

@lukehutch lukehutch merged commit b92a251 into classgraph:latest Apr 17, 2024
1 check passed
@lukehutch
Copy link
Member

Released in 4.8.171. Many thanks!

@attilapuskas attilapuskas deleted the feature/reduce-file-access branch April 18, 2024 05:18
@attilapuskas
Copy link
Author

Thanks for the very quick response! :)


// Determine whether this is a modular jar running under JRE 9+
final boolean isModularJar = VersionFinder.JAVA_MAJOR_VERSION >= 9 && getModuleName() != null;

// Only scan files in directory if directory is not only an ancestor of an accepted path
if (parentMatchStatus != ScanSpecPathMatch.ANCESTOR_OF_ACCEPTED_PATH) {
// Do preorder traversal (files in dir, then subdirs), to reduce filesystem cache misses
for (final Path subPath : pathsInDir) {
for (final Path subPath : new ArrayList<>(pathsInDir)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could avoid the copy of the list and the subsequent linear scans for pathsInDir.remove by using an Iterator instead of the for-loop along with Iterator.remove.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I just selected one of the smallest & easy change to avoid any side effects and considered that one also rather cheap, at least negligible regarding the whole performance.

@@ -396,16 +401,19 @@ private void scanPathRecursively(final Path path, final LogNode log) {
return;
}
Collections.sort(pathsInDir);
FileUtils.FileAttributesGetter getFileAttributes = FileUtils.createCachedAttributesGetter();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the purpose of the cached attributes here? As far as I see it, each path is used once with the getFileAttributes. I'm likely missing smth.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I believe when I introduced the cache based lookup it was still the case that we needed it. Maybe I had a version were 'Recurse into subdirectories' also used the already cached attributes to check if a path is a directory and I missed it from the final state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to take another pass at this! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants