Reduce scan time by avoid having redundant file access #850

attilapuskas · 2024-04-17T13:24:51Z

These changes are meant to reduce the general file access and keep the same scan results as before.
Most of the time it is about removing some duplicated canRead, isFile and similar calls that are usually top contibutors of the scanning time especially on Windows based systems with any kind of antivirus/firewall enabled.

I have tested these on a project using ~500 classpath entries, from which ~380 are jar files. Thanks to the extended ClassGraph configuration it was possible to filter out most of the artifacts that are not relevant, but even in that state it was a measurable improvement.

	before	changes included
all	21 s	11 s
filtered	9 s	6 s

Even though the Files API might be considered as a newer variant, but for example on Windows systems certain methods are four times slower than the File counterparts.

Both FileSlice & PathSlice do the same, so no need for this additional check.

Probably it was there only to set the open state to true, so that the close works as intended.

The file attributes are already used when creating a new Resource so we can easily pass it to the Resource for further usage, thus can share the already loaded data for a Path. If from a subPath it turns out to be a file, then we do not need to check at the end whether it's a directory so we now remove the files from the pathsInDir.

The previous commits introduced some code that would not work on JRE 7, which are fixed here.

Flip the condition check to first check the fileName since it is cheaper and also use the built-in filtering. The latter does not really contribute to the performance benefit though.

ClasspathElementDir already checks whether the resource can be accessed and is a file before creating it. So, in order to avoid the redundant file access we do not check the same in PathSlice created via these Resources.

lukehutch · 2024-04-17T18:44:07Z

@attilapuskas Looks good to me -- this is an epic PR! Thank you!

lukehutch · 2024-04-17T18:49:13Z

Released in 4.8.171. Many thanks!

attilapuskas · 2024-04-18T05:21:56Z

Thanks for the very quick response! :)

szarnekow · 2024-04-18T09:16:05Z

src/main/java/io/github/classgraph/ClasspathElementDir.java


        // Determine whether this is a modular jar running under JRE 9+
        final boolean isModularJar = VersionFinder.JAVA_MAJOR_VERSION >= 9 && getModuleName() != null;

        // Only scan files in directory if directory is not only an ancestor of an accepted path
        if (parentMatchStatus != ScanSpecPathMatch.ANCESTOR_OF_ACCEPTED_PATH) {
            // Do preorder traversal (files in dir, then subdirs), to reduce filesystem cache misses
-            for (final Path subPath : pathsInDir) {
+            for (final Path subPath : new ArrayList<>(pathsInDir)) {


I think you could avoid the copy of the list and the subsequent linear scans for pathsInDir.remove by using an Iterator instead of the for-loop along with Iterator.remove.

Indeed, I just selected one of the smallest & easy change to avoid any side effects and considered that one also rather cheap, at least negligible regarding the whole performance.

szarnekow · 2024-04-18T09:22:26Z

src/main/java/io/github/classgraph/ClasspathElementDir.java

@@ -396,16 +401,19 @@ private void scanPathRecursively(final Path path, final LogNode log) {
            return;
        }
        Collections.sort(pathsInDir);
+        FileUtils.FileAttributesGetter getFileAttributes = FileUtils.createCachedAttributesGetter();


Can you help me understand the purpose of the cached attributes here? As far as I see it, each path is used once with the getFileAttributes. I'm likely missing smth.

Nice catch! I believe when I introduced the cache based lookup it was still the case that we needed it. Maybe I had a version were 'Recurse into subdirectories' also used the already cached attributes to check if a path is a directory and I missed it from the final state.

Feel free to take another pass at this! Thanks!

Attila Puskas added 9 commits April 17, 2024 14:32

Prefer File usage to Files in FileUtils

ec46faa

Even though the Files API might be considered as a newer variant, but for example on Windows systems certain methods are four times slower than the File counterparts.

Remove redundant isRegularFile check from canRead

9c432ca

Optimize file attributes access of classpath entry objects

d9d536f

Remove redundant checkCanReadAndIsFile from PhysicalZipFile

bddbf5d

Both FileSlice & PathSlice do the same, so no need for this additional check.

Remove redundant read call from Resources of ClasspathElementDir

803a102

Probably it was there only to set the open state to true, so that the close works as intended.

Restore JRE 7 compatibility

5a79eb7

The previous commits introduced some code that would not work on JRE 7, which are fixed here.

Optimize filtering of nested jar files

6fb124d

Flip the condition check to first check the fileName since it is cheaper and also use the built-in filtering. The latter does not really contribute to the performance benefit though.

Avoid double checking of paths in ClasspathElementDir

dac1321

ClasspathElementDir already checks whether the resource can be accessed and is a file before creating it. So, in order to avoid the redundant file access we do not check the same in PathSlice created via these Resources.

lukehutch merged commit b92a251 into classgraph:latest Apr 17, 2024
1 check passed

attilapuskas deleted the feature/reduce-file-access branch April 18, 2024 05:18

szarnekow reviewed Apr 18, 2024

View reviewed changes

attilapuskas mentioned this pull request Apr 19, 2024

ClasspathElementDir review adjustments #851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce scan time by avoid having redundant file access #850

Reduce scan time by avoid having redundant file access #850

attilapuskas commented Apr 17, 2024

lukehutch commented Apr 17, 2024

lukehutch commented Apr 17, 2024

attilapuskas commented Apr 18, 2024

szarnekow Apr 18, 2024

attilapuskas Apr 18, 2024

szarnekow Apr 18, 2024

attilapuskas Apr 18, 2024

lukehutch Apr 18, 2024

Reduce scan time by avoid having redundant file access #850

Reduce scan time by avoid having redundant file access #850

Conversation

attilapuskas commented Apr 17, 2024

lukehutch commented Apr 17, 2024

lukehutch commented Apr 17, 2024

attilapuskas commented Apr 18, 2024

szarnekow Apr 18, 2024

Choose a reason for hiding this comment

attilapuskas Apr 18, 2024

Choose a reason for hiding this comment

szarnekow Apr 18, 2024

Choose a reason for hiding this comment

attilapuskas Apr 18, 2024

Choose a reason for hiding this comment

lukehutch Apr 18, 2024

Choose a reason for hiding this comment