Path scanning does deep scans for whitelisted paths #282

GedMarc · 2018-11-06T20:46:07Z

When enabling path scanning - deep scans are still performed unless explicitly black listed - causing a lot of overhead.

Path White List : META-INF
Path Black List: META-INF/maven, META-INF/resources, META-INF/services (yes I can do whitelist no subdir but its a demo)

Paths deep scanned that shouldn't be - nothing not located under root/META-INF, current scanning the entire class path to the .class level
Attached the list is perty big :)

output.txt

Thanks!

GedMarc · 2018-11-06T20:46:59Z

Adding path exclusions for com/jwebmp and org/hibernate reduces the time quite a lot - also as you can see the scan takes 0.6 seconds to complete, but doing the file scan is taking a second

lukehutch · 2018-11-06T22:54:03Z

What you are probably seeing is this:

https://github.com/classgraph/classgraph/releases/tag/classgraph-4.4.3

What this means is that all paths (directory entries and zipfile directories) in all classpath elements have to be scanned, in order to know where resources are located, so that after reading the contents of whitelisted classfiles, any superclasses, interfaces and annotations that are not whitelisted can be quickly manually located, and the class graph can be traced "upwards" to the last reachable superclass / interface / annotation.

Unfortunately this incurs a small penalty for directory-based classpath elements, because the whole directory tree has to be read. This is unavoidable if this "upwards scanning" is to be implemented, and this needed to be fixed for a long time. (Before, only directories that may contain a whitelisted path are recursed to.)

However, the contents of files are not read unless they are actually whitelisted, so very few additional files are actually ever opened as a result of this change (and recursively enumerating files in directories does take time, but that's really only an issue on HDDs, not SSDs). And for jarfile classpath elements and modules, the code already had to read all paths within the jar / module, since there was no way to scan recursively (you only have a list of all resources within the jar / module) -- so there is little to no performance hit for anything but directories. In your case, everything on your classpath should be a jarfile or module, so you shouldn't see much of a performance hit.

However, I made one other change, either in that release or one around that time, which was to log everything scanned, even if not whitelisted. This makes the logs much bigger, but I figured it would be easier to find bugs that way. So the logfiles are huge now. Logs are only written if .verbose() is called, so if logging is off, there is no performance penalty for this.

Please let me know how much of a performance hit you see from scanning on 4.4.2 vs. 4.4.3 with logging disabled. Thanks.

GedMarc · 2018-11-06T23:43:54Z

Hmm OK, I understand, A few questions around this though
a) Wouldn't the upwards search only ever take effect if class scanning/info is turned on? Path items would never have "super classes' or upwards navigation? Are we assuming that path and class info scans will always include class files?
b) I think there may be a peak to how many jars and packages before it becomes "feelable"

I didn't switch to the older version to check (parent bom's bla will do it on my next push to maven) - but I did do some basic checking

Basic Path Blacklisting

Adding in the path exclusions to the path scanner

So it looks like 1304ms to 909ms under the same conditions with just the basic exclusions, and wouldn't this be running through a larger list/set of path exclusions which would still add extra overhead? 400ms?

The complete path black list was [META-INF/maven, META-INF/resources, META-INF/services, antlr, com/fasterxml, com/google, com/hazelcast, com/jwebmp, com/microsoft, com/sun, io/github, io/undertow, javax/persistence, javax/servlet, javax/transaction, javax/validation, javax/xml, net/bytebuddy, org/hibernate] - [FINE]

GedMarc · 2018-11-07T00:20:38Z

4.4.2 came in at 794ms

[2018/11/07 02:20:10.611]-Loaded Classpath Scanner - Took [794] millis. - [FINE]

lukehutch · 2018-11-07T00:40:08Z

794 vs. 909? That sounds about right

GedMarc · 2018-11-07T00:43:09Z

Naw 794 vs 1304 with the same configuration,

If i exclude a few of the larger paths then its 909 :)

lukehutch · 2018-11-07T00:45:29Z

I think I figured out a way to get back the lost performance... Rather than exhaustively recursively scanning directories, I can linearly search through class path elements for the classfile only in cases of upward extension beyond the whitelist. I'll try and get that done soon.

In your case, how many jars among your transitive dependencies contain nested jars that are being extracted to temporary files? I am going to try writing a custom unzipped for that to avoid the extraction. See the one open PR.

lukehutch · 2018-11-07T00:46:38Z

1304 vs. 794ms? That's surprising. Are you scanning extracted/exploded jars, ie. Not directly from zipfiles?

lukehutch · 2018-11-07T00:47:14Z

Or how many class path elements are extracted/exploded on disk vs. in a jar?

GedMarc · 2018-11-07T00:52:34Z

Mmm, 110 jars/modules, +_65mb
OOOOOO - Ok the IDE producing the numbers is running exploded with a rather large @ args file
Running java -jar with libs dir is 381ms.

Can take it just as debug/ide overhead I guess
[2018/11/07 02:51:00.236]-Loaded Classpath Scanner - Took [381] millis.

lukehutch · 2018-11-07T00:55:42Z

Ah, that's more what I would expect. This should take the same time before/after the change, for jars/modules. Still, I'll need to implement the optimization I discussed -- almost doubling the time for exploded jars is not OK.

Your project is crazy 🤪

GedMarc · 2018-11-07T01:14:08Z

I think I'm safely at that point where I can start questioning my own sanity instead of the stupid scope of a project xD
But it's still fast as ever (actually faster) than a while ago!

If you want to delve deeper into this -
These numbers are from https://github.com/GedMarc/JWebMPHomePage, you can swop profiles between jdk8 and jdk11, eclipse configs should just work.
Getting a build running is a mvn -Pjdk11 package, then removing index.list from the meta-inf folder of the jar created (this is crashing all jpms projects xD)
There's no nested jars or anything funny like that, its a pure java -jar thing.
HomePageStartup has the static void main

lukehutch · 2018-11-07T01:19:22Z

Actually I'll reopen this as a TODO to myself to optimize exploded jar scanning. Thanks.

lukehutch · 2018-11-28T01:46:21Z

@GedMarc Please test git master to see if it fixes the speed regression you were seeing between 4.4.2 and 4.4.3. Thanks!

lukehutch · 2018-11-29T09:05:16Z

Sorry, I forgot to push the changes to master that I needed you to test :-S

You should see that checking out commit db87473 restores the speed back to 4.4.2 era performance.

I am working on a new custom nested zipfile reader that should deliver dramatically higher performance for jarfiles, so I'll need you to test that too once it's done, but for now please let me know if the above commit solved the regression.

lukehutch · 2018-11-30T11:56:16Z

@GedMarc The new nested zipfile reader was committed in c7c431c (along with some subsequent fixes). This should give you big speedups for classpaths consisting of lots of nested jars.

If you can, please test and let me know your total scan time for:

4.4.2
4.4.3
4.4.12
db87473
master

Thanks!

lukehutch · 2018-12-01T09:52:13Z

I committed lots more changes and optimizations to master. It should be relatively stable now, and temp files are almost never created (even for the rare case of a deflated nested jar, if the jar size is <32MB, it is extracted to RAM rather than disk).

lukehutch · 2018-12-02T17:18:33Z

Everything looked good in my own testing, so I released 4.6.0. I'll close this bug report now that I fixed the performance regression, but please let me know once you test the performance. Thanks!

GedMarc · 2018-12-02T18:00:31Z

Hey sorry, been left and right and center and everywhere!
I have a few screens I want to finish off that use the scanner quite a lot ;)

lukehutch · 2018-12-02T18:49:45Z

All good... I just wanted to make sure it was bug-free before releasing, and I guarantee you are the biggest stress-tester of ClassGraph in the world (probably followed by Lars Grefer, for his joinfaces project)...

I found a ton of bugs (some old, some new), and got it to the point where I couldn't find any more, so I just pushed it out with fingers crossed :-)

GedMarc closed this as completed Nov 7, 2018

lukehutch reopened this Nov 7, 2018

lukehutch added a commit that referenced this issue Nov 29, 2018

Truncate recursive dir scanning when outside whitelist (#282)

798c665

lukehutch closed this as completed Dec 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path scanning does deep scans for whitelisted paths #282

Path scanning does deep scans for whitelisted paths #282

GedMarc commented Nov 6, 2018

GedMarc commented Nov 6, 2018

lukehutch commented Nov 6, 2018

GedMarc commented Nov 6, 2018 •

edited

GedMarc commented Nov 7, 2018

lukehutch commented Nov 7, 2018

GedMarc commented Nov 7, 2018

lukehutch commented Nov 7, 2018

lukehutch commented Nov 7, 2018

lukehutch commented Nov 7, 2018

GedMarc commented Nov 7, 2018 •

edited

lukehutch commented Nov 7, 2018 •

edited

GedMarc commented Nov 7, 2018

lukehutch commented Nov 7, 2018

lukehutch commented Nov 28, 2018

lukehutch commented Nov 29, 2018

lukehutch commented Nov 30, 2018 •

edited

lukehutch commented Dec 1, 2018

lukehutch commented Dec 2, 2018

GedMarc commented Dec 2, 2018

lukehutch commented Dec 2, 2018

Path scanning does deep scans for whitelisted paths #282

Path scanning does deep scans for whitelisted paths #282

Comments

GedMarc commented Nov 6, 2018

GedMarc commented Nov 6, 2018

lukehutch commented Nov 6, 2018

GedMarc commented Nov 6, 2018 • edited

GedMarc commented Nov 7, 2018

lukehutch commented Nov 7, 2018

GedMarc commented Nov 7, 2018

lukehutch commented Nov 7, 2018

lukehutch commented Nov 7, 2018

lukehutch commented Nov 7, 2018

GedMarc commented Nov 7, 2018 • edited

lukehutch commented Nov 7, 2018 • edited

GedMarc commented Nov 7, 2018

lukehutch commented Nov 7, 2018

lukehutch commented Nov 28, 2018

lukehutch commented Nov 29, 2018

lukehutch commented Nov 30, 2018 • edited

lukehutch commented Dec 1, 2018

lukehutch commented Dec 2, 2018

GedMarc commented Dec 2, 2018

lukehutch commented Dec 2, 2018

GedMarc commented Nov 6, 2018 •

edited

GedMarc commented Nov 7, 2018 •

edited

lukehutch commented Nov 7, 2018 •

edited

lukehutch commented Nov 30, 2018 •

edited