Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path scanning does deep scans for whitelisted paths #282

Closed
GedMarc opened this issue Nov 6, 2018 · 20 comments
Closed

Path scanning does deep scans for whitelisted paths #282

GedMarc opened this issue Nov 6, 2018 · 20 comments

Comments

@GedMarc
Copy link

GedMarc commented Nov 6, 2018

When enabling path scanning - deep scans are still performed unless explicitly black listed - causing a lot of overhead.

Path White List : META-INF
Path Black List: META-INF/maven, META-INF/resources, META-INF/services (yes I can do whitelist no subdir but its a demo)

Paths deep scanned that shouldn't be - nothing not located under root/META-INF, current scanning the entire class path to the .class level
Attached the list is perty big :)

output.txt

Thanks!

@GedMarc
Copy link
Author

GedMarc commented Nov 6, 2018

Adding path exclusions for com/jwebmp and org/hibernate reduces the time quite a lot - also as you can see the scan takes 0.6 seconds to complete, but doing the file scan is taking a second

@lukehutch
Copy link
Member

What you are probably seeing is this:

https://github.com/classgraph/classgraph/releases/tag/classgraph-4.4.3

What this means is that all paths (directory entries and zipfile directories) in all classpath elements have to be scanned, in order to know where resources are located, so that after reading the contents of whitelisted classfiles, any superclasses, interfaces and annotations that are not whitelisted can be quickly manually located, and the class graph can be traced "upwards" to the last reachable superclass / interface / annotation.

Unfortunately this incurs a small penalty for directory-based classpath elements, because the whole directory tree has to be read. This is unavoidable if this "upwards scanning" is to be implemented, and this needed to be fixed for a long time. (Before, only directories that may contain a whitelisted path are recursed to.)

However, the contents of files are not read unless they are actually whitelisted, so very few additional files are actually ever opened as a result of this change (and recursively enumerating files in directories does take time, but that's really only an issue on HDDs, not SSDs). And for jarfile classpath elements and modules, the code already had to read all paths within the jar / module, since there was no way to scan recursively (you only have a list of all resources within the jar / module) -- so there is little to no performance hit for anything but directories. In your case, everything on your classpath should be a jarfile or module, so you shouldn't see much of a performance hit.

However, I made one other change, either in that release or one around that time, which was to log everything scanned, even if not whitelisted. This makes the logs much bigger, but I figured it would be easier to find bugs that way. So the logfiles are huge now. Logs are only written if .verbose() is called, so if logging is off, there is no performance penalty for this.

Please let me know how much of a performance hit you see from scanning on 4.4.2 vs. 4.4.3 with logging disabled. Thanks.

@GedMarc
Copy link
Author

GedMarc commented Nov 6, 2018

Hmm OK, I understand, A few questions around this though
a) Wouldn't the upwards search only ever take effect if class scanning/info is turned on? Path items would never have "super classes' or upwards navigation? Are we assuming that path and class info scans will always include class files?
b) I think there may be a peak to how many jars and packages before it becomes "feelable"

I didn't switch to the older version to check (parent bom's bla will do it on my next push to maven) - but I did do some basic checking

Basic Path Blacklisting

image

Adding in the path exclusions to the path scanner

image

image

So it looks like 1304ms to 909ms under the same conditions with just the basic exclusions, and wouldn't this be running through a larger list/set of path exclusions which would still add extra overhead? 400ms?

The complete path black list was [META-INF/maven, META-INF/resources, META-INF/services, antlr, com/fasterxml, com/google, com/hazelcast, com/jwebmp, com/microsoft, com/sun, io/github, io/undertow, javax/persistence, javax/servlet, javax/transaction, javax/validation, javax/xml, net/bytebuddy, org/hibernate] - [FINE]

@GedMarc
Copy link
Author

GedMarc commented Nov 7, 2018

4.4.2 came in at 794ms

[2018/11/07 02:20:10.611]-Loaded Classpath Scanner - Took [794] millis. - [FINE]

@lukehutch
Copy link
Member

794 vs. 909? That sounds about right

@GedMarc
Copy link
Author

GedMarc commented Nov 7, 2018

Naw 794 vs 1304 with the same configuration,

If i exclude a few of the larger paths then its 909 :)

@lukehutch
Copy link
Member

I think I figured out a way to get back the lost performance... Rather than exhaustively recursively scanning directories, I can linearly search through class path elements for the classfile only in cases of upward extension beyond the whitelist. I'll try and get that done soon.

In your case, how many jars among your transitive dependencies contain nested jars that are being extracted to temporary files? I am going to try writing a custom unzipped for that to avoid the extraction. See the one open PR.

@lukehutch
Copy link
Member

1304 vs. 794ms? That's surprising. Are you scanning extracted/exploded jars, ie. Not directly from zipfiles?

@lukehutch
Copy link
Member

Or how many class path elements are extracted/exploded on disk vs. in a jar?

@GedMarc
Copy link
Author

GedMarc commented Nov 7, 2018

Mmm, 110 jars/modules, +_65mb
OOOOOO - Ok the IDE producing the numbers is running exploded with a rather large @ args file
Running java -jar with libs dir is 381ms.

Can take it just as debug/ide overhead I guess
[2018/11/07 02:51:00.236]-Loaded Classpath Scanner - Took [381] millis.

@lukehutch
Copy link
Member

lukehutch commented Nov 7, 2018

Ah, that's more what I would expect. This should take the same time before/after the change, for jars/modules. Still, I'll need to implement the optimization I discussed -- almost doubling the time for exploded jars is not OK.

Your project is crazy 🤪

@GedMarc
Copy link
Author

GedMarc commented Nov 7, 2018

I think I'm safely at that point where I can start questioning my own sanity instead of the stupid scope of a project xD
But it's still fast as ever (actually faster) than a while ago!

If you want to delve deeper into this -
These numbers are from https://github.com/GedMarc/JWebMPHomePage, you can swop profiles between jdk8 and jdk11, eclipse configs should just work.
Getting a build running is a mvn -Pjdk11 package, then removing index.list from the meta-inf folder of the jar created (this is crashing all jpms projects xD)
There's no nested jars or anything funny like that, its a pure java -jar thing.
HomePageStartup has the static void main

@GedMarc GedMarc closed this as completed Nov 7, 2018
@lukehutch
Copy link
Member

Actually I'll reopen this as a TODO to myself to optimize exploded jar scanning. Thanks.

@lukehutch lukehutch reopened this Nov 7, 2018
@lukehutch
Copy link
Member

@GedMarc Please test git master to see if it fixes the speed regression you were seeing between 4.4.2 and 4.4.3. Thanks!

@lukehutch
Copy link
Member

Sorry, I forgot to push the changes to master that I needed you to test :-S

You should see that checking out commit db87473 restores the speed back to 4.4.2 era performance.

I am working on a new custom nested zipfile reader that should deliver dramatically higher performance for jarfiles, so I'll need you to test that too once it's done, but for now please let me know if the above commit solved the regression.

@lukehutch
Copy link
Member

lukehutch commented Nov 30, 2018

@GedMarc The new nested zipfile reader was committed in c7c431c (along with some subsequent fixes). This should give you big speedups for classpaths consisting of lots of nested jars.

If you can, please test and let me know your total scan time for:

  • 4.4.2
  • 4.4.3
  • 4.4.12
  • db87473
  • master

Thanks!

@lukehutch
Copy link
Member

I committed lots more changes and optimizations to master. It should be relatively stable now, and temp files are almost never created (even for the rare case of a deflated nested jar, if the jar size is <32MB, it is extracted to RAM rather than disk).

@lukehutch
Copy link
Member

Everything looked good in my own testing, so I released 4.6.0. I'll close this bug report now that I fixed the performance regression, but please let me know once you test the performance. Thanks!

@GedMarc
Copy link
Author

GedMarc commented Dec 2, 2018

Hey sorry, been left and right and center and everywhere!
I have a few screens I want to finish off that use the scanner quite a lot ;)

@lukehutch
Copy link
Member

All good... I just wanted to make sure it was bug-free before releasing, and I guarantee you are the biggest stress-tester of ClassGraph in the world (probably followed by Lars Grefer, for his joinfaces project)...

I found a ton of bugs (some old, some new), and got it to the point where I couldn't find any more, so I just pushed it out with fingers crossed :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants