New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Path scanning does deep scans for whitelisted paths #282
Comments
Adding path exclusions for com/jwebmp and org/hibernate reduces the time quite a lot - also as you can see the scan takes 0.6 seconds to complete, but doing the file scan is taking a second |
What you are probably seeing is this: https://github.com/classgraph/classgraph/releases/tag/classgraph-4.4.3 What this means is that all paths (directory entries and zipfile directories) in all classpath elements have to be scanned, in order to know where resources are located, so that after reading the contents of whitelisted classfiles, any superclasses, interfaces and annotations that are not whitelisted can be quickly manually located, and the class graph can be traced "upwards" to the last reachable superclass / interface / annotation. Unfortunately this incurs a small penalty for directory-based classpath elements, because the whole directory tree has to be read. This is unavoidable if this "upwards scanning" is to be implemented, and this needed to be fixed for a long time. (Before, only directories that may contain a whitelisted path are recursed to.) However, the contents of files are not read unless they are actually whitelisted, so very few additional files are actually ever opened as a result of this change (and recursively enumerating files in directories does take time, but that's really only an issue on HDDs, not SSDs). And for jarfile classpath elements and modules, the code already had to read all paths within the jar / module, since there was no way to scan recursively (you only have a list of all resources within the jar / module) -- so there is little to no performance hit for anything but directories. In your case, everything on your classpath should be a jarfile or module, so you shouldn't see much of a performance hit. However, I made one other change, either in that release or one around that time, which was to log everything scanned, even if not whitelisted. This makes the logs much bigger, but I figured it would be easier to find bugs that way. So the logfiles are huge now. Logs are only written if Please let me know how much of a performance hit you see from scanning on 4.4.2 vs. 4.4.3 with logging disabled. Thanks. |
Hmm OK, I understand, A few questions around this though I didn't switch to the older version to check (parent bom's bla will do it on my next push to maven) - but I did do some basic checking Basic Path Blacklisting Adding in the path exclusions to the path scanner So it looks like 1304ms to 909ms under the same conditions with just the basic exclusions, and wouldn't this be running through a larger list/set of path exclusions which would still add extra overhead? 400ms? The complete path black list was [META-INF/maven, META-INF/resources, META-INF/services, antlr, com/fasterxml, com/google, com/hazelcast, com/jwebmp, com/microsoft, com/sun, io/github, io/undertow, javax/persistence, javax/servlet, javax/transaction, javax/validation, javax/xml, net/bytebuddy, org/hibernate] - [FINE] |
4.4.2 came in at 794ms [2018/11/07 02:20:10.611]-Loaded Classpath Scanner - Took [794] millis. - [FINE] |
794 vs. 909? That sounds about right |
Naw 794 vs 1304 with the same configuration, If i exclude a few of the larger paths then its 909 :) |
I think I figured out a way to get back the lost performance... Rather than exhaustively recursively scanning directories, I can linearly search through class path elements for the classfile only in cases of upward extension beyond the whitelist. I'll try and get that done soon. In your case, how many jars among your transitive dependencies contain nested jars that are being extracted to temporary files? I am going to try writing a custom unzipped for that to avoid the extraction. See the one open PR. |
1304 vs. 794ms? That's surprising. Are you scanning extracted/exploded jars, ie. Not directly from zipfiles? |
Or how many class path elements are extracted/exploded on disk vs. in a jar? |
Mmm, 110 jars/modules, +_65mb Can take it just as debug/ide overhead I guess |
Ah, that's more what I would expect. This should take the same time before/after the change, for jars/modules. Still, I'll need to implement the optimization I discussed -- almost doubling the time for exploded jars is not OK. Your project is crazy 🤪 |
I think I'm safely at that point where I can start questioning my own sanity instead of the stupid scope of a project xD If you want to delve deeper into this - |
Actually I'll reopen this as a TODO to myself to optimize exploded jar scanning. Thanks. |
@GedMarc Please test git master to see if it fixes the speed regression you were seeing between 4.4.2 and 4.4.3. Thanks! |
Sorry, I forgot to push the changes to master that I needed you to test :-S You should see that checking out commit db87473 restores the speed back to 4.4.2 era performance. I am working on a new custom nested zipfile reader that should deliver dramatically higher performance for jarfiles, so I'll need you to test that too once it's done, but for now please let me know if the above commit solved the regression. |
I committed lots more changes and optimizations to master. It should be relatively stable now, and temp files are almost never created (even for the rare case of a deflated nested jar, if the jar size is <32MB, it is extracted to RAM rather than disk). |
Everything looked good in my own testing, so I released 4.6.0. I'll close this bug report now that I fixed the performance regression, but please let me know once you test the performance. Thanks! |
Hey sorry, been left and right and center and everywhere! |
All good... I just wanted to make sure it was bug-free before releasing, and I guarantee you are the biggest stress-tester of ClassGraph in the world (probably followed by Lars Grefer, for his joinfaces project)... I found a ton of bugs (some old, some new), and got it to the point where I couldn't find any more, so I just pushed it out with fingers crossed :-) |
When enabling path scanning - deep scans are still performed unless explicitly black listed - causing a lot of overhead.
Path White List : META-INF
Path Black List: META-INF/maven, META-INF/resources, META-INF/services (yes I can do whitelist no subdir but its a demo)
Paths deep scanned that shouldn't be - nothing not located under root/META-INF, current scanning the entire class path to the .class level
Attached the list is perty big :)
output.txt
Thanks!
The text was updated successfully, but these errors were encountered: