-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ingest] geo-ip performance improvements #33029
Conversation
Re-implement the cache to avoid jackson JSON de-serialization for every IP lookup. The built in maxmind cache caches JsonNode objects. This requires de-serialization for every lookup, even if the object is found in cache. Profiling shows that is very expensive (CPU). The cache will now consist of the fully de-serialized objects. Profiling shows that the new footprint for the CityDB is ~6kb per cache entry. This may result in ~6Mb increase with the 1000 entry default. The performance has been measured up to 40% faster on a modern 4 core/8 thread CPU for an ingest (minimal indexing) workflow. Further, the since prior implementation cached the JsonNode objects, and there is not a 1:1 relationship between an IP lookup / JsonNode object, the default cache size was most likely too small to be very effective. While this change does not change the 1000 default cache size, it will now cache more since there is now a 1:1 relationship between an IP lookup and value in the cache.
Pinging @elastic/es-core-infra |
* cost of deserialization for each lookup (cached or not). This comes at slight expense of higher memory usage, but significant | ||
* reduction of CPU usage. | ||
*/ | ||
class GeoIpCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to an inner class since this needs to be 1 instance per plugin (same cache used across multiple processors).
Jenkins test this please. |
these performance improvements are great! sorry I didn't get a chance to look at this yet. Will allocate some time this week! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me, although I am not that familiar with the previous inner workings.
* cost of deserialization for each lookup (cached or not). This comes at slight expense of higher memory usage, but significant | ||
* reduction of CPU usage. | ||
*/ | ||
class GeoIpCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this be static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to allow instances of this class to exist outside of the IngestGeoIpPlugin which it is nested. Hence not static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static is about whether there is state needed from the outer class. This is still package private, so I don't see it as a problem for someone trying to construct this (especially since it is all within the geoid plugin project anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static is about whether there is state needed from the outer class.
Non static inner classes are also an effective for defining the logical scope to which something belongs (if even if the outer state's class is not needed). In this case you need a instance of a plugin to instantiate a cache. This helps to enforce the relationship between these two classes. IMO this helps to enforce this cache is not intended to be used by any other class (especially the by processor class...we don't a cache per processor instance) since it's logical scope is per-plugin. If I could, it would be fully private, but for testing reasons it is not.
This is mostly academic and I don't have strong opposition, just wanted to describe my rationale in more detail.
I'll defer to your opinion and change this (and below) to static.
* provides a means to safely cast the return objects. | ||
* @param <T> The AbstractResponse type used to scope the key and cast the result. | ||
*/ | ||
private class CacheKey<T extends AbstractResponse> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this be static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to allow instances of the this class to exist outside the IngestGeoIpPlugin.GeoIpCache class which it is nested. Hence not static.
//package private for testing | ||
GeoIpCache(long maxSize) { | ||
if (maxSize < 0) { | ||
throw new IllegalArgumentException("geoip cache size must be 0 or greater"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache size -> max cache size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setting which effects this is ingest.geoip.cache_size
hence the wording. I don't have a strong opinion, but do believe it is pretty unlikely anyone would set the value to negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is actually the max size, I think we should use that wording? We aren't referring to the setting directly in the exception, and documentation + renaming the setting long term can fix the confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. updated to max cache size.
updated branch to re-run CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM.
//package private for testing | ||
GeoIpCache(long maxSize) { | ||
if (maxSize < 0) { | ||
throw new IllegalArgumentException("geoip cache size must be 0 or greater"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is actually the max size, I think we should use that wording? We aren't referring to the setting directly in the exception, and documentation + renaming the setting long term can fix the confusion?
Re-implement the cache to avoid jackson JSON de-serialization for every IP lookup. The built in maxmind cache caches JsonNode objects. This requires de-serialization for every lookup, even if the object is found in cache. Profiling shows that is very expensive (CPU). The cache will now consist of the fully de-serialized objects. Profiling shows that the new footprint for the CityDB is ~6KB per cache entry. This may result in ~6MB increase with the 1000 entry default. The performance has been measured up to 40% faster on a modern 4 core/8 thread CPU for an ingest (minimal indexing) workflow. Further, the since prior implementation cached the JsonNode objects, and there is not a 1:1 relationship between an IP lookup / JsonNode object, the default cache size was most likely too small to be very effective. While this change does not change the 1000 default cache size, it will now cache more since there is now a 1:1 relationship between an IP lookup and value in the cache.
Re-implement the cache to avoid jackson JSON de-serialization for every IP lookup. The built in maxmind cache caches JsonNode objects. This requires de-serialization for every lookup, even if the object is found in cache. Profiling shows that is very expensive (CPU). The cache will now consist of the fully de-serialized objects. Profiling shows that the new footprint for the CityDB is ~6KB per cache entry. This may result in ~6MB increase with the 1000 entry default. The performance has been measured up to 40% faster on a modern 4 core/8 thread CPU for an ingest (minimal indexing) workflow. Further, the since prior implementation cached the JsonNode objects, and there is not a 1:1 relationship between an IP lookup / JsonNode object, the default cache size was most likely too small to be very effective. While this change does not change the 1000 default cache size, it will now cache more since there is now a 1:1 relationship between an IP lookup and value in the cache.
Re-implement the cache to avoid jackson JSON de-serialization for
every IP lookup. The built in maxmind cache caches JsonNode objects.
This requires de-serialization for every lookup, even if the object
is found in cache. Profiling shows that is very expensive (CPU).
The cache will now consist of the fully de-serialized objects.
Profiling shows that the new footprint for the CityDB is ~6KB per cache
entry. This may result in ~6MB increase with the 1000 entry default.
The performance has been measured up to 40% faster on a modern 4 core/8 thread
CPU for an ingest (minimal indexing) workflow.
Further, the since prior implementation cached the JsonNode objects,
and there is not a 1:1 relationship between an IP lookup / JsonNode
object, the default cache size was most likely too small to be very
effective. While this change does not change the 1000 default cache
size, it will now cache more since there is now a 1:1 relationship between
an IP lookup and value in the cache.
This issue was discovered while adding ingest node benchmarks to rally. It is believed that this is the same root cause as maxmind/GeoIP2-java#68.
Profiling before this change:
Via Rally and JFR:
Considerably more expensive then Grok:
Majority of the cost is coming from Jackson deserialization
Similar results as measured locally with YourKit:
The fully populated cache is pretty small , 1000 entries ~= 216KB
Profiling this change:
The Jackson hotspot is now gone:
However our cache size grew to ~6MB for 1000 entries
As mentioned in the commit, there is now a much stronger correlation between 1 cache entry and 1 IP address lookup. So not only are we avoiding the hotspot, we are also effectively caching more with the same 1000 default configuration.
The difference of the cache size is substantial, but imo ~6MB is still with reason especially for the performance gains.
Rally benchmark
Using a track that turns off indexing for all fields (to better isolate ingest performance), I was able to see a ~40% increase in performance on a i7 4 core/8 thread physical node.
The baseline is before this change, and the contender is this change: