Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest geoip processor cache 'no results' from the database #104092

Merged
merged 5 commits into from Jan 10, 2024

Conversation

joegallo
Copy link
Contributor

@joegallo joegallo commented Jan 8, 2024

For my own future reference, here's some history of the geoip cache: it was originally added in #22231, rewritten for better performance in #33029, and tweaked for better performance in #80737 and #92372.

The maxmind DatabaseReader's tryCity/tryCountry/tryAsn methods return an empty Optional if you ask about an ip address for which the database doesn't have a record, and we very quickly convert that into a null value (see DatabaseReaderLazyLoader#getResponse).

Externally, the cache has two way logic: it either returns a response if there was one, or it returns null if there's nothing in the database for that ip address. That was true before this PR, and it remains true after this PR.

Internally, the cache used to have two way logic: it would A.) return a result from the cache if there was one, OR B.) it would hit the underlying database if it had either never seen the ip address before or if the ip address wasn't in the database.

After this pull request that logic changes to three way logic: it will A.) return a result from the cache if there is one, OR B.) return null 'from the cache' if there isn't a record of the ip address in the database, OR C.) hit the underlying database if it hasn't seen the ip address before.

Generally speaking, I don't expect the databases to have records for "private internets" (https://www.ietf.org/rfc/rfc1918.txt) so if you're feeding a lot of those sorts of ip addresses into the geoip processor, then this change will result in significantly better performance for you (since we'll cache 'no-result' rather than hitting the underlying database each time).

In some ad-hoc testing on my machine, a cache hit is about an order of magnitude cheaper than hitting the actual database, so the more we can exercise the cache, the faster we'll be.

rather than only cache actual-result responses. Various internal and
private ips aren't expected to be in the database. If we get no result
for an ip, it's not like that is going to change if we check again.
@joegallo joegallo added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.13.0 labels Jan 8, 2024
@joegallo joegallo requested a review from masseyke January 8, 2024 20:06
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @joegallo, I've created a changelog YAML for you.

Copy link
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little nervous about unintended side effects, but it definitely makes sense that we'd want to cache cache misses. And if due to some weird distribution of data a user finds themselves now getting 0 cache hits (and they probably weren't getting many cache hits before), they can can increase their cache size and/or start mapping their invalid IP addresses to something valid in a step before the geoip processor.

@joegallo joegallo merged commit 53db0b3 into elastic:main Jan 10, 2024
15 checks passed
@joegallo joegallo deleted the ingest-geoip-cache-no-result branch January 10, 2024 14:55
jedrazb pushed a commit to jedrazb/elasticsearch that referenced this pull request Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants