-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make GeoIpProcessor backing database instance pluggable. #93285
Conversation
Introduces two new interfaces: GeoIpDatabase and GeoIpDatabaseProvider. GeoIpDatabaseProvider acts as a generic factory interface for GeoIpDatabase instances. This allows for specifying how database instances are obtained to the processor. GeoIpDatabase encompasses the API footprint for performing GeoIp lookups against a maxmind database.
Hi @jbaiera, I've created a changelog YAML for you. |
|
||
package org.elasticsearch.ingest.geoip; | ||
|
||
import com.maxmind.geoip2.model.AsnResponse; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're making an interface, do we want to take this opportunity to remove the direct dependency on maxmind? Or is it pretty certain that that's the only option we'll ever support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good question. I think if we were planning on embedding this more widely we would want to provide a facade for the Maxmind libraries so they don't leak out.
I imagine it wouldn't be too difficult to knock those out for this project. I'm just concerned how much logic is already Maxmind specific in the processor and whether it makes sense to go for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things that are Maxmind specific in the GeoIpProcessor right now:
- The format expected for the database types (e.g.
-City
,-Country
,-ASN
) - Specific logic to remain compliant with the Maxmind database licensing
- The set of default database names for determining if a custom database is specified
- The results we set on ingest documents are based on those available from Maxmind currently. Other database formats would likely need to match capabilities or we'd need to either verify capabilities match the expected fields on the processor or fill the gaps with null values.
- Many error messages mention Maxmind directly
I think without concrete plans for allowing other geo ip database formats, I'd push back on the idea of putting a facade over the Maxmind library unless folks harbor overwhelming support for the idea.
* @return the database type as it is detailed in the database file metadata | ||
* @throws IOException if the database file could not be read or the data could not be accessed | ||
*/ | ||
String getDatabaseType() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really familiar with the type. Is there a way to make it an enum (city, country, etc)? Or is it really open-ended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specifically the type that is listed in the Maxmind database file metadata. From the looks of things it identifies the original filename in the cases I tested. There is also a database type enum that exists in the Maxmind libraries but I don't know how difficult it is to lazily load that type enum from the database file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A vague memory from digging in the maxmind code ~8 years ago gives me the spidey-sense that the database name encodes all of the types of responses you can get out of it, including what is also available from commercially-licensed variants? It is a very vague memory at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new interfaces look fine, but there is no way to provide them to the GeoIpProcessor
from outside the package -- its constructor that accepts them is package-private and the GeoIpProcessor.Factory
constructor still requires a specific implementation of the new interface.
Additionally, the GeoIpProcessor.Factory
requires a ClusterService
(which pluggable implementations won't have), which is used only to generate a Supplier<Boolean>
validity check for the configured database path.
I have made a proof-of-concept stab that I think will solve both issues over on main...yaauie:elasticsearch:geo-ip-db-interfaces-pluggable -- it is certainly the minimum-viable hack from my end and includes no tests, but I can instantiate it from "the outside" in my other work.
@yaauie Ah yeah I missed the constructor issues on the first pass. These shouldn't be too hard to handle. As for the factory portion of it, my thought was originally that it would be easiest to just instantiate the processor itself, but that means there is no configuration parsing logic being done. Would you rather get an instance of the processor by calling its constructor, or would you prefer to get an instance of the factory and call that for the processor instance? If you'd rather the factory route, I can move the validation logic to the provider interface and then we just need to pass the provider into the factory instead of requiring an additional dependency on the cluster service. |
@yaauie I have updated the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interfaces that I need to interact with and/or implement all look sensible.
Pinging @elastic/es-data-management (Team:Data Management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me.
@elasticmachine update branch |
This PR introduces two new interfaces to the ingest-geoip module:
GeoIpDatabase
andGeoIpDatabaseProvider
. These interfaces allow downstream embedded users (like the Logstash project) to substitute the existing GeoIp database management code with custom code that handles provisioning and looking up ip data.The
GeoIpDatabaseProvider
acts as a generic interface for creatingGeoIpDatabase
instances. This allows for specifying how database instances are obtained by the processor. An existing reference implementation of this interface would be theDatabaseNodeService
which handles updates to local database files, and pools database readers to be shared between processor calls.The
GeoIpDatabase
interface encompasses the API footprint for performing GeoIp lookups against a Maxmind database. An existing reference implementation of this interface would be theDatabaseReaderLazyLoader
which executes operations against a lazily loaded database in memory and manages an internal result cache.