Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make GeoIpProcessor backing database instance pluggable. #93285

Merged
merged 8 commits into from
Feb 2, 2023

Conversation

jbaiera
Copy link
Member

@jbaiera jbaiera commented Jan 26, 2023

This PR introduces two new interfaces to the ingest-geoip module: GeoIpDatabase and GeoIpDatabaseProvider. These interfaces allow downstream embedded users (like the Logstash project) to substitute the existing GeoIp database management code with custom code that handles provisioning and looking up ip data.

The GeoIpDatabaseProvider acts as a generic interface for creating GeoIpDatabase instances. This allows for specifying how database instances are obtained by the processor. An existing reference implementation of this interface would be the DatabaseNodeService which handles updates to local database files, and pools database readers to be shared between processor calls.

The GeoIpDatabase interface encompasses the API footprint for performing GeoIp lookups against a Maxmind database. An existing reference implementation of this interface would be the DatabaseReaderLazyLoader which executes operations against a lazily loaded database in memory and manages an internal result cache.

Introduces two new interfaces: GeoIpDatabase and GeoIpDatabaseProvider. GeoIpDatabaseProvider acts as a generic factory interface for GeoIpDatabase instances. This allows for specifying how database instances are obtained to the processor. GeoIpDatabase encompasses the API footprint for performing GeoIp lookups against a maxmind database.
@jbaiera jbaiera added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.7.0 labels Jan 26, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @jbaiera, I've created a changelog YAML for you.

@jbaiera jbaiera requested review from jsvd, yaauie and masseyke and removed request for jsvd January 27, 2023 00:04

package org.elasticsearch.ingest.geoip;

import com.maxmind.geoip2.model.AsnResponse;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're making an interface, do we want to take this opportunity to remove the direct dependency on maxmind? Or is it pretty certain that that's the only option we'll ever support?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good question. I think if we were planning on embedding this more widely we would want to provide a facade for the Maxmind libraries so they don't leak out.

I imagine it wouldn't be too difficult to knock those out for this project. I'm just concerned how much logic is already Maxmind specific in the processor and whether it makes sense to go for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things that are Maxmind specific in the GeoIpProcessor right now:

  • The format expected for the database types (e.g. -City, -Country, -ASN)
  • Specific logic to remain compliant with the Maxmind database licensing
  • The set of default database names for determining if a custom database is specified
  • The results we set on ingest documents are based on those available from Maxmind currently. Other database formats would likely need to match capabilities or we'd need to either verify capabilities match the expected fields on the processor or fill the gaps with null values.
  • Many error messages mention Maxmind directly

I think without concrete plans for allowing other geo ip database formats, I'd push back on the idea of putting a facade over the Maxmind library unless folks harbor overwhelming support for the idea.

* @return the database type as it is detailed in the database file metadata
* @throws IOException if the database file could not be read or the data could not be accessed
*/
String getDatabaseType() throws IOException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really familiar with the type. Is there a way to make it an enum (city, country, etc)? Or is it really open-ended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specifically the type that is listed in the Maxmind database file metadata. From the looks of things it identifies the original filename in the cases I tested. There is also a database type enum that exists in the Maxmind libraries but I don't know how difficult it is to lazily load that type enum from the database file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A vague memory from digging in the maxmind code ~8 years ago gives me the spidey-sense that the database name encodes all of the types of responses you can get out of it, including what is also available from commercially-licensed variants? It is a very vague memory at this point.

Copy link
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new interfaces look fine, but there is no way to provide them to the GeoIpProcessor from outside the package -- its constructor that accepts them is package-private and the GeoIpProcessor.Factory constructor still requires a specific implementation of the new interface.

Additionally, the GeoIpProcessor.Factory requires a ClusterService (which pluggable implementations won't have), which is used only to generate a Supplier<Boolean> validity check for the configured database path.

I have made a proof-of-concept stab that I think will solve both issues over on main...yaauie:elasticsearch:geo-ip-db-interfaces-pluggable -- it is certainly the minimum-viable hack from my end and includes no tests, but I can instantiate it from "the outside" in my other work.

@jbaiera
Copy link
Member Author

jbaiera commented Jan 31, 2023

@yaauie Ah yeah I missed the constructor issues on the first pass. These shouldn't be too hard to handle. As for the factory portion of it, my thought was originally that it would be easiest to just instantiate the processor itself, but that means there is no configuration parsing logic being done.

Would you rather get an instance of the processor by calling its constructor, or would you prefer to get an instance of the factory and call that for the processor instance? If you'd rather the factory route, I can move the validation logic to the provider interface and then we just need to pass the provider into the factory instead of requiring an additional dependency on the cluster service.

@jbaiera
Copy link
Member Author

jbaiera commented Jan 31, 2023

@yaauie I have updated the GeoIpProcessor.Factory constructor to only take a GeoIpDatabaseProvider instance. I've moved the validation logic into the DatabaseNodeService so now there is a isValid(databaseName) method on the interface that will need to be implemented. There should be no more dependencies on the ClusterService in the factory. They should all be moved to underneath the ES specific implementations for the GeoIpDatabase interfaces now.

@jbaiera jbaiera requested a review from yaauie January 31, 2023 22:35
Copy link
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interfaces that I need to interact with and/or implement all look sensible.

@jbaiera jbaiera marked this pull request as ready for review February 2, 2023 16:57
@jbaiera jbaiera requested a review from masseyke February 2, 2023 16:57
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Feb 2, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

@jbaiera
Copy link
Member Author

jbaiera commented Feb 2, 2023

@elasticmachine update branch

@jbaiera jbaiera merged commit c143caf into elastic:main Feb 2, 2023
@jbaiera jbaiera deleted the geo-ip-db-interfaces branch February 2, 2023 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants