New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the ingest-geoip databases even lazier to load #36679
Conversation
Today we try to load the ingest-geoip databases lazily. Currently they are loaded as soon as any pipeline that uses an ingest-geoip processor is created. This is not lazy enough. For example, we could only load the databases the first time that they are actually used. This would ensure that we load the minimal set of data to support an in-use pipeline (instead of *all* of the data). This can come up in a couple of ways. One is when a subset of the database is used (e.g., the city database versus the country database versus the ASN database). Another is when the plugins are installed on non-ingest nodes (e.g., master-only nodes); we would never use the database in this case yet they are currently being loaded occupying ~60 MB of the heap. This commit makes the ingest-geoip databases as lazy as possible.
Pinging @elastic/es-core-features |
@elasticmachine run gradle build tests 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 - This is a good improvement!
Another is when the plugins are installed on non-ingest nodes (e.g., master-only nodes);
Maybe we should also look into not loading ingest pipelines at all on master-only nodes.
There is a substantial downside to this pull request which is that we no longer eagerly validate configuration. I am exploring options to maintain this. |
We go through a validation step on the master on put pipeline requests. 😢 |
plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpProcessor.java
Outdated
Show resolved
Hide resolved
plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpProcessor.java
Outdated
Show resolved
Hide resolved
Oops. I forgot about that.
We can try to read the database type from the geoip files in
that way we can still validate the geoip properties in the factory. |
Something like this should work: Path path = PathUtils.get(file);
long fileSize = Files.size(path);
final int[] DATABASE_TYPE_MARKER = {'d', 'a', 't', 'a', 'b', 'a', 's', 'e', '_', 't', 'y', 'p', 'e'};
try (InputStream in = Files.newInputStream(path)) {
// read last 512 bytes, for all 3 databases this is sufficient
in.skip(fileSize - 512);
byte[] tail = new byte[512];
in.read(tail);
// Find the database_type header:
int metadataOffset = -1;
int markerOffset = 0;
for (int i = 0; i < tail.length; i++) {
byte b = tail[i];
if (b == DATABASE_TYPE_MARKER[markerOffset]) {
markerOffset++;
} else {
markerOffset = 0;
}
if (markerOffset == DATABASE_TYPE_MARKER.length) {
metadataOffset = i + 1;
break;
}
}
// Read the database type
int offsetByte = tail[metadataOffset] & 0xFF;
int type = offsetByte >>> 5;
if (type != 2) {
throw new RuntimeException("type must be UTF8_STRING");
}
int size = offsetByte & 0x1f;
String dataBaseType = new String(tail, metadataOffset + 1, size, StandardCharsets.UTF_8);
System.out.println(dataBaseType);
} This looks for the database type header in the last bit of the geoip db file. |
@elasticmachine run gradle build tests 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments. It would also be great to have a test that verifies that adding a pipeline does not load the database on the elected master node. Maybe we can add a single node test that adds a pipeline with a geoip processor, then add a method isLoaded()
to DatabaseReaderLazyLoader
and then in this new test we can assert that this method returns false.
@@ -98,7 +115,7 @@ public IngestDocument execute(IngestDocument ingestDocument) { | |||
final InetAddress ipAddress = InetAddresses.forString(ip); | |||
|
|||
Map<String, Object> geoData; | |||
String databaseType = dbReader.getMetadata().getDatabaseType(); | |||
String databaseType = dbReader.get().getMetadata().getDatabaseType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also use dbReader.getDatabaseType()
here?
@@ -119,7 +136,7 @@ public IngestDocument execute(IngestDocument ingestDocument) { | |||
geoData = Collections.emptyMap(); | |||
} | |||
} else { | |||
throw new ElasticsearchParseException("Unsupported database type [" + dbReader.getMetadata().getDatabaseType() | |||
throw new ElasticsearchParseException("Unsupported database type [" + dbReader.get().getMetadata().getDatabaseType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here?
@@ -64,14 +65,30 @@ | |||
|
|||
private final String field; | |||
private final String targetField; | |||
private final DatabaseReader dbReader; | |||
private final DatabaseReaderLazyLoader dbReader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename dbReader
to dbLoader
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does mean the lazy loader class will have to be opened up. The reason for this is because in some tests we don’t read from a file on disk but from an embedded resource. That’s different than reading a file, so for that to work I have to add hooks to allow that test to override how to get the size of the stream and how to open a stream to that resource. That’s why the class is no longer. I went down this path and then reverted it all, and forgot to revert the non-final change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, thanks for explaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should have been a thread on [this review comment]. Continuing here, I pushed this change. Do you want to take a look?
|
||
/** | ||
* Facilitates lazy loading of the database reader, so that when the geoip plugin is installed, but not used, | ||
* no memory is being wasted on the database reader. | ||
*/ | ||
final class DatabaseReaderLazyLoader implements Closeable { | ||
class DatabaseReaderLazyLoader implements Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class can remain to be sealed?
Note the test |
Cool, I think that, that is sufficient. |
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked; | ||
import static org.hamcrest.Matchers.equalTo; | ||
|
||
public class GeoIpProcessorNonIngestNodeIT extends ESIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Left a small comment.
this.databaseFileName = databaseFileName; | ||
this.loader = loader; | ||
// cache the database type so that we do not re-read it on every pipeline execution | ||
final SetOnce<String> databaseType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍- great idea to cache the database types.
break; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add:
if (metadataOffset == -1) {
throw new IOException("database type marker not found")
}
Otherwise it would fail with an ArrayOutOfBoundException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 5a0ee8f.
@elasticmachine run the default distro tests |
* elastic/master: (31 commits) enable bwc tests and switch transport serialization version to 6.6.0 for CAS features [DOCs] Adds ml-cpp PRs to alpha release notes (elastic#36790) Synchronize WriteReplicaResult callbacks (elastic#36770) Add CcrRestoreSourceService to track sessions (elastic#36578) [Painless] Add tests for boxed return types (elastic#36747) Internal: Remove originalSettings from Node (elastic#36569) [ILM][DOCS] Update ILM API authorization docs (elastic#36749) Core: Deprecate use of scientific notation in epoch time parsing (elastic#36691) [ML] Merge the Jindex master feature branch (elastic#36702) Tests: Mute SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization Update versions in SearchSortValues transport serialization Update version in SearchHits transport serialization [Geo] Integrate Lucene's LatLonShape (BKD Backed GeoShapes) as default `geo_shape` indexing approach (elastic#36751) [Docs] Fix error in Common Grams Token Filter (elastic#36774) Fix rollup search statistics (elastic#36674) SQL: Fix wrong appliance of StackOverflow limit for IN (elastic#36724) [TEST] Added more logging Invalidate Token API enhancements - HLRC (elastic#36362) Deprecate types in index API (elastic#36575) Disable bwc tests until elastic#36555 backport is complete (elastic#36737) ...
@elasticmachine run gradle builds tests 1 |
@elasticmachine run gradle build tests 1 |
@elasticmachine run gradle build tests 1 |
@elasticmachine run gradle build tests 2 |
1 similar comment
@elasticmachine run gradle build tests 2 |
Today we try to load the ingest-geoip databases lazily. Currently they are loaded as soon as any pipeline that uses an ingest-geoip processor is created. This is not lazy enough. For example, we could only load the databases the first time that they are actually used. This would ensure that we load the minimal set of data to support an in-use pipeline (instead of *all* of the data). This can come up in a couple of ways. One is when a subset of the database is used (e.g., the city database versus the country database versus the ASN database). Another is when the plugins are installed on non-ingest nodes (e.g., master-only nodes); we would never use the database in this case yet they are currently being loaded occupying ~60 MB of the heap. This commit makes the ingest-geoip databases as lazy as possible. Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Thanks @martijnvg! |
Today we try to load the ingest-geoip databases lazily. Currently they are loaded as soon as any pipeline that uses an ingest-geoip processor is created. This is not lazy enough. For example, we could only load the databases the first time that they are actually used. This would ensure that we load the minimal set of data to support an in-use pipeline (instead of all of the data). This can come up in a couple of ways. One is when a subset of the database is used (e.g., the city database versus the country database versus the ASN database). Another is when the plugins are installed on non-ingest nodes (e.g., master-only nodes); we would never use the database in this case yet they are currently being loaded occupying ~60 MB of the heap. This commit makes the ingest-geoip databases as lazy as possible.