Make the ingest-geoip databases even lazier to load #36679

jasontedor · 2018-12-15T21:45:17Z

Today we try to load the ingest-geoip databases lazily. Currently they are loaded as soon as any pipeline that uses an ingest-geoip processor is created. This is not lazy enough. For example, we could only load the databases the first time that they are actually used. This would ensure that we load the minimal set of data to support an in-use pipeline (instead of all of the data). This can come up in a couple of ways. One is when a subset of the database is used (e.g., the city database versus the country database versus the ASN database). Another is when the plugins are installed on non-ingest nodes (e.g., master-only nodes); we would never use the database in this case yet they are currently being loaded occupying ~60 MB of the heap. This commit makes the ingest-geoip databases as lazy as possible.

Today we try to load the ingest-geoip databases lazily. Currently they are loaded as soon as any pipeline that uses an ingest-geoip processor is created. This is not lazy enough. For example, we could only load the databases the first time that they are actually used. This would ensure that we load the minimal set of data to support an in-use pipeline (instead of *all* of the data). This can come up in a couple of ways. One is when a subset of the database is used (e.g., the city database versus the country database versus the ASN database). Another is when the plugins are installed on non-ingest nodes (e.g., master-only nodes); we would never use the database in this case yet they are currently being loaded occupying ~60 MB of the heap. This commit makes the ingest-geoip databases as lazy as possible.

elasticmachine · 2018-12-15T21:45:18Z

Pinging @elastic/es-core-features

jasontedor · 2018-12-15T23:05:26Z

@elasticmachine run gradle build tests 1
@elasticmachine run gradle build tests 2

martijnvg

👍 - This is a good improvement!

Another is when the plugins are installed on non-ingest nodes (e.g., master-only nodes);

Maybe we should also look into not loading ingest pipelines at all on master-only nodes.

jasontedor · 2018-12-16T13:23:11Z

There is a substantial downside to this pull request which is that we no longer eagerly validate configuration. I am exploring options to maintain this.

jasontedor · 2018-12-16T13:23:39Z

Maybe we should also look into not loading ingest pipelines at all on master-only nodes.

We go through a validation step on the master on put pipeline requests. 😢

plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpProcessor.java

martijnvg · 2018-12-17T08:38:53Z

We go through a validation step on the master on put pipeline requests. cry

Oops. I forgot about that.

There is a substantial downside to this pull request which is that we no longer eagerly validate configuration. I am exploring options to maintain this.

We can try to read the database type from the geoip files in IngestGeoIpPlugin without loading the geoip database itself. The header / metadata is at the end of each geoip file and we could try to read that via a normal inputstream and extract the database type:

000003a0  ef 4d 61 78 4d 69 6e 64  2e 63 6f 6d e9 5b 62 69  |.MaxMind.com.[bi|
000003b0  6e 61 72 79 5f 66 6f 72  6d 61 74 5f 6d 61 6a 6f  |nary_format_majo|
000003c0  72 5f 76 65 72 73 69 6f  6e a1 02 5b 62 69 6e 61  |r_version..[bina|
000003d0  72 79 5f 66 6f 72 6d 61  74 5f 6d 69 6e 6f 72 5f  |ry_format_minor_|
000003e0  76 65 72 73 69 6f 6e a0  4b 62 75 69 6c 64 5f 65  |version.Kbuild_e|
000003f0  70 6f 63 68 04 02 5c 11  65 bb 4d 64 61 74 61 62  |poch..\.e.Mdatab|
00000400  61 73 65 5f 74 79 70 65  4d 47 65 6f 4c 69 74 65  |ase_typeMGeoLite|
00000410  32 2d 43 69 74 79 4b 64  65 73 63 72 69 70 74 69  |2-CityKdescripti|
00000420  6f 6e e1 42 65 6e 56 47  65 6f 4c 69 74 65 32 20  |on.BenVGeoLite2 |
00000430  43 69 74 79 20 64 61 74  61 62 61 73 65 4a 69 70  |City databaseJip|
00000440  5f 76 65 72 73 69 6f 6e  a1 06 49 6c 61 6e 67 75  |_version..Ilangu|
00000450  61 67 65 73 08 04 42 64  65 42 65 6e 42 65 73 42  |ages..BdeBenBesB|
00000460  66 72 42 6a 61 45 70 74  2d 42 52 42 72 75 45 7a  |frBjaEpt-BRBruEz|
00000470  68 2d 43 4e 4a 6e 6f 64  65 5f 63 6f 75 6e 74 c3  |h-CNJnode_count.|
00000480  3b 4e 34 4b 72 65 63 6f  72 64 5f 73 69 7a 65 a1  |;N4Krecord_size.|
00000490  1c                                                |.|
00000491

that way we can still validate the geoip properties in the factory.

martijnvg · 2018-12-17T09:54:29Z

Something like this should work:

Path path = PathUtils.get(file);
            long fileSize = Files.size(path);
            final int[] DATABASE_TYPE_MARKER = {'d', 'a', 't', 'a', 'b', 'a', 's', 'e', '_', 't', 'y', 'p', 'e'};
            try (InputStream in = Files.newInputStream(path)) {
                // read last 512 bytes, for all 3 databases this is sufficient
                in.skip(fileSize - 512);
                byte[] tail = new byte[512];
                in.read(tail);

                // Find the database_type header:
                int metadataOffset = -1;
                int markerOffset = 0;
                for (int i = 0; i < tail.length; i++) {
                    byte b = tail[i];

                    if (b == DATABASE_TYPE_MARKER[markerOffset]) {
                        markerOffset++;
                    } else {
                        markerOffset = 0;
                    }
                    if (markerOffset == DATABASE_TYPE_MARKER.length) {
                        metadataOffset = i + 1;
                        break;
                    }
                }

                // Read the database type
                int offsetByte = tail[metadataOffset] & 0xFF;
                int type = offsetByte >>> 5;
                if (type != 2) {
                    throw new RuntimeException("type must be UTF8_STRING");
                }
                int size = offsetByte & 0x1f;
                String dataBaseType = new String(tail, metadataOffset + 1, size, StandardCharsets.UTF_8);
                System.out.println(dataBaseType);
            }

This looks for the database type header in the last bit of the geoip db file.

jasontedor · 2018-12-18T02:39:59Z

@elasticmachine run gradle build tests 2

martijnvg

I left a few comments. It would also be great to have a test that verifies that adding a pipeline does not load the database on the elected master node. Maybe we can add a single node test that adds a pipeline with a geoip processor, then add a method isLoaded() to DatabaseReaderLazyLoader and then in this new test we can assert that this method returns false.

martijnvg · 2018-12-18T08:37:35Z

plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpProcessor.java

@@ -98,7 +115,7 @@ public IngestDocument execute(IngestDocument ingestDocument) {
        final InetAddress ipAddress = InetAddresses.forString(ip);

        Map<String, Object> geoData;
-        String databaseType = dbReader.getMetadata().getDatabaseType();
+        String databaseType = dbReader.get().getMetadata().getDatabaseType();


maybe also use dbReader.getDatabaseType() here?

martijnvg · 2018-12-18T08:37:43Z

plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpProcessor.java

@@ -119,7 +136,7 @@ public IngestDocument execute(IngestDocument ingestDocument) {
                geoData = Collections.emptyMap();
            }
        } else {
-            throw new ElasticsearchParseException("Unsupported database type [" + dbReader.getMetadata().getDatabaseType()
+            throw new ElasticsearchParseException("Unsupported database type [" + dbReader.get().getMetadata().getDatabaseType()


martijnvg · 2018-12-18T08:38:11Z

plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpProcessor.java

@@ -64,14 +65,30 @@

    private final String field;
    private final String targetField;
-    private final DatabaseReader dbReader;
+    private final DatabaseReaderLazyLoader dbReader;


maybe rename dbReader to dbLoader?

This does mean the lazy loader class will have to be opened up. The reason for this is because in some tests we don’t read from a file on disk but from an embedded resource. That’s different than reading a file, so for that to work I have to add hooks to allow that test to override how to get the size of the stream and how to open a stream to that resource. That’s why the class is no longer. I went down this path and then reverted it all, and forgot to revert the non-final change.

gotcha, thanks for explaining.

This comment should have been a thread on [this review comment]. Continuing here, I pushed this change. Do you want to take a look?

martijnvg · 2018-12-18T08:40:00Z

plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseReaderLazyLoader.java


 /**
 * Facilitates lazy loading of the database reader, so that when the geoip plugin is installed, but not used,
 * no memory is being wasted on the database reader.
 */
-final class DatabaseReaderLazyLoader implements Closeable {
+class DatabaseReaderLazyLoader implements Closeable {


I think this class can remain to be sealed?

jasontedor · 2018-12-18T10:53:30Z

It would also be great to have a test that verifies that adding a pipeline does not load the database on the elected master node.

Note the test GeoIpProcessorFactoryTests#testLazyLoading covers this although perhaps not as directly as you might like. It shows the database isn’t loaded until a pipeline is executed.

martijnvg · 2018-12-18T12:18:51Z

Note the test GeoIpProcessorFactoryTests#testLazyLoading covers this although perhaps not as directly as you might like. It shows the database isn’t loaded until a pipeline is executed.

Cool, I think that, that is sufficient.

martijnvg · 2018-12-18T14:02:24Z

...ingest-geoip/src/test/java/org/elasticsearch/ingest/geoip/GeoIpProcessorNonIngestNodeIT.java

+import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
+import static org.hamcrest.Matchers.equalTo;
+
+public class GeoIpProcessorNonIngestNodeIT extends ESIntegTestCase {


martijnvg

LGTM - Left a small comment.

martijnvg · 2018-12-18T16:10:46Z

plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseReaderLazyLoader.java

-        this.databaseFileName = databaseFileName;
-        this.loader = loader;
+    // cache the database type so that we do not re-read it on every pipeline execution
+    final SetOnce<String> databaseType;


👍- great idea to cache the database types.

martijnvg · 2018-12-18T16:12:13Z

plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseReaderLazyLoader.java

+                                break;
+                            }
+                        }
+


maybe add:

if (metadataOffset == -1) { throw new IOException("database type marker not found") }

Otherwise it would fail with an ArrayOutOfBoundException.

I pushed 5a0ee8f.

jasontedor · 2018-12-18T19:28:35Z

@elasticmachine run the default distro tests

* elastic/master: (31 commits) enable bwc tests and switch transport serialization version to 6.6.0 for CAS features [DOCs] Adds ml-cpp PRs to alpha release notes (elastic#36790) Synchronize WriteReplicaResult callbacks (elastic#36770) Add CcrRestoreSourceService to track sessions (elastic#36578) [Painless] Add tests for boxed return types (elastic#36747) Internal: Remove originalSettings from Node (elastic#36569) [ILM][DOCS] Update ILM API authorization docs (elastic#36749) Core: Deprecate use of scientific notation in epoch time parsing (elastic#36691) [ML] Merge the Jindex master feature branch (elastic#36702) Tests: Mute SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization Update versions in SearchSortValues transport serialization Update version in SearchHits transport serialization [Geo] Integrate Lucene's LatLonShape (BKD Backed GeoShapes) as default `geo_shape` indexing approach (elastic#36751) [Docs] Fix error in Common Grams Token Filter (elastic#36774) Fix rollup search statistics (elastic#36674) SQL: Fix wrong appliance of StackOverflow limit for IN (elastic#36724) [TEST] Added more logging Invalidate Token API enhancements - HLRC (elastic#36362) Deprecate types in index API (elastic#36575) Disable bwc tests until elastic#36555 backport is complete (elastic#36737) ...

jasontedor · 2018-12-18T21:12:42Z

@elasticmachine run gradle builds tests 1

jasontedor · 2018-12-18T21:13:22Z

@elasticmachine run gradle build tests 1

jasontedor · 2018-12-18T22:09:08Z

@elasticmachine run gradle build tests 1
@elasticmachine run gradle build tests 2

jasontedor · 2018-12-19T00:11:50Z

@elasticmachine run gradle build tests 2

jasontedor · 2018-12-19T00:26:33Z

@elasticmachine run gradle build tests 2

Today we try to load the ingest-geoip databases lazily. Currently they are loaded as soon as any pipeline that uses an ingest-geoip processor is created. This is not lazy enough. For example, we could only load the databases the first time that they are actually used. This would ensure that we load the minimal set of data to support an in-use pipeline (instead of *all* of the data). This can come up in a couple of ways. One is when a subset of the database is used (e.g., the city database versus the country database versus the ASN database). Another is when the plugins are installed on non-ingest nodes (e.g., master-only nodes); we would never use the database in this case yet they are currently being loaded occupying ~60 MB of the heap. This commit makes the ingest-geoip databases as lazy as possible. Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>

jasontedor · 2018-12-19T03:22:28Z

Thanks @martijnvg!

jasontedor added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.0.0 v6.6.0 labels Dec 15, 2018

jasontedor requested a review from martijnvg December 15, 2018 21:45

jasontedor added 3 commits December 15, 2018 16:52

Fix silliness

43d5176

Fix silliness

f762a08

Checkstyle

d9477a4

martijnvg approved these changes Dec 16, 2018

View reviewed changes

wip

3169121

danielmitterdorfer reviewed Dec 16, 2018

View reviewed changes

plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpProcessor.java Outdated Show resolved Hide resolved

danielmitterdorfer reviewed Dec 16, 2018

View reviewed changes

plugins/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpProcessor.java Outdated Show resolved Hide resolved

jasontedor added the WIP label Dec 16, 2018

jasontedor changed the title ~~Make the ingest-geoip databases even lazier to load~~ [WIP] Make the ingest-geoip databases even lazier to load Dec 16, 2018

jasontedor added 9 commits December 17, 2018 14:44

WIP

457004f

Fix comments

9a37260

Fix checkstyle

9315409

Some cleanup

39344df

Error handling

f6240d8

Javadocs

a89d2ab

Better exception

719fff1

More cleanup

23a47c6

Another leanup

8014754

martijnvg reviewed Dec 18, 2018

View reviewed changes

martijnvg approved these changes Dec 18, 2018

View reviewed changes

jasontedor added 4 commits December 18, 2018 08:42

Add test showing no loading on non-ingest

f3289ab

Add Javadocs

b823514

Make class final

4bd2926

A little more cleanup

bea6de9

martijnvg reviewed Dec 18, 2018

View reviewed changes

jasontedor added 3 commits December 18, 2018 10:24

Use lazy loader database type

e37e838

Fix imports

e6dc88a

Missing newline

c0870a3

jasontedor requested a review from martijnvg December 18, 2018 15:55

martijnvg approved these changes Dec 18, 2018

View reviewed changes

jasontedor added 2 commits December 18, 2018 11:19

Add exception handling

5a0ee8f

Fix test name

3a131b1

jasontedor merged commit 273b37a into elastic:master Dec 19, 2018

jasontedor deleted the lazier-ingest-geoip branch December 19, 2018 03:22

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the ingest-geoip databases even lazier to load #36679

Make the ingest-geoip databases even lazier to load #36679

jasontedor commented Dec 15, 2018

elasticmachine commented Dec 15, 2018

jasontedor commented Dec 15, 2018

martijnvg left a comment

jasontedor commented Dec 16, 2018

jasontedor commented Dec 16, 2018

martijnvg commented Dec 17, 2018

martijnvg commented Dec 17, 2018

jasontedor commented Dec 18, 2018

martijnvg left a comment

martijnvg Dec 18, 2018

martijnvg Dec 18, 2018

martijnvg Dec 18, 2018

jasontedor Dec 18, 2018 •

edited

martijnvg Dec 18, 2018

jasontedor Dec 18, 2018

martijnvg Dec 18, 2018

jasontedor commented Dec 18, 2018

martijnvg commented Dec 18, 2018

martijnvg Dec 18, 2018

martijnvg left a comment

martijnvg Dec 18, 2018

martijnvg Dec 18, 2018

jasontedor Dec 18, 2018

jasontedor commented Dec 18, 2018

jasontedor commented Dec 18, 2018 •

edited

jasontedor commented Dec 18, 2018

jasontedor commented Dec 18, 2018

jasontedor commented Dec 19, 2018

jasontedor commented Dec 19, 2018

jasontedor commented Dec 19, 2018

Make the ingest-geoip databases even lazier to load #36679

Make the ingest-geoip databases even lazier to load #36679

Conversation

jasontedor commented Dec 15, 2018

elasticmachine commented Dec 15, 2018

jasontedor commented Dec 15, 2018

martijnvg left a comment

Choose a reason for hiding this comment

jasontedor commented Dec 16, 2018

jasontedor commented Dec 16, 2018

martijnvg commented Dec 17, 2018

martijnvg commented Dec 17, 2018

jasontedor commented Dec 18, 2018

martijnvg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor Dec 18, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor commented Dec 18, 2018

martijnvg commented Dec 18, 2018

Choose a reason for hiding this comment

martijnvg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor commented Dec 18, 2018

jasontedor commented Dec 18, 2018 • edited

jasontedor commented Dec 18, 2018

jasontedor commented Dec 18, 2018

jasontedor commented Dec 19, 2018

jasontedor commented Dec 19, 2018

jasontedor commented Dec 19, 2018

jasontedor Dec 18, 2018 •

edited

jasontedor commented Dec 18, 2018 •

edited