Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the ingest-geoip databases even lazier to load #36679

Merged
merged 27 commits into from Dec 19, 2018

Conversation

jasontedor
Copy link
Member

Today we try to load the ingest-geoip databases lazily. Currently they are loaded as soon as any pipeline that uses an ingest-geoip processor is created. This is not lazy enough. For example, we could only load the databases the first time that they are actually used. This would ensure that we load the minimal set of data to support an in-use pipeline (instead of all of the data). This can come up in a couple of ways. One is when a subset of the database is used (e.g., the city database versus the country database versus the ASN database). Another is when the plugins are installed on non-ingest nodes (e.g., master-only nodes); we would never use the database in this case yet they are currently being loaded occupying ~60 MB of the heap. This commit makes the ingest-geoip databases as lazy as possible.

Today we try to load the ingest-geoip databases lazily. Currently they
are loaded as soon as any pipeline that uses an ingest-geoip processor
is created. This is not lazy enough. For example, we could only load the
databases the first time that they are actually used. This would ensure
that we load the minimal set of data to support an in-use pipeline
(instead of *all* of the data). This can come up in a couple of
ways. One is when a subset of the database is used (e.g., the city
database versus the country database versus the ASN database). Another
is when the plugins are installed on non-ingest nodes (e.g., master-only
nodes); we would never use the database in this case yet they are
currently being loaded occupying ~60 MB of the heap. This commit makes
the ingest-geoip databases as lazy as possible.
@jasontedor jasontedor added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.0.0 v6.6.0 labels Dec 15, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@jasontedor
Copy link
Member Author

@elasticmachine run gradle build tests 1
@elasticmachine run gradle build tests 2

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 - This is a good improvement!

Another is when the plugins are installed on non-ingest nodes (e.g., master-only nodes);

Maybe we should also look into not loading ingest pipelines at all on master-only nodes.

@jasontedor
Copy link
Member Author

There is a substantial downside to this pull request which is that we no longer eagerly validate configuration. I am exploring options to maintain this.

@jasontedor
Copy link
Member Author

Maybe we should also look into not loading ingest pipelines at all on master-only nodes.

We go through a validation step on the master on put pipeline requests. 😢

@jasontedor jasontedor added the WIP label Dec 16, 2018
@jasontedor jasontedor changed the title Make the ingest-geoip databases even lazier to load [WIP] Make the ingest-geoip databases even lazier to load Dec 16, 2018
@martijnvg
Copy link
Member

We go through a validation step on the master on put pipeline requests. cry

Oops. I forgot about that.

There is a substantial downside to this pull request which is that we no longer eagerly validate configuration. I am exploring options to maintain this.

We can try to read the database type from the geoip files in IngestGeoIpPlugin without loading the geoip database itself. The header / metadata is at the end of each geoip file and we could try to read that via a normal inputstream and extract the database type:

000003a0  ef 4d 61 78 4d 69 6e 64  2e 63 6f 6d e9 5b 62 69  |.MaxMind.com.[bi|
000003b0  6e 61 72 79 5f 66 6f 72  6d 61 74 5f 6d 61 6a 6f  |nary_format_majo|
000003c0  72 5f 76 65 72 73 69 6f  6e a1 02 5b 62 69 6e 61  |r_version..[bina|
000003d0  72 79 5f 66 6f 72 6d 61  74 5f 6d 69 6e 6f 72 5f  |ry_format_minor_|
000003e0  76 65 72 73 69 6f 6e a0  4b 62 75 69 6c 64 5f 65  |version.Kbuild_e|
000003f0  70 6f 63 68 04 02 5c 11  65 bb 4d 64 61 74 61 62  |poch..\.e.Mdatab|
00000400  61 73 65 5f 74 79 70 65  4d 47 65 6f 4c 69 74 65  |ase_typeMGeoLite|
00000410  32 2d 43 69 74 79 4b 64  65 73 63 72 69 70 74 69  |2-CityKdescripti|
00000420  6f 6e e1 42 65 6e 56 47  65 6f 4c 69 74 65 32 20  |on.BenVGeoLite2 |
00000430  43 69 74 79 20 64 61 74  61 62 61 73 65 4a 69 70  |City databaseJip|
00000440  5f 76 65 72 73 69 6f 6e  a1 06 49 6c 61 6e 67 75  |_version..Ilangu|
00000450  61 67 65 73 08 04 42 64  65 42 65 6e 42 65 73 42  |ages..BdeBenBesB|
00000460  66 72 42 6a 61 45 70 74  2d 42 52 42 72 75 45 7a  |frBjaEpt-BRBruEz|
00000470  68 2d 43 4e 4a 6e 6f 64  65 5f 63 6f 75 6e 74 c3  |h-CNJnode_count.|
00000480  3b 4e 34 4b 72 65 63 6f  72 64 5f 73 69 7a 65 a1  |;N4Krecord_size.|
00000490  1c                                                |.|
00000491

that way we can still validate the geoip properties in the factory.

@martijnvg
Copy link
Member

Something like this should work:

Path path = PathUtils.get(file);
            long fileSize = Files.size(path);
            final int[] DATABASE_TYPE_MARKER = {'d', 'a', 't', 'a', 'b', 'a', 's', 'e', '_', 't', 'y', 'p', 'e'};
            try (InputStream in = Files.newInputStream(path)) {
                // read last 512 bytes, for all 3 databases this is sufficient
                in.skip(fileSize - 512);
                byte[] tail = new byte[512];
                in.read(tail);

                // Find the database_type header:
                int metadataOffset = -1;
                int markerOffset = 0;
                for (int i = 0; i < tail.length; i++) {
                    byte b = tail[i];

                    if (b == DATABASE_TYPE_MARKER[markerOffset]) {
                        markerOffset++;
                    } else {
                        markerOffset = 0;
                    }
                    if (markerOffset == DATABASE_TYPE_MARKER.length) {
                        metadataOffset = i + 1;
                        break;
                    }
                }

                // Read the database type
                int offsetByte = tail[metadataOffset] & 0xFF;
                int type = offsetByte >>> 5;
                if (type != 2) {
                    throw new RuntimeException("type must be UTF8_STRING");
                }
                int size = offsetByte & 0x1f;
                String dataBaseType = new String(tail, metadataOffset + 1, size, StandardCharsets.UTF_8);
                System.out.println(dataBaseType);
            }

This looks for the database type header in the last bit of the geoip db file.

@jasontedor
Copy link
Member Author

@elasticmachine run gradle build tests 2

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. It would also be great to have a test that verifies that adding a pipeline does not load the database on the elected master node. Maybe we can add a single node test that adds a pipeline with a geoip processor, then add a method isLoaded() to DatabaseReaderLazyLoader and then in this new test we can assert that this method returns false.

@@ -98,7 +115,7 @@ public IngestDocument execute(IngestDocument ingestDocument) {
final InetAddress ipAddress = InetAddresses.forString(ip);

Map<String, Object> geoData;
String databaseType = dbReader.getMetadata().getDatabaseType();
String databaseType = dbReader.get().getMetadata().getDatabaseType();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also use dbReader.getDatabaseType() here?

@@ -119,7 +136,7 @@ public IngestDocument execute(IngestDocument ingestDocument) {
geoData = Collections.emptyMap();
}
} else {
throw new ElasticsearchParseException("Unsupported database type [" + dbReader.getMetadata().getDatabaseType()
throw new ElasticsearchParseException("Unsupported database type [" + dbReader.get().getMetadata().getDatabaseType()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here?

@@ -64,14 +65,30 @@

private final String field;
private final String targetField;
private final DatabaseReader dbReader;
private final DatabaseReaderLazyLoader dbReader;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename dbReader to dbLoader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does mean the lazy loader class will have to be opened up. The reason for this is because in some tests we don’t read from a file on disk but from an embedded resource. That’s different than reading a file, so for that to work I have to add hooks to allow that test to override how to get the size of the stream and how to open a stream to that resource. That’s why the class is no longer. I went down this path and then reverted it all, and forgot to revert the non-final change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, thanks for explaining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should have been a thread on [this review comment]. Continuing here, I pushed this change. Do you want to take a look?


/**
* Facilitates lazy loading of the database reader, so that when the geoip plugin is installed, but not used,
* no memory is being wasted on the database reader.
*/
final class DatabaseReaderLazyLoader implements Closeable {
class DatabaseReaderLazyLoader implements Closeable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class can remain to be sealed?

@jasontedor
Copy link
Member Author

It would also be great to have a test that verifies that adding a pipeline does not load the database on the elected master node.

Note the test GeoIpProcessorFactoryTests#testLazyLoading covers this although perhaps not as directly as you might like. It shows the database isn’t loaded until a pipeline is executed.

@martijnvg
Copy link
Member

Note the test GeoIpProcessorFactoryTests#testLazyLoading covers this although perhaps not as directly as you might like. It shows the database isn’t loaded until a pipeline is executed.

Cool, I think that, that is sufficient.

import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
import static org.hamcrest.Matchers.equalTo;

public class GeoIpProcessorNonIngestNodeIT extends ESIntegTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Left a small comment.

this.databaseFileName = databaseFileName;
this.loader = loader;
// cache the database type so that we do not re-read it on every pipeline execution
final SetOnce<String> databaseType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍- great idea to cache the database types.

break;
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add:

if (metadataOffset == -1) {
   throw new IOException("database type marker not found")
}

Otherwise it would fail with an ArrayOutOfBoundException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 5a0ee8f.

@jasontedor
Copy link
Member Author

@elasticmachine run the default distro tests

* elastic/master: (31 commits)
  enable bwc tests and switch transport serialization version to 6.6.0 for CAS features
  [DOCs] Adds ml-cpp PRs to alpha release notes (elastic#36790)
  Synchronize WriteReplicaResult callbacks (elastic#36770)
  Add CcrRestoreSourceService to track sessions (elastic#36578)
  [Painless] Add tests for boxed return types (elastic#36747)
  Internal: Remove originalSettings from Node (elastic#36569)
  [ILM][DOCS] Update ILM API authorization docs (elastic#36749)
  Core: Deprecate use of scientific notation in epoch time parsing (elastic#36691)
  [ML] Merge the Jindex master feature branch (elastic#36702)
  Tests: Mute SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization
  Update versions in SearchSortValues transport serialization
  Update version in SearchHits transport serialization
  [Geo] Integrate Lucene's LatLonShape (BKD Backed GeoShapes) as default `geo_shape` indexing approach (elastic#36751)
  [Docs] Fix error in Common Grams Token Filter (elastic#36774)
  Fix rollup search statistics (elastic#36674)
  SQL: Fix wrong appliance of StackOverflow limit for IN (elastic#36724)
  [TEST] Added more logging
  Invalidate Token API enhancements - HLRC (elastic#36362)
  Deprecate types in index API (elastic#36575)
  Disable bwc tests until elastic#36555 backport is complete (elastic#36737)
  ...
@jasontedor
Copy link
Member Author

jasontedor commented Dec 18, 2018

@elasticmachine run gradle builds tests 1

@jasontedor
Copy link
Member Author

@elasticmachine run gradle build tests 1

@jasontedor
Copy link
Member Author

@elasticmachine run gradle build tests 1
@elasticmachine run gradle build tests 2

@jasontedor
Copy link
Member Author

@elasticmachine run gradle build tests 2

1 similar comment
@jasontedor
Copy link
Member Author

@elasticmachine run gradle build tests 2

@jasontedor jasontedor merged commit 273b37a into elastic:master Dec 19, 2018
@jasontedor jasontedor deleted the lazier-ingest-geoip branch December 19, 2018 03:22
jasontedor added a commit that referenced this pull request Dec 19, 2018
Today we try to load the ingest-geoip databases lazily. Currently they
are loaded as soon as any pipeline that uses an ingest-geoip processor
is created. This is not lazy enough. For example, we could only load the
databases the first time that they are actually used. This would ensure
that we load the minimal set of data to support an in-use pipeline
(instead of *all* of the data). This can come up in a couple of
ways. One is when a subset of the database is used (e.g., the city
database versus the country database versus the ASN database). Another
is when the plugins are installed on non-ingest nodes (e.g., master-only
nodes); we would never use the database in this case yet they are
currently being loaded occupying ~60 MB of the heap. This commit makes
the ingest-geoip databases as lazy as possible.

Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
@jasontedor
Copy link
Member Author

Thanks @martijnvg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v6.6.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants