Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we store geoip files uncompressed? #28782

Closed
danielmitterdorfer opened this issue Feb 22, 2018 · 6 comments
Closed

Should we store geoip files uncompressed? #28782

danielmitterdorfer opened this issue Feb 22, 2018 · 6 comments
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement

Comments

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Feb 22, 2018

Current Situation

The ingest-geoip plugin ships with three geoip files:

  • GeoLite2-ASN.mmdb.gz
  • GeoLite2-City.mmdb.gz
  • GeoLite2-Country.mmdb.gz

We load these files lazily to reduce memory usage when this feature is not required (despite the plugin being loaded).

While the Maxmind DB reader allows to load data either on-heap or off-heap, we basically have to load data on-heap because we provide an InputStream that decompresses the gzipped data on the fly.

In order to allow loading the data off-heap, we need to provide a file (see the builder.mode parameter which controls whether to load on- or off-heap).

Discussion Item

Does it make sense to store these files uncompressed instead of gzip-compressed?

Consequences

Positive Consequences

  • We take up less heap memory (it would be roughly 70MB less). This will positively affect users - especially users with small clusters - as well as our own integration tests (we can probably reduce the heap size of our integration test clusters). As a corollary, I expect that we reduce GC pressure a bit as well.
  • We also have the choice to decide whether we load data on-heap or off-heap when we use a file instead of an input stream. This means that we can provide an (expert) setting if we want to if certain users still prefer to have the data on-heap.

Negative Consequences

  • Loading the files is not free: The data are memory-mapped instead and take up native memory.
  • The size of the config directory on disk increases from 34.5 MB to 72.6 MB.
  • The size of the plugin artefact increases from 34.7 MB to 37.6 MB (tested with ingest-geoip-6.2.2.zip).
@danielmitterdorfer danielmitterdorfer added discuss :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Feb 22, 2018
@jasontedor
Copy link
Member

FYI @elastic/es-core-infra.

@martijnvg
Copy link
Member

@danielmitterdorfer Thank you for this analysis. I'm in favor of loading the geoip database as uncompressed files, so that we need less heap memory when the geoip plugin gets used.

Loading the files is not free: The data are memory-mapped instead and take up native memory.

Even if we change how the database files are loaded the increase of native memory only occurs if geoip gets used and this memory is managed by the OS instead of the jvm.

The size of the config directory on disk increases from 34.5 MB to 72.6 MB.

To me this is a minor downside, as diskspace is the resources we should worry least about and it just increases the diskspace used by Elasticsearch if the geoip plugin is installed with 38.1 MB.

The size of the plugin artefact increases from 34.7 MB to 37.6 MB (tested with ingest-geoip-6.2.2.zip).

This is just a small increase and doesn't outweigh the benefits of not changing how the data files get loaded.

@rjernst
Copy link
Member

rjernst commented Feb 22, 2018

An alternative, or possibly in conjunction with this idea, would be to further investigate what we have discussed casually for some time now: re-encoding the data we actually use from the database in our own optimized format. From what I remember, we use a small portion of the database fields. The work for this is doable; the question really is just whether this is allowed via the maxmind db license (assuming we keep adequate notice about the source of the data).

@danielmitterdorfer
Copy link
Member Author

the question really is just whether this is allowed via the maxmind db license

According to the site for the Geolite2 database the license is CC BY-SA 4.0. I also think that it would be wise to double-check but on the Creative Commons site it says:

You are free to:
Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially.

We must give appropriate credit (which is IMHO nice behaviour anyway) and must redistribute under the same license.

All this sounds to me (IANAL) as if it would be feasible to re-encode the data.

@clintongormley
Copy link

An alternative, or possibly in conjunction with this idea,

I think we should pursue both, but let's not make the immediate gains depend on the longer term goal of improving encoding.

@danielmitterdorfer
Copy link
Member Author

We discussed in Fix-it Friday and want to pursue this. By default, we will load the files using the mmap approach and will add a system property in case some users explicitly want to load data still on-heap.

@danielmitterdorfer danielmitterdorfer self-assigned this Mar 8, 2018
@danielmitterdorfer danielmitterdorfer removed the help wanted adoptme label Mar 8, 2018
danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Mar 9, 2018
With this commit we reduce heap usage of the ingest-geoip plugin by
memory-mapping the database files. Previously, we have stored these
files gzip-compressed but this has resulted that data are loaded on the
heap.

Closes elastic#28782
danielmitterdorfer added a commit that referenced this issue Mar 12, 2018
With this commit we reduce heap usage of the ingest-geoip plugin by
memory-mapping the database files. Previously, we have stored these
files gzip-compressed but this has resulted that data are loaded on the
heap.

Closes #28782
danielmitterdorfer added a commit that referenced this issue Mar 12, 2018
With this commit we reduce heap usage of the ingest-geoip plugin by
memory-mapping the database files. Previously, we have stored these
files gzip-compressed but this has resulted that data are loaded on the
heap.

Closes #28782
danielmitterdorfer added a commit that referenced this issue Mar 12, 2018
With this commit we reduce heap usage of the ingest-geoip plugin by
memory-mapping the database files. Previously, we have stored these
files gzip-compressed but this has resulted that data are loaded on the
heap.

Closes #28782
@danielmitterdorfer danielmitterdorfer removed their assignment Mar 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement
Projects
None yet
Development

No branches or pull requests

5 participants