Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add `scaled_float`. #19264

Merged
merged 1 commit into from Jul 18, 2016

Conversation

Projects
None yet
3 participants
@jpountz
Copy link
Contributor

commented Jul 5, 2016

This is a tentative to revive #15939 motivated by elastic/beats#1941.
Half-floats are a pretty bad option for storing percentages. They would likely
require 2 bytes all the time while percentages really don't need more than one
byte.

So this PR exposes a new scaled_float type that requires a scaling_factor
and internally indexes value*scaling_factor in a long field. Compared to the
original PR it exposes a lower-level API so that the trade-offs are clearer and
avoids any reference to fixed precision that might imply that this type is more
accurate (actually it is less accurate).

In addition to being more space-efficient for some use-cases that beats is
interested in, this is also faster that half_float unless we can improve the
efficiency of decoding half-float bits (which is currently done using software)
or until Java gets first-class support for half-floats.

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 5, 2016

To give more detailed information about why half floats are not enough, here is a table that gives disk usage for storing 10M random floats between 0 and 1 depending on the mapping:

Mapping Points disk usage (kB) Doc values disk usage (kB) Total
float 49728 34180 83908
half float 26560 19532 46092
scaled float (factor=4000) 25744 14652 40396
scaled float (factor=100) 13044 9768 22812

I chose 4000 and 100 as scaling factors because 4000 means 0.025% accuracy, which is better than what a half float can do for this particular use case (floats between 0 and 1) yet requires less disk, and 100 because I suspect it would be enough for many metrics like cpu utilization with its 1% accuracy.

Of course this is not a good benchmark since this is fake data, but given how points and doc values work this simulates the worst case and real data could expect even better disk utilization.


/** A {@link FieldMapper} for scaled floats. Values are internally multiplied
* by a scaling factor and rounded to the closest long. */
public class ScaledFloatFieldMapper extends FieldMapper implements AllFieldMapper.IncludeInAll {

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jul 8, 2016

Member

Just a question: would it be possible to extend from LongFieldMapper? Would be nice to have some code reuse.

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 8, 2016

Author Contributor

I thought about it when working on this PR but in the end it made things more complicated since this mapper partially needs to behave as a long field and as a double field.

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jul 8, 2016

Member

Cool, I can see how this can complicate things, was just hoping that this code reuse would be a low hanging fruit.

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 12, 2016

Updated numbers with https://issues.apache.org/jira/browse/LUCENE-7371:

Mapping Points disk usage (kB) Doc values disk usage (kB) Total
float 40312 34180 74492
half float 23092 19532 42624
scaled float (factor=4000) 22792 14652 37444
scaled float (factor=100) 12984 9768 22752
Add `scaled_float`. #19264
This is a tentative to revive #15939 motivated by elastic/beats#1941.
Half-floats are a pretty bad option for storing percentages. They would likely
require 2 bytes all the time while they don't need more than one byte.

So this PR exposes a new `scaled_float` type that requires a `scaling_factor`
and internally indexes `value*scaling_factor` in a long field. Compared to the
original PR it exposes a lower-level API so that the trade-offs are clearer and
avoids any reference to fixed precision that might imply that this type is more
accurate (actually it is *less* accurate).

In addition to being more space-efficient for some use-cases that beats is
interested in, this is also faster that `half_float` unless we can improve the
efficiency of decoding half-float bits (which is currently done using software)
or until Java gets first-class support for half-floats.

@jpountz jpountz force-pushed the jpountz:feature/scaled_floats branch to 398d70b Jul 18, 2016

@jpountz jpountz merged commit 398d70b into elastic:master Jul 18, 2016

1 check passed

CLA Commit author is a member of Elasticsearch
Details

@jpountz jpountz deleted the jpountz:feature/scaled_floats branch Jul 18, 2016

Bargs added a commit to Bargs/kibana that referenced this pull request Jul 22, 2016

Support new half_float and scaled_float field types
Elasticsearch added a couple of new numeric datatypes, which means we
need to update our type casting list to include them. Kibana should
see them as "numbers" so they work properly in searches and aggs.

Fixes elastic#7782
Related elastic/elasticsearch#18887
Related elastic/elasticsearch#19264

tsg added a commit to tsg/beats that referenced this pull request Aug 2, 2016

Use scaled_floats for percentages in ES mapping
Elasticsearch has recently added scaled_float as an option for storing floating
point numbers. The scaled floats are stored internally as longs, which means
they can take advantage of the integer compression in Lucene. See
elastic/elasticsearch#19264 for details.

The PR moves all percentages to scaled floats. In our `fields.yml` we assume a
default scaling factor of 1000, which should work well for our percentages
(values between 0 and 1). This scaling factor can also be set to a different
value in `fields.yml`.

tsg added a commit to tsg/beats that referenced this pull request Aug 2, 2016

Use scaled_floats for percentages in ES mapping
Elasticsearch has recently added scaled_float as an option for storing floating
point numbers. The scaled floats are stored internally as longs, which means
they can take advantage of the integer compression in Lucene. See
elastic/elasticsearch#19264 for details.

The PR moves all percentages to scaled floats. In our `fields.yml` we assume a
default scaling factor of 1000, which should work well for our percentages
(values between 0 and 1). This scaling factor can also be set to a different
value in `fields.yml`.

ruflin added a commit to elastic/beats that referenced this pull request Aug 2, 2016

Use scaled_floats for percentages in ES mapping (#2156)
Elasticsearch has recently added scaled_float as an option for storing floating
point numbers. The scaled floats are stored internally as longs, which means
they can take advantage of the integer compression in Lucene. See
elastic/elasticsearch#19264 for details.

The PR moves all percentages to scaled floats. In our `fields.yml` we assume a
default scaling factor of 1000, which should work well for our percentages
(values between 0 and 1). This scaling factor can also be set to a different
value in `fields.yml`.

@acchen97 acchen97 referenced this pull request Aug 9, 2016

Closed

New ES type scaled_float #822

airow pushed a commit to airow/kibana that referenced this pull request Feb 16, 2017

Support new half_float and scaled_float field types
Elasticsearch added a couple of new numeric datatypes, which means we
need to update our type casting list to include them. Kibana should
see them as "numbers" so they work properly in searches and aggs.

Fixes elastic#7782
Related elastic/elasticsearch#18887
Related elastic/elasticsearch#19264


Former-commit-id: 298ee35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.