float type losing precision when doing terms aggregation #30529

wenyuan · 2018-05-11T08:23:19Z

Elasticsearch Version

5.3.2

Issue feature

Step1: I have a field named "float_numbers" with the field type "float";
Step2: Then i inserted a value 0.62;
Step3: When i make terms aggregations to this field, i found it lost precision. As you can see below.
　　　original value: 0.62
　　　key in buckets: 0.6200000047683716

"aggregations": {
    "float_numbers": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 0.6200000047683716,
          "doc_count": 1
        }
      ]
    }
  }

Clue

I've found some info in this discuss.
It seems that all aggs will convert the values to a double before operating on them.
So, I can set the field mapping to "double" to deal with this issue. But i don't think it an effective solution, since double type costs twice the float type.

Expectation

I'm looking forward to any other solutions or some related info when elasticsearch updating.

Test Case

Here is my test cases, copy and paste into kibana will work.
1. index settings and mapping

PUT /term-test
{
  "settings": {
    "refresh_interval": "1s",
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "demo_type": {
      "properties": {
        "float_field": {
          "type": "float"
        }
      }
    }
  }
}

2. insert values

POST term-test/demo_type
{
  "float_field": 0.62
}

3. terms aggregation

GET term-test/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "float_numbers": {
      "terms": {
        "field": "float_field"
      }
    }
  }
}

Thanks for any suggestions

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-05-11T12:15:58Z

Pinging @elastic/es-search-aggs

jpountz · 2018-05-11T13:17:33Z

Your number does lose precision, but not the way you think. This is due to how floating-point numbers work: 9.62 can't be expressed as a * 2 ^ b so neither doubles nor floats can represent it accurately.

If you print your float, then it will seem to work because the system prints the shortest string whose value is precise enough to distinguish it from adjacent float values. So you only happen to have more decimals in the terms aggregation output because it is stored as a double under the hood, so more digits need to be printed for it to be distinguish from adjacent double values.

Because floats and doubles cannot accurately represent a value, it is generally a bad idea to run terms aggregations on them.

maihde · 2018-09-21T01:22:34Z

@jpountz I know this ticket is closed but I wanted to add a bit of additional information that might help people who run across this ticket in the future better understand what's going on.

1. Index settings and insert two documents

PUT /term-test
{
  "settings": {
    "refresh_interval": "1s",
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "demo_type": {
      "properties": {
        "float_field": {
          "type": "float",
          "store": true
        }
      }
    }
  }
}
POST term-test/demo_type
{
  "float_field": 0.62
}
POST term-test/demo_type
{
  "float_field": 0.620000004
}

2. Execute search

GET term-test/_search
{
  "query": {
    "match_all": {}
  }
}
...
        "_source": {
          "float_field": 0.62
        }
...
        "_source": {
          "float_field": 0.620000004
        }
...

The _source field returns the exact values passed in because it is the original JSON so it's reasonable that it behaves as currently implemented.

3. Execute search with stored field

GET term-test/_search
{
  "stored_fields": [
    "float_field"
  ],
  "query": {
    "match_all": {}
  }
}
...
        "fields": {
          "float_field": [
            0.62
          ]
        }
...
        "fields": {
          "float_field": [
            0.62
          ]
        }
...

This is the confusing part. If you use stored fields the hits and the bucket keys are not consistent with each other. My personal opinion is that either: (a) the stored fields response should return 0.6200000047683716 so that it matches the bucket key, or (b) the bucket key should return 0.62 so that it matches the stored fields. With either of these approaches ElasticSearch would at least be internally consistent when it returns results.

An easy way to achieve the former behavior is to modify the FieldsVisitor to look like this:

    @Override
    public void floatField(FieldInfo fieldInfo, float value) throws IOException {
        addValue(fieldInfo.name, (double) value);
    }

An easy (but potentially inefficient) way to achieve the latter behavior is to change the conversion of stored float values to look like this.

static final class SingleFloatValues extends NumericDoubleValues {
...
        @Override
        public double doubleValue() throws IOException {
        	String floatValue = Float.toString(
            			NumericUtils.sortableIntToFloat((int) in.longValue())
            	    );
            return Double.parseDouble(floatValue);
            	
        }
...
}

With either of these options the search results and the aggregation results will be internally consistent. I'm inclined to the former because it accurately represents the actual value of stored in the field when you do the query.

The reason I needed to use the terms aggreation instead of the histogram aggregation is that I have a large number of floating point values where I need to create buckets with interval: 0.00001 resolution. I want to aggregate to find the top N and then match that with the responses returned from the search (i.e. was the hit one of the top N or not). The histogram aggregation reaches the 10,000 bucket limit, even with min_doc_count set to non-zero. What I really need is the ability to have a size parameter on the histogram aggregation that behaves like the size parameter on the terms aggregation. Do you think such an enhancement would be accepted into the baseline? If so, I will take a stab at implementing it.

The workaround I used is use the terms aggregation and the call Math.fround() on both the bucket key and the hit values to make them consistent with each other.

colings86 added the :Analytics/Aggregations Aggregations label May 11, 2018

jpountz closed this as completed May 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

float type losing precision when doing terms aggregation #30529

float type losing precision when doing terms aggregation #30529

wenyuan commented May 11, 2018 •

edited

elasticmachine commented May 11, 2018

jpountz commented May 11, 2018

maihde commented Sep 21, 2018

float type losing precision when doing terms aggregation #30529

float type losing precision when doing terms aggregation #30529

Comments

wenyuan commented May 11, 2018 • edited

Elasticsearch Version

Issue feature

Clue

Expectation

Test Case

Thanks for any suggestions

elasticmachine commented May 11, 2018

jpountz commented May 11, 2018

maihde commented Sep 21, 2018

wenyuan commented May 11, 2018 •

edited