Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

float type losing precision when doing terms aggregation #30529

Closed
wenyuan opened this issue May 11, 2018 · 3 comments
Closed

float type losing precision when doing terms aggregation #30529

wenyuan opened this issue May 11, 2018 · 3 comments
Labels

Comments

@wenyuan
Copy link

wenyuan commented May 11, 2018

Elasticsearch Version

5.3.2

Issue feature

Step1: I have a field named "float_numbers" with the field type "float";
Step2: Then i inserted a value 0.62;
Step3: When i make terms aggregations to this field, i found it lost precision. As you can see below.
   original value: 0.62
   key in buckets: 0.6200000047683716

"aggregations": {
    "float_numbers": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 0.6200000047683716,
          "doc_count": 1
        }
      ]
    }
  }

Clue

I've found some info in this discuss.
It seems that all aggs will convert the values to a double before operating on them.
So, I can set the field mapping to "double" to deal with this issue. But i don't think it an effective solution, since double type costs twice the float type.

Expectation

I'm looking forward to any other solutions or some related info when elasticsearch updating.

Test Case

Here is my test cases, copy and paste into kibana will work.
1. index settings and mapping

PUT /term-test
{
  "settings": {
    "refresh_interval": "1s",
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "demo_type": {
      "properties": {
        "float_field": {
          "type": "float"
        }
      }
    }
  }
}

2. insert values

POST term-test/demo_type
{
  "float_field": 0.62
}

3. terms aggregation

GET term-test/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "float_numbers": {
      "terms": {
        "field": "float_field"
      }
    }
  }
}

Thanks for any suggestions

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@colings86 colings86 added the :Analytics/Aggregations Aggregations label May 11, 2018
@jpountz
Copy link
Contributor

jpountz commented May 11, 2018

Your number does lose precision, but not the way you think. This is due to how floating-point numbers work: 9.62 can't be expressed as a * 2 ^ b so neither doubles nor floats can represent it accurately.

If you print your float, then it will seem to work because the system prints the shortest string whose value is precise enough to distinguish it from adjacent float values. So you only happen to have more decimals in the terms aggregation output because it is stored as a double under the hood, so more digits need to be printed for it to be distinguish from adjacent double values.

Because floats and doubles cannot accurately represent a value, it is generally a bad idea to run terms aggregations on them.

@jpountz jpountz closed this as completed May 11, 2018
@maihde
Copy link

maihde commented Sep 21, 2018

@jpountz I know this ticket is closed but I wanted to add a bit of additional information that might help people who run across this ticket in the future better understand what's going on.

1. Index settings and insert two documents

PUT /term-test
{
  "settings": {
    "refresh_interval": "1s",
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "demo_type": {
      "properties": {
        "float_field": {
          "type": "float",
          "store": true
        }
      }
    }
  }
}
POST term-test/demo_type
{
  "float_field": 0.62
}
POST term-test/demo_type
{
  "float_field": 0.620000004
}

2. Execute search

GET term-test/_search
{
  "query": {
    "match_all": {}
  }
}
...
        "_source": {
          "float_field": 0.62
        }
...
        "_source": {
          "float_field": 0.620000004
        }
...

The _source field returns the exact values passed in because it is the original JSON so it's reasonable that it behaves as currently implemented.

3. Execute search with stored field

GET term-test/_search
{
  "stored_fields": [
    "float_field"
  ],
  "query": {
    "match_all": {}
  }
}
...
        "fields": {
          "float_field": [
            0.62
          ]
        }
...
        "fields": {
          "float_field": [
            0.62
          ]
        }
...

This is the confusing part. If you use stored fields the hits and the bucket keys are not consistent with each other. My personal opinion is that either: (a) the stored fields response should return 0.6200000047683716 so that it matches the bucket key, or (b) the bucket key should return 0.62 so that it matches the stored fields. With either of these approaches ElasticSearch would at least be internally consistent when it returns results.

An easy way to achieve the former behavior is to modify the FieldsVisitor to look like this:

    @Override
    public void floatField(FieldInfo fieldInfo, float value) throws IOException {
        addValue(fieldInfo.name, (double) value);
    }

An easy (but potentially inefficient) way to achieve the latter behavior is to change the conversion of stored float values to look like this.

static final class SingleFloatValues extends NumericDoubleValues {
...
        @Override
        public double doubleValue() throws IOException {
        	String floatValue = Float.toString(
            			NumericUtils.sortableIntToFloat((int) in.longValue())
            	    );
            return Double.parseDouble(floatValue);
            	
        }
...
}

With either of these options the search results and the aggregation results will be internally consistent. I'm inclined to the former because it accurately represents the actual value of stored in the field when you do the query.

The reason I needed to use the terms aggreation instead of the histogram aggregation is that I have a large number of floating point values where I need to create buckets with interval: 0.00001 resolution. I want to aggregate to find the top N and then match that with the responses returned from the search (i.e. was the hit one of the top N or not). The histogram aggregation reaches the 10,000 bucket limit, even with min_doc_count set to non-zero. What I really need is the ability to have a size parameter on the histogram aggregation that behaves like the size parameter on the terms aggregation. Do you think such an enhancement would be accepted into the baseline? If so, I will take a stab at implementing it.

The workaround I used is use the terms aggregation and the call Math.fround() on both the bucket key and the hit values to make them consistent with each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants