Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the coerce option on numeric fields. #25861

Closed
jpountz opened this issue Jul 24, 2017 · 23 comments
Closed

Remove the coerce option on numeric fields. #25861

jpountz opened this issue Jul 24, 2017 · 23 comments
Labels
>deprecation good first issue low hanging fruit :Search/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team team-discuss

Comments

@jpountz
Copy link
Contributor

jpountz commented Jul 24, 2017

I don't like the fact that the coerce option combines a useful feature: the ability to convert strings to a number (which we are supposed to do in most APIs anyway) with a dangerous feature: the truncation (it does not even round) of floating-point numbers when indexed as a byte/short/integer/long.

I think we should remove this option, accept a string in all cases and require users to use an ingest processor if they want to ignore the decimal part of their floating-point numbers. Note that eg. "1.0" would still be accepted in an integer field since its decimal part is equal to zero.

@jpountz jpountz added :Search/Mapping Index mappings, including merging and defining field types discuss labels Jul 24, 2017
@jpountz jpountz changed the title Remove the coerce option. Remove the coerce option on numeric fields. Jul 24, 2017
@jpountz
Copy link
Contributor Author

jpountz commented Jul 24, 2017

We had a brief discussion about it in the search meeting and there was consensus that enabling truncation by default can be trappy. We discussed the following alternatives:

  • keep the coerce option but make it only about accepting strings
  • add an option that tells the field what to do if there is a decimal part (truncate/floor/ceil/round)

We should also try to gather feedback about use-cases for truncating, if any, and what users should migrate to instead (the convert processor?).

@jimczi
Copy link
Contributor

jimczi commented Jul 28, 2017

Discussed in Fix it Friday and we agreed that we should keep the coerce option and make it only about accepting strings. The automatic truncation is not a good option and should be replaced with an ingest processor that does the right thing (floor/ceil/round).

@jimczi jimczi added help wanted adoptme good first issue low hanging fruit and removed discuss labels Jul 28, 2017
@nablyy
Copy link

nablyy commented Jul 31, 2017

hi I'd like to look into this as my first PR.
is there a good place to start looking in the code to address this?

@jpountz
Copy link
Contributor Author

jpountz commented Jul 31, 2017

@nablyy You should look at NumberFieldMapper, and try to disallow floating-point numbers that have a decimal part on integer types and 7.x indices.

@liketic
Copy link
Contributor

liketic commented Sep 29, 2017

@nablyy If you're not working on this, can I take this?

@colings86 colings86 added this to Search & Aggs in Background tasks Oct 11, 2017
@nablyy
Copy link

nablyy commented Oct 19, 2017

@liketic If you want, take this.

@PnPie
Copy link
Contributor

PnPie commented Nov 1, 2017

Hello everyone,
If no one is working on this anymore, I would like to make a PR #27219 :) thx in advance for all your reviews and comments.

@akhilravipati97
Copy link

@jpountz Hi, I am trying to understand the issue and I think I need some clarification. When you say make coerce all about accepting strings, do you mean that from now on all it should mean to a user is that it will convert strings into the target type as long as there is no need to manipulate the value itself and that invalid strings will now be simply errored instead of truncated etc?

i.e Do you mean, the change should be so that, "5.0" --> Integer [coerce, in this context would actually coerce the value to 5], but "5.1" --> Integer [coerce, in this context means nothing, because "5.1" can't be coerced to integer without manipulation, and so should result in error]

@akotlar
Copy link

akotlar commented Mar 9, 2018

I think this is a frankly terrible idea. It is perfectly understandable what happens when you convert a float to an int. It truncated the float in every sensible language.

Why conflate this wth questions about rounding? In what language does int(float) result in a rounded number?

I find this feature eminantly useful because it allows me to define int(float) in my indexing config. Please leave it in. This is a documentation issue at worst.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 13, 2018

@akotlar Agreed casting may be useful. The thing I dislike is that it happens implicitly.
cc @elastic/es-search-aggs

@akotlar
Copy link

akotlar commented Mar 13, 2018

@jpountz Sorry, maybe I misunderstood. Does it not only happen when coerce: true ? If so, isn't that explicit? I would agree if it did something like round implicitly: when coercing from float to int I expect truncation.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 4, 2018

@akotlar Could you give more details about the kind of fields that you index and want to truncate?

@akotlar
Copy link

akotlar commented Apr 4, 2018

Absolutely. I work with genetic data, indexing all mutations, and a lot of annotated information about those mutations, found in say 1000 people.

One of those annotations comes from dbSNP. As it arrives from UCSC Genome Browser, there is a field, alleleNs, which is an integer count ostensibly, because it is the averaging of number of chromosomes (most people have 2 copies of each chromosome) by number of people. Unfortunately, on rare occasion someone will have 3 or 1 copies of a chromosome. So the average ends up being a float. This happens very rarely, and I mostly don’t care about being off by < 1 once in a great while.

In general, genetic data can be really inconsistent, and so say 1/10000 entries will be badly formatted. These I want to cast to the majority type.

The way I wrote the program that handles all of this stuff (and which uses ES), I have attempted to place as much of my api as possible into YAML configs, so that one program could accommodate many permutations of this messy data. This maps really well to ES, since the search layer can be largely configured from a single Json/YAML. I don’t want to write programming logic or invest functions per annotation feature, since there are on the order of a thousand possible features, and millions of possibly desirable combinations of these (and the feature space grows monthly more or less). That is a good use case for config files.

So I realize that I could use an ingestion pipeline, but for simple transformations like this, I vastly prefer to just use an already existing mechanism (existant in the mapping configs ES supports) that can handle this in a perfectly understandable way.

@jpountz jpountz added discuss and removed help wanted adoptme labels Apr 4, 2018
@jpountz
Copy link
Contributor Author

jpountz commented Apr 4, 2018

OK, I'm adding back the discuss label to make sure that we take your feedback into account.

On my end I would still prefer that we do this change. I understand there could be disagreement. And I also think there might be work to do on the integration of index APIs with pipelines. For instance the mapping option has the benefit that you can never skip it, while it's easy to just forget to pass the name of the ingest pipeline in a call to the index APIs.

@akotlar
Copy link

akotlar commented Apr 4, 2018

Sounds good

@jpountz
Copy link
Contributor Author

jpountz commented Aug 24, 2018

We discussed this in FixitFriday and agreed to still remove support for coerce despite latest arguments. For the record, here is an example how to handle rounding on top of Elasticsearch thanks to an ingest pipeline. Another option would be to do the same on client-side before sending data to Elasticsearch.

PUT test
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_long": {
          "type": "long",
          "coerce": false
        }
      }
    }
  }
}

PUT _ingest/pipeline/round_my_long
{
  "description": "describe pipeline",
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": "ctx[params.field] = (long) Math.round(ctx[params.field])",
        "params": {
          "field": "my_long"
        }
      }
    }
  ]
}

PUT test/_doc/1?pipeline=round_my_long
{
  "my_long": 3.6
}

@jpountz jpountz removed the discuss label Aug 24, 2018
@akotlar
Copy link

akotlar commented Aug 24, 2018

My position hasn't changed, it is clear that this is a needless complication.

Thanks for the ingestion pipeline tip.

@mayya-sharipova
Copy link
Contributor

I am wondering if we still want to coerce values during queries. Currently we truncate decimal part on range queries on integer data types. I assume this is useful if we ran a query across different indexes where the same field name can have different field types.

@jpountz
Copy link
Contributor Author

jpountz commented Jul 16, 2020

@mayya-sharipova Indeed we want to keep doing the right thing when querying integer field types with ranges on floating-point numbers. (We're not always truncating, we sometimes round up, sometimes down, so that querying integer fields with floating-point numbers always gives the same answer as if the field had been indexed as a floating-point number too.)

wesyoung added a commit to csirtgadgets/csirtg-smrt-v1 that referenced this issue Sep 24, 2020
Float values (like 7.5 or 8.5) get stored in the Integer-mapped confidence field due to coercion [1], but when querying ES (as in a range query), the value is truncated/rounded [2]. E.g., if an indicator's conf is stored as 7.5, but you later query for indicators at 7.5, that indicator won't be returned b/c ES truncates the field value before querying to comply with the Integer mapping type. 
Fix this problem by remapping as Float.
[1]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/coerce.html
[2]: elastic/elasticsearch#25861 (comment)

Co-authored-by: wes <wesyoung@users.noreply.github.com>
wesyoung added a commit to csirtgadgets/bearded-avenger that referenced this issue Oct 4, 2020
Float values (like 7.5 or 8.5) get stored in the Integer-mapped confidence field due to coercion [1], but when querying ES (as in a range query), the value is truncated/rounded [2]. E.g., if an indicator's conf is stored as 7.5, but you later query for indicators at 7.5, that indicator won't be returned b/c ES truncates the field value before querying to comply with the Integer mapping type. 
[1]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/coerce.html
[2]: elastic/elasticsearch#25861 (comment)

Co-authored-by: wes <wesyoung@users.noreply.github.com>
@hendrikmuhs
Copy link
Contributor

I think this is a good initiative. However I am concerned about the breaking change when a field is mapped as integer type but _source contains a floating point part. Users might rely on this feature, a re-index would suddenly fail.

I think such a change should be done with at least 2 major release cycles: 1 to flip the default, so its strict by default and 2 to remove it completely.

I am not fully sure if accepting .0 is good or bad, unsigned_long which lacks coerce by design errors for .0. If .0 gets accepted, unsigned_long should be relaxed, too. It's probably better to be fully strict which is also more compliant with e.g. JSON.

@mayya-sharipova
Copy link
Contributor

We've discussed this issue today in the Search meeting, and decided for now to postpone implementation of this change. One reason is that we still don't have an appealing way to index numbers with floating point as integers. Ingest pipeline can be complex for some use cases.
Another reason is runtime fields. As we change the behaviour of coerce option, it may be break runtime fields that were relying on the original behaviour.
As runtime fields are shaping up, they may provide better ways to extract values depending on index version and to index documents from source instead of using injest pipelines.

@copernicus365
Copy link

Is it at least possible to have an opt-in setting that allows floating point numbers (doubles, decimals, etc), to not be truncated as integers (long)? That way there's no breaking change, while those of us who are being bit by floats / doubles being truncated have a way ff. Thanks

@javanna
Copy link
Member

javanna commented Aug 18, 2022

We have re-discussed this with the team. We are reluctant to change the behaviour of coerce due to the breaking nature of such change. We have discussed the possibility of adding a new option that would allow users to not truncate floating point numbers, but we are not sold on it given that documents can be adapted before ingestion, or through an ingest pipeline. Also, it is now possible to load the original value as float, or do any needed computation around it if necessary through a runtime field, that can even shadow the original field transparently for search consumers.

With this, I am leaning towards closing this issue. Please feel free to comment below if you think otherwise.

@javanna javanna closed this as completed Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>deprecation good first issue low hanging fruit :Search/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team team-discuss
Projects
Background tasks
Search & Aggs
Development

No branches or pull requests