Support for BigInteger and BigDecimal #17006

clintongormley · 2016-03-08T14:10:32Z

Lucene now has sandbox support for BigInteger (LUCENE-7043), and hopefully BigDecimal will follow soon. We should look at what needs to be done to support them in Elasticsearch.

I propose adding big_integer and big_decimal types which have to be specified explicitly - they shouldn't be a type which can be detected by dynamic mapping.

Many languages don't support big int/decimal. Javascript will convert to floats or throw an exception if a number is out of range. This can be worked around by always rendering these numbers in JSON as strings. We can possibly accept known bigints/bigdecimals as numbers but there are a few places where this could be a problem:

indexing a known big field (do we know ahead of time to parse a floating point as a BigDecimal?)
dynamic mapping (a floating point number could have lost precision before the field is defined as big_decimal)
ingest pipeline (ingest doesn't know about field mappings)

The above could be worked around by telling Jackson to parse floats and ints as BIG* (USE_BIG_DECIMAL_FOR_FLOATS and USE_BIG_INTEGER_FOR_INTS) but this may well generate a lot of garbage for what is an infrequent use case.

Alternatively, we could just say that Big* should always be passed in as strings if they are to maintain their precision.

The text was updated successfully, but these errors were encountered:

rmuir · 2016-03-08T16:28:59Z

One thing to note here is that our points support is for fixed-width types.

In other words, the BigIntegerPoint in lucene is a little misleading, it does not in fact support "Immutable arbitrary-precision integers".

Instead its a signed 128-bit integer type, more like a long long. If you try to give it a too-big BigInteger you get an exception! But otherwise BigInteger is a natural api for the user to provide a 128-bit integer.

On the other hand, If someone wanted to add support for a 128-bit floating point type, its of course possible, but I have my doubts there if BigDecimal is even the right java api around that (BigDecimal is a very different thing than a quad-precision floating point type).

I already see some confusion (e.g. "lossless storage") referenced to the issue so I think its important to disambiguate a little.

Maybe names like BigInteger/BigDecimal should be avoided with these, but thats part of why the thing is in sandbox, we can change that (e.g. to LongLongPoint).

clintongormley · 2016-03-08T17:57:37Z

thanks for the heads up @rmuir - i was indeed unaware of that

jpountz · 2016-04-14T15:55:52Z

I'd like to collect more information about use-cases before we start implementing this type. For instance I think the natural decision would be to use SORTED_SET doc values, but if the main use-case is to run stats aggregations, this won't work and so the fact that we have a long long type will probably be confusing since users won't be able to run the operations that they expect to work.

rmuir · 2016-04-17T11:12:11Z

I agree: we did some digging the other day.

One cause of confusion is many databases have a bigint type which is really a 64-bit long! So I'm concerned about people using a too-big type when its not needed due to naming confusion.

Also we have the challenge of how such numbers would behave in e.g. scripting and other places. Personally, i've only used BigInteger for cryptography-like things. You can see from its API its really geared at that. So maybe its not something we should expose?

ravicious · 2016-04-17T18:36:02Z

@jpountz:

I'd like to collect more information about use-cases before we start implementing this type. For instance I think the natural decision would be to use SORTED_SET doc values, but if the main use-case is to run stats aggregations, this won't work (…)

Sorry for my newb questions, but why wouldn't this work? Aren't stats aggregations done with floats possibly inaccurate due to floating point arithmetics?

jpountz · 2016-04-17T21:55:14Z

They can be inaccurate indeed.

The point I was making above is that Lucene provides two ways to encode doc values. On the one hand, we have SORTED_SET, which assigns an ordinal to every value per segment. This way you can efficiently sort and run terms, cardinality or range aggregations since these operations can work directly on the ordinals. However the cost of resolving a value given an ordinal is high enough that it would make anything that needs to have access to the actual values slow, such as a stats aggregation, just because it needs access to the actual values. On the other hand, there is BINARY, which just encodes the raw binary values in a column stride fashion. This would be slower for sorting and terms/cardinality/range aggregations, but reading the original values would be faster than with SORTED_SET so we could theoretically run eg. stats aggregations or use the values in scripts.

So knowing about the use-cases will help figure out which format to use. But then if we want to leverage all 128 bits of the values, we will have to duplicate implementations for everything that needs to add or multiply values such as stats/sum/avg aggregations. This would be an important burden in terms of maintenance so we would certainly not want to go that route without making sure that there are valid/common use-cases for it first.

devgc · 2016-07-29T14:33:39Z

This feature would be useful for the Digital Forensics and Indecent Response (DFIR) community. There are lots of data structures we look at that have uint64 types. When we index these, if the field is considered a long and the value is out of range, information can be lost.

rmuir · 2016-07-29T14:53:21Z

I see a 64 bit unsigned integer type (versus the 64-bit signed type we have), as a separate feature actually. This can be implemented more efficiently with lucene (and made easier with java 8).

Yeah, figuring out how to make a 64-bit unsigned type work efficiently in say, the scripting API might be a challenge as it stands today. Perhaps it truly must be a Number backed by BigInteger to work the best today, which would be slower.

But in general, typical things such as ranges and aggregations would be as fast as the 64-bit signed type we have today, and perhaps a newer scripting api (with more type information) could make scripting faster too down the road, so it is much more compelling than larger integers (e.g. 128-bit), which will always be slower.

Use cases where BigInteger is truly needed, to me that situation is less clear. I would like for us to consider the two cases (64-bit unsigned vs larger integers) as separate.

jeffknupp · 2016-09-24T16:52:34Z

@rmuir it's surprising to me that you have to ask for cases where BigDecimal (i.e. a decimal representation with arbitrary precision) would be needed, as much data science/analytics work requires exact representations of the source data without loss of precision. If putting my data into ES means that I am necessarily going to lose precision, that's a non-starter for many uses. Nothing in the JSON spec suggests this. In fact, it expressly mentions that numerics are arbitrary precision and it is up to the various libraries to represent that properly.

jasontedor · 2016-09-24T21:07:58Z

Nothing in the JSON spec suggests this. In fact, it expressly mentions that numerics are arbitrary precision and it is up to the various libraries to represent that properly.

This is not correct; the spec says:

This specification allows implementations to set limits on the range and precision of numbers accepted.

You are correct that numerics in the JSON spec are arbitrary precision, but nothing in the spec suggests that implementations must support this and, in fact, implementations do not have to support this.

The spec further says:

Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision.

jeffknupp · 2016-09-26T13:51:15Z

@jasontedor I was referring to ECMA-404 but regardless, my point is that the elastic documentation specifically says that _source, for example, contains the original JSON message verbatim and is used for search results. I think you'd have to heavily amend statements like that in the documentation to explicitly describe how JSON numbers are handled internally in ES.

You also cut your quoting of the spec short, as the entire paragraph is:

This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
generally available and widely used, good interoperability can be
achieved by implementations that expect no more precision or range
than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision. A JSON
number such as 1E400 or 3.141592653589793238462643383279 may indicate
potential interoperability problems, since it suggests that the
software that created it expects receiving software to have greater
capabilities for numeric magnitude and precision than is widely
available.

This is exactly what I'm referring to, as "the software that created it" (i.e. a client) has no reason to suspect, based on the documentation, that either of these values would lose precision.

jpountz · 2016-09-26T21:32:03Z

@jeffknupp

it's surprising to me that you have to ask for cases where BigDecimal would be needed

We are asking for use-cases because depending on the expectations, the feature could be implemented in very different ways.

For instance a MySQL BIGINT is just a 64-bits integer, which we already support with the long type. We do not support unsigned numbers, but if that is a common need, then this could be something we could fix and support efficiently.

If the use-case requires more than 64 bits (eg. 128), then things are more complicated. We could probably support efficient sorting, but aggregations would be tricky.

If arbitrary precision is needed, then there is not much we can do efficiently, at least at the moment.

jasontedor · 2016-09-26T21:39:45Z

I was referring to ECMA-404

The JSON spec only spells out the representation in JSON which is used for interchange, it is completely agnostic to how such information is represented by software consuming such JSON.

I think you'd have to heavily amend statements like that in the documentation to explicitly describe how JSON numbers are handled internally in ES.

The documentation spells out the numeric datatypes that are supported.

devgc · 2016-10-07T17:56:19Z

Here is a good example.

Windows uses the USN Journal to record changes made to the file system. These records are extremely important "logs" for people in the DFIR community.

Version 2 records uses 64 bit unsigned integer to store reference numbers.

Version 3 records uses 128-bit ordinal number for reference numbers.

For instance a MySQL BIGINT is just a 64-bits integer, which we already support with the long type. We do not support unsigned numbers, but if that is a common need, then this could be something we could fix and support efficiently.

I would say that this is important for the DFIR community.

If the use-case requires more than 64 bits (eg. 128), then things are more complicated. We could probably support efficient sorting, but aggregations would be tricky.

I would say this is equally as important.

There are many other logs that record these references, thus by maintaining their native types we can correlate logs to determine certain types of activity.

devgc · 2016-10-07T17:57:56Z

Use cases where BigInteger is truly needed, to me that situation is less clear. I would like for us to consider the two cases (64-bit unsigned vs larger integers) as separate.

Should we go ahead and create a new issue for 64-bit unsigned type as a feature?

marcurdy · 2016-10-07T18:45:25Z

For instance a MySQL BIGINT is just a 64-bits integer, which we already support with the long type. We do not support unsigned numbers, but if that is a common need, then this could be something we could fix and support efficiently.

I'm also in the digital forensics world and see merit in providing a 64-bit unsigned type. If it were 128-bit with a speed impact, it wouldn't affect the way in which I process data. My use is less real time and more one time run bulk processing. The biggest factor to me would be what makes the most sense from the developer side in respect to java and OS integration.

tezcane · 2016-10-22T22:37:24Z

Spring Data JPA supports BigInteger and BigDecimal, so any code where you try also use elasticsearch with will fail:

/**  Spring Data ElasticSearch repository for the Task entity.  */
public interface TaskSearchRepository extends ElasticsearchRepository<Task,BigInteger> {
     //THIS COMPILES BUT FAILS ON INIT
}

/**  Spring Data JPA repository for the Task entity. */
@SuppressWarnings("unused")
public interface TaskRepository extends JpaRepository<Task,BigInteger> {
    //THIS IS OK
}

I think a hack (that may end up being almost as efficient) is to convert my BigInteger to a string for use with elasticsearch:

/**  Spring Data ElasticSearch repository for the Task entity.  */
public interface TaskSearchRepository extends ElasticsearchRepository<Task,String> {
     //HACK, convert biginteger to string when saving to elasticsearch...
}

So these data types should be added in my opinion.

niemyjski · 2016-11-04T23:52:51Z

We also need something like this, We are unable to store C#'s Decimal.MaxValue currently.

jordansissel · 2017-04-03T03:57:15Z

On use cases, I see DFIR and USN mentioned. Would either of these use cases use aggregations, or just search and sorting? If you see aggregations necessary, can you state which ones and what the use case for that is?

Apologies if I am oversimplifying, but it seems like:

For USN, searching for individual USN and also sorting is desired (for viewing the journal in the correct order).
For DFIR, the category is too broad for me to really speculate, but I wonder if search-and-sort is enough?

If search and sort is enough, and no aggregations are needed, I wonder if there is even need for a 128 bit numeric type-- could strings be enough for these use cases, even if they may have speed differences from a (theoretical) 128bit type?

jpountz · 2018-03-13T16:29:16Z

Some use-cases described on this issue do not need biginteger/bigdecimal:

storing large numbers: Elasticsearch preserves the _source document all the time, you can map it as a disabled object to tell Elasticsearch to store it but not try to index or add doc values for it (meaning it will be returned but can't be searched or aggregated)
indexing id-like fields like UUID-4: These fields typically do not need ordering, so indexing them as keyword will allow to support exact queries and terms aggregations.
precise timestamps is already covered by a different issue as noted before: Date type has not enough precision for the logging use case. #10005

In general it looks like there is more interest in big integers than decimals. In particular, some use-cases look like they could benefit from unsigned 128-bits integers because they need ordering, which keyword cannot provide. It's still unknown whether range queries would be needed on such a field however.

There seems to be less traction for big decimals. @jeffknupp Can you clarify what operations you would like to run on these big decimal fields (exact queries? range queries? sorting? aggregations?).

jpountz · 2018-03-13T16:29:32Z

cc @elastic/es-search-aggs

tyre · 2018-04-17T22:14:51Z

For Ethereum contracts, integers default to 256 bit so this is an issue. Lucene doesn't support that large of an integer, so it seems out of the question, but 128 bits would cover a far larger set of values for aggregation, analysis, querying, etc.

jpountz · 2018-04-18T08:14:18Z

@tyre What kind of aggregations and querying would you perform on such a field?

tyre · 2018-04-19T00:57:39Z

@jpountz off the top of my head: sum, average, moving average, percentile, percentile rank, filter

jpountz · 2018-04-19T12:58:28Z

sum, average, moving average, percentile, percentile rank

I don't think we will ever support these aggregations on large integers. Numeric aggregations use doubles internally, so either we support big integers and still use doubles internally but then the fact we have big integers is pointless as they could just be indexed as doubles instead. Or we try to make aggregations support wider data types but it will make them slower which is also something we want to avoid. So I don't see this happening. The only aggregation that we could support on big integers would be the range aggregation.

jpountz · 2018-05-28T14:25:02Z

Some data: Beats are interested in supporting uint64, which they typically need for OS-level counters, and they would be fine with the accuracy loss due to the fact that these numbers would be converted to doubles for aggregations.

insukcho · 2018-07-23T00:58:02Z

Do we have any update of this case? Still will not support BigInteger and BigDecimal officially?

jpountz · 2018-07-27T13:52:50Z

@insukcho No, no support for BigInteger and BigDecimal. Note that the naming may be a bit confusing due to the fact that what some datastores call bigint map to our longs. For instance both Mysql's BIGINT and Postgresql's bigint are 64-bit integers, just like Elasticsearch's long.

jpountz · 2018-07-27T13:53:24Z

We discussed this issue in FixitFriday and agreed to implement 64-bit unsigned integers. I opened #32434. Thanks all for the feedback.

fredgalvao · 2018-07-27T19:49:31Z

@jpountz Thanks for taking the time to keep this under the radar.

Initially, my interest on this issue was not to have a custom/new datatype per se, but to have support for BigDecimal/BigInteger (the java objects) on the Elasticsearch API (TransportClient using BulkProcessor, to be specific). I had to implement a generic number normalization to bring everything to it's pure and non-scientific-notation representation to be able to send data properly to elasticsearch, because when I tried to simply proxy my ETL input to the Elasticsearch client, I'd get an error for BigDecimal/BigInteger don't have a mapped type on the translating API. To be honest, I first got that issue on a 2.4.x cluster/api, and I'm on the way to finish migrating to 6.3.x, and have not tried removing numeric normalization to see if the limitation still exists (please feel free to point me to any obscure point on the changelogs or commit that would make me happy).

Although I'm sure 64-bit uint will solve most issues for people that wanted a new datatype for really long numbers, this issue of mine doesn't get attention by proxy with that implementation. Are there plans to support in any way the translation of BigDecimal/BigIntegers from the java client perspectives (even if it means an error/warning when the value would incur in precision loss)?

jpountz · 2018-07-28T07:21:07Z

I would expect this issue to be specific to the transport client, which we want to replace with a new rest client, which we call high-level rest client as opposed to the other low-level rest client which doesn't try to understand requests and responses and only works with maps of maps. With a rest client, bigintegers wouldn't be transported any differently from short, ints and longs, so I would expect things to work as long as the value that your big integers store are in the acceptable range of the mapping type, eg. -2^63 to 2^63-1 for long.

fredgalvao · 2018-07-28T22:56:32Z

Fair enough. I forgot to take into account the client progression when re-evaluating the issue. Thanks @jpountz !

insukcho · 2018-07-30T09:32:32Z

Thanks for handling this case @jpountz .
I will keep watching #32434 for further support of 64-bit unsigned integers.

fredgalvao · 2018-08-01T23:49:38Z

I'm risking being out of scope by stretching this, but I think we still have an issue on 6.3.0 with BigDecimal at least in one place:

I'm using the LLRC (to talk to the cluster) in conjunction with the server artifact org.elasticsearch:elasticsearch (to build/manipulate queries), and using BigDecimal params on a Script object to be used in a bucketSelector, and it fails with the following:

cannot write xcontent for unknown value of type class java.math.BigDecimal
java.lang.IllegalArgumentException: cannot write xcontent for unknown value of type class java.math.BigDecimal
	at org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:755)
	at org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:810)
	at org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:792)
	at org.elasticsearch.common.xcontent.XContentBuilder.field(XContentBuilder.java:788)
	at org.elasticsearch.script.Script.toXContent(Script.java:663)
	at org.elasticsearch.common.xcontent.XContentBuilder.value(XContentBuilder.java:779)
	at org.elasticsearch.common.xcontent.XContentBuilder.value(XContentBuilder.java:772)
	at org.elasticsearch.common.xcontent.XContentBuilder.field(XContentBuilder.java:764)
	at org.elasticsearch.search.aggregations.pipeline.bucketselector.BucketSelectorPipelineAggregationBuilder.internalXContent(BucketSelectorPipelineAggregationBuilder.java:120)
	at org.elasticsearch.search.aggregations.pipeline.AbstractPipelineAggregationBuilder.toXContent(AbstractPipelineAggregationBuilder.java:130)
	at org.elasticsearch.common.Strings.toString(Strings.java:778)
	at org.elasticsearch.search.aggregations.PipelineAggregationBuilder.toString(PipelineAggregationBuilder.java:92)

@jpountz Am I out of scope? Should I not expect the builders on the elasticsearch artifact to be as type-agnostic as the rest of the LLRC? Should I open a new issue?

jpountz · 2018-08-14T14:30:43Z

@fredgalvao This is a different bug. I would expect us to fix it when addressing #32395.

tpmccallum · 2019-02-11T04:52:25Z

Just pinging this conversation to point to a specific issue which requests that Elasticsearch commences support for Ethereum blockchain data types. Namely uint256 (2 ^ 256 -1)
#38242

clintongormley added >feature discuss :Search Foundations/Mapping Index mappings, including merging and defining field types Meta labels Mar 8, 2016

This was referenced Mar 8, 2016

Add support for lossless storage of BigDecimal numeric values in _source #5491

Closed

BigInteger/BigDecimal support #5683

Closed

mikemccand mentioned this issue Apr 14, 2016

Use the new points API to index numeric fields. #17746

Merged

russcam mentioned this issue Dec 8, 2016

5.0 RC4 - Scalar mappings are storing Decimal as double elastic/elasticsearch-net#2463

Closed

lpic10 mentioned this issue Feb 28, 2017

Add Elasticsearch 5.x output influxdata/telegraf#2332

Merged

3 tasks

jordansissel closed this as completed Apr 3, 2017

jordansissel reopened this Apr 3, 2017

jasontedor mentioned this issue Dec 6, 2017

UncategorizedExecutionException[Failed execution]; nested: IOException[can not write type [class java.math.BigDecimal]]; #27270

Closed

matthid mentioned this issue Feb 16, 2018

Date type has not enough precision for the logging use case. #10005

Closed

ferronrsmith mentioned this issue Apr 10, 2018

Big integer issue elasticsearch-dump/elasticsearch-dump#415

Closed

jkakavas mentioned this issue May 10, 2018

Question about BigDecimal #30515

Closed

lixiangdude mentioned this issue May 11, 2018

How to use aggs query on decimal type field? #30524

Closed

jpountz mentioned this issue Jul 27, 2018

Add support for 64-bit unsigned integers #32434

Closed

jpountz closed this as completed Jul 27, 2018

jpountz mentioned this issue Nov 28, 2018

Decimal serialisation elastic/elasticsearch-net#3497

Closed

tpmccallum mentioned this issue Feb 2, 2019

Support for Ethereum data type (uint256) #38242

Open

jimczi mentioned this issue Jun 7, 2019

scaled_float offers 53-bit precision instead of the usual 64-bit precision of longs #42898

Closed

jbaiera mentioned this issue Sep 20, 2019

Add BigDecimal data type #46934

Open

ebeahan mentioned this issue Jul 29, 2020

64-bit unsigned integer field type usage elastic/ecs#900

Closed

javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024

Support for BigInteger and BigDecimal #17006

Support for BigInteger and BigDecimal #17006

Comments

clintongormley commented Mar 8, 2016

rmuir commented Mar 8, 2016

clintongormley commented Mar 8, 2016

jpountz commented Apr 14, 2016

rmuir commented Apr 17, 2016

ravicious commented Apr 17, 2016

jpountz commented Apr 17, 2016

devgc commented Jul 29, 2016

rmuir commented Jul 29, 2016

jeffknupp commented Sep 24, 2016

jasontedor commented Sep 24, 2016

jeffknupp commented Sep 26, 2016 • edited Loading

jpountz commented Sep 26, 2016

jasontedor commented Sep 26, 2016

devgc commented Oct 7, 2016

devgc commented Oct 7, 2016

marcurdy commented Oct 7, 2016

tezcane commented Oct 22, 2016 • edited Loading

niemyjski commented Nov 4, 2016

jordansissel commented Apr 3, 2017 • edited Loading

jpountz commented Mar 13, 2018

jpountz commented Mar 13, 2018

tyre commented Apr 17, 2018

jpountz commented Apr 18, 2018

tyre commented Apr 19, 2018

jpountz commented Apr 19, 2018

jpountz commented May 28, 2018

insukcho commented Jul 23, 2018

jpountz commented Jul 27, 2018

jpountz commented Jul 27, 2018

fredgalvao commented Jul 27, 2018

jpountz commented Jul 28, 2018

fredgalvao commented Jul 28, 2018

insukcho commented Jul 30, 2018

fredgalvao commented Aug 1, 2018

jpountz commented Aug 14, 2018

tpmccallum commented Feb 11, 2019

jeffknupp commented Sep 26, 2016 •

edited

Loading

tezcane commented Oct 22, 2016 •

edited

Loading

jordansissel commented Apr 3, 2017 •

edited

Loading