Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for BigInteger and BigDecimal #17006

Closed
clintongormley opened this issue Mar 8, 2016 · 55 comments
Closed

Support for BigInteger and BigDecimal #17006

clintongormley opened this issue Mar 8, 2016 · 55 comments
Labels
discuss >feature Meta :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@clintongormley
Copy link

Lucene now has sandbox support for BigInteger (LUCENE-7043), and hopefully BigDecimal will follow soon. We should look at what needs to be done to support them in Elasticsearch.

I propose adding big_integer and big_decimal types which have to be specified explicitly - they shouldn't be a type which can be detected by dynamic mapping.

Many languages don't support big int/decimal. Javascript will convert to floats or throw an exception if a number is out of range. This can be worked around by always rendering these numbers in JSON as strings. We can possibly accept known bigints/bigdecimals as numbers but there are a few places where this could be a problem:

  • indexing a known big field (do we know ahead of time to parse a floating point as a BigDecimal?)
  • dynamic mapping (a floating point number could have lost precision before the field is defined as big_decimal)
  • ingest pipeline (ingest doesn't know about field mappings)

The above could be worked around by telling Jackson to parse floats and ints as BIG* (USE_BIG_DECIMAL_FOR_FLOATS and USE_BIG_INTEGER_FOR_INTS) but this may well generate a lot of garbage for what is an infrequent use case.

Alternatively, we could just say that Big* should always be passed in as strings if they are to maintain their precision.

@rmuir
Copy link
Contributor

rmuir commented Mar 8, 2016

One thing to note here is that our points support is for fixed-width types.

In other words, the BigIntegerPoint in lucene is a little misleading, it does not in fact support "Immutable arbitrary-precision integers".

Instead its a signed 128-bit integer type, more like a long long. If you try to give it a too-big BigInteger you get an exception! But otherwise BigInteger is a natural api for the user to provide a 128-bit integer.

On the other hand, If someone wanted to add support for a 128-bit floating point type, its of course possible, but I have my doubts there if BigDecimal is even the right java api around that (BigDecimal is a very different thing than a quad-precision floating point type).

I already see some confusion (e.g. "lossless storage") referenced to the issue so I think its important to disambiguate a little.

Maybe names like BigInteger/BigDecimal should be avoided with these, but thats part of why the thing is in sandbox, we can change that (e.g. to LongLongPoint).

@clintongormley
Copy link
Author

thanks for the heads up @rmuir - i was indeed unaware of that

@jpountz
Copy link
Contributor

jpountz commented Apr 14, 2016

I'd like to collect more information about use-cases before we start implementing this type. For instance I think the natural decision would be to use SORTED_SET doc values, but if the main use-case is to run stats aggregations, this won't work and so the fact that we have a long long type will probably be confusing since users won't be able to run the operations that they expect to work.

@rmuir
Copy link
Contributor

rmuir commented Apr 17, 2016

I agree: we did some digging the other day.

One cause of confusion is many databases have a bigint type which is really a 64-bit long! So I'm concerned about people using a too-big type when its not needed due to naming confusion.

Also we have the challenge of how such numbers would behave in e.g. scripting and other places. Personally, i've only used BigInteger for cryptography-like things. You can see from its API its really geared at that. So maybe its not something we should expose?

@ravicious
Copy link

@jpountz:

I'd like to collect more information about use-cases before we start implementing this type. For instance I think the natural decision would be to use SORTED_SET doc values, but if the main use-case is to run stats aggregations, this won't work (…)

Sorry for my newb questions, but why wouldn't this work? Aren't stats aggregations done with floats possibly inaccurate due to floating point arithmetics?

@jpountz
Copy link
Contributor

jpountz commented Apr 17, 2016

They can be inaccurate indeed.

The point I was making above is that Lucene provides two ways to encode doc values. On the one hand, we have SORTED_SET, which assigns an ordinal to every value per segment. This way you can efficiently sort and run terms, cardinality or range aggregations since these operations can work directly on the ordinals. However the cost of resolving a value given an ordinal is high enough that it would make anything that needs to have access to the actual values slow, such as a stats aggregation, just because it needs access to the actual values. On the other hand, there is BINARY, which just encodes the raw binary values in a column stride fashion. This would be slower for sorting and terms/cardinality/range aggregations, but reading the original values would be faster than with SORTED_SET so we could theoretically run eg. stats aggregations or use the values in scripts.

So knowing about the use-cases will help figure out which format to use. But then if we want to leverage all 128 bits of the values, we will have to duplicate implementations for everything that needs to add or multiply values such as stats/sum/avg aggregations. This would be an important burden in terms of maintenance so we would certainly not want to go that route without making sure that there are valid/common use-cases for it first.

@devgc
Copy link

devgc commented Jul 29, 2016

This feature would be useful for the Digital Forensics and Indecent Response (DFIR) community. There are lots of data structures we look at that have uint64 types. When we index these, if the field is considered a long and the value is out of range, information can be lost.

@rmuir
Copy link
Contributor

rmuir commented Jul 29, 2016

I see a 64 bit unsigned integer type (versus the 64-bit signed type we have), as a separate feature actually. This can be implemented more efficiently with lucene (and made easier with java 8).

Yeah, figuring out how to make a 64-bit unsigned type work efficiently in say, the scripting API might be a challenge as it stands today. Perhaps it truly must be a Number backed by BigInteger to work the best today, which would be slower.

But in general, typical things such as ranges and aggregations would be as fast as the 64-bit signed type we have today, and perhaps a newer scripting api (with more type information) could make scripting faster too down the road, so it is much more compelling than larger integers (e.g. 128-bit), which will always be slower.

Use cases where BigInteger is truly needed, to me that situation is less clear. I would like for us to consider the two cases (64-bit unsigned vs larger integers) as separate.

@jeffknupp
Copy link

@rmuir it's surprising to me that you have to ask for cases where BigDecimal (i.e. a decimal representation with arbitrary precision) would be needed, as much data science/analytics work requires exact representations of the source data without loss of precision. If putting my data into ES means that I am necessarily going to lose precision, that's a non-starter for many uses. Nothing in the JSON spec suggests this. In fact, it expressly mentions that numerics are arbitrary precision and it is up to the various libraries to represent that properly.

@jasontedor
Copy link
Member

Nothing in the JSON spec suggests this. In fact, it expressly mentions that numerics are arbitrary precision and it is up to the various libraries to represent that properly.

This is not correct; the spec says:

This specification allows implementations to set limits on the range and precision of numbers accepted.

You are correct that numerics in the JSON spec are arbitrary precision, but nothing in the spec suggests that implementations must support this and, in fact, implementations do not have to support this.

The spec further says:

Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision.

@jeffknupp
Copy link

jeffknupp commented Sep 26, 2016

@jasontedor I was referring to ECMA-404 but regardless, my point is that the elastic documentation specifically says that _source, for example, contains the original JSON message verbatim and is used for search results. I think you'd have to heavily amend statements like that in the documentation to explicitly describe how JSON numbers are handled internally in ES.

You also cut your quoting of the spec short, as the entire paragraph is:

This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
generally available and widely used, good interoperability can be
achieved by implementations that expect no more precision or range
than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision. A JSON
number such as 1E400 or 3.141592653589793238462643383279 may indicate
potential interoperability problems, since it suggests that the
software that created it expects receiving software to have greater
capabilities for numeric magnitude and precision than is widely
available.

This is exactly what I'm referring to, as "the software that created it" (i.e. a client) has no reason to suspect, based on the documentation, that either of these values would lose precision.

@jpountz
Copy link
Contributor

jpountz commented Sep 26, 2016

@jeffknupp

it's surprising to me that you have to ask for cases where BigDecimal would be needed

We are asking for use-cases because depending on the expectations, the feature could be implemented in very different ways.

For instance a MySQL BIGINT is just a 64-bits integer, which we already support with the long type. We do not support unsigned numbers, but if that is a common need, then this could be something we could fix and support efficiently.

If the use-case requires more than 64 bits (eg. 128), then things are more complicated. We could probably support efficient sorting, but aggregations would be tricky.

If arbitrary precision is needed, then there is not much we can do efficiently, at least at the moment.

@jasontedor
Copy link
Member

I was referring to ECMA-404

The JSON spec only spells out the representation in JSON which is used for interchange, it is completely agnostic to how such information is represented by software consuming such JSON.

I think you'd have to heavily amend statements like that in the documentation to explicitly describe how JSON numbers are handled internally in ES.

The documentation spells out the numeric datatypes that are supported.

@devgc
Copy link

devgc commented Oct 7, 2016

Here is a good example.

Windows uses the USN Journal to record changes made to the file system. These records are extremely important "logs" for people in the DFIR community.

Version 2 records uses 64 bit unsigned integer to store reference numbers.

Version 3 records uses 128-bit ordinal number for reference numbers.

For instance a MySQL BIGINT is just a 64-bits integer, which we already support with the long type. We do not support unsigned numbers, but if that is a common need, then this could be something we could fix and support efficiently.

I would say that this is important for the DFIR community.

If the use-case requires more than 64 bits (eg. 128), then things are more complicated. We could probably support efficient sorting, but aggregations would be tricky.

I would say this is equally as important.

There are many other logs that record these references, thus by maintaining their native types we can correlate logs to determine certain types of activity.

@devgc
Copy link

devgc commented Oct 7, 2016

Use cases where BigInteger is truly needed, to me that situation is less clear. I would like for us to consider the two cases (64-bit unsigned vs larger integers) as separate.

Should we go ahead and create a new issue for 64-bit unsigned type as a feature?

@marcurdy
Copy link

marcurdy commented Oct 7, 2016

For instance a MySQL BIGINT is just a 64-bits integer, which we already support with the long type. We do not support unsigned numbers, but if that is a common need, then this could be something we could fix and support efficiently.

I'm also in the digital forensics world and see merit in providing a 64-bit unsigned type. If it were 128-bit with a speed impact, it wouldn't affect the way in which I process data. My use is less real time and more one time run bulk processing. The biggest factor to me would be what makes the most sense from the developer side in respect to java and OS integration.

@tezcane
Copy link

tezcane commented Oct 22, 2016

Spring Data JPA supports BigInteger and BigDecimal, so any code where you try also use elasticsearch with will fail:

/**  Spring Data ElasticSearch repository for the Task entity.  */
public interface TaskSearchRepository extends ElasticsearchRepository<Task,BigInteger> {
     //THIS COMPILES BUT FAILS ON INIT
}

/**  Spring Data JPA repository for the Task entity. */
@SuppressWarnings("unused")
public interface TaskRepository extends JpaRepository<Task,BigInteger> {
    //THIS IS OK
}

I think a hack (that may end up being almost as efficient) is to convert my BigInteger to a string for use with elasticsearch:

/**  Spring Data ElasticSearch repository for the Task entity.  */
public interface TaskSearchRepository extends ElasticsearchRepository<Task,String> {
     //HACK, convert biginteger to string when saving to elasticsearch...
}

So these data types should be added in my opinion.

@niemyjski
Copy link
Contributor

We also need something like this, We are unable to store C#'s Decimal.MaxValue currently.

@jordansissel
Copy link
Contributor

jordansissel commented Apr 3, 2017

On use cases, I see DFIR and USN mentioned. Would either of these use cases use aggregations, or just search and sorting? If you see aggregations necessary, can you state which ones and what the use case for that is?

Apologies if I am oversimplifying, but it seems like:

  • For USN, searching for individual USN and also sorting is desired (for viewing the journal in the correct order).
  • For DFIR, the category is too broad for me to really speculate, but I wonder if search-and-sort is enough?

If search and sort is enough, and no aggregations are needed, I wonder if there is even need for a 128 bit numeric type-- could strings be enough for these use cases, even if they may have speed differences from a (theoretical) 128bit type?

@jpountz
Copy link
Contributor

jpountz commented Mar 13, 2018

Some use-cases described on this issue do not need biginteger/bigdecimal:

  • storing large numbers: Elasticsearch preserves the _source document all the time, you can map it as a disabled object to tell Elasticsearch to store it but not try to index or add doc values for it (meaning it will be returned but can't be searched or aggregated)
  • indexing id-like fields like UUID-4: These fields typically do not need ordering, so indexing them as keyword will allow to support exact queries and terms aggregations.
  • precise timestamps is already covered by a different issue as noted before: Date type has not enough precision for the logging use case. #10005

In general it looks like there is more interest in big integers than decimals. In particular, some use-cases look like they could benefit from unsigned 128-bits integers because they need ordering, which keyword cannot provide. It's still unknown whether range queries would be needed on such a field however.

There seems to be less traction for big decimals. @jeffknupp Can you clarify what operations you would like to run on these big decimal fields (exact queries? range queries? sorting? aggregations?).

@jpountz
Copy link
Contributor

jpountz commented Mar 13, 2018

cc @elastic/es-search-aggs

@tyre
Copy link

tyre commented Apr 17, 2018

For Ethereum contracts, integers default to 256 bit so this is an issue. Lucene doesn't support that large of an integer, so it seems out of the question, but 128 bits would cover a far larger set of values for aggregation, analysis, querying, etc.

@jpountz
Copy link
Contributor

jpountz commented Apr 18, 2018

@tyre What kind of aggregations and querying would you perform on such a field?

@tyre
Copy link

tyre commented Apr 19, 2018

@jpountz off the top of my head: sum, average, moving average, percentile, percentile rank, filter

@jpountz
Copy link
Contributor

jpountz commented Apr 19, 2018

sum, average, moving average, percentile, percentile rank

I don't think we will ever support these aggregations on large integers. Numeric aggregations use doubles internally, so either we support big integers and still use doubles internally but then the fact we have big integers is pointless as they could just be indexed as doubles instead. Or we try to make aggregations support wider data types but it will make them slower which is also something we want to avoid. So I don't see this happening. The only aggregation that we could support on big integers would be the range aggregation.

@jpountz
Copy link
Contributor

jpountz commented May 28, 2018

Some data: Beats are interested in supporting uint64, which they typically need for OS-level counters, and they would be fine with the accuracy loss due to the fact that these numbers would be converted to doubles for aggregations.

@insukcho
Copy link
Contributor

Do we have any update of this case? Still will not support BigInteger and BigDecimal officially?

@jpountz
Copy link
Contributor

jpountz commented Jul 27, 2018

@insukcho No, no support for BigInteger and BigDecimal. Note that the naming may be a bit confusing due to the fact that what some datastores call bigint map to our longs. For instance both Mysql's BIGINT and Postgresql's bigint are 64-bit integers, just like Elasticsearch's long.

@jpountz
Copy link
Contributor

jpountz commented Jul 27, 2018

We discussed this issue in FixitFriday and agreed to implement 64-bit unsigned integers. I opened #32434. Thanks all for the feedback.

@jpountz jpountz closed this as completed Jul 27, 2018
@fredgalvao
Copy link

@jpountz Thanks for taking the time to keep this under the radar.

Initially, my interest on this issue was not to have a custom/new datatype per se, but to have support for BigDecimal/BigInteger (the java objects) on the Elasticsearch API (TransportClient using BulkProcessor, to be specific). I had to implement a generic number normalization to bring everything to it's pure and non-scientific-notation representation to be able to send data properly to elasticsearch, because when I tried to simply proxy my ETL input to the Elasticsearch client, I'd get an error for BigDecimal/BigInteger don't have a mapped type on the translating API. To be honest, I first got that issue on a 2.4.x cluster/api, and I'm on the way to finish migrating to 6.3.x, and have not tried removing numeric normalization to see if the limitation still exists (please feel free to point me to any obscure point on the changelogs or commit that would make me happy).

Although I'm sure 64-bit uint will solve most issues for people that wanted a new datatype for really long numbers, this issue of mine doesn't get attention by proxy with that implementation. Are there plans to support in any way the translation of BigDecimal/BigIntegers from the java client perspectives (even if it means an error/warning when the value would incur in precision loss)?

@jpountz
Copy link
Contributor

jpountz commented Jul 28, 2018

I would expect this issue to be specific to the transport client, which we want to replace with a new rest client, which we call high-level rest client as opposed to the other low-level rest client which doesn't try to understand requests and responses and only works with maps of maps. With a rest client, bigintegers wouldn't be transported any differently from short, ints and longs, so I would expect things to work as long as the value that your big integers store are in the acceptable range of the mapping type, eg. -2^63 to 2^63-1 for long.

@fredgalvao
Copy link

Fair enough. I forgot to take into account the client progression when re-evaluating the issue. Thanks @jpountz !

@insukcho
Copy link
Contributor

Thanks for handling this case @jpountz .
I will keep watching #32434 for further support of 64-bit unsigned integers.

@fredgalvao
Copy link

I'm risking being out of scope by stretching this, but I think we still have an issue on 6.3.0 with BigDecimal at least in one place:

I'm using the LLRC (to talk to the cluster) in conjunction with the server artifact org.elasticsearch:elasticsearch (to build/manipulate queries), and using BigDecimal params on a Script object to be used in a bucketSelector, and it fails with the following:

cannot write xcontent for unknown value of type class java.math.BigDecimal
java.lang.IllegalArgumentException: cannot write xcontent for unknown value of type class java.math.BigDecimal
	at org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:755)
	at org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:810)
	at org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:792)
	at org.elasticsearch.common.xcontent.XContentBuilder.field(XContentBuilder.java:788)
	at org.elasticsearch.script.Script.toXContent(Script.java:663)
	at org.elasticsearch.common.xcontent.XContentBuilder.value(XContentBuilder.java:779)
	at org.elasticsearch.common.xcontent.XContentBuilder.value(XContentBuilder.java:772)
	at org.elasticsearch.common.xcontent.XContentBuilder.field(XContentBuilder.java:764)
	at org.elasticsearch.search.aggregations.pipeline.bucketselector.BucketSelectorPipelineAggregationBuilder.internalXContent(BucketSelectorPipelineAggregationBuilder.java:120)
	at org.elasticsearch.search.aggregations.pipeline.AbstractPipelineAggregationBuilder.toXContent(AbstractPipelineAggregationBuilder.java:130)
	at org.elasticsearch.common.Strings.toString(Strings.java:778)
	at org.elasticsearch.search.aggregations.PipelineAggregationBuilder.toString(PipelineAggregationBuilder.java:92)

@jpountz Am I out of scope? Should I not expect the builders on the elasticsearch artifact to be as type-agnostic as the rest of the LLRC? Should I open a new issue?

@jpountz
Copy link
Contributor

jpountz commented Aug 14, 2018

@fredgalvao This is a different bug. I would expect us to fix it when addressing #32395.

@tpmccallum
Copy link

Just pinging this conversation to point to a specific issue which requests that Elasticsearch commences support for Ethereum blockchain data types. Namely uint256 (2 ^ 256 -1)
#38242

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss >feature Meta :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests