[ML] Result document IDs are not sufficiently unique #50613

droberts195 · 2020-01-03T15:58:23Z

In order to ensure we don't get duplicate results if a job does the same work multiple times (for example after failing over from one node to another), our results documents have IDs generated from their contents, such that if two identical results are indexed the second will overwrite the first and we won't get a duplicate
Since we cannot literally have the document ID be the whole result, because the document IDs are limited to 512 bytes and the result could contain more data than this, we create an ID using a formula that we thought was likely to avoid ID collisions for different results
- In particular, since by/over/partition field values can be long, the document ID includes a combination of the hashes of these values, plus the total length - the assumption was that it would be unlikely for there to be a hash collision unless the total length was different
Unfortunately, some by/over/partition values do produce hash collisions for identical total lengths of by/over/partition field values. An example is:
- by field value = L018, over field value = null, partition field value = 128
- by field value = L017, over field value = null, partition field value = 228
- Both produce a hash of -2073230751

More variation is needed in the document IDs. However, before making the change an analysis of the possibilities for duplicate documents caused by the change is required. We need to consider all the places where we're assuming that duplicate results will overwrite one another and the likelihood of occurrence.

The problem affects at least model plot, forecast and anomaly record results. Other types of results, for example, influencer results, contain a hash of just one string. We should consider consistency in the way IDs are created across the different types.

elasticmachine · 2020-01-03T15:58:27Z

Pinging @elastic/ml-core (:ml)

benwtrent · 2020-01-03T18:10:20Z

This might be lost to history, but why aren't we using MurmurHash3 that is provided within org.elasticsearch.common.hash? Were we too scared of collisions with it?

droberts195 · 2020-01-03T19:16:13Z

I was just working on a change this afternoon to use MurmurHash3. The PR should be ready on Monday morning. I think we just didn’t consider it necessary originally. It certainly fixes the collision cases we’ve seen.

The biggest question is now what happens if someone upgrades from a version with the old IDs to a version with the new ones. I haven’t yet checked if this makes duplicate results likely in any situation.

Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes elastic#50613

Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes #50613

Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes elastic#50613

droberts195 added >bug :ml Machine learning labels Jan 3, 2020

droberts195 self-assigned this Jan 3, 2020

droberts195 mentioned this issue Jan 6, 2020

[ML] Improve uniqueness of result document IDs #50644

Merged

droberts195 closed this as completed in #50644 Jan 7, 2020

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Result document IDs are not sufficiently unique #50613

[ML] Result document IDs are not sufficiently unique #50613

droberts195 commented Jan 3, 2020

elasticmachine commented Jan 3, 2020

benwtrent commented Jan 3, 2020 •

edited

droberts195 commented Jan 3, 2020

[ML] Result document IDs are not sufficiently unique #50613

[ML] Result document IDs are not sufficiently unique #50613

Comments

droberts195 commented Jan 3, 2020

elasticmachine commented Jan 3, 2020

benwtrent commented Jan 3, 2020 • edited

droberts195 commented Jan 3, 2020

benwtrent commented Jan 3, 2020 •

edited