Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Result document IDs are not sufficiently unique #50613

Closed
droberts195 opened this issue Jan 3, 2020 · 3 comments · Fixed by #50644
Closed

[ML] Result document IDs are not sufficiently unique #50613

droberts195 opened this issue Jan 3, 2020 · 3 comments · Fixed by #50644
Assignees
Labels
>bug :ml Machine learning

Comments

@droberts195
Copy link
Contributor

  • In order to ensure we don't get duplicate results if a job does the same work multiple times (for example after failing over from one node to another), our results documents have IDs generated from their contents, such that if two identical results are indexed the second will overwrite the first and we won't get a duplicate
  • Since we cannot literally have the document ID be the whole result, because the document IDs are limited to 512 bytes and the result could contain more data than this, we create an ID using a formula that we thought was likely to avoid ID collisions for different results
    • In particular, since by/over/partition field values can be long, the document ID includes a combination of the hashes of these values, plus the total length - the assumption was that it would be unlikely for there to be a hash collision unless the total length was different
  • Unfortunately, some by/over/partition values do produce hash collisions for identical total lengths of by/over/partition field values. An example is:
    • by field value = L018, over field value = null, partition field value = 128
    • by field value = L017, over field value = null, partition field value = 228
    • Both produce a hash of -2073230751

More variation is needed in the document IDs. However, before making the change an analysis of the possibilities for duplicate documents caused by the change is required. We need to consider all the places where we're assuming that duplicate results will overwrite one another and the likelihood of occurrence.

The problem affects at least model plot, forecast and anomaly record results. Other types of results, for example, influencer results, contain a hash of just one string. We should consider consistency in the way IDs are created across the different types.

@droberts195 droberts195 added >bug :ml Machine learning labels Jan 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195 droberts195 self-assigned this Jan 3, 2020
@benwtrent
Copy link
Member

benwtrent commented Jan 3, 2020

This might be lost to history, but why aren't we using MurmurHash3 that is provided within org.elasticsearch.common.hash? Were we too scared of collisions with it?

@droberts195
Copy link
Contributor Author

I was just working on a change this afternoon to use MurmurHash3. The PR should be ready on Monday morning. I think we just didn’t consider it necessary originally. It certainly fixes the collision cases we’ve seen.

The biggest question is now what happens if someone upgrades from a version with the old IDs to a version with the new ones. I haven’t yet checked if this makes duplicate results likely in any situation.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 6, 2020
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes elastic#50613
droberts195 added a commit that referenced this issue Jan 7, 2020
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes #50613
droberts195 added a commit that referenced this issue Jan 7, 2020
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes #50613
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes elastic#50613
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants