Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ingest] Re-introduce the Hash Processor with consistent keys across all nodes #34085

Open
jakelandis opened this issue Sep 26, 2018 · 10 comments

Comments

@jakelandis
Copy link
Contributor

commented Sep 26, 2018

A Hash processor was introduced (#31087), and subsequently reverted once it was realized that if the keys (local to the node) ever got out of sync that could cause silent functional issues (same values hashed to different values based on the node which processed them).

Since the key is sensitive data, we can not simply store it cluster state. After some internal discussions we have identified two possible strategies to mitigate the out of sync issue.

a) Introduce the concept of a consistent setting. Keep the key stored in the keystore and keep a id of sorts (a hash of the key ?) either in cluster state, or require that id to be included in the configuration for the hash processor. What to do when the setting is inconsistent will require some more discussion.

b) Introduce the concept of encrypted settings stored in the cluster state. More discussion here: #32727

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Sep 26, 2018

@jakelandis

This comment has been minimized.

Copy link
Contributor Author

commented Sep 3, 2019

Unblocked by #40416

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Sep 3, 2019

@danhermann danhermann self-assigned this Sep 4, 2019

@jakelandis

This comment has been minimized.

Copy link
Contributor Author

commented Sep 9, 2019

Once #46241 is merged, we may want to consider this processor to be async.

@jasontedor

This comment has been minimized.

Copy link
Member

commented Sep 9, 2019

@jakelandis Can you clarify why we would want to go async here? I would expect calculating the hash to be fast, and not worth context switching off the processing thread?

@jakelandis

This comment has been minimized.

Copy link
Contributor Author

commented Sep 9, 2019

At job-1 we did something similar on high volume ingestion (not to ES) and calculating the hash was by far the most expensive of the transformations applied to the data. Once this is implemented, running this with Rally would be helpful to make that decision.

@jasontedor

This comment has been minimized.

Copy link
Member

commented Sep 9, 2019

It might be the most expensive, but that doesn't mean it's so expensive that it needs to be done after a couple of context switches. Recall, the use-case for the hash processor is to anonymize fields like names. Those are not huge fields.

@jakelandis

This comment has been minimized.

Copy link
Contributor Author

commented Sep 9, 2019

I'm not sure if context switches or cryptographic hashes (for small data) is more expensive. However, I do think we should measure it and let that guide the decision.

EDIT: for clarity, i am not saying we should choose async if it is marginally faster in the tests... IIRC at job-1 the overhead was substantial 5-10ms per doc on enterprise class bare metal when multiplied by millions adds up. I would just want to confirm that we are not introducing a large performance difference

@jasontedor

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

I am concerned with spending time on details that are intuitively not expected to be a problem, and that maybe we should only consider if it proves to be a problem in the wild.

@jakelandis

This comment has been minimized.

Copy link
Contributor Author

commented Sep 11, 2019

I ran a quick JMH benchmark to see if my concerns were justified. During that process, remembered that that at job-1 we used SHA256withRSA, not HmacSHA256 (the default).

HmacSHA256 is plenty fast enough and we don't need to explore going aysnc here. (SHA256withRSA is indeed slow but a not relevant here since it is not supported)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.