New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to decouple a document shard allocation with its ID #28206
Comments
It sounds to me that you're trying to use the document id for something quite different from what it's designed for: by design, the id defines the shard to which each document belongs. You can add other identifiers to your documents and use these to find them, and this would allow Elasticsearch to spread the load more evenly across the nodes. |
@DaveCTurner it could be anything, even if we use another field like color within the document, the fact that this document say color=red, would always be in the same shard and the calls to obtain this document would go to the same node is the cause of the issue. If we now have X Documents identified by this field then the calls would always go to this node. So there should be another lever for us to move the document to other nodes for better balancing. Unless of course this can be incorporated in the product once the feature is in to automatically move it to a different shard for better allocation On a separate note:
I am not sure if this is how other software would see it, and Id is the id of the document, elastic chose to implement the shard logic which is unfortunate. As per the data store is concerned, it asked for the document to be inserted while inserting the document it asked for an id, so an id was provided, in the event an id is not provided an automatic identifier was given. This is the similar logic for a data table. |
This is reasonably standard, and a quick survey of other sharded datastores will turn up similar design decisions. If you have an external identifier that you don't want to have an impact on the allocation of documents then you should simply store it in a different field. |
Maybe also have a look at the |
@ywelsch , This feature can only say at put time where the document should go, what is needed is the ability to move the document to a different shard after putting, in this case, we need to delete the document and then try to reindex it again by giving a different routing field value. What is needed is a fundamental decoupling between documents and the shard allocation. So the system should be intelligent to keep track of how many documents are going to shards and either inject the shard logic if most documents are going to the same shard or to move documents seamlessly between shards if the get loads on a particular node is heavy. This meta data state management should be part of the system and not pass on this complexity to the clients. |
I agree with the sentiment. Shards are an internal detail and from a user perspective the documents are store in an in index. How to a document is assigned to a shard is internal to the system (although we expose some tooling to allow people to optimize some use cases). Elasticsearch choice of shard is designed as a function of the document id. Once placed, documents are never re-assigned a shard as that is a very costly operation (unlike a simple key value store). Instead of discussing abstract system design, I would love to hear the specific data setup and how you feel it goes wrong when it comes to single document assignment. The issue you opened and linked to concerns how shards are allocated, which is a valid concern. I'm going to close this issue for now. We can open discussion on the forums if you want to discuss things further. |
The current logic of allocating a document to a shard in a node by its Id should be enhanced to have another id for shard allocation and provide an api to allow a document to be directly allocated to a shard.
The reason being, elastic search provides both search as well as the ability to get by id.
Lets assume that we have two documents which are colors namely red and blue. The are 5 shards in the index and there are 5 nodes and each node has 1 shard. Now based on the id of these documents the document went to the shard no 2 which is present in node no2 got the document.
We are now having the searches and get by ids from elastic search. The get by id is trying to get the red and blue document at large scale along with the searches. The node 2 would now receive a high amount of request as the documents are present only in the node. The same would happen if the search always tries to get the same set of documents.
As the no. of documents that are being fetched by id on this node becomes higher and higher the node peaks in either memory or cpu leading to the node becoming stressed which in-turn slows the searches too as now the node doesn't have the resources to provide results for the search request too, leading to a gradual degradation of the cluster even if one node in the cluster peaks. If we had the ability to move certain documents to other nodes in the cluster having the other shards we could better utilize the resources of the entire cluster.
One node causing the cluster imbalance is described in the issue #27622
The text was updated successfully, but these errors were encountered: