Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to decouple a document shard allocation with its ID #28206

Closed
antonpious opened this issue Jan 13, 2018 · 6 comments
Closed

Ability to decouple a document shard allocation with its ID #28206

antonpious opened this issue Jan 13, 2018 · 6 comments
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. feedback_needed

Comments

@antonpious
Copy link

antonpious commented Jan 13, 2018

The current logic of allocating a document to a shard in a node by its Id should be enhanced to have another id for shard allocation and provide an api to allow a document to be directly allocated to a shard.

The reason being, elastic search provides both search as well as the ability to get by id.
Lets assume that we have two documents which are colors namely red and blue. The are 5 shards in the index and there are 5 nodes and each node has 1 shard. Now based on the id of these documents the document went to the shard no 2 which is present in node no2 got the document.

We are now having the searches and get by ids from elastic search. The get by id is trying to get the red and blue document at large scale along with the searches. The node 2 would now receive a high amount of request as the documents are present only in the node. The same would happen if the search always tries to get the same set of documents.

As the no. of documents that are being fetched by id on this node becomes higher and higher the node peaks in either memory or cpu leading to the node becoming stressed which in-turn slows the searches too as now the node doesn't have the resources to provide results for the search request too, leading to a gradual degradation of the cluster even if one node in the cluster peaks. If we had the ability to move certain documents to other nodes in the cluster having the other shards we could better utilize the resources of the entire cluster.

One node causing the cluster imbalance is described in the issue #27622

@DaveCTurner
Copy link
Contributor

It sounds to me that you're trying to use the document id for something quite different from what it's designed for: by design, the id defines the shard to which each document belongs. You can add other identifiers to your documents and use these to find them, and this would allow Elasticsearch to spread the load more evenly across the nodes.

@antonpious
Copy link
Author

antonpious commented Jan 19, 2018

@DaveCTurner it could be anything, even if we use another field like color within the document, the fact that this document say color=red, would always be in the same shard and the calls to obtain this document would go to the same node is the cause of the issue. If we now have X Documents identified by this field then the calls would always go to this node. So there should be another lever for us to move the document to other nodes for better balancing. Unless of course this can be incorporated in the product once the feature is in to automatically move it to a different shard for better allocation

On a separate note:

the id defines the shard to which each document belongs

I am not sure if this is how other software would see it, and Id is the id of the document, elastic chose to implement the shard logic which is unfortunate. As per the data store is concerned, it asked for the document to be inserted while inserting the document it asked for an id, so an id was provided, in the event an id is not provided an automatic identifier was given. This is the similar logic for a data table.

@DaveCTurner
Copy link
Contributor

I am not sure if this is how other software would see it, and Id is the id of the document, elastic chose to implement the shard logic which is unfortunate.

This is reasonably standard, and a quick survey of other sharded datastores will turn up similar design decisions. If you have an external identifier that you don't want to have an impact on the allocation of documents then you should simply store it in a different field.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 19, 2018

Maybe also have a look at the _routing field: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html

@antonpious
Copy link
Author

@ywelsch , This feature can only say at put time where the document should go, what is needed is the ability to move the document to a different shard after putting, in this case, we need to delete the document and then try to reindex it again by giving a different routing field value.

What is needed is a fundamental decoupling between documents and the shard allocation.
Shard placement is an internal feature of certain stores and should not be exposed to the clients, the clients are working with the data stores and to them they can identify a document by what they want to identify. The data stores should take this hit of where it needs to be placed for better resource management in its cluster.
Like @DaveCTurner said, asking the client to pass could be a first step but it need to move to the next generation to incorporate this as part of the product.

So the system should be intelligent to keep track of how many documents are going to shards and either inject the shard logic if most documents are going to the same shard or to move documents seamlessly between shards if the get loads on a particular node is heavy. This meta data state management should be part of the system and not pass on this complexity to the clients.

@lcawl lcawl added :Search/Search Search-related issues that do not fall into other categories and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Search/Search Search-related issues that do not fall into other categories labels Feb 13, 2018
@bleskes
Copy link
Contributor

bleskes commented Mar 20, 2018

Shard placement is an internal feature of certain stores and should not be exposed to the clients, the clients are working with the data stores and to them they can identify a document by what they want to identify

I agree with the sentiment. Shards are an internal detail and from a user perspective the documents are store in an in index. How to a document is assigned to a shard is internal to the system (although we expose some tooling to allow people to optimize some use cases). Elasticsearch choice of shard is designed as a function of the document id. Once placed, documents are never re-assigned a shard as that is a very costly operation (unlike a simple key value store).

Instead of discussing abstract system design, I would love to hear the specific data setup and how you feel it goes wrong when it comes to single document assignment. The issue you opened and linked to concerns how shards are allocated, which is a valid concern. I'm going to close this issue for now. We can open discussion on the forums if you want to discuss things further.

@bleskes bleskes closed this as completed Mar 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. feedback_needed
Projects
None yet
Development

No branches or pull requests

7 participants