Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Insertion Not Properly Distributed in Kubernetes Cluster with Custom ShardKey #19566

Open
gianluca-valentini opened this issue Aug 10, 2023 · 3 comments

Comments

@gianluca-valentini
Copy link

My Environment

I am currently running an ArangoDB cluster with the following specifications:

ArangoDB Version: 3.11.2
Deployment Mode: Cluster
Deployment Strategy: Kubernetes
Configuration: rook/ceph PVC provider
Infrastructure: own/kubeadm
Operating System: Ubuntu 20.04
Total RAM in your machine: 32Gb
Disks in use: SSD
Used Package: Docker - official Docker library

I have created a collection with a custom shardKey and expected data to be distributed among the cluster nodes. However, I noticed that all data is being inserted into the same node instead of being evenly distributed.

Reproduction Steps:

Create ArangoDB cluster with the specified settings.
Create a collection with a custom shardKey.
Insert data into the collection.
Expected Behavior:
I expect data to be evenly distributed among the cluster nodes based on the custom shardKey.

Observed Behavior:
All data is being inserted into the same node, despite the use of the custom shardKey.
image

Note that currently all the elements has the same shardKey value.
When I insert a new element with a new shardKey value, it starts to be inserted in another Leader, and it is not correct.

Additional Information:
I have checked the load balancing configuration and sharding settings, but I have not been able to resolve the issue. I would like to request assistance in understanding why data is not being properly distributed in my cluster environment or if there si something that I have to set in order to have the correct behavior.

Thank you for your attention and assistance.

Best regards,
Gianluca Valentini

@dothebart
Copy link
Contributor

hm, the distribution is happening by the shard key you specified.
If all documents have the same value in that attribute, all documents will end in the same shard as you specified it?

Or am I misreading something?

@gianluca-valentini
Copy link
Author

Hi @dothebart
Thank you for your response.
I understand that the distribution is based on the shard key specified. However, I have observed that even though I have specified a custom shard key, the data is not being evenly distributed among the nodes. For example, if one shard key value groups 1,000,000 documents and another only 100, the data is not distributed equally among the nodes. Is this behavior to be expected? I appreciate your clarification on this matter

Thanks
Gianluca

@gianluca-valentini
Copy link
Author

Hi @dothebart
If I use _key as the shard key, which is the default, Arango requires it to be present in a unique index. That's fine.
However, if the document also requires another unique field, the database gives me an error: Error: 1470 - shard key '_key' must be present in unique index.
So, if I understand correctly, the shard key must be the only unique field in the document, is this correct?
In my scenario, if I want to shard accounts using _key, I need to ensure that the userid is unique. But it is not possible.
What am I missing?

Gianluca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants