Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmselect: introduce search.skipSlowReplicas cmd-line flag #4538

Merged
merged 2 commits into from Jul 7, 2023

Conversation

hagen1778
Copy link
Collaborator

@hagen1778 hagen1778 commented Jun 29, 2023

vmselect has two logical conditions during request processing when
-replicationFactor cmd-line flag is set:

  1. If at least len(storageNodes) - replicationFactor responded, it could skip
    waiting for the rest of nodes to respond. This could lead to problems described
    here is not recommand enable vmselect replicationfactor in cluster with data #1207.
  2. Mark response as partial if less than len(storageNodes) - replicationFactor responded
    without an error.

The P1 showed itself error-prone and became the main reason why
-replicationFactor wasn't recommended to use at vmselect level.
However, this optimization could be still very useful in situations
when there are slow and fast replicas in cluster.

But P2 remains viable and important conditionless.
Hiding P1 behind the feature-flag search.skipSlowReplicas
should make -replicationFactor flag usable again. And let users
choose whether they want P1 to be respected.

Related issues
#1207
#711

Signed-off-by: hagen1778 roman@victoriametrics.com

Copy link
Contributor

@f41gh7 f41gh7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

vmselect has two logical conditions during request processing when
`-replicationFactor` cmd-line flag is set:
1. If at least `len(storageNodes) - replicationFactor` responded, it could skip
waiting for the rest of nodes to respond. This could lead to problems described
here #1207.
2. Mark response as partial if less than `len(storageNodes) - replicationFactor` responded
without an error.

The P1 showed itself error-prone and became the main reason why
`-replicationFactor` wasn't recommended to use at vmselect level.
However, this optimization could be still very useful in situations
when there are slow and fast replicas in cluster.

But P2 remains viable and important conditionless.
Hiding P1 behind the feature-flag `search.skipSlowReplicas`
should make `-replicationFactor` flag usable again. And let users
choose whether they want P1 to be respected.

Related issues
#1207
#711

Signed-off-by: hagen1778 <roman@victoriametrics.com>
@hagen1778 hagen1778 changed the title vmselect: wait for all vmstorage nodes to respond vmselect: introduce search.skipSlowReplicas cmd-line flag Jul 7, 2023
@hagen1778
Copy link
Collaborator Author

@valyala @f41gh7 see this PR updated in cbbb555

Copy link
Contributor

@f41gh7 f41gh7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: hagen1778 <roman@victoriametrics.com>
@hagen1778
Copy link
Collaborator Author

hagen1778 commented Jul 7, 2023

@wf1nder please note, if this PR is accepted it may change the default behavior of vmselects introduced in #711

@f41gh7 f41gh7 merged commit 173ccf4 into cluster Jul 7, 2023
3 of 6 checks passed
@f41gh7 f41gh7 deleted the vmselect-partial-response branch July 7, 2023 09:50
valyala pushed a commit that referenced this pull request Jul 9, 2023
* vmselect: introduce `search.skipSlowReplicas` cmd-line flag

vmselect has two logical conditions during request processing when
`-replicationFactor` cmd-line flag is set:
1. If at least `len(storageNodes) - replicationFactor` responded, it could skip
waiting for the rest of nodes to respond. This could lead to problems described
here #1207.
2. Mark response as partial if less than `len(storageNodes) - replicationFactor` responded
without an error.

The P1 showed itself error-prone and became the main reason why
`-replicationFactor` wasn't recommended to use at vmselect level.
However, this optimization could be still very useful in situations
when there are slow and fast replicas in cluster.

But P2 remains viable and important conditionless.
Hiding P1 behind the feature-flag `search.skipSlowReplicas`
should make `-replicationFactor` flag usable again. And let users
choose whether they want P1 to be respected.

Related issues
#1207
#711

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* docs: update changelog

Signed-off-by: hagen1778 <roman@victoriametrics.com>

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
@valyala
Copy link
Collaborator

valyala commented Aug 28, 2023

FYI, this pull request has been included in VictoriaMetrics v1.91.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants