Skip to content

Conversation

@maedhroz
Copy link
Contributor

No description provided.

SimpleCondition continuePreviewRepair = new SimpleCondition();
// this pauses the validation request sent from node1 to node2 until the inc repair below has run
cluster.filters()
.outbound()
Copy link
Contributor Author

@maedhroz maedhroz Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: If we don't use outbound() explicitly here, we'll deadlock, with the single-threaded anti-entropy stage blocked waiting for the incremental repair propose message, which itself has to be processed on the same thread.

adelapena pushed a commit to adelapena/cassandra that referenced this pull request Oct 13, 2023
* Revert "Skip similarity score caching when using brute force (apache#769)"

This reverts commit ca7eea6.

* Revert " Compute similarity to query only once per indexed vector (per replica) (apache#723)"

This reverts commit 692491f.

* Add test that exposed score caching impl flaw

* Do not revert fix to VectorTester

Similarity score caching does not currently work when there are updates to vectors. The added test shows the issue. Conceptually, the problem materializes when we observe an earlier instance of a row with a close score and don't observe the row's later instance of a low score. The error is ranking some rows higher than than they should be.

The problem stems from updates to vectors. If we are searching for vector `v` and assume we have a row in sstable `a` that is close to `v` and an update to that row in sstable `b` that is far from `v`, the graph search will only find the version of the row in `a`. Then, the score caching will only observe `a`'s close score and then in `VectorTopKProcessor` we will assume that the row's score is for vector `a`, but that is out of date.

As far as I understand, we don't have a way to know the vector for which we scored against, which means we don't have a way to verify that the vector in `VectorTopKProcessor` (the one read from storage) is the same as the one from the index.
ekaterinadimitrova2 pushed a commit to ekaterinadimitrova2/cassandra that referenced this pull request Jun 3, 2024
* Revert "Skip similarity score caching when using brute force (apache#769)"

This reverts commit ca7eea6.

* Revert " Compute similarity to query only once per indexed vector (per replica) (apache#723)"

This reverts commit 692491f.

* Add test that exposed score caching impl flaw

* Do not revert fix to VectorTester

Similarity score caching does not currently work when there are updates to vectors. The added test shows the issue. Conceptually, the problem materializes when we observe an earlier instance of a row with a close score and don't observe the row's later instance of a low score. The error is ranking some rows higher than than they should be.

The problem stems from updates to vectors. If we are searching for vector `v` and assume we have a row in sstable `a` that is close to `v` and an update to that row in sstable `b` that is far from `v`, the graph search will only find the version of the row in `a`. Then, the score caching will only observe `a`'s close score and then in `VectorTopKProcessor` we will assume that the row's score is for vector `a`, but that is out of date.

As far as I understand, we don't have a way to know the vector for which we scored against, which means we don't have a way to verify that the vector in `VectorTopKProcessor` (the one read from storage) is the same as the one from the index.
michaelsembwever pushed a commit to thelastpickle/cassandra that referenced this pull request Jan 7, 2026
* Revert "Skip similarity score caching when using brute force (apache#769)"

This reverts commit ca7eea6.

* Revert " Compute similarity to query only once per indexed vector (per replica) (apache#723)"

This reverts commit 692491f.

* Add test that exposed score caching impl flaw

* Do not revert fix to VectorTester

Similarity score caching does not currently work when there are updates to vectors. The added test shows the issue. Conceptually, the problem materializes when we observe an earlier instance of a row with a close score and don't observe the row's later instance of a low score. The error is ranking some rows higher than than they should be.

The problem stems from updates to vectors. If we are searching for vector `v` and assume we have a row in sstable `a` that is close to `v` and an update to that row in sstable `b` that is far from `v`, the graph search will only find the version of the row in `a`. Then, the score caching will only observe `a`'s close score and then in `VectorTopKProcessor` we will assume that the row's score is for vector `a`, but that is out of date.

As far as I understand, we don't have a way to know the vector for which we scored against, which means we don't have a way to verify that the vector in `VectorTopKProcessor` (the one read from storage) is the same as the one from the index.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants