Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brought back index regeneration #2648

Merged
merged 6 commits into from Oct 13, 2023
Merged

Brought back index regeneration #2648

merged 6 commits into from Oct 13, 2023

Conversation

istranic
Copy link
Contributor

@istranic istranic commented Oct 10, 2023

🚀 🚀 Pull Request

Impact

  • Bug fix (non-breaking change which fixes expected existing functionality)
  • Enhancement/New feature (adds functionality without impacting existing logic)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description

-- Added index regeneration back. It was removed previously by mistake
-- Minor refactors to how parameters are passed
-- Increased M and efConstruction to provide more accuracy to index

Things to be aware of

Things to worry about

Additional Context

@istranic istranic marked this pull request as draft October 10, 2023 19:13
@codecov
Copy link

codecov bot commented Oct 10, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (edd6efc) 84.30% compared to head (29efde8) 81.79%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2648      +/-   ##
==========================================
- Coverage   84.30%   81.79%   -2.52%     
==========================================
  Files         229      229              
  Lines       25503    25499       -4     
==========================================
- Hits        21501    20857     -644     
- Misses       4002     4642     +640     
Flag Coverage Δ
unittests 81.79% <100.00%> (-2.52%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
deeplake/core/tensor.py 85.14% <100.00%> (+0.10%) ⬆️
deeplake/core/vectorstore/deeplake_vectorstore.py 96.94% <100.00%> (+0.79%) ⬆️
...lake/core/vectorstore/vector_search/indra/index.py 68.91% <100.00%> (-14.42%) ⬇️
deeplake/core/vectorstore/vector_search/utils.py 96.16% <100.00%> (+0.01%) ⬆️

... and 30 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@istranic istranic requested a review from sounakr October 11, 2023 00:10
@istranic istranic marked this pull request as ready for review October 11, 2023 00:17
Copy link
Contributor

@sounakr sounakr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +1491 to +1497
self.delete_vdb_index(vdb_index["id"])
# Recreate it back.
self.create_vdb_index(
vdb_index["id"],
vdb_index["distance"],
additional_params=vdb_index.get("additional_params", None),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when either of these commands fail? do we end up with corrupt indexes or do we just delete index? If possible I think we should rolling back to previous index state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adolkhan I don't think we should roll back, because the previous index won't have all the samples, so even though it could technically, it's not a correct index. Btw this stuff is being removed with incremental updates, so this code will likely exist for a short time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future shouldn't we undo the whole operation. Say if we tried to delete some samples, index part errored out, shouldn't we add back all deleted elements and use old index while throwing exception that this operation was unsuccessful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise the dataset state will be in weir situation. Some samples got removed while index was wiped out completely or if error happened during deletion, then we either end up with corrupted index or old index. Isn't it the case? @sounakr correct me if I am wrong. But since you said that this code is short lived, then I think it is fine to merge these changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, though I don't like the idea of writing rollbacks near the user-facing API. We should write all of this at the low level, i.e. the API for adding samples and updating the index, and all the rollback logic, should be implemented at a low level behind a single user-facing API call.

That's how ds.extend({tensor_1.... tensor_2....}) works. In that API, and if some samples fail to append, no samples append at all, presumably using something similar to rollbacks, but most importantly, the person who calls the ds.extend API doesn't implement or worry about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@istranic / @adolkhan , When there is an error or corruption we can do two things.

  1. Retry index creation from very beginning. i.e. not incremental but from scratch.
  2. Even if the retry fails then mark the index as non_usable. This non_usable flag will prohibit us from using the index temporairy during search options.
  3. The index has to be manually dropped and recreated later to clear out the flag and make it reusable again.

Let me know you thoughts. Will implement this logic in incremental.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset and index should be kept separate. Dataset is a superset and index can be recreated anytime if Dataset is present. So in case of corruption we have Indra generic consine similarity and L2 implementation which can take over even if hnsw index is absent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sounakr My concern is not about the index corruption itself. My concern is about our code structure, and the fact that it's architected in a way that's very bad for a developer or a user.

deeplake/constants.py Show resolved Hide resolved
Comment on lines +1491 to +1497
self.delete_vdb_index(vdb_index["id"])
# Recreate it back.
self.create_vdb_index(
vdb_index["id"],
vdb_index["distance"],
additional_params=vdb_index.get("additional_params", None),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future shouldn't we undo the whole operation. Say if we tried to delete some samples, index part errored out, shouldn't we add back all deleted elements and use old index while throwing exception that this operation was unsuccessful?

Comment on lines +1491 to +1497
self.delete_vdb_index(vdb_index["id"])
# Recreate it back.
self.create_vdb_index(
vdb_index["id"],
vdb_index["distance"],
additional_params=vdb_index.get("additional_params", None),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise the dataset state will be in weir situation. Some samples got removed while index was wiped out completely or if error happened during deletion, then we either end up with corrupted index or old index. Isn't it the case? @sounakr correct me if I am wrong. But since you said that this code is short lived, then I think it is fine to merge these changes

@sonarcloud
Copy link

sonarcloud bot commented Oct 13, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

100.0% 100.0% Coverage
0.0% 0.0% Duplication

@istranic istranic merged commit 0ec0301 into main Oct 13, 2023
13 of 14 checks passed
@istranic istranic deleted the add_regen_back branch October 13, 2023 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants