Brought back index regeneration #2648

istranic · 2023-10-10T19:13:27Z

🚀 🚀 Pull Request

Impact

Bug fix (non-breaking change which fixes expected existing functionality)
Enhancement/New feature (adds functionality without impacting existing logic)
Breaking change (fix or feature that would cause existing functionality to change)

Description

-- Added index regeneration back. It was removed previously by mistake
-- Minor refactors to how parameters are passed
-- Increased M and efConstruction to provide more accuracy to index

Things to be aware of

Things to worry about

Additional Context

codecov · 2023-10-10T20:37:03Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (edd6efc) 84.30% compared to head (29efde8) 81.79%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2648      +/-   ##
==========================================
- Coverage   84.30%   81.79%   -2.52%     
==========================================
  Files         229      229              
  Lines       25503    25499       -4     
==========================================
- Hits        21501    20857     -644     
- Misses       4002     4642     +640

Flag	Coverage Δ
unittests	`81.79% <100.00%> (-2.52%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
deeplake/core/tensor.py	`85.14% <100.00%> (+0.10%)`	⬆️
deeplake/core/vectorstore/deeplake_vectorstore.py	`96.94% <100.00%> (+0.79%)`	⬆️
...lake/core/vectorstore/vector_search/indra/index.py	`68.91% <100.00%> (-14.42%)`	⬇️
deeplake/core/vectorstore/vector_search/utils.py	`96.16% <100.00%> (+0.01%)`	⬆️

... and 30 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sounakr

LGTM

adolkhan · 2023-10-12T10:51:24Z

deeplake/core/tensor.py

+                self.delete_vdb_index(vdb_index["id"])
+                # Recreate it back.
+                self.create_vdb_index(
+                    vdb_index["id"],
+                    vdb_index["distance"],
+                    additional_params=vdb_index.get("additional_params", None),
+                )


What happens when either of these commands fail? do we end up with corrupt indexes or do we just delete index? If possible I think we should rolling back to previous index state.

@adolkhan I don't think we should roll back, because the previous index won't have all the samples, so even though it could technically, it's not a correct index. Btw this stuff is being removed with incremental updates, so this code will likely exist for a short time.

In the future shouldn't we undo the whole operation. Say if we tried to delete some samples, index part errored out, shouldn't we add back all deleted elements and use old index while throwing exception that this operation was unsuccessful?

otherwise the dataset state will be in weir situation. Some samples got removed while index was wiped out completely or if error happened during deletion, then we either end up with corrupted index or old index. Isn't it the case? @sounakr correct me if I am wrong. But since you said that this code is short lived, then I think it is fine to merge these changes

That's a good point, though I don't like the idea of writing rollbacks near the user-facing API. We should write all of this at the low level, i.e. the API for adding samples and updating the index, and all the rollback logic, should be implemented at a low level behind a single user-facing API call.

That's how ds.extend({tensor_1.... tensor_2....}) works. In that API, and if some samples fail to append, no samples append at all, presumably using something similar to rollbacks, but most importantly, the person who calls the ds.extend API doesn't implement or worry about that.

@istranic / @adolkhan , When there is an error or corruption we can do two things.

Retry index creation from very beginning. i.e. not incremental but from scratch.

Even if the retry fails then mark the index as non_usable. This non_usable flag will prohibit us from using the index temporairy during search options.

The index has to be manually dropped and recreated later to clear out the flag and make it reusable again.

Let me know you thoughts. Will implement this logic in incremental.

Dataset and index should be kept separate. Dataset is a superset and index can be recreated anytime if Dataset is present. So in case of corruption we have Indra generic consine similarity and L2 implementation which can take over even if hnsw index is absent.

@sounakr My concern is not about the index corruption itself. My concern is about our code structure, and the fact that it's architected in a way that's very bad for a developer or a user.

deeplake/constants.py

adolkhan · 2023-10-12T18:38:03Z

deeplake/core/tensor.py

+                self.delete_vdb_index(vdb_index["id"])
+                # Recreate it back.
+                self.create_vdb_index(
+                    vdb_index["id"],
+                    vdb_index["distance"],
+                    additional_params=vdb_index.get("additional_params", None),
+                )


In the future shouldn't we undo the whole operation. Say if we tried to delete some samples, index part errored out, shouldn't we add back all deleted elements and use old index while throwing exception that this operation was unsuccessful?

adolkhan · 2023-10-12T18:40:48Z

deeplake/core/tensor.py

+                self.delete_vdb_index(vdb_index["id"])
+                # Recreate it back.
+                self.create_vdb_index(
+                    vdb_index["id"],
+                    vdb_index["distance"],
+                    additional_params=vdb_index.get("additional_params", None),
+                )


otherwise the dataset state will be in weir situation. Some samples got removed while index was wiped out completely or if error happened during deletion, then we either end up with corrupted index or old index. Isn't it the case? @sounakr correct me if I am wrong. But since you said that this code is short lived, then I think it is fine to merge these changes

…add_regen_back

sonarcloud · 2023-10-13T11:27:50Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

100.0% Coverage
0.0% Duplication

Brought back index regeneration

8453c73

istranic marked this pull request as draft October 10, 2023 19:13

Fixed linting and tweaked default parameters

886f49d

istranic requested a review from sounakr October 11, 2023 00:10

istranic marked this pull request as ready for review October 11, 2023 00:17

istranic added 2 commits October 11, 2023 10:45

Removed old commented code

4a62842

Removed old commented code

0f23b63

sounakr approved these changes Oct 12, 2023

View reviewed changes

Added slightly more tests

a591710

adolkhan reviewed Oct 12, 2023

View reviewed changes

adolkhan approved these changes Oct 12, 2023

View reviewed changes

Merge branch 'main' of https://github.com/activeloopai/deeplake into …

29efde8

…add_regen_back

istranic merged commit 0ec0301 into main Oct 13, 2023
13 of 14 checks passed

istranic deleted the add_regen_back branch October 13, 2023 11:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brought back index regeneration #2648

Brought back index regeneration #2648

istranic commented Oct 10, 2023 •

edited

codecov bot commented Oct 10, 2023 •

edited

sounakr left a comment

adolkhan Oct 12, 2023

istranic Oct 12, 2023

adolkhan Oct 12, 2023

adolkhan Oct 12, 2023

istranic Oct 12, 2023

sounakr Oct 13, 2023

sounakr Oct 13, 2023

istranic Oct 13, 2023

adolkhan Oct 12, 2023

adolkhan Oct 12, 2023

sonarcloud bot commented Oct 13, 2023

Brought back index regeneration #2648

Brought back index regeneration #2648

Conversation

istranic commented Oct 10, 2023 • edited

🚀 🚀 Pull Request

Impact

Description

Things to be aware of

Things to worry about

Additional Context

codecov bot commented Oct 10, 2023 • edited

Codecov Report

sounakr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Oct 13, 2023

istranic commented Oct 10, 2023 •

edited

codecov bot commented Oct 10, 2023 •

edited