Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(elastic): update offset2ids management #416

Merged
merged 10 commits into from
Jun 27, 2022

Conversation

alphinside
Copy link
Contributor

Goals:

Solving inconsistent offset issue on deletion #407.

I've found that after index deletion, update offset persistence has missed on truncating the trailing offset if the len of ids is less than initial length. So in this PR I submit logic to reevaluate offset length compared to in memory offset ids list and bulk delete offset which is more than the offset ids len

Also solving inconsitent offset issue on extend/add #412

Elasticsearch handle document indexing as upsert operation when indexing using same id, however the extend logic didn't using this logic, so I update the offset indexing to not extend id that already exist

  • ...
  • ...
  • check and update documentation, if required. See guide

@alphinside alphinside changed the title fix(elastic): update offset2ids logic fix(elastic): update offset2ids management Jun 23, 2022
@alphinside alphinside changed the title fix(elastic): update offset2ids management draft: fix(elastic): update offset2ids management Jun 24, 2022
@alphinside alphinside changed the title draft: fix(elastic): update offset2ids management fix(elastic): update offset2ids management Jun 24, 2022
@alphinside
Copy link
Contributor Author

please re-review @JoanFM

@codecov
Copy link

codecov bot commented Jun 24, 2022

Codecov Report

Merging #416 (a22aa54) into main (f3534cc) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #416      +/-   ##
==========================================
+ Coverage   86.53%   86.55%   +0.01%     
==========================================
  Files         134      134              
  Lines        6389     6395       +6     
==========================================
+ Hits         5529     5535       +6     
  Misses        860      860              
Flag Coverage Δ
docarray 86.55% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/array/storage/elastic/backend.py 94.87% <100.00%> (+0.27%) ⬆️
docarray/array/storage/elastic/seqlike.py 91.66% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e514c7...a22aa54. Read the comment docs.

assert len(elastic_doc) == len(elastic_doc[:, 'embedding'])
assert len(elastic_doc) == indexed_offset_count

elastic_doc._client.indices.delete(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not erase here, just create a random name for the index at the beginning so u do not need to care about this. In any case if test fails the index will be polluted and other tests may fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted

)['hits']['hits'][0]['_id']
assert actual_offset_index == expected_offset

elastic_doc._client.indices.delete(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, nothing after asserts should be needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted

@alphinside alphinside requested a review from JoanFM June 24, 2022 23:39
index=elastic_doc._index_name_offset2id
)['count']

assert len(elastic_doc) == len(elastic_doc[:, 'embedding'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assert that len(elastic_doc) == 7 also for extra security, this test otherwise would pass even with wrong behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay noted

index=elastic_doc._index_name_offset2id
)['count']

assert len(elastic_doc._offset2ids.ids) == indexed_offset_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, what should be the length here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep will be updated

@alphinside alphinside requested a review from JoanFM June 26, 2022 23:20
'n_dim': 3,
'columns': [('price', 'int')],
'distance': 'l2_norm',
'index_name': 'test_add',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you perhaps randomize the index name or give the index name the exact same name as the test function name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sure will do

@alphinside alphinside requested a review from JoanFM June 27, 2022 05:09
Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @alphinside for this great contribution!

@JoanFM JoanFM merged commit 87aff37 into docarray:main Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants