Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(index): add epsilla connector #1835

Merged
merged 7 commits into from
Dec 19, 2023

Conversation

tonyyanga
Copy link
Contributor

This PR adds the EpsillaDocumentIndex. It allows docarray users to use Epsilla vector database for queries.

It is the 1st PR for issue #1832. I intend to make a follow-up PR to add documentation for the connector, but want to limit the size of the PR to make it easier to review.

Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@tonyyanga tonyyanga force-pushed the tonyyang-add-epsilla-db-connector branch from de419f5 to 91e9351 Compare December 4, 2023 19:06
Copy link

codecov bot commented Dec 4, 2023

Codecov Report

Attention: 30 lines in your changes are missing coverage. Please review.

Comparison is base (522811f) 85.05% compared to head (03c2b9c) 85.07%.
Report is 2 commits behind head on main.

Files Patch % Lines
docarray/index/backends/epsilla.py 87.44% 27 Missing ⚠️
docarray/index/__init__.py 50.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1835      +/-   ##
==========================================
+ Coverage   85.05%   85.07%   +0.02%     
==========================================
  Files         135      136       +1     
  Lines        9031     9254     +223     
==========================================
+ Hits         7681     7873     +192     
- Misses       1350     1381      +31     
Flag Coverage Δ
docarray 85.07% <86.42%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice contribution, I have some comments about the PR.

docarray/index/backends/epsilla.py Outdated Show resolved Hide resolved
docarray/index/backends/epsilla.py Outdated Show resolved Hide resolved
docarray/index/backends/epsilla.py Outdated Show resolved Hide resolved
docarray/index/backends/epsilla.py Outdated Show resolved Hide resolved
docarray/index/backends/epsilla.py Outdated Show resolved Hide resolved
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@tonyyanga
Copy link
Contributor Author

@JoanFM Thanks for the review!

I have updated the input validation logic to report specific parameters missing.

It is unfortunate that Epsilla's python library does not include pythonic handling of error (throw instead of status codes). It directly returns HTTP status codes. For now I replaced it with from http import HTTPStatus.

When there is a conflict on loading the db, Epsilla returns HTTP 500 when it should return HTTP 409. I have filed an issue, but parsing the string error message is the best workaround for now.

What do you think? Please let me know if there are other issues we should address.

or "already exists" in response["message"]
if status_code != HTTPStatus.OK:
# Epsilla returns HTTP 500 when multiple clients connect to the same db
# Bug filed at https://github.com/epsilla-cloud/vectordb/issues/93
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the quick is fix on Epsilla side, would it be possible to wait for a patch on epsille side?

)

def num_docs(self) -> int:
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a way to know the number of docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no public API for counting the number of docs. One workaround is to run filter query without any condition, but I worry that is quite inefficient. Would you prefer to implement num_docs with that rather than throwing here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no API, okey. But I will double check if some functionality actually depends on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked. num_docs impacts these two methods:

  • __len__
  • _is_index_empty

I pushed a _filter based implementation for _is_index_empty. Confirmed there is nothing that depends on num_docs is limited (nothing in abstract.py depends on this). But there is no easy way to check all the len() locations.

$ grep -rn -l --exclude-dir=./tests "num_docs()"
./docarray/index/abstract.py
./docarray/index/backends/milvus.py
./docarray/index/backends/redis.py
./docarray/index/backends/hnswlib.py
./docs/user_guide/storing/index_hnswlib.md
./docs/user_guide/storing/index_elastic.md
./docs/user_guide/storing/index_in_memory.md
./docs/user_guide/storing/index_qdrant.md
./docs/user_guide/storing/index_redis.md

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and _is_index_empty, I think it was checked in many queries. I am not sure if the result is kind of cached. I have little access to check properly now the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For _is_index_empty, it should work because I have overriden the default implementation. See https://github.com/docarray/docarray/pull/1835/files#diff-2584da09c45f6eeaca7247dcdfd1a1f1cb1da5752f9a00289cb3f9a52e228969R361-R368

@JoanFM
Copy link
Member

JoanFM commented Dec 5, 2023

@JoanFM Thanks for the review!

I have updated the input validation logic to report specific parameters missing.

It is unfortunate that Epsilla's python library does not include pythonic handling of error (throw instead of status codes). It directly returns HTTP status codes. For now I replaced it with from http import HTTPStatus.

When there is a conflict on loading the db, Epsilla returns HTTP 500 when it should return HTTP 409. I have filed an issue, but parsing the string error message is the best workaround for now.

What do you think? Please let me know if there are other issues we should address.

Do you plan to have the fix for that behavior anytime soon on epsilla side?

@tonyyanga
Copy link
Contributor Author

Epsilla is going to merge a quick fix the load_db behavior: epsilla-cloud/vectordb#93 Will wait for them.

fix input validation logic

Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@tonyyanga tonyyanga force-pushed the tonyyang-add-epsilla-db-connector branch from 744d7ee to c6deb79 Compare December 5, 2023 20:00
@tonyyanga
Copy link
Contributor Author

Epsilla fixed on their end. I've pushed a change to use HTTP 409 status codes rather than relying on error messages.

Love to know what you think about the num_docs implementation. :-)

Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@tonyyanga tonyyanga force-pushed the tonyyang-add-epsilla-db-connector branch from a3923d0 to 1d3ff6d Compare December 6, 2023 03:19
@tonyyanga
Copy link
Contributor Author

@JoanFM Any other issues to address before we can merge this change?

@JoanFM
Copy link
Member

JoanFM commented Dec 9, 2023

I will review the implications of num_docs in detail on Monday.

@JoanFM
Copy link
Member

JoanFM commented Dec 11, 2023

Is it possible for you to have an API that decides whether the table is empty or not? Otherwise, u can try to override _is_index_empty to say always False. And make sure that all the methods work when the table is empty returning empty results properly.

Signed-off-by: Tony Yang <tonyyanga@gmail.com>
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@tonyyanga
Copy link
Contributor Author

@JoanFM

Unfortunately, there is no API for that. I have pushed a change to override _is_index_empty to return False, and added test cases with the index being empty.

Please let me know if there are other things to address.

@tonyyanga
Copy link
Contributor Author

Hey @JoanFM

Gently bumping this - please let me know if there are other things to address.

@JoanFM JoanFM merged commit ff00b60 into docarray:main Dec 19, 2023
40 checks passed
@JoanFM
Copy link
Member

JoanFM commented Dec 19, 2023

Hello @tonyyanga ,

Thanks for the contribution, we will be waiting for the documentation PR

@tonyyanga
Copy link
Contributor Author

Thank you! will follow up with it, hopefully later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants