-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(index): add epsilla connector #1835
feat(index): add epsilla connector #1835
Conversation
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
de419f5
to
91e9351
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1835 +/- ##
==========================================
+ Coverage 85.05% 85.07% +0.02%
==========================================
Files 135 136 +1
Lines 9031 9254 +223
==========================================
+ Hits 7681 7873 +192
- Misses 1350 1381 +31
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice contribution, I have some comments about the PR.
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@JoanFM Thanks for the review! I have updated the input validation logic to report specific parameters missing. It is unfortunate that Epsilla's python library does not include pythonic handling of error (throw instead of status codes). It directly returns HTTP status codes. For now I replaced it with from http import HTTPStatus. When there is a conflict on loading the db, Epsilla returns HTTP 500 when it should return HTTP 409. I have filed an issue, but parsing the string error message is the best workaround for now. What do you think? Please let me know if there are other issues we should address. |
docarray/index/backends/epsilla.py
Outdated
or "already exists" in response["message"] | ||
if status_code != HTTPStatus.OK: | ||
# Epsilla returns HTTP 500 when multiple clients connect to the same db | ||
# Bug filed at https://github.com/epsilla-cloud/vectordb/issues/93 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the quick is fix on Epsilla side, would it be possible to wait for a patch on epsille side?
) | ||
|
||
def num_docs(self) -> int: | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there a way to know the number of docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no public API for counting the number of docs. One workaround is to run filter
query without any condition, but I worry that is quite inefficient. Would you prefer to implement num_docs
with that rather than throwing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no API, okey. But I will double check if some functionality actually depends on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked. num_docs
impacts these two methods:
__len__
_is_index_empty
I pushed a _filter
based implementation for _is_index_empty
. Confirmed there is nothing that depends on num_docs
is limited (nothing in abstract.py
depends on this). But there is no easy way to check all the len()
locations.
$ grep -rn -l --exclude-dir=./tests "num_docs()"
./docarray/index/abstract.py
./docarray/index/backends/milvus.py
./docarray/index/backends/redis.py
./docarray/index/backends/hnswlib.py
./docs/user_guide/storing/index_hnswlib.md
./docs/user_guide/storing/index_elastic.md
./docs/user_guide/storing/index_in_memory.md
./docs/user_guide/storing/index_qdrant.md
./docs/user_guide/storing/index_redis.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and _is_index_empty, I think it was checked in many queries. I am not sure if the result is kind of cached. I have little access to check properly now the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For _is_index_empty
, it should work because I have overriden the default implementation. See https://github.com/docarray/docarray/pull/1835/files#diff-2584da09c45f6eeaca7247dcdfd1a1f1cb1da5752f9a00289cb3f9a52e228969R361-R368
Do you plan to have the fix for that behavior anytime soon on epsilla side? |
Epsilla is going to merge a quick fix the |
fix input validation logic Signed-off-by: Tony Yang <tonyyanga@gmail.com>
744d7ee
to
c6deb79
Compare
Epsilla fixed on their end. I've pushed a change to use HTTP 409 status codes rather than relying on error messages. Love to know what you think about the |
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
a3923d0
to
1d3ff6d
Compare
@JoanFM Any other issues to address before we can merge this change? |
I will review the implications of num_docs in detail on Monday. |
Is it possible for you to have an API that decides whether the table is empty or not? Otherwise, u can try to override |
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
Unfortunately, there is no API for that. I have pushed a change to override Please let me know if there are other things to address. |
Hey @JoanFM Gently bumping this - please let me know if there are other things to address. |
Hello @tonyyanga , Thanks for the contribution, we will be waiting for the documentation PR |
Thank you! will follow up with it, hopefully later this week. |
This PR adds the
EpsillaDocumentIndex
. It allows docarray users to use Epsilla vector database for queries.It is the 1st PR for issue #1832. I intend to make a follow-up PR to add documentation for the connector, but want to limit the size of the PR to make it easier to review.