Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Weaviate as another DocumentStore #957 #1064

Merged
merged 26 commits into from
Jun 10, 2021
Merged

Integrate Weaviate as another DocumentStore #957 #1064

merged 26 commits into from
Jun 10, 2021

Conversation

venuraja79
Copy link
Contributor

@venuraja79 venuraja79 commented May 16, 2021

Proposed changes:

  • Weaviate as a document store

Status (please check what you already did):

  • [*] First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

Discussion Points:
update embeddings - My pov is that update embeddings is not required because
weaviate will automatically generate embedding if it isn't passed. Is this OK?

write documents - if there is a failure in a batch, we will log an error message.
User will debug and figure out to format the documents correctly. Is this OK?

To implement filters in query, GraphQL is supported. Can we allow the user to provide the filter as a text
instead of a Dict? An example below -

where_filter = {
"path": ["wordCount"],
"operator": "GreaterThan",
"valueInt": 1000
}
We will need the user to pass 'path', 'operator' and both 'value' & 'valueType'
in case if we want to construct the filter text on the fly. Any suggestions please?

Open items:
code:

  • implement get_document_count()
  • implement filters in query and get_all_douments()
  • other TODO items marked in code

tests

  • update conftest.py with fixtures for weaviate
  • add test_weaviate.py and include all the required tests

ci/cd
-add weaviate dockers in ci.yml

@lalitpagaria
Copy link
Contributor

Cross tagging issue #957

@lalitpagaria
Copy link
Contributor

@venuraja79 Thank you for creating PR
I see this PR, already in great shape.
I will review the code in couple of days and provide early feedback.
Adding @tholor @oryx1729 as well to get their feedback.

@venuraja79
Copy link
Contributor Author

@lalitpagaria @tholor Just trying to get some attention :) - Once we decide on the above design aspects and also an initial review of the source code, we can update the PR with tests & next improved version.

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @venuraja79,

First of all: Thanks a lot for working on this! I think it will be a great addition to Haystack - especially as Weaviate allows vector search with scalar filters which FAISS and Milvus don't allow for (yet).
Sorry for my late reply here - was a crazy week and wanted to first check weaviate out as I am not familiar with their design yet.

However, here are already a few comments / observations / questions to not stale the process here and especially gain clarity on the major design decisions:

update embeddings - My pov is that update embeddings is not required because
weaviate will automatically generate embedding if it isn't passed. Is this OK?

This is actually the core question I have. I only had a quick look at weaviate but I understand that they actually have an own module tight to the actual "Document Store", where the user can pick a model and weaviate takes care of embedding creation.
This is fundamentally different to all other vector stores in Haystack. Just using their models is not really an option as the optionality seems rather limited there, we want to be able to easily swap different documentstores and combine them with our retrievers in Haystack. I see two directions:

a) We don't use their models and just pass embeddings / update them as with the other document stores.
=> Seems to be the cleaner solution to me as we have full compatibility with Haystack models, the API will be quite similar to the others and we have control over model configuration (GPU, preprocessing, distribution ...). Not sure though if this plays nicely with weaviate and if there are any nice features in weaviate that we wouldn't be able to use in this case.

b) We allow both: just passing embeddings from Haystack or running a weaviate model
=> This might be quite complex to implement in a user-friendly way. We would need to expose the model choice somehow in the documentstore and add a "dummy WeaviateRetriever". This will be fundamentally different to the other documentstores and could add confusion further down the line in pipelines.

write documents - if there is a failure in a batch, we will log an error message.
User will debug and figure out to format the documents correctly. Is this OK?

This behavior will change soon in #1069. If the implementation is ready before this PR get's merged we can update it in here, too. For now, let's not worry about it.

To implement filters in query, GraphQL is supported. Can we allow the user to provide the filter as a text
instead of a Dict? An example below -

where_filter = {
"path": ["wordCount"],
"operator": "GreaterThan",
"valueInt": 1000
}
We will need the user to pass 'path', 'operator' and both 'value' & 'valueType'
in case if we want to construct the filter text on the fly. Any suggestions please?

I think it will be cleaner to have the same syntax as in the other document stores and then convert the dict to weaviate format under the hood. Again, this will simplify compatibility of different pipelines so that users can easily do prototyping in A and deploy it in B.
FYI We plan to improve the syntax and functionality of filtering in the other document stores (e.g. to support date ranges, numerical comparisons etc). I don't think it makes sense to mix it with this PR though. It would just blow up the complexity. So let's implement the current filter style here and refactor it once we have settled on the new filter syntax in Haystack.


def __init__(
self,
weaviate_url: str = "http://localhost:8080",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep naming of args as similar to other document stores as possible.
In this case, I would suggest keeping it similar to elasticsearch and split it into host + port.
(I know we use sql_url in FAISS - reasoning back then was that just host confuses people as they would probably look for a FAISS host in a FAISSDocumentstore)

See https://www.semi.technology/developers/weaviate/current/data-schema/schema-configuration.html
:param module_name : Vectorization module to convert data into vectors. Default is "text2vec-trasnformers"
For more details, See https://www.semi.technology/developers/weaviate/current/modules/
:param update_existing_documents: Whether to update any existing documents with the same ID when adding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you saw in #1069, this arg is going to change soon.

haystack/document_store/weaviate.py Outdated Show resolved Hide resolved
@venuraja79
Copy link
Contributor Author

Thank you @tholor for the feedback.
#1 - Embeddings - To be consistent with Haystack document stores, I understand that we have to find a way to store embeddings passed by the client code into Weaviate. (not the ones generated by it). Also, as Weaviate is both a store & dense retriever, (in Haystack terms), we may want to leverage both its features.

#2. Filter Implementation: Another challenge is that elastic search has a _meta to store any additional custom (dynamic ?) key-value pairs. However, Weaviate doesn't support anything like that by default because of the schema requirement. As a hack, converted all the meta into a string during write and converted back to a dict while reading. But this will not help in filtering by any meta data. Planning to do some experiments with a meta class or raise a question in Weaviate forum.

@venuraja79
Copy link
Contributor Author

@tholor, @lalitpagaria Please see this issue that I created in Weaviate repo for creating dynamic properties. We can make a decision on filter implementation based on the direction here. Thank you!
weaviate/weaviate#1587

@venuraja79
Copy link
Contributor Author

Additional updates:
1 - Weaviate can run without a module (i.e., no vectorizer) and simply can be used as a store. Client can pass a custom embedding to it. please see this link for more details. This part is resolved now, we can go ahead with the implementation.
https://www.semi.technology/developers/weaviate/current/tutorials/how-to-use-weaviate-without-modules.html

  1. Filter Implementation:
    Here is a hack to store dynamic key/value pairs (meta). By creating a new class "Meta" that has key and value as properties and referencing it in Document class, we can accommodate the "dynamic" meta. This will require additional messaging while write and read, but the approach seem to work work for filters.

I think we have a few options -
a. No meta support in Weaviate in the first version and add this after Weaviate supports dynamic schema - simple & clean approach.
b. We can go ahead with the meta approach below - read/write turns complex and also we'll have to worry about backward compatibility once Weaviate releases dynamic schema feature.

your views/suggestions please! Also, waiting for feedback from Weaviate team.

Schema for the meta class:
{
"class": "Meta",
"description": "Store Document's dynamic meta data",
"properties": [
{
"dataType": ["string"],
"description": "key of the meta",
"name": "key"
},
{
"dataType": ["string"],
"description": "value of the meta",
"name": "value"
}
]
}
For example, write has the below steps -

  1. Write document
  2. Write all meta key-value pairs (KVP)
  3. Write references of KVPs to the document

@lalitpagaria
Copy link
Contributor

@venuraja79 Thanks for the update.
In my view this one is good approach -

a. No meta support in Weaviate in the first version and add this after Weaviate supports dynamic schema - simple & clean approach.

This will save us from adding hackish code. But it will add another embedding store (FAISS and Milvus) to Haystack which does not support filtering. @tholor WDYT?

@bobvanluijt
Copy link
Contributor

Hi all –

I just noticed this PR (thanks for mentioning it on the Weaviate Slack channel @venuraja79)

RE: @tholor

[...]especially as Weaviate allows vector search with scalar filters [...]

I would agree that this would add the most value to Haystack because Weaviate is created with a scalable “vector-first” approach (also see the feature overview). We also have some internal benchmarks ready that you are probably going to like 😊

When it comes to horizontal scalability, we will be releasing a detailed architecture roadmap for this very soon on our website (I’ll leave a link here when it's live), but I can already share that the ETA is in Q3 this year.

Long story short, it seems the timing is perfect. Around the time this feature is finished, horizontal scalability is around the corner in Weaviate.

PS:
Also tagging @etiennedi in here because he is working on this part in Weaviate and @laura-ham, who is working on the Weaviate documentation.

PPS:
Awesome work @venuraja79 💪😊👍

@etiennedi
Copy link

Great to see this PR develop and Weaviate being used! As you've already identified absolutely correctly, you can use Weaviate as a "pure vector storage and search" or optionally use modules to include the encoding journey as well. I'm also really happy to read that the combining of scalar and vector search is explicitly highlighted, as that's one of the things that make Weaviate really unique in my opinion. The filtering happens pre-search by the way, so no messing around with very large k values is required on restrictive filters, such as with post-filtering.

The planned auto-schema feature has also been mentioned in this thread quite a lot. It's currenlty not in short-term prioritization yet, but it's often asked for and we'll make sure that it gets prioritized soon. I can let you know once we have an ETA for it.

If you want to get started adding metadata before that point already, you can also simply cache the schema on your side and add a new property ever time you encounter a previously unseen one. (Essentially this is what we're going to do in the feature, too). See this Slack post for a more detail outline of this workaround.

@tholor
Copy link
Member

tholor commented Jun 4, 2021

Thanks for the pointers @etiennedi @bobvanluijt !

@venuraja79 I see two ways forward regarding meta data filtering:

  1. Implementing the auto-schema hack that @etiennedi proposed in the slack thread
  2. Requiring the user to set up the schema at init time of the document store with all meta fields that are planned (and possibly a method add_fields_to_schema to extend it later. If there are docs with unexpected meta fields at runtime, we'll throw an exception.

Both ways are fine for me. 1) is obviously more elegant but also more work. I leave the choice up to you.

FYI: we will soon also work on a better filtering syntax to allow for more data types in all documentstores (#625). Right now we only support string filtering end-to-end.

@venuraja79
Copy link
Contributor Author

Thank you all for the help!
Just a quick update - I tried the caching auto-schema hack that @etiennedi suggested and it works like a charm.

@tholor
Copy link
Member

tholor commented Jun 6, 2021

Awesome! I think then we are on the last mile of this PR :)
@venuraja79 can you please fix the mypy typing issues (let me know if you need help here)?
Once CI is passing we can do a final review round.

@venuraja79
Copy link
Contributor Author

@tholor @lalitpagaria This is mostly code complete except for a couple of outstanding issues. please review when you get a chance.

Few points to note:
Created a separate test file because of the below Weaviate differences
Property names can't have _ or numbers
Mandatory to pass a vector when creating new documents
Only UUID is accepted as a valid document ID in weaviate
Class Names (index) have to be PascalCase.

Limitations

  1. Filters: Only simple string equal to filters are allowed.
  2. Dynamic property can only be a string (for now)

Update Embeddings:
It is possible to update all the existing documents. No notion of docs without embeddings as it's mandatory to pass an embedding while creating document

Known Issue:
Query (with text) is not working because we aren't using a vectorizer. I'll seek a workaround for this from weaviate team. Query using an embedding works though.

@lalitpagaria
Copy link
Contributor

@venuraja79 Tests are failing because of missing weaviate dependency

06/08/2021 06:24:17 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
ImportError while loading conftest '/home/runner/work/haystack/haystack/test/conftest.py'.
conftest.py:12: in <module>
    import weaviate
E   ModuleNotFoundError: No module named 'weaviate'
Error: Process completed with exit code 4.

@etiennedi
Copy link

Hi @venuraja79,

some answers inlined:

Property names can't have _ or numbers

we want to open this up, but unfortunately it hasn't made the cut for v1.4.0, so it will probably have to wait for another release.

Mandatory to pass a vector when creating new documents

This is no longer the case in v1.4.0 (which will be released this week). In this new version you can skip vector indexing a class, then no vector needs to be provided.

Only UUID is accepted as a valid document ID in weaviate

Thanks for raising this, we'll think about what it would take to allow arbitrary url-safe keys. In theory there should be nothing in the way of it, but it would definitely be a larger change, as for now Weaviate relies on UUID keys quite a bit. So, only expect changes in the long-term here.

Class Names (index) have to be PascalCase.

This will also be loosened once we losen up allowed characters in property names.

Limitations

  1. Filters: Only simple string equal to filters are allowed.
  2. Dynamic property can only be a string (for now)

Do I understand it correctly that this are limitations in this PR and not in Weaviate? Because Weaviate's filters should be quite flexible and allow a lot more than this.

Update Embeddings:
It is possible to update all the existing documents. No notion of docs without embeddings as it's mandatory to pass an embedding while creating document

I think I need a bit more context on this one. But as mentioned above, the requirement to pass an embedding will go away in v1.4.0 if you choose not to vector-index a class.

Known Issue:
Query (with text) is not working because we aren't using a vectorizer. I'll seek a workaround for this from weaviate team. Query using an embedding works though.

If you aren't using a vectorizer, Weaviate does not know how text can be embedded as vectors. The idea behind the vectorizer system in Weaviate is that it is used both at import time as well as at search time. This makes sure the same text2vec mechanism is used and the vector spaces are compatible. Therefore - without a vectorizer - there is also no search time vectorizer. How do you currently do this with ES or faiss where the only interface is vectors? Can you use the same process you use to create embeddings for the objects at import time to also create embeddings for the search query?

Hope I could help with the Weaviate-related questions.

@venuraja79
Copy link
Contributor Author

Thank you @etiennedi for your quick response.

Most of the comments here are related to this PR except the last question, please see inline -

Hi @venuraja79,

some answers inlined:

Property names can't have _ or numbers

we want to open this up, but unfortunately it hasn't made the cut for v1.4.0, so it will probably have to wait for another release.

Mandatory to pass a vector when creating new documents

This is no longer the case in v1.4.0 (which will be released this week). In this new version you can skip vector indexing a class, then no vector needs to be provided.

Only UUID is accepted as a valid document ID in weaviate

Thanks for raising this, we'll think about what it would take to allow arbitrary url-safe keys. In theory there should be nothing in the way of it, but it would definitely be a larger change, as for now Weaviate relies on UUID keys quite a bit. So, only expect changes in the long-term here.

Class Names (index) have to be PascalCase.

This will also be loosened once we losen up allowed characters in property names.

Limitations

  1. Filters: Only simple string equal to filters are allowed.
  2. Dynamic property can only be a string (for now)

Do I understand it correctly that this are limitations in this PR and not in Weaviate? Because Weaviate's filters should be quite flexible and allow a lot more than this.

Yes, you are correct! This is a limitation for this PR as we want to keep this filter feature simple to start with.

Update Embeddings:
It is possible to update all the existing documents. No notion of docs without embeddings as it's mandatory to pass an embedding while creating document

I think I need a bit more context on this one. But as mentioned above, the requirement to pass an embedding will go away in v1.4.0 if you choose not to vector-index a class.

This is a PR feature that updates embeddings only for the documents that didn't have a vector associated with them while indexing. Your answer already helps here.

Known Issue:
Query (with text) is not working because we aren't using a vectorizer. I'll seek a workaround for this from weaviate team. Query using an embedding works though.

If you aren't using a vectorizer, Weaviate does not know how text can be embedded as vectors. The idea behind the vectorizer system in Weaviate is that it is used both at import time as well as at search time. This makes sure the same text2vec mechanism is used and the vector spaces are compatible. Therefore - without a vectorizer - there is also no search time vectorizer. How do you currently do this with ES or faiss where the only interface is vectors? Can you use the same process you use to create embeddings for the objects at import time to also create embeddings for the search query?

Hope I could help with the Weaviate-related questions.

Yes, thank you!

This is our understanding as well.
Other stores such as ES, SQL support inverted index based search. My understanding is that faiss and milvus only store the vectors and the actual document is still stored in a SQL store. @lalitpagaria / @tholor please correct this isn't the case.

Because weaviate is primarily vector search, we can allow only to query using an embedding. I hope this is OK for now. Anyway waiting for the team to confirm.

@etiennedi
Copy link

Thanks for the reply.

Other stores such as ES, SQL support inverted index based search.

You can also perform inverted index searches in Weaviate using the where filter. If the property type is text and the words are broken up using spaces it will match documents where each word is contained. There is no td-idf/bm25 scoring on the individual search queries at the moment though. (We might add this scoring later, the architectural design did take this into account, but so far we've seen the largest amount of users use vector search and only use the inverted index for filtering, but not for text-retrieval).

I fully agree that it makes most sense to use Weaviate for vector search, however. :)

My understanding is that faiss and milvus only store the vectors and the actual document is still stored in a SQL store.

This is in my (biased) opinion one of the biggest strengths of Weaviate. It allows for scaling things together that live together, i.e. document store, inverted index and vector index always grow together. And in a multi-node setup cross-node traffic is minimized. Also it allows for efficient pre-filtering when combining vector and scalar search. :)

@bobvanluijt
Copy link
Contributor

bobvanluijt commented Jun 8, 2021

Additionally, links:

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good!
Left one minor comment about test markers.
We should also adjust the description in the docstring slightly - I can take care of this.
Should be ready to merge very soon 🎉

test/test_weaviate.py Outdated Show resolved Hide resolved
@tholor
Copy link
Member

tholor commented Jun 8, 2021

Other stores such as ES, SQL support inverted index based search. My understanding is that faiss and milvus only store the vectors and the actual document is still stored in a SQL store. @lalitpagaria / @tholor please correct this isn't the case.

Yes, that's completely correct. FAISS and milvus are pure "vector stores" and therefore we use SQL in our implementation to store docs and metadata.

Because weaviate is primarily vector search, we can allow only to query using an embedding. I hope this is OK for now. Anyway waiting for the team to confirm.

The focus of weaviate is clearly on vector search. We can optionally also support query() with the basic keyword matching @etiennedi described or raise a NotImplementedError. Either way is fine with me. My current understanding is that the currently implemented query() method is already doing the keyword matching. Is that correct @venuraja79 ?

@tholor tholor changed the title WIP : Integrate Weaviate as another DocumentStore #957 Integrate Weaviate as another DocumentStore #957 Jun 8, 2021
@venuraja79
Copy link
Contributor Author

venuraja79 commented Jun 8, 2021

Other stores such as ES, SQL support inverted index based search. My understanding is that faiss and milvus only store the vectors and the actual document is still stored in a SQL store. @lalitpagaria / @tholor please correct this isn't the case.

Yes, that's completely correct. FAISS and milvus are pure "vector stores" and therefore we use SQL in our implementation to store docs and metadata.

Because weaviate is primarily vector search, we can allow only to query using an embedding. I hope this is OK for now. Anyway waiting for the team to confirm.

The focus of weaviate is clearly on vector search. We can optionally also support query() with the basic keyword matching @etiennedi described or raise a NotImplementedError. Either way is fine with me. My current understanding is that the currently implemented query() method is already doing the keyword matching. Is that correct @venuraja79 ?

query() method supports keyword matching via filters. For example {"text" : "some text"} will return all documents that contains "some text". However, If the user passes only query text (as a method parameter), we raise a NotImplementedError. This is because the number of properties to search can be many and it's difficult to construct the query (for now). In short, the user has to provide both the property and the keywords as a filter.

In addition, we also allow the user to pass a custom graphQL query which can be very powerful.
@tholor - please review the query() method as I have updated this now and also added the test for this.

@venuraja79 venuraja79 requested a review from tholor June 9, 2021 13:49
@@ -76,6 +76,9 @@ jobs:
- name: Run Milvus
run: docker run -d -p 19530:19530 -p 19121:19121 milvusdb/milvus:1.1.0-cpu-d050721-5e559c

- name: Run Weaviate
run: docker run -d -p 8080:8080 --name haystack_test_weaviate --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.3.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to set Weaviate to version 1.4.0

@bobvanluijt
Copy link
Contributor

Hi @venuraja79Weaviate 1.4.0 is just released so you might want to use that as mentioned here.

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me. Let's update to 1.4.0 and merge once CI is passing

@tholor tholor merged commit 49886f8 into deepset-ai:master Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants