fix: discard metadata fields if not set in Weaviate by masci · Pull Request #3578 · deepset-ai/haystack

masci · 2022-11-15T11:12:40Z

Related Issues

fixes https://github.com/deepset-ai/haystack/actions?query=workflow:Tests

Weaviate hardcodes certain metadata fields into its index, in this case name. When name is not present in the original document, and only if other metadata is set, after one full roundtrip write/read you end up having meta["name"] == None in the resulting doc.

Proposed Changes:

Drop the name field from meta if its value is None

How did you test it?

pytest -m"document_store and integration" test/document_stores/test_weaviate.py -x -vv

Notes for the reviewer

While running the tests, another bug was exposed: the embedding field in Document was set regardless of the return_embeddings init param - this is fixed now.

I also noticed the original documents passed to write_documents are modified when they don't contain embeddings, which I think it's wrong but I guess that goes for another time.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

bogdankostic

Added one comment wrt to how we check whether to include a metadata field.

haystack/document_stores/weaviate.py

bogdankostic · 2022-11-15T15:53:11Z

I also noticed the original documents passed to write_documents are modified when they don't contain embeddings, which I think it's wrong but I guess that goes for another time.

I think we do this because Weaviate requires Documents to contain embeddings in order to index them. (At least this was the case at the time we added WeaviateDocumentStore to Haystack if I recall correctly.)

bogdankostic

LGTM, cool that you directly fixed the bug about return_embeddings! :)

* Fix docstrings for DocumentStores * Fix docstrings for AnswerGenerator * Fix docstrings for Connector * Fix docstrings for DocumentClassifier * Fix docstrings for LabelGenerator * Fix docstrings for QueryClassifier * Fix docstrings for Ranker * Fix docstrings for Retriever and Summarizer * Fix docstrings for Translator * Fix docstrings for Pipelines * Fix docstrings for Primitives * Fix Python code block spacing * Add line break before code block * Fix code blocks * fix: discard metadata fields if not set in Weaviate (#3578) * fix weaviate bug in returning embeddings and setting empty meta fields * review comment * Update unstable version and openapi schema (#3584) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix: Flatten `DocumentClassifier` output in `SQLDocumentStore`; remove `_sql_session_rollback` hack in tests (#3273) * first draft * fix * fix * move test to test_sql * test: add test to check id_hash_keys is not ignored (#3577) * refactor: Generate JSON schema when missing (#3533) * removed unused script * print info logs when generating openapi schema * create json schema only when needed * fix tests * Remove leftover Co-authored-by: ZanSara <sarazanzo94@gmail.com> * move milvus tests to their own module (#3596) * feat: store metadata using JSON in SQLDocumentStore (#3547) * add warnings * make the field cachable * review comment * Pin faiss-cpu as 1.7.3 seems to have problems (#3603) * Update Haystack imports (#3599) * Update Python version (#3602) * fix: `ParsrConverter` fails on pages without text (#3605) * try to fix bug * remove print * leftover * refactor: update Squad data (#3513) * refractor the to_squad data class * fix the validation label * refractor the to_squad data class * fix the validation label * add the test for the to_label object function * fix the tests for to_label_objects * move all the test related to squad data to one file * remove unused imports * revert tiny_augmented.json Co-authored-by: ZanSara <sarazanzo94@gmail.com> * Url fixes (#3592) * add 2 example scripts * fixing faq script * fixing some urls * removing example scripts * black reformatting * add labeler to the repo (#3609) * convert eval metrics to python float (#3612) * feat: add support for `BM25Retriever` in `InMemoryDocumentStore` (#3561) * very first draft * implement query and query_batch * add more bm25 parameters * add rank_bm25 dependency * fix mypy * remove tokenizer callable parameter * remove unused import * only json serializable attributes * try to fix: pylint too-many-public-methods / R0904 * bm25 attribute always present * convert errors into warnings to make the tutorial 1 work * add docstrings; tests * try to make tests run * better docstrings; revert not running tests * some suggestions from review * rename elasticsearch retriever as bm25 in tests; try to test memory_bm25 * exclude tests with filters * change elasticsearch to bm25 retriever in test_summarizer * add tests * try to improve tests * better type hint * adapt test_table_text_retriever_embedding * handle non-textual docs * query only textual documents * Incorporate Reviewer feedback * refactor: replace `torch.no_grad` with `torch.inference_mode` (where possible) (#3601) * try to replace torch.no_grad * revert erroneous change * revert other module breaking * revert training/base * Fix docstrings for DocumentStores * Fix docstrings for AnswerGenerator * Fix docstrings for Connector * Fix docstrings for DocumentClassifier * Fix docstrings for LabelGenerator * Fix docstrings for QueryClassifier * Fix docstrings for Ranker * Fix docstrings for Retriever and Summarizer * Fix docstrings for Translator * Fix docstrings for Pipelines * Fix docstrings for Primitives * Fix Python code block spacing * Add line break before code block * Fix code blocks * Incorporate Reviewer feedback Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai> Co-authored-by: ZanSara <sarazanzo94@gmail.com> Co-authored-by: Espoir Murhabazi <espoir.mur@gmail.com> Co-authored-by: Tuana Celik <tuana.celik@deepset.ai> Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>

fix weaviate bug in returning embeddings and setting empty meta fields

0801b20

masci added type:bug Something isn't working topic:weaviate labels Nov 15, 2022

masci requested a review from a team as a code owner November 15, 2022 11:12

masci requested review from bogdankostic and removed request for a team November 15, 2022 11:12

julian-risch mentioned this pull request Nov 15, 2022

test: add test to check id_hash_keys is not ignored #3577

Merged

6 tasks

bogdankostic requested changes Nov 15, 2022

View reviewed changes

haystack/document_stores/weaviate.py Outdated Show resolved Hide resolved

haystack/document_stores/weaviate.py Outdated Show resolved Hide resolved

review comment

e280722

masci requested a review from bogdankostic November 15, 2022 17:00

bogdankostic approved these changes Nov 15, 2022

View reviewed changes

masci mentioned this pull request Nov 15, 2022

fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" #3548

Merged

6 tasks

masci merged commit ba75d39 into main Nov 15, 2022

masci deleted the massi/weaviate-bug branch November 15, 2022 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: discard metadata fields if not set in Weaviate#3578

fix: discard metadata fields if not set in Weaviate#3578
masci merged 2 commits intomainfrom
massi/weaviate-bug

masci commented Nov 15, 2022

Uh oh!

bogdankostic left a comment

Uh oh!

Uh oh!

Uh oh!

bogdankostic commented Nov 15, 2022

Uh oh!

bogdankostic left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

masci commented Nov 15, 2022

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

bogdankostic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bogdankostic commented Nov 15, 2022

Uh oh!

bogdankostic left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants