Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invalid metadata for SQLDocumentStore #2868

Merged
merged 18 commits into from Jul 25, 2022
Merged

Handle invalid metadata for SQLDocumentStore #2868

merged 18 commits into from Jul 25, 2022

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Jul 21, 2022

Related Issue(s): #2792

Proposed changes:

  • Skip writing metadata if they are not automatically cast to string by SqlAlchemy
    Valid types are str, int, float, bool, bytes, bytearray, NoneType
  • Added specific test

Pre-flight checklist

In #2792, we seem to agree that it would be good to have a more expressive format for metadata value (not simply string).
We also understand that it would be a complex change that could break several things.

So to avoid blocking errors, for now, the best solution seems to skip writing metadata that is incorrectly managed by SqlAlchemy.
And I've started this implementation...
(Another possibility is to write a string representation in the DB even for complex metadata.)

I make the PR a draft, to run all the tests...

@anakin87 anakin87 marked this pull request as draft July 21, 2022 22:09
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising! That's a nice default behavior for the SQLDocumentStore for now. Left a few comments for improvements

Comment on lines 395 to 401
if value is None or isinstance(value, self.valid_metadata_types):
valid_meta_orms.append(MetaDocumentORM(name=name, value=value))
else:
logger.warning(
f"Metadata '{name}' skipped for document {doc.id}, since it has invalid type: {type(value).__name__}.\n"
f"SQLDocumentStore accepts only the following types: {', '.join([el.__name__ for el in self.valid_metadata_types])}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For speed reason I'd use a try-catch in here. It also saves you from specifying the allowed types in line 112.

In addition, this is quite a bad error imho, so I'd raise the log to ERROR or even EXCEPTION. Personally, if this is an issue that requires action from the users in almost all cases, I'd go for EXCEPTION. If you see any situation in which the users might want to ignore that for a good reason, then ERROR is better (the only difference between the two is that exception shows the stack trace while error doesn't).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Agree on logging an ERROR. I think that there are cases when the user simply wants to ignore this error (see SQL based Datastores fail when document metadata has a list #2792 initial issue).

  • About try-catch: the problem is that the error only shows up, during the first writing operation of the whole record, eg here:

    self.session.query(MetaDocumentORM).filter_by(document_id=doc.id).delete()

    IMHO, we could properly use try-catch, if the exception was raised during the creation of MetaDocumentORM.
    What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, I see the point. There are two ways to go:

  • This one (just fine), or
  • See if SQLAlchemy offers any hook to deal with unsupported types, like json and Pydantic do. Do you mind having a look? If the library offers no help there, we can stick with your original solution.

@@ -453,6 +453,32 @@ def test_write_document_meta(document_store: BaseDocumentStore):
assert document_store.get_document_by_id("4").meta["meta_field"] == "test4"


@pytest.mark.parametrize("document_store", ["sql", "faiss", "milvus1", "milvus"], indirect=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would test all docstores here, so we're sure there is a consistent behavior across all of them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to test only docstores, whose metadata are written in SQL databases.
For other docstores, I think that we don't have this error...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right, because the other docstores would be happy to deal with lists in the meta, so they won't show this error.

Then we might even want to test only sql. I got thinking here because, for example. pinecone uses SQLDocumentStore as well, and keeping track of all subclasses would be prone to errors. Let's just test sql, it should be enough.

Comment on lines 476 to 479
assert not "invalid_meta_field" in document_store.get_document_by_id("1").meta
assert document_store.get_document_by_id("1").meta["valid_meta_field"] == "test1"
assert not "invalid_meta_field" in document_store.get_document_by_id("2").meta
assert document_store.get_document_by_id("2").meta["valid_meta_field"] == "test2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it slightly stricter, how about testing the whole dictionary?

Suggested change
assert not "invalid_meta_field" in document_store.get_document_by_id("1").meta
assert document_store.get_document_by_id("1").meta["valid_meta_field"] == "test1"
assert not "invalid_meta_field" in document_store.get_document_by_id("2").meta
assert document_store.get_document_by_id("2").meta["valid_meta_field"] == "test2"
assert document_store.get_document_by_id("1").meta == {"valid_meta_field": "test1"}
assert document_store.get_document_by_id("2").meta == {"valid_meta_field": "test2"}

@anakin87
Copy link
Member Author

See if SQLAlchemy offers any hook to deal with unsupported types, like json and Pydantic do.

@ZanSara Generally, SQLAlchemy prefers to let these checks to the DBs, but I found simple validators!
So now, the validation is defined in MetaDocumentORM. I found this solution simpler and cleaner 🙂

Maybe the error handling and logging can be improved, even if now I think it's not so bad.

ERROR - haystack.document_stores.sql - Document 4 - Discarded metadata 'strange_meta_2', since it has invalid type: tuple.
SQLDocumentStore can accept and cast to string only the following types: str, int, float, bool, bytes, bytearray, NoneType

I also improved the test...

@anakin87 anakin87 marked this pull request as ready for review July 22, 2022 15:25
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, good to go! The validation hooks for SQLAlchemy are really neat. I agree the database should always have the last word, but a bit of proactive validation won't hurt and saves a scary error to the users, so I really like it.

Thank you very much @anakin87 😊

@ZanSara ZanSara merged commit 7dcef68 into deepset-ai:master Jul 25, 2022
@anakin87 anakin87 deleted the handle_nostring_metadata_sqlstore branch July 25, 2022 13:11
Krak91 pushed a commit to Krak91/haystack that referenced this pull request Jul 26, 2022
* modify notebook

* skip invalid metadata

* Update Documentation & Code Style

* fix nonetype

* fix nonetype

* drop nonetype from valid types

* drop nonetype from valid types

* fix

* Update sql.py

* sqlalchemy validation

* removed newlines

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants