Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL based Datastores fail when document metadata has a list #2792

Closed
sridhar opened this issue Jul 11, 2022 · 11 comments
Closed

SQL based Datastores fail when document metadata has a list #2792

sridhar opened this issue Jul 11, 2022 · 11 comments
Labels
Contributions wanted! Looking for external contributions good first issue Good for newcomers topic:file_converter topic:metadata type:bug Something isn't working

Comments

@sridhar
Copy link

sridhar commented Jul 11, 2022

This issue is easily reproducible in FAISSDataStore and SQLDataStore. When the value of the key in doc.meta is set to a list, then document_store.write fails.
This happens when I'm using TikaConverter to convert a directory of files, and some of them have lists in metadata.

Error message

File "/home/sridhar/devel/doc_intelligence/haystack/haystack/document_stores/faiss.py", line 295, in write_documents
    super(FAISSDocumentStore, self).write_documents(
  File "/home/sridhar/devel/doc_intelligence/haystack/haystack/document_stores/sql.py", line 401, in write_documents
    self.session.query(MetaDocumentORM).filter_by(document_id=doc.id).delete()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3209, in delete
    result = self.session.execute(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1660, in execute
    ) = compile_state_cls.orm_pre_session_exec(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1829, in orm_pre_session_exec
    session._autoflush()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2257, in _autoflush
    util.raise_(e, with_traceback=sys.exc_info()[2])
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2246, in _autoflush
    self.flush()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3383, in flush
    self._flush(objects)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3523, in _flush
    transaction.rollback(_capture_exception=True)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3483, in _flush
    flush_context.execute()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute
    rec.execute(self)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute
    util.preloaded.orm_persistence.save_obj(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    _emit_insert_statements(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1238, in _emit_insert_statements
    result = connection._execute_20(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 332, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
   self._handle_dbapi_exception(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.InterfaceError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(sqlite3.InterfaceError) Error binding parameter 2 - probably unsupported type.
[SQL: INSERT INTO meta_document (id, name, value, document_id, document_index) VALUES (?, ?, ?, ?, ?)]
[parameters: ('60afe526-9500-420d-8170-0aeff009e205', 'xmpMM:History:InstanceID', ['xmp.iid:f12cf1f4-997f-474e-b7e1-4eb15ac63f4f', 'xmp.iid:d8516db1-2277-c840-b5d8-12ba4225175d'], 'feb937070fb26bb3f3cf9f397a3
531a5', 'document')]

(Background on this error at: https://sqlalche.me/e/14/rvf5)

Expected behavior
The workaround is to simply set the particular meta variable to "ignore" in the TikaConverter.convert function.
There should be a knob to ignore these metadatas automatically.

FAQ Check
Yes

System:

  • OS: Ubuntu 18.04
  • GPU/CPU: Nvidia Titan XP/Intel(R) Xeon(R) CPU E5-2697
  • Haystack version (commit or version number): ba08fc8
  • DocumentStore: FAISS
  • Reader: TransformersReader
  • Retriever: EmbeddingRetriever
@anakin87
Copy link
Member

@tstadel @ZanSara I too have encountered this error in the past.

I see two main alternative ways to address this issue (in the document store implementation):

  • Skip the metadata variables which are lists and alert the user with a warning
  • Translate these specific list variables into some string-based representation and alert the user with a warning

What do you think? Any other ideas or advice?

@danielbichuetti
Copy link
Contributor

After reading the SQL document store implementation (after @anakin87 post), I think it's possible to address this by these alterations on the way document meta is stored:

  1. Store meta as JSON data type (SQL Alchemy already supports for Postgres and MySQL as example)
  2. If the underlying database doesn't support native JSON, there would be two options:
    2.1. Throw a warning (fast implementation, but not the best UX, nor a well properly designed)
    2.2. Store as a string (JSON dump) and then, after SQL Alchemy reads it, convert to a list again for further manipulation.

It could be handled inside the SQL document store implementation without any impact (code and behavior changes) outside of it.

@anakin87
Copy link
Member

As we can see here,

class MetaLabelORM(ORMBase):
__tablename__ = "meta_label"
name = Column(String(100), index=True)
value = Column(String(1000), index=True)

currently meta is stored in a table, where both name and value are strings.

As @danielbichuetti suggests, storing the whole meta as a JSON would provide more flexibility.
However, on the other hand, this would radically change the current DB structure, probably also impacting the various filtering mechanisms.

After these reflections, my previous alternative proposals could be a good compromise.

  • Skip the metadata variables which are not strings and alert the user with a warning
  • Translate these specific variables into some string-based representation and alert the user with a warning

@tstadel @ZanSara what do you think? How is it best to deal with this issue?

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 19, 2022

@anakin87 Indeed there will need to be some change on the DB structure, but the filtering mechanisms will keep the correct work (after some changes). Postgres, MySQL, Maria and SQL Server allow “filtering” (complex queries) inside the JSON data type. Main issue is the compatibility with already existent databases, so it would fall under the breaking change. Usually, when db structure changes, a python migration script could be provided. So new users would just allow haystack to fix the DB for then. Indeed, at some point such a script will need to be made, it's almost impossible to eternally avoid improvements based on the “don't touch DB structure culture”.

So, haystack won't have to discard the metadata or use a simple string data type where filtering would be broken without harder custom code then the JSON change. Despise being less time-consuming to implement, I think this kind of behavior will reduce user expectations on the framework. Well, of course, it would be some improvement over the actual implementation.

Furthermore, SQL Lite demands for JSON filtering some custom code for string. Since the major SQL databases can have its implementation improved and SQL Lite won't fit, how about in future a split on this Datastore?

  1. SQLLiteDocumentStore - limited in functionality, like the list issues.
  2. SQLDocumentStore - improved current class.

@ZanSara ZanSara added the Contributions wanted! Looking for external contributions label Jul 19, 2022
@sridhar
Copy link
Author

sridhar commented Jul 19, 2022

Just my 2c here: Some of the metadata generated by these parsers are quite expansive and people will not be interested in using sqlite for this. In a separate project, I use the likes of BigQuery and Spark for this kind of query
For now I think, just a simple isinstance() check is good enough and the metadata with a lot of these list elements can simply be ignored for now.

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 19, 2022

@sridhar Indeed, my thoughts were that people who use SQL Lite commonly are doing small tests, and a more complex metadata processing wouldn't be needed. Considering performance problems, SQL Lite is highly not recommended for a mainstream, not tests or small scale, NLP scenario where workload exceed even some low profiles.

Some of the metadata generated by these parsers are quite expansive

Yes. This is one point that the main idea is already at code (filtering metadata). With JSON, it would still be easy to exclude fields inside the hierarchy that aren't need and keep the ones of interest in NLP.

#2809 is somehow related to this subject. Implementing a JSON data type support into the data store would allow lots of other information which could be of interest. The user would decide when building his pipeline.

I think that for now, doing a type check as a safe measure would be just a starting point. Error is just bypassed, losing some data.

Maybe a progressive change so current users could still use this data store in such scenarios, but knowing they will lose the metadata. Then, the split of the document stores (SQL Lite capabilities are much limited compared to current mainstream ones), which would technically be a breaking change, as the document store name would change and users should update their codes. And after, the improvement of the class to handle any kind of objects in metadata, keeping the current features and improving to support complex objects into it.

@sjrl
Copy link
Contributor

sjrl commented Jul 26, 2022

Hi @anakin87, thanks again for opening a PR for this issue! Is it safe to close this issue now?

@anakin87
Copy link
Member

Hey @sjrl, thanks!
For the moment, this error is resolved: if the metadata contains types that are not acceptable by the SQLDocumentStore, the specific values are skipped and we raise a non-blocking error message.
In any case, this issue contains interesting considerations about metadata handling, so it's up to you to decide whether to close it or keep it open.

@sjrl
Copy link
Contributor

sjrl commented Jul 26, 2022

Hey, @anakin87 I see what you are saying. It sounds like there is still more to be done on this topic in the future. I think the best course of action here would be to open a new issue with the details of what should be done next and then close this issue. Otherwise this issue could get stale or confusing to new people reading it. What do you think?

@TuanaCelik
Copy link
Member

I want to add to this issue to keep track. A community member ran into this problem while using FAISSDocumentStore with TransformersDocumentClassifier. The problem is the classification results are added as dictionary within dictionary as results in metadata but these sql based document stores don't accept them, which means you can't re-write the results from the classifier back to your document store. @ZanSara @tstadel wdyt, is it best to add support for this in TransformersDocumentClassifier or fix the issue in the DocumentStores?

@anakin87
Copy link
Member

anakin87 commented Sep 7, 2022

@sjrl I think that this issue can be closed as the problem has actually been solved. In any case, the interesting reflections we have made will remain here.
What @TuanaCelik reported is addressed in another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions good first issue Good for newcomers topic:file_converter topic:metadata type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants