SQL based Datastores fail when document metadata has a list #2792

sridhar · 2022-07-11T23:38:36Z

This issue is easily reproducible in FAISSDataStore and SQLDataStore. When the value of the key in doc.meta is set to a list, then document_store.write fails.
This happens when I'm using TikaConverter to convert a directory of files, and some of them have lists in metadata.

Error message

File "/home/sridhar/devel/doc_intelligence/haystack/haystack/document_stores/faiss.py", line 295, in write_documents
    super(FAISSDocumentStore, self).write_documents(
  File "/home/sridhar/devel/doc_intelligence/haystack/haystack/document_stores/sql.py", line 401, in write_documents
    self.session.query(MetaDocumentORM).filter_by(document_id=doc.id).delete()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3209, in delete
    result = self.session.execute(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1660, in execute
    ) = compile_state_cls.orm_pre_session_exec(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1829, in orm_pre_session_exec
    session._autoflush()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2257, in _autoflush
    util.raise_(e, with_traceback=sys.exc_info()[2])
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2246, in _autoflush
    self.flush()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3383, in flush
    self._flush(objects)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3523, in _flush
    transaction.rollback(_capture_exception=True)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3483, in _flush
    flush_context.execute()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute
    rec.execute(self)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute
    util.preloaded.orm_persistence.save_obj(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    _emit_insert_statements(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1238, in _emit_insert_statements
    result = connection._execute_20(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 332, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
   self._handle_dbapi_exception(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.InterfaceError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(sqlite3.InterfaceError) Error binding parameter 2 - probably unsupported type.
[SQL: INSERT INTO meta_document (id, name, value, document_id, document_index) VALUES (?, ?, ?, ?, ?)]
[parameters: ('60afe526-9500-420d-8170-0aeff009e205', 'xmpMM:History:InstanceID', ['xmp.iid:f12cf1f4-997f-474e-b7e1-4eb15ac63f4f', 'xmp.iid:d8516db1-2277-c840-b5d8-12ba4225175d'], 'feb937070fb26bb3f3cf9f397a3
531a5', 'document')]

(Background on this error at: https://sqlalche.me/e/14/rvf5)

Expected behavior
The workaround is to simply set the particular meta variable to "ignore" in the TikaConverter.convert function.
There should be a knob to ignore these metadatas automatically.

FAQ Check
Yes

System:

OS: Ubuntu 18.04
GPU/CPU: Nvidia Titan XP/Intel(R) Xeon(R) CPU E5-2697
Haystack version (commit or version number): ba08fc8
DocumentStore: FAISS
Reader: TransformersReader
Retriever: EmbeddingRetriever

The text was updated successfully, but these errors were encountered:

anakin87 · 2022-07-17T11:38:52Z

@tstadel @ZanSara I too have encountered this error in the past.

I see two main alternative ways to address this issue (in the document store implementation):

Skip the metadata variables which are lists and alert the user with a warning
Translate these specific list variables into some string-based representation and alert the user with a warning

What do you think? Any other ideas or advice?

danielbichuetti · 2022-07-17T12:10:36Z

After reading the SQL document store implementation (after @anakin87 post), I think it's possible to address this by these alterations on the way document meta is stored:

Store meta as JSON data type (SQL Alchemy already supports for Postgres and MySQL as example)
If the underlying database doesn't support native JSON, there would be two options:
2.1. Throw a warning (fast implementation, but not the best UX, nor a well properly designed)
2.2. Store as a string (JSON dump) and then, after SQL Alchemy reads it, convert to a list again for further manipulation.

It could be handled inside the SQL document store implementation without any impact (code and behavior changes) outside of it.

anakin87 · 2022-07-19T12:26:36Z

As we can see here,

haystack/haystack/document_stores/sql.py

Lines 93 to 97 in baf5ef8

    
           class MetaLabelORM(ORMBase): 
        
               __tablename__ = "meta_label" 
        
               name = Column(String(100), index=True) 
        
               value = Column(String(1000), index=True)

currently meta is stored in a table, where both name and value are strings.

As @danielbichuetti suggests, storing the whole meta as a JSON would provide more flexibility.
However, on the other hand, this would radically change the current DB structure, probably also impacting the various filtering mechanisms.

After these reflections, my previous alternative proposals could be a good compromise.

Skip the metadata variables which are not strings and alert the user with a warning

Translate these specific variables into some string-based representation and alert the user with a warning

@tstadel @ZanSara what do you think? How is it best to deal with this issue?

danielbichuetti · 2022-07-19T12:51:46Z

@anakin87 Indeed there will need to be some change on the DB structure, but the filtering mechanisms will keep the correct work (after some changes). Postgres, MySQL, Maria and SQL Server allow “filtering” (complex queries) inside the JSON data type. Main issue is the compatibility with already existent databases, so it would fall under the breaking change. Usually, when db structure changes, a python migration script could be provided. So new users would just allow haystack to fix the DB for then. Indeed, at some point such a script will need to be made, it's almost impossible to eternally avoid improvements based on the “don't touch DB structure culture”.

So, haystack won't have to discard the metadata or use a simple string data type where filtering would be broken without harder custom code then the JSON change. Despise being less time-consuming to implement, I think this kind of behavior will reduce user expectations on the framework. Well, of course, it would be some improvement over the actual implementation.

Furthermore, SQL Lite demands for JSON filtering some custom code for string. Since the major SQL databases can have its implementation improved and SQL Lite won't fit, how about in future a split on this Datastore?

SQLLiteDocumentStore - limited in functionality, like the list issues.
SQLDocumentStore - improved current class.

sridhar · 2022-07-19T19:55:00Z

Just my 2c here: Some of the metadata generated by these parsers are quite expansive and people will not be interested in using sqlite for this. In a separate project, I use the likes of BigQuery and Spark for this kind of query
For now I think, just a simple isinstance() check is good enough and the metadata with a lot of these list elements can simply be ignored for now.

danielbichuetti · 2022-07-19T20:25:44Z

@sridhar Indeed, my thoughts were that people who use SQL Lite commonly are doing small tests, and a more complex metadata processing wouldn't be needed. Considering performance problems, SQL Lite is highly not recommended for a mainstream, not tests or small scale, NLP scenario where workload exceed even some low profiles.

Some of the metadata generated by these parsers are quite expansive

Yes. This is one point that the main idea is already at code (filtering metadata). With JSON, it would still be easy to exclude fields inside the hierarchy that aren't need and keep the ones of interest in NLP.

#2809 is somehow related to this subject. Implementing a JSON data type support into the data store would allow lots of other information which could be of interest. The user would decide when building his pipeline.

I think that for now, doing a type check as a safe measure would be just a starting point. Error is just bypassed, losing some data.

Maybe a progressive change so current users could still use this data store in such scenarios, but knowing they will lose the metadata. Then, the split of the document stores (SQL Lite capabilities are much limited compared to current mainstream ones), which would technically be a breaking change, as the document store name would change and users should update their codes. And after, the improvement of the class to handle any kind of objects in metadata, keeping the current features and improving to support complex objects into it.

sjrl · 2022-07-26T08:59:08Z

Hi @anakin87, thanks again for opening a PR for this issue! Is it safe to close this issue now?

anakin87 · 2022-07-26T09:13:19Z

Hey @sjrl, thanks!
For the moment, this error is resolved: if the metadata contains types that are not acceptable by the SQLDocumentStore, the specific values are skipped and we raise a non-blocking error message.
In any case, this issue contains interesting considerations about metadata handling, so it's up to you to decide whether to close it or keep it open.

sjrl · 2022-07-26T10:05:16Z

Hey, @anakin87 I see what you are saying. It sounds like there is still more to be done on this topic in the future. I think the best course of action here would be to open a new issue with the details of what should be done next and then close this issue. Otherwise this issue could get stale or confusing to new people reading it. What do you think?

TuanaCelik · 2022-08-08T13:31:00Z

I want to add to this issue to keep track. A community member ran into this problem while using FAISSDocumentStore with TransformersDocumentClassifier. The problem is the classification results are added as dictionary within dictionary as results in metadata but these sql based document stores don't accept them, which means you can't re-write the results from the classifier back to your document store. @ZanSara @tstadel wdyt, is it best to add support for this in TransformersDocumentClassifier or fix the issue in the DocumentStores?

anakin87 · 2022-09-07T21:17:50Z

@sjrl I think that this issue can be closed as the problem has actually been solved. In any case, the interesting reflections we have made will remain here.
What @TuanaCelik reported is addressed in another issue.

tstadel added type:bug Something isn't working topic:file_converter topic:metadata journey:first steps labels Jul 12, 2022

ZanSara added the good first issue Good for newcomers label Jul 14, 2022

ZanSara added the Contributions wanted! Looking for external contributions label Jul 19, 2022

anakin87 mentioned this issue Jul 21, 2022

Handle invalid metadata for SQLDocumentStore #2868

Merged

4 tasks

TuanaCelik mentioned this issue Aug 10, 2022

TransformersDocumentClassifier results can't be written back to SQL based DocumentStores #3019

Closed

1 task

sjrl closed this as completed Sep 13, 2022

anakin87 mentioned this issue Oct 20, 2022

Metadata values are being dropped when list of values are provided #3434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL based Datastores fail when document metadata has a list #2792

SQL based Datastores fail when document metadata has a list #2792

sridhar commented Jul 11, 2022 •

edited by sjrl

anakin87 commented Jul 17, 2022

danielbichuetti commented Jul 17, 2022

anakin87 commented Jul 19, 2022

danielbichuetti commented Jul 19, 2022 •

edited

sridhar commented Jul 19, 2022

danielbichuetti commented Jul 19, 2022 •

edited

sjrl commented Jul 26, 2022

anakin87 commented Jul 26, 2022

sjrl commented Jul 26, 2022

TuanaCelik commented Aug 8, 2022

anakin87 commented Sep 7, 2022

SQL based Datastores fail when document metadata has a list #2792

SQL based Datastores fail when document metadata has a list #2792

Comments

sridhar commented Jul 11, 2022 • edited by sjrl

anakin87 commented Jul 17, 2022

danielbichuetti commented Jul 17, 2022

anakin87 commented Jul 19, 2022

danielbichuetti commented Jul 19, 2022 • edited

sridhar commented Jul 19, 2022

danielbichuetti commented Jul 19, 2022 • edited

sjrl commented Jul 26, 2022

anakin87 commented Jul 26, 2022

sjrl commented Jul 26, 2022

TuanaCelik commented Aug 8, 2022

anakin87 commented Sep 7, 2022

sridhar commented Jul 11, 2022 •

edited by sjrl

danielbichuetti commented Jul 19, 2022 •

edited

danielbichuetti commented Jul 19, 2022 •

edited