New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL based Datastores fail when document metadata has a list #2792
Comments
@tstadel @ZanSara I too have encountered this error in the past. I see two main alternative ways to address this issue (in the document store implementation):
What do you think? Any other ideas or advice? |
After reading the SQL document store implementation (after @anakin87 post), I think it's possible to address this by these alterations on the way document meta is stored:
It could be handled inside the SQL document store implementation without any impact (code and behavior changes) outside of it. |
As we can see here, haystack/haystack/document_stores/sql.py Lines 93 to 97 in baf5ef8
currently meta is stored in a table, where both As @danielbichuetti suggests, storing the whole meta as a JSON would provide more flexibility. After these reflections, my previous alternative proposals could be a good compromise.
@tstadel @ZanSara what do you think? How is it best to deal with this issue? |
@anakin87 Indeed there will need to be some change on the DB structure, but the filtering mechanisms will keep the correct work (after some changes). Postgres, MySQL, Maria and SQL Server allow “filtering” (complex queries) inside the JSON data type. Main issue is the compatibility with already existent databases, so it would fall under the breaking change. Usually, when db structure changes, a python migration script could be provided. So new users would just allow haystack to fix the DB for then. Indeed, at some point such a script will need to be made, it's almost impossible to eternally avoid improvements based on the “don't touch DB structure culture”. So, haystack won't have to discard the metadata or use a simple string data type where filtering would be broken without harder custom code then the JSON change. Despise being less time-consuming to implement, I think this kind of behavior will reduce user expectations on the framework. Well, of course, it would be some improvement over the actual implementation. Furthermore, SQL Lite demands for JSON filtering some custom code for string. Since the major SQL databases can have its implementation improved and SQL Lite won't fit, how about in future a split on this Datastore?
|
Just my 2c here: Some of the metadata generated by these parsers are quite expansive and people will not be interested in using sqlite for this. In a separate project, I use the likes of BigQuery and Spark for this kind of query |
@sridhar Indeed, my thoughts were that people who use SQL Lite commonly are doing small tests, and a more complex metadata processing wouldn't be needed. Considering performance problems, SQL Lite is highly not recommended for a mainstream, not tests or small scale, NLP scenario where workload exceed even some low profiles.
Yes. This is one point that the main idea is already at code (filtering metadata). With JSON, it would still be easy to exclude fields inside the hierarchy that aren't need and keep the ones of interest in NLP. #2809 is somehow related to this subject. Implementing a JSON data type support into the data store would allow lots of other information which could be of interest. The user would decide when building his pipeline. I think that for now, doing a type check as a safe measure would be just a starting point. Error is just bypassed, losing some data. Maybe a progressive change so current users could still use this data store in such scenarios, but knowing they will lose the metadata. Then, the split of the document stores (SQL Lite capabilities are much limited compared to current mainstream ones), which would technically be a breaking change, as the document store name would change and users should update their codes. And after, the improvement of the class to handle any kind of objects in metadata, keeping the current features and improving to support complex objects into it. |
Hi @anakin87, thanks again for opening a PR for this issue! Is it safe to close this issue now? |
Hey @sjrl, thanks! |
Hey, @anakin87 I see what you are saying. It sounds like there is still more to be done on this topic in the future. I think the best course of action here would be to open a new issue with the details of what should be done next and then close this issue. Otherwise this issue could get stale or confusing to new people reading it. What do you think? |
I want to add to this issue to keep track. A community member ran into this problem while using |
@sjrl I think that this issue can be closed as the problem has actually been solved. In any case, the interesting reflections we have made will remain here. |
This issue is easily reproducible in FAISSDataStore and SQLDataStore. When the value of the key in doc.meta is set to a list, then document_store.write fails.
This happens when I'm using TikaConverter to convert a directory of files, and some of them have lists in metadata.
Error message
(Background on this error at: https://sqlalche.me/e/14/rvf5)
Expected behavior
The workaround is to simply set the particular meta variable to "ignore" in the TikaConverter.convert function.
There should be a knob to ignore these metadatas automatically.
FAQ Check
Yes
System:
The text was updated successfully, but these errors were encountered: