-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Columns names instead of ORM to get all documents #620
Using Columns names instead of ORM to get all documents #620
Conversation
@tholor @tanaysoni please review @vinchg can you please test these changes out on your large dataset. |
@tholor and @tanaysoni could you please take a look see if approach is fine or not. |
Hi @lalitpagaria, thank you for the PR! Do you have any benchmarks for how much performance improvement we get with the new changes? |
I tried to do locally but I don't have much data to stress the system. I think @vinchg can try the changes. In theory these changes will reduce memory foot print and improve performance. |
Tried to run the automated benchmarks, but I believe they are currently not working on external forks as secrets are not passed. |
Can you please pull these changes (as these are small only) to main repo and run benchmark |
Sorry for the delay, I'll test today with the changes. |
MetaORM.value | ||
).filter(MetaORM.document_id.in_(documents_map.keys())) | ||
|
||
for row in meta_query.all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first query (line 120) executes quickly as it should if just querying by index (< 1 sec). Memory usage seems to be normal / expected.
The second call to get metadata is much slower and errors eventually:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) too many SQL variables
[SQL: SELECT meta.document_id AS meta_document_id, meta.name AS meta_name, meta.value AS meta_value
FROM meta
WHERE meta.document_id IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,.....?)
It seems odd that 2 separate queries would be required to get the same fields in a document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second call to get metadata is much slower and errors eventually:
To fix it I have added index on document_id and limiting the number of host variable parameters passed to sql queries. (999 is for sqlite < 3.32 and 32K for >= 3.32)
It seems odd that 2 separate queries would be required to get the same fields in a document.
It is done to prevent duplication of very long text
field in memory. Each docs can have multiple metas and for each meta better not to keep duplicate text in the memory. Hence I have split it in two queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please test with latest change where I have fix the issue you have reported.
It seems you have very good amount of data to benchmark it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinchg by chance if you get time could you please test latest changes.
@tholor could you please pull these changes to main repo branch and run benchmark. |
@tholor Could you please run benchmark locally to test sql and faiss doc store. |
Just ran the benchmark: Indexing
Querying
In comparison the latest benchmarks on master (#652) So it seems that this PR does improve the speed 🚀 (although there's always some slight fluctuations between runs). |
Slight fluctuations might be due to previous run warmup or stale connection or data. Hopefully it will stay same for each fresh run. |
@lalitpagaria While I think the changes in this PR are helpful, I think they won't resolve the underlying problem discussed #601. We still pull all documents in memory, then create all embeddings and finally write (in batches) to SQL/FAISS. Are you still planning to add a batch mode for |
@tholor we can create separate PR for update embedding. I would like to work but unfortunately I will not have time next 1-2 weeks. |
Ok, sure - just wanted to clarify and avoid double work. We will tackle the batch mode then from our side. |
If all fine then can you please merge this. Longer PR stays active merge conflicts will come. |
There's still a failing test after your latest commit. Can you please fix this first? |
Not why broken after merging with latest master, checking it. Also run couple of times locally did not observed any test failure. |
@tanaysoni I had to make following changes to fix issue on latest master (specially this commit 8e52b48) -
https://github.com/deepset-ai/haystack/pull/620/files#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03f |
Hi @lalitpagaria @tholor, Thank you for all your efforts on this! I did some performance benchmarks on PostgreSQL with this script: import random
import string
import time
from haystack.document_store.sql import SQLDocumentStore
document_store = SQLDocumentStore(url="postgres://postgres:@localhost/test")
# Write Documents
for i in range(10):
documents = []
start = time.time()
for _ in range(1_000):
text = ''.join(random.choices(string.ascii_uppercase + string.digits, k=10_000))
meta_value = f"meta_value_{i}"
documents.append(
{"text": text, "meta_field_1": meta_value, "meta_field_2": meta_value, "meta_field_3": meta_value})
document_store.write_documents(documents)
# Benchmark query performance
times = []
for i in range(10):
start = time.time()
documents = document_store.get_all_documents(filters={"meta_field_1": ["meta_value_1", "meta_value_2"]})
for doc in documents:
pass
times.append(time.time() - start)
print(f"Query Time :: {sum(times) / len(times)}") On my local machine, the query takes |
The benchmarks that I was running did not include any filters for metadata, so I think your benchmarks show that the PR creates a bottleneck on that functionality |
The results are similar with or without the metadata filtering. I suspect the Haystack query benchmark may not give a good representation for this case as only 10 documents get queried as per the |
@tanaysoni Thanks for benchmark script. I think it is good to commit such script in some extra folder. Also what is your hunch about these slower number in this PR? Where do you feel bottleneck could be? |
Hi @lalitpagaria, I suspect the slow performance could be due to the additional SQL query to get the metadata after the documents are retrieved? The database join query on the current master is possibly faster? |
…memory not duplicating document.text
…fetch meta information
…with exiting files by corrupting DB file
@tanaysoni Thank you for your script. Today got a chance to work on this PR again.
I used latest
It seems PR branch is improving the query time. Please let me know we can plan some other benchmark test. |
@tholor and @tanaysoni could you please check this as well. In my view this PR will improve query time (slightly by 10-15% if not more). I performed multiple benchmark as well by tweaking Tanay's script. And I got consistently good results with this PR. I made sure to clean old data before each test run. |
Hey @lalitpagaria , thanks for benchmarking this PR again. @tanaysoni is currently on vacation, so it might take a bit until he can review here. I am also off, but will try to run some benchmarks again within the next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some quick benchmarks locally with @tanaysoni's previous script. Results are looking good:
n_docs | branch | sec_write | sec_query |
---|---|---|---|
10k | PR | 1.44 | 0.134 |
10k | Master | 1.45 | 0.334 |
100k | PR | 15.53 | 1.75 |
100k | Master | 15.86 | 3.74 |
These benchmarks cover of course only a small part of the document_store
(basically only get_all_documents()
). Not sure if / how the performance of other query types has changed.
From my perspective, we are good to merge once the docstring for batch_size is improved (see comment).
One rather unrelated thing I noticed: when calling document_store.delete_all_documents()
, it seems that we are not deleting all records in the DB and therefore impact the performance of subsequent runs. Do we miss a session.commit()
here? This behavior is identical for this PR and master. So I am also fine with tackling it in a separate PR - if needed at all.
Yes you are correct about delete. But calling commit after delete will cause some performance issue during that time. Hence better to clearly warn user about danger. This library also warned users about this as follows -
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Thanks for this great work @lalitpagaria - looking forward to seeing the speed improvements in the next benchmarks that come with our next release!
To improve #601
Generally ORM objects kept in memory cause performance issue
Hence using directly column name improve memory and performance.
Refer StackOverflow