Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Columns names instead of ORM to get all documents #620

Merged
merged 14 commits into from
Jan 6, 2021
Merged

Using Columns names instead of ORM to get all documents #620

merged 14 commits into from
Jan 6, 2021

Conversation

lalitpagaria
Copy link
Contributor

To improve #601

Generally ORM objects kept in memory cause performance issue
Hence using directly column name improve memory and performance.
Refer StackOverflow

@lalitpagaria
Copy link
Contributor Author

@tholor @tanaysoni please review

@vinchg can you please test these changes out on your large dataset.

@lalitpagaria
Copy link
Contributor Author

@tholor and @tanaysoni could you please take a look see if approach is fine or not.

@tanaysoni
Copy link
Contributor

Hi @lalitpagaria, thank you for the PR! Do you have any benchmarks for how much performance improvement we get with the new changes?

@lalitpagaria
Copy link
Contributor Author

I tried to do locally but I don't have much data to stress the system.

I think @vinchg can try the changes.

In theory these changes will reduce memory foot print and improve performance.

@tholor
Copy link
Member

tholor commented Nov 27, 2020

Tried to run the automated benchmarks, but I believe they are currently not working on external forks as secrets are not passed.

@lalitpagaria
Copy link
Contributor Author

Can you please pull these changes (as these are small only) to main repo and run benchmark

@vinchg
Copy link

vinchg commented Nov 27, 2020

Sorry for the delay, I'll test today with the changes.

MetaORM.value
).filter(MetaORM.document_id.in_(documents_map.keys()))

for row in meta_query.all():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first query (line 120) executes quickly as it should if just querying by index (< 1 sec). Memory usage seems to be normal / expected.

The second call to get metadata is much slower and errors eventually:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) too many SQL variables
[SQL: SELECT meta.document_id AS meta_document_id, meta.name AS meta_name, meta.value AS meta_value
FROM meta
WHERE meta.document_id IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,.....?)

It seems odd that 2 separate queries would be required to get the same fields in a document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second call to get metadata is much slower and errors eventually:

To fix it I have added index on document_id and limiting the number of host variable parameters passed to sql queries. (999 is for sqlite < 3.32 and 32K for >= 3.32)

It seems odd that 2 separate queries would be required to get the same fields in a document.

It is done to prevent duplication of very long text field in memory. Each docs can have multiple metas and for each meta better not to keep duplicate text in the memory. Hence I have split it in two queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please test with latest change where I have fix the issue you have reported.
It seems you have very good amount of data to benchmark it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinchg by chance if you get time could you please test latest changes.

@lalitpagaria
Copy link
Contributor Author

@tholor could you please pull these changes to main repo branch and run benchmark.

@lalitpagaria
Copy link
Contributor Author

@tholor Could you please run benchmark locally to test sql and faiss doc store.

@tholor
Copy link
Member

tholor commented Dec 3, 2020

Just ran the benchmark:

Indexing

retriever doc_store n_docs indexing_time docs_per_second date_time error
0 dpr faiss_flat 10000 88.934 112.443 2020-12-03 07:48:54.486046

Querying

retriever doc_store n_docs n_queries retrieve_time queries_per_second seconds_per_query recall map top_k date_time error
0 dpr faiss_flat 10000 100 3.59963 27.7806 0.0359963 0.96 0.905 10 2020-12-03 07:49:52.590119

In comparison the latest benchmarks on master (#652)
=> indexing: 105.15 docs/sec
=> querying: 22.47 queries/sec

So it seems that this PR does improve the speed 🚀 (although there's always some slight fluctuations between runs).

@lalitpagaria
Copy link
Contributor Author

Slight fluctuations might be due to previous run warmup or stale connection or data.

Hopefully it will stay same for each fresh run.

@tholor
Copy link
Member

tholor commented Dec 3, 2020

@lalitpagaria While I think the changes in this PR are helpful, I think they won't resolve the underlying problem discussed #601. We still pull all documents in memory, then create all embeddings and finally write (in batches) to SQL/FAISS.

Are you still planning to add a batch mode for update_embeddings() so that we do the above steps for e.g. 1000 docs at a time, but never have to keep all of them in memory at a time? If not, let us know and we take over.

@lalitpagaria
Copy link
Contributor Author

@tholor we can create separate PR for update embedding.

I would like to work but unfortunately I will not have time next 1-2 weeks.

@tholor
Copy link
Member

tholor commented Dec 3, 2020

Ok, sure - just wanted to clarify and avoid double work. We will tackle the batch mode then from our side.

@lalitpagaria
Copy link
Contributor Author

If all fine then can you please merge this. Longer PR stays active merge conflicts will come.

@tholor
Copy link
Member

tholor commented Dec 5, 2020

There's still a failing test after your latest commit. Can you please fix this first?

@lalitpagaria
Copy link
Contributor Author

lalitpagaria commented Dec 5, 2020

There's still a failing test after your latest commit. Can you please fix this first?

Not why broken after merging with latest master, checking it.
Same test with same error also failing here #639

Also run couple of times locally did not observed any test failure.
Compared dependencies version between last successful build https://github.com/deepset-ai/haystack/runs/1497488718?check_suite_focus=true and failure build where my changes are not there https://github.com/deepset-ai/haystack/runs/1497488718?check_suite_focus=true

@lalitpagaria
Copy link
Contributor Author

@tanaysoni I had to make following changes to fix issue on latest master (specially this commit 8e52b48) -

  • Not sure but I think pytest on CI is running tests in parallel and test_faiss_retrieving interfered with existing DB file, hence caused Disk I/O error. So I renamed the DB file.
  • I don't see pipeline marker attached to any test and if I run generator together with all the test then I get OOM error, hence I reverted your changes in ci.yml selectively.

https://github.com/deepset-ai/haystack/pull/620/files#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03f
https://github.com/deepset-ai/haystack/pull/620/files#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03f

@tanaysoni
Copy link
Contributor

tanaysoni commented Dec 7, 2020

Hi @lalitpagaria @tholor,

Thank you for all your efforts on this!

I did some performance benchmarks on PostgreSQL with this script:

import random
import string
import time

from haystack.document_store.sql import SQLDocumentStore

document_store = SQLDocumentStore(url="postgres://postgres:@localhost/test")

# Write Documents
for i in range(10):
    documents = []
    start = time.time()
    for _ in range(1_000):
        text = ''.join(random.choices(string.ascii_uppercase + string.digits, k=10_000))
        meta_value = f"meta_value_{i}"
        documents.append(
            {"text": text, "meta_field_1": meta_value, "meta_field_2": meta_value, "meta_field_3": meta_value})
    document_store.write_documents(documents)

# Benchmark query performance
times = []
for i in range(10):
    start = time.time()
    documents = document_store.get_all_documents(filters={"meta_field_1": ["meta_value_1", "meta_value_2"]})
    for doc in documents:
        pass
    times.append(time.time() - start)

print(f"Query Time :: {sum(times) / len(times)}")

On my local machine, the query takes 0.386 seconds on the current master branch and with the PR it takes 1.923 seconds. It seems the queries are now significantly slower. Can you verify if the benchmark makes sense and these results are reproducible?

@tholor
Copy link
Member

tholor commented Dec 7, 2020

The benchmarks that I was running did not include any filters for metadata, so I think your benchmarks show that the PR creates a bottleneck on that functionality

@tanaysoni
Copy link
Contributor

tanaysoni commented Dec 7, 2020

The results are similar with or without the metadata filtering. I suspect the Haystack query benchmark may not give a good representation for this case as only 10 documents get queried as per the top_k value?

@lalitpagaria
Copy link
Contributor Author

@tanaysoni Thanks for benchmark script. I think it is good to commit such script in some extra folder.
Numbers this PR is showing is very concerning. Allow me some time, I will debug bottleneck with with the help from your benchmark script. Till then let's hold this PR :)

Also what is your hunch about these slower number in this PR? Where do you feel bottleneck could be?

@tanaysoni
Copy link
Contributor

Hi @lalitpagaria, I suspect the slow performance could be due to the additional SQL query to get the metadata after the documents are retrieved? The database join query on the current master is possibly faster?

@lalitpagaria
Copy link
Contributor Author

@tanaysoni Thank you for your script. Today got a chance to work on this PR again.
I rebased it with latest master and run script on master and this branch And I got following results -

With PR branch = Query Time :: 0.9146072149276734
With latest master = Query Time :: 2.1301419496536256

I used latest postgres docker image to test, and each run cleaned older data -

docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres

It seems PR branch is improving the query time. Please let me know we can plan some other benchmark test.

@lalitpagaria
Copy link
Contributor Author

@tholor and @tanaysoni could you please check this as well. In my view this PR will improve query time (slightly by 10-15% if not more). I performed multiple benchmark as well by tweaking Tanay's script. And I got consistently good results with this PR. I made sure to clean old data before each test run.

@tholor
Copy link
Member

tholor commented Dec 28, 2020

Hey @lalitpagaria , thanks for benchmarking this PR again. @tanaysoni is currently on vacation, so it might take a bit until he can review here. I am also off, but will try to run some benchmarks again within the next week.

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some quick benchmarks locally with @tanaysoni's previous script. Results are looking good:

n_docs branch sec_write sec_query
10k PR 1.44 0.134
10k Master 1.45 0.334
100k PR 15.53 1.75
100k Master 15.86 3.74

These benchmarks cover of course only a small part of the document_store (basically only get_all_documents()). Not sure if / how the performance of other query types has changed.

From my perspective, we are good to merge once the docstring for batch_size is improved (see comment).

One rather unrelated thing I noticed: when calling document_store.delete_all_documents(), it seems that we are not deleting all records in the DB and therefore impact the performance of subsequent runs. Do we miss a session.commit() here? This behavior is identical for this PR and master. So I am also fine with tackling it in a separate PR - if needed at all.

haystack/document_store/sql.py Outdated Show resolved Hide resolved
@lalitpagaria
Copy link
Contributor Author

One rather unrelated thing I noticed: when calling document_store.delete_all_documents(), it seems that we are not deleting all records in the DB and therefore impact the performance of subsequent runs. Do we miss a session.commit() here? This behavior is identical for this PR and master. So I am also fine with tackling it in a separate PR - if needed at all.

Yes you are correct about delete. But calling commit after delete will cause some performance issue during that time. Hence better to clearly warn user about danger. This library also warned users about this as follows -

        .. warning:: The :meth:`_query.Query.delete`
           method is a "bulk" operation,
           which bypasses ORM unit-of-work automation in favor of greater
           performance.  **Please read all caveats and warnings below.**

:param synchronize_session: chooses the strategy for the removal of
            matched objects from the session. Valid values are:

            ``False`` - don't synchronize the session. This option is the most
            efficient and is reliable once the session is expired, which
            typically occurs after a commit(), or explicitly using
            expire_all(). Before the expiration, objects may still remain in
            the session which were in fact deleted which can lead to confusing
            results if they are accessed via get() or already loaded
            collections.

            ``'fetch'`` - performs a select query before the delete to find
            objects that are matched by the delete query and need to be
            removed from the session. Matched objects are removed from the
            session.

            ``'evaluate'`` - Evaluate the query's criteria in Python straight
            on the objects in the session. If evaluation of the criteria isn't
            implemented, an error is raised.

            The expression evaluator currently doesn't account for differing
            string collations between the database and Python.

@tholor tholor added this to the #7 milestone Jan 6, 2021
@tholor tholor self-assigned this Jan 6, 2021
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Thanks for this great work @lalitpagaria - looking forward to seeing the speed improvements in the next benchmarks that come with our next release!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants