Using Columns names instead of ORM to get all documents #620

lalitpagaria · 2020-11-25T12:44:08Z

To improve #601

Generally ORM objects kept in memory cause performance issue
Hence using directly column name improve memory and performance.
Refer StackOverflow

lalitpagaria · 2020-11-25T12:45:16Z

@tholor @tanaysoni please review

@vinchg can you please test these changes out on your large dataset.

lalitpagaria · 2020-11-27T07:54:30Z

@tholor and @tanaysoni could you please take a look see if approach is fine or not.

tanaysoni · 2020-11-27T09:55:41Z

Hi @lalitpagaria, thank you for the PR! Do you have any benchmarks for how much performance improvement we get with the new changes?

lalitpagaria · 2020-11-27T10:01:40Z

I tried to do locally but I don't have much data to stress the system.

I think @vinchg can try the changes.

In theory these changes will reduce memory foot print and improve performance.

tholor · 2020-11-27T11:33:43Z

Tried to run the automated benchmarks, but I believe they are currently not working on external forks as secrets are not passed.

lalitpagaria · 2020-11-27T13:37:42Z

Can you please pull these changes (as these are small only) to main repo and run benchmark

vinchg · 2020-11-27T17:58:35Z

Sorry for the delay, I'll test today with the changes.

vinchg · 2020-11-27T22:27:42Z

haystack/document_store/sql.py

+                MetaORM.value
+            ).filter(MetaORM.document_id.in_(documents_map.keys()))
+
+            for row in meta_query.all():


The first query (line 120) executes quickly as it should if just querying by index (< 1 sec). Memory usage seems to be normal / expected.

The second call to get metadata is much slower and errors eventually:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) too many SQL variables [SQL: SELECT meta.document_id AS meta_document_id, meta.name AS meta_name, meta.value AS meta_value FROM meta WHERE meta.document_id IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,.....?)

It seems odd that 2 separate queries would be required to get the same fields in a document.

The second call to get metadata is much slower and errors eventually:

To fix it I have added index on document_id and limiting the number of host variable parameters passed to sql queries. (999 is for sqlite < 3.32 and 32K for >= 3.32)

It seems odd that 2 separate queries would be required to get the same fields in a document.

It is done to prevent duplication of very long text field in memory. Each docs can have multiple metas and for each meta better not to keep duplicate text in the memory. Hence I have split it in two queries.

Can you please test with latest change where I have fix the issue you have reported.
It seems you have very good amount of data to benchmark it.

@vinchg by chance if you get time could you please test latest changes.

lalitpagaria · 2020-11-30T06:17:17Z

@tholor could you please pull these changes to main repo branch and run benchmark.

lalitpagaria · 2020-12-02T18:50:46Z

@tholor Could you please run benchmark locally to test sql and faiss doc store.

tholor · 2020-12-03T07:56:54Z

Just ran the benchmark:

Indexing

	retriever	doc_store	n_docs	indexing_time	docs_per_second	date_time	error
0	dpr	faiss_flat	10000	88.934	112.443	2020-12-03 07:48:54.486046

Querying

	retriever	doc_store	n_docs	n_queries	retrieve_time	queries_per_second	seconds_per_query	recall	map	top_k	date_time	error
0	dpr	faiss_flat	10000	100	3.59963	27.7806	0.0359963	0.96	0.905	10	2020-12-03 07:49:52.590119

In comparison the latest benchmarks on master (#652)
=> indexing: 105.15 docs/sec
=> querying: 22.47 queries/sec

So it seems that this PR does improve the speed 🚀 (although there's always some slight fluctuations between runs).

lalitpagaria · 2020-12-03T08:04:32Z

Slight fluctuations might be due to previous run warmup or stale connection or data.

Hopefully it will stay same for each fresh run.

tholor · 2020-12-03T08:17:29Z

@lalitpagaria While I think the changes in this PR are helpful, I think they won't resolve the underlying problem discussed #601. We still pull all documents in memory, then create all embeddings and finally write (in batches) to SQL/FAISS.

Are you still planning to add a batch mode for update_embeddings() so that we do the above steps for e.g. 1000 docs at a time, but never have to keep all of them in memory at a time? If not, let us know and we take over.

lalitpagaria · 2020-12-03T09:15:45Z

@tholor we can create separate PR for update embedding.

I would like to work but unfortunately I will not have time next 1-2 weeks.

tholor · 2020-12-03T09:19:25Z

Ok, sure - just wanted to clarify and avoid double work. We will tackle the batch mode then from our side.

lalitpagaria · 2020-12-04T23:35:07Z

If all fine then can you please merge this. Longer PR stays active merge conflicts will come.

tholor · 2020-12-05T08:20:27Z

There's still a failing test after your latest commit. Can you please fix this first?

lalitpagaria · 2020-12-05T10:04:37Z

There's still a failing test after your latest commit. Can you please fix this first?

Not why broken after merging with latest master, checking it.
Same test with same error also failing here #639

Also run couple of times locally did not observed any test failure.
Compared dependencies version between last successful build https://github.com/deepset-ai/haystack/runs/1497488718?check_suite_focus=true and failure build where my changes are not there https://github.com/deepset-ai/haystack/runs/1497488718?check_suite_focus=true

lalitpagaria · 2020-12-05T13:35:38Z

@tanaysoni I had to make following changes to fix issue on latest master (specially this commit 8e52b48) -

Not sure but I think pytest on CI is running tests in parallel and test_faiss_retrieving interfered with existing DB file, hence caused Disk I/O error. So I renamed the DB file.
I don't see pipeline marker attached to any test and if I run generator together with all the test then I get OOM error, hence I reverted your changes in ci.yml selectively.

https://github.com/deepset-ai/haystack/pull/620/files#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03f
https://github.com/deepset-ai/haystack/pull/620/files#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03f

tanaysoni · 2020-12-07T12:11:30Z

Hi @lalitpagaria @tholor,

Thank you for all your efforts on this!

I did some performance benchmarks on PostgreSQL with this script:

import random
import string
import time

from haystack.document_store.sql import SQLDocumentStore

document_store = SQLDocumentStore(url="postgres://postgres:@localhost/test")

# Write Documents
for i in range(10):
    documents = []
    start = time.time()
    for _ in range(1_000):
        text = ''.join(random.choices(string.ascii_uppercase + string.digits, k=10_000))
        meta_value = f"meta_value_{i}"
        documents.append(
            {"text": text, "meta_field_1": meta_value, "meta_field_2": meta_value, "meta_field_3": meta_value})
    document_store.write_documents(documents)

# Benchmark query performance
times = []
for i in range(10):
    start = time.time()
    documents = document_store.get_all_documents(filters={"meta_field_1": ["meta_value_1", "meta_value_2"]})
    for doc in documents:
        pass
    times.append(time.time() - start)

print(f"Query Time :: {sum(times) / len(times)}")

On my local machine, the query takes 0.386 seconds on the current master branch and with the PR it takes 1.923 seconds. It seems the queries are now significantly slower. Can you verify if the benchmark makes sense and these results are reproducible?

tholor · 2020-12-07T12:33:27Z

The benchmarks that I was running did not include any filters for metadata, so I think your benchmarks show that the PR creates a bottleneck on that functionality

tanaysoni · 2020-12-07T12:52:09Z

The results are similar with or without the metadata filtering. I suspect the Haystack query benchmark may not give a good representation for this case as only 10 documents get queried as per the top_k value?

lalitpagaria · 2020-12-07T13:17:08Z

@tanaysoni Thanks for benchmark script. I think it is good to commit such script in some extra folder.
Numbers this PR is showing is very concerning. Allow me some time, I will debug bottleneck with with the help from your benchmark script. Till then let's hold this PR :)

Also what is your hunch about these slower number in this PR? Where do you feel bottleneck could be?

tanaysoni · 2020-12-09T16:38:31Z

Hi @lalitpagaria, I suspect the slow performance could be due to the additional SQL query to get the metadata after the documents are retrieved? The database join query on the current master is possibly faster?

…memory not duplicating document.text

…fetch meta information

…with exiting files by corrupting DB file

lalitpagaria · 2020-12-21T23:15:04Z

@tanaysoni Thank you for your script. Today got a chance to work on this PR again.
I rebased it with latest master and run script on master and this branch And I got following results -

With PR branch = Query Time :: 0.9146072149276734
With latest master = Query Time :: 2.1301419496536256

I used latest postgres docker image to test, and each run cleaned older data -

docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres

It seems PR branch is improving the query time. Please let me know we can plan some other benchmark test.

lalitpagaria · 2020-12-27T12:32:21Z

@tholor and @tanaysoni could you please check this as well. In my view this PR will improve query time (slightly by 10-15% if not more). I performed multiple benchmark as well by tweaking Tanay's script. And I got consistently good results with this PR. I made sure to clean old data before each test run.

tholor · 2020-12-28T09:08:07Z

Hey @lalitpagaria , thanks for benchmarking this PR again. @tanaysoni is currently on vacation, so it might take a bit until he can review here. I am also off, but will try to run some benchmarks again within the next week.

tholor

I did some quick benchmarks locally with @tanaysoni's previous script. Results are looking good:

n_docs	branch	sec_write	sec_query
10k	PR	1.44	0.134
10k	Master	1.45	0.334
100k	PR	15.53	1.75
100k	Master	15.86	3.74

These benchmarks cover of course only a small part of the document_store (basically only get_all_documents()). Not sure if / how the performance of other query types has changed.

From my perspective, we are good to merge once the docstring for batch_size is improved (see comment).

One rather unrelated thing I noticed: when calling document_store.delete_all_documents(), it seems that we are not deleting all records in the DB and therefore impact the performance of subsequent runs. Do we miss a session.commit() here? This behavior is identical for this PR and master. So I am also fine with tackling it in a separate PR - if needed at all.

haystack/document_store/sql.py

lalitpagaria · 2020-12-31T10:42:44Z

One rather unrelated thing I noticed: when calling document_store.delete_all_documents(), it seems that we are not deleting all records in the DB and therefore impact the performance of subsequent runs. Do we miss a session.commit() here? This behavior is identical for this PR and master. So I am also fine with tackling it in a separate PR - if needed at all.

Yes you are correct about delete. But calling commit after delete will cause some performance issue during that time. Hence better to clearly warn user about danger. This library also warned users about this as follows -

        .. warning:: The :meth:`_query.Query.delete`
           method is a "bulk" operation,
           which bypasses ORM unit-of-work automation in favor of greater
           performance.  **Please read all caveats and warnings below.**

:param synchronize_session: chooses the strategy for the removal of
            matched objects from the session. Valid values are:

            ``False`` - don't synchronize the session. This option is the most
            efficient and is reliable once the session is expired, which
            typically occurs after a commit(), or explicitly using
            expire_all(). Before the expiration, objects may still remain in
            the session which were in fact deleted which can lead to confusing
            results if they are accessed via get() or already loaded
            collections.

            ``'fetch'`` - performs a select query before the delete to find
            objects that are matched by the delete query and need to be
            removed from the session. Matched objects are removed from the
            session.

            ``'evaluate'`` - Evaluate the query's criteria in Python straight
            on the objects in the session. If evaluation of the criteria isn't
            implemented, an error is raised.

            The expression evaluator currently doesn't account for differing
            string collations between the database and Python.

tholor

Looking good! Thanks for this great work @lalitpagaria - looking forward to seeing the speed improvements in the next benchmarks that come with our next release!

tholor added the benchmark label Nov 27, 2020

vinchg reviewed Nov 27, 2020

View reviewed changes

tholor mentioned this pull request Nov 30, 2020

Optimize convert row to doc #638

Closed

tholor mentioned this pull request Dec 3, 2020

Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

Closed

tholor requested review from tanaysoni and tholor December 5, 2020 06:39

Using Columns name instead of ORM object for get all documents call

cb69140

lalitpagaria added 11 commits December 21, 2020 23:37

Separating meta search from documents. This way it will optimize the …

b7fc4a5

…memory not duplicating document.text

Fixing mypy issue

ce812e2

SQLite have limit on number of host variable hence using batching to …

eaf269b

…fetch meta information

Query meta only if meta field is not Null in DocOrm

ced3bfc

Add batch_size to other functions except label

172d297

meta can be none so fix that issue

c17ae21

Dummy commit to trigger CI

ffcefba

Using chunked dictionary

5bb39fc

Upgrading faiss

e7352aa

reverting change related to faiss upgrade

9606749

Changing DB name in test_faiss_retrieving test as it might interfere …

5fec8b4

…with exiting files by corrupting DB file

tholor requested changes Dec 31, 2020

View reviewed changes

haystack/document_store/sql.py Outdated Show resolved Hide resolved

Updating doc string related to batch_size

377cf63

tholor added this to the #7 milestone Jan 6, 2021

tholor self-assigned this Jan 6, 2021

Update docstring for batch_size

f2e639f

tholor approved these changes Jan 6, 2021

View reviewed changes

tholor merged commit 3a9a756 into deepset-ai:master Jan 6, 2021

lalitpagaria deleted the optimize_convert_row_to_doc branch January 6, 2021 15:02

lalitpagaria mentioned this pull request Jan 21, 2021

[SQLDocumentStore] delete_all_documents() method is missing commit statement #760

Closed

lalitpagaria mentioned this pull request Dec 16, 2021

Change PK for DocumentORM and LabelORM & Improve FAISS-Milvus test suite #1860

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Columns names instead of ORM to get all documents #620

Using Columns names instead of ORM to get all documents #620

lalitpagaria commented Nov 25, 2020

lalitpagaria commented Nov 25, 2020

lalitpagaria commented Nov 27, 2020

tanaysoni commented Nov 27, 2020

lalitpagaria commented Nov 27, 2020

tholor commented Nov 27, 2020

lalitpagaria commented Nov 27, 2020

vinchg commented Nov 27, 2020

vinchg Nov 27, 2020

lalitpagaria Nov 28, 2020

lalitpagaria Nov 28, 2020

lalitpagaria Dec 2, 2020

lalitpagaria commented Nov 30, 2020

lalitpagaria commented Dec 2, 2020

tholor commented Dec 3, 2020 •

edited

Loading

lalitpagaria commented Dec 3, 2020

tholor commented Dec 3, 2020

lalitpagaria commented Dec 3, 2020

tholor commented Dec 3, 2020

lalitpagaria commented Dec 4, 2020

tholor commented Dec 5, 2020

lalitpagaria commented Dec 5, 2020 •

edited

Loading

lalitpagaria commented Dec 5, 2020

tanaysoni commented Dec 7, 2020 •

edited

Loading

tholor commented Dec 7, 2020 •

edited

Loading

tanaysoni commented Dec 7, 2020 •

edited

Loading

lalitpagaria commented Dec 7, 2020

tanaysoni commented Dec 9, 2020

lalitpagaria commented Dec 21, 2020

lalitpagaria commented Dec 27, 2020

tholor commented Dec 28, 2020

tholor left a comment

lalitpagaria commented Dec 31, 2020

tholor left a comment

Using Columns names instead of ORM to get all documents #620

Using Columns names instead of ORM to get all documents #620

Conversation

lalitpagaria commented Nov 25, 2020

lalitpagaria commented Nov 25, 2020

lalitpagaria commented Nov 27, 2020

tanaysoni commented Nov 27, 2020

lalitpagaria commented Nov 27, 2020

tholor commented Nov 27, 2020

lalitpagaria commented Nov 27, 2020

vinchg commented Nov 27, 2020

vinchg Nov 27, 2020

Choose a reason for hiding this comment

lalitpagaria Nov 28, 2020

Choose a reason for hiding this comment

lalitpagaria Nov 28, 2020

Choose a reason for hiding this comment

lalitpagaria Dec 2, 2020

Choose a reason for hiding this comment

lalitpagaria commented Nov 30, 2020

lalitpagaria commented Dec 2, 2020

tholor commented Dec 3, 2020 • edited Loading

Indexing

Querying

lalitpagaria commented Dec 3, 2020

tholor commented Dec 3, 2020

lalitpagaria commented Dec 3, 2020

tholor commented Dec 3, 2020

lalitpagaria commented Dec 4, 2020

tholor commented Dec 5, 2020

lalitpagaria commented Dec 5, 2020 • edited Loading

lalitpagaria commented Dec 5, 2020

tanaysoni commented Dec 7, 2020 • edited Loading

tholor commented Dec 7, 2020 • edited Loading

tanaysoni commented Dec 7, 2020 • edited Loading

lalitpagaria commented Dec 7, 2020

tanaysoni commented Dec 9, 2020

lalitpagaria commented Dec 21, 2020

lalitpagaria commented Dec 27, 2020

tholor commented Dec 28, 2020

tholor left a comment

Choose a reason for hiding this comment

lalitpagaria commented Dec 31, 2020

tholor left a comment

Choose a reason for hiding this comment

tholor commented Dec 3, 2020 •

edited

Loading

lalitpagaria commented Dec 5, 2020 •

edited

Loading

tanaysoni commented Dec 7, 2020 •

edited

Loading

tholor commented Dec 7, 2020 •

edited

Loading

tanaysoni commented Dec 7, 2020 •

edited

Loading