Choose correct similarity fns during benchmark runs & re-run benchmarks #773

brandenchan · 2021-01-26T10:56:58Z

This PR ensures that during benchmarking, the right similarity fns are chosen (i.e. dot_product for DPR and cosine for ES retrieval). This solves #653. This PR will also include a rerun of the retriever query benchmark.

TODO:

0.84 instead of 84 bug
Run benchmarks on 0.7.0 @tholor
Specify cosine / dot product on website
Update benchmarks page on website
Post results to social media

brandenchan · 2021-01-27T11:23:32Z

Reran retriever query benchmarks and compared to the numbers reported for Haystack 0.7.0. MAP numbers look a bit better than before across all settings, (perhaps because of filtering misaligned samples #774).

DPR + FAISS HNSW speed has improved significantly

# old
n_docs - queries per second
500k - 3.3168707580865915
100k - 12.84692505158515
10k -   31.34417509568776

# new
n_docs - queries per second
500k - 35.5682201186989
100k - 37.19018429167806
10k   - 39.10424804647853

However, ES + BM25 speed has dropped.

# new
n_docs - queries per second
500k - 64.52084335269566
100k - 96.5938217421237
10k   - 121.87842069254714


# old 
n_docs - queries per second
500k - 91.38510941614904
100k - 162.59167924109505
10k   - 248.9647289083211

Note that ES + DPR speed is about the same as before

# new
n_docs - queries per second
500k - 1.4722693734451493
100k - 6.123145993917852
10k   - 21.642577719925317

# old
n_docs - queries per second
500k - 1.45036114184423
100k - 6.234155953220104
10k   - 24.796429587106445

@tholor Is this expected? Any ideas for what accounts for the drop in speed for BM25?

tholor · 2021-01-27T16:03:00Z

DPR + FAISS HNSW speed has improved significantly

Nice, could be related to the recent changes of @lalitpagaria to SQL queries 🤔

@tholor Is this expected? Any ideas for what accounts for the drop in speed for BM25?

No, nothing obvious comes to my mind. @tanaysoni any idea?

lalitpagaria · 2021-01-28T07:50:40Z

Wow! 10x increase when number of doc's are more. At this rate it will surpass ES performance when number of documents are more than 1M.

I think improvement done by myself and @tanaysoni have improved performance.

How about having benchmarks for SQL doc store as well? (Like with SQLite, MySQL and posgres)

Regarding ES performance, how we are using ES? Means as single node or in clusters mode? How much heap size assigned to it? Swapping enabled or disabled?

Just suggestion, can we use Cassandra as well here? SyllaDB is very high performing Open Source Cassandra written in pure C can be used.

tholor · 2021-02-01T18:24:03Z

Just rerun all benchmarks on the EC2 image that we previously used + elasticsearch 7.9.2.
From what it seems, there is a slight drop in BM25 speed when moving from 7.9.2 to 7.10 - the remaining difference to your runs is still unclear to me @brandenchan . Hypotheses: i) differences in the EC2 image (e.g. ubuntu 20 vs 18, cuda version, versions of other dependencies ...) or ii) warmup / caching of elastic (possible there are effects between subsequent runs causing different numbers if you run for 500k in isolation or after the 100k runs etc 🤔)

tholor · 2021-02-01T18:26:49Z

@brandenchan if the numbers are looking good to you, can you please update the markdown files for the benchmarks on the website so that we have the correct numbers for each version in the dropdown?

…into rebenchmark

brandenchan · 2021-02-02T15:39:42Z

I had a look through the newly reported numbers and compared them to the benchmark for 0.6.0 (0.7.0 is the same).

Retriever query speed
===============
DPR/Elastic   	        about 30% slower
BM25/Elastic 	        significant negative divergence at 10k, 10% divergence for rest
DPR/FAISS flat	generally faster
DPR/FAISS HNSW	scales significantly better

Retriever Performance
=====================
DPR/Elastic		index speed 5% slower, query speed 30% slower, map 4.5%points better
BM25/Elastic	        No significant change
DPR/FAISS flat	index speed 10% slower, query speed 50% faster, MAP about the same
DPR/FAISS HNSW	Index speed about 10% slower, query speed 200% faster, MAP about the same

Retriever MAP
=============
DPR/Elastic		Improvement across the board
BM25/Elastic	        No significant change
DPR/FAISS flat	No significant change
DPR/FAISS HNSW	No significant change

Reader speed
============
All models between 5% and 50% faster
Very slight degradation in F1

The speed of DPR/Elastic seems to have dropped. One factor could be the switching from cosine to dot product similarity. It may be that the dot product implementation is slower. On the flip side, its performance is noticeably better.

Reader speed improvements may be due to the implementation of fast tokenizers.

Not totally sure where the ES/BM25 speed instability is coming from. Could be variability between runs / maybe the building up and tearing down of document stores is causing problems in measuring speed.

brandenchan · 2021-02-02T15:42:20Z

In future, we should do more to control for the environment in which we run benchmarks. We should:

always run using pip install -e . --upgrade
fix versions of pytorch / transformers
fix ES = 7.9.2
ensure we're using an ubuntu 18 image on EC2
run benchmarks from a docker

…enchmark

brandenchan added 4 commits January 26, 2021 11:21

Adapt to new dataset_from_dicts return signature

e8e0f36

rename fn

a341937

Align similarity fn in benchmark doc store

56a3b20

Better choice of similarity fn

3b645a1

brandenchan self-assigned this Jan 26, 2021

brandenchan added 4 commits January 26, 2021 14:29

Increase postgres wait time

efc9f9e

Add more expected returned variables

d3bdec2

update benchmark results

697bc2d

Fix typo

99a6f2b

lalitpagaria mentioned this pull request Jan 28, 2021

Milvus integration #771

Merged

4 tasks

Merge branch 'master' into rebenchmark

90117df

tholor changed the title ~~Choose correct similarity fns during benchmark runs~~ Choose correct similarity fns during benchmark runs & re-run benchmarks Feb 1, 2021

update all benchmark runs

bb5e350

Merge branch 'rebenchmark' of https://github.com/deepset-ai/haystack …

28241f8

…into rebenchmark

brandenchan and others added 6 commits February 2, 2021 17:23

multiply stats by 100

5226683

Specify similarity fns for website

ee9450a

update benchmarks for 0.7.0

e0f1f0f

Merge branch 'rebenchmark' of github.com:deepset-ai/haystack into reb…

e52f789

…enchmark

fix f1 scale and distilbert name

10970bb

Multiply by 100

0fba19d

brandenchan requested a review from tholor February 3, 2021 09:55

tholor approved these changes Feb 3, 2021

View reviewed changes

Scale by 100

a2a7bc6

Scale test assert by 100

6c1352c

brandenchan merged commit f3a3b73 into master Feb 3, 2021

brandenchan deleted the rebenchmark branch February 3, 2021 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose correct similarity fns during benchmark runs & re-run benchmarks #773

Choose correct similarity fns during benchmark runs & re-run benchmarks #773

brandenchan commented Jan 26, 2021 •

edited

Loading

brandenchan commented Jan 27, 2021

tholor commented Jan 27, 2021

lalitpagaria commented Jan 28, 2021 •

edited

Loading

tholor commented Feb 1, 2021

tholor commented Feb 1, 2021

brandenchan commented Feb 2, 2021

brandenchan commented Feb 2, 2021

Choose correct similarity fns during benchmark runs & re-run benchmarks #773

Choose correct similarity fns during benchmark runs & re-run benchmarks #773

Conversation

brandenchan commented Jan 26, 2021 • edited Loading

brandenchan commented Jan 27, 2021

tholor commented Jan 27, 2021

lalitpagaria commented Jan 28, 2021 • edited Loading

tholor commented Feb 1, 2021

tholor commented Feb 1, 2021

brandenchan commented Feb 2, 2021

brandenchan commented Feb 2, 2021

brandenchan commented Jan 26, 2021 •

edited

Loading

lalitpagaria commented Jan 28, 2021 •

edited

Loading