Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashing query performance improvements (1.5-2x faster on benchmarks) #114

Merged
merged 19 commits into from
Jul 24, 2020

Conversation

alexklibisz
Copy link
Owner

@alexklibisz alexklibisz commented Jul 23, 2020

Rewrote custom hashing query (MatchHashesAndScoreQuery) in Java.
This doesn't necessarily make it faster, rather less likely that you introduce an expensive scala abstraction.
Also easier to get help from Lucene users.

Made match counting faster by using an array instead of a map.
This works because each counter only deals with the consecutive doc ids in a single segment.
So instead of a Map from doc id to count, you have an array where the index is the doc id and value is the count.

Made candidate identification faster using a similar construct.
Since you know the highest possible count is the number of terms, you can us an array to build a histogram of the counts,
then traverse from the end of the array to find the kth largest count.

Still need to understand how the PrefixCodedTerms work and if there's any possible optimization.

Specific timing improvements (p90 benchmark times):

Angular LSH: 121ms -> 50ms
L2 LSH: 18ms -> 11ms
Jaccard LSH: 58ms -> 36ms

@github-actions
Copy link

Benchmark Results

dataset similarity algorithm k recallP10 durationP10 recallP50 durationP50 recallP90 durationP90 mapping query
Random1000d50K1K Angular Exact 100 1.0 136.013 1.0 199.005 1.0 205.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"angular","vec":{},"model":"exact"}
Random1000d50K1K Angular LSH 100 0.15 41.001 0.16005 46.005 0.18 49.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"angular","dims":1000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"angular","model":"lsh"}
Random1000d50K1K L2 Exact 100 1.0 136.006 1.0 146.01 1.0 151.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"l2","vec":{},"model":"exact"}
Random1000d50K1K L2 LSH 100 1.0E-5 9.001 0.02 10.0 0.02 11.0 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"l2","dims":1000,"L":400,"k":1,"r":3}} {"field":"vec","candidates":1000,"vec":{},"similarity":"l2","model":"lsh"}
Random3000d50K1K Jaccard Exact 100 1.0 396.034 1.0 507.015 1.0 529.009 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"dims":3000}} {"field":"vec","similarity":"jaccard","vec":{},"model":"exact"}
Random3000d50K1K Jaccard LSH 100 0.28 26.004 0.29005 34.0 0.30009 36.0 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"model":"lsh","similarity":"jaccard","dims":3000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"jaccard","model":"lsh"}

@alexklibisz alexklibisz changed the title Hashing query performance improvements Hashing query performance improvements (1.5-2x faster on benchmarks) Jul 23, 2020
@github-actions
Copy link

Benchmark Results

dataset similarity algorithm k recallP10 durationP10 recallP50 durationP50 recallP90 durationP90 mapping query
Random1000d50K1K Angular Exact 100 1.0 130.048 1.0 187.02 1.0 197.0 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"angular","vec":{},"model":"exact"}
Random1000d50K1K Angular LSH 100 0.15 37.001 0.16005 39.005 0.18 40.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"angular","dims":1000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"angular","model":"lsh"}
Random1000d50K1K L2 Exact 100 1.0 113.01 1.0 131.005 1.0 136.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"l2","vec":{},"model":"exact"}
Random1000d50K1K L2 LSH 100 1.0E-5 9.0 0.02 10.0 0.02 10.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"l2","dims":1000,"L":400,"k":1,"r":3}} {"field":"vec","candidates":1000,"vec":{},"similarity":"l2","model":"lsh"}
Random3000d50K1K Jaccard Exact 100 1.0 318.13 1.0 479.01 1.0 492.009 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"dims":3000}} {"field":"vec","similarity":"jaccard","vec":{},"model":"exact"}
Random3000d50K1K Jaccard LSH 100 0.28 28.001 0.29005 30.0 0.30009 30.009 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"model":"lsh","similarity":"jaccard","dims":3000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"jaccard","model":"lsh"}

@github-actions
Copy link

Benchmark Results

dataset similarity algorithm k recallP10 durationP10 recallP50 durationP50 recallP90 durationP90 mapping query
Random1000d50K1K Angular Exact 100 1.0 184.01 1.0 209.02 1.0 218.063 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"angular","vec":{},"model":"exact"}
Random1000d50K1K Angular LSH 100 0.15 40.005 0.16005 48.0 0.18 49.0 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"angular","dims":1000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"angular","model":"lsh"}
Random1000d50K1K L2 Exact 100 1.0 113.009 1.0 136.005 1.0 140.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"l2","vec":{},"model":"exact"}
Random1000d50K1K L2 LSH 100 1.0E-5 9.0 0.02 10.0 0.02 10.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"l2","dims":1000,"L":400,"k":1,"r":3}} {"field":"vec","candidates":1000,"vec":{},"similarity":"l2","model":"lsh"}
Random3000d50K1K Jaccard Exact 100 1.0 429.023 1.0 502.04 1.0 514.0 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"dims":3000}} {"field":"vec","similarity":"jaccard","vec":{},"model":"exact"}
Random3000d50K1K Jaccard LSH 100 0.28 29.0 0.29005 31.005 0.30009 33.009 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"model":"lsh","similarity":"jaccard","dims":3000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"jaccard","model":"lsh"}

@github-actions
Copy link

Benchmark Results

dataset similarity algorithm k recallP10 durationP10 recallP50 durationP50 recallP90 durationP90 mapping query
Random1000d50K1K Angular Exact 100 1.0 177.01 1.0 197.005 1.0 203.018 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"angular","vec":{},"model":"exact"}
Random1000d50K1K Angular LSH 100 0.15 40.004 0.16005 47.005 0.18 49.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"angular","dims":1000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"angular","model":"lsh"}
Random1000d50K1K L2 Exact 100 1.0 135.006 1.0 144.0 1.0 146.027 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"l2","vec":{},"model":"exact"}
Random1000d50K1K L2 LSH 100 1.0E-5 9.001 0.02 10.0 0.02 11.0 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"l2","dims":1000,"L":400,"k":1,"r":3}} {"field":"vec","candidates":1000,"vec":{},"similarity":"l2","model":"lsh"}
Random3000d50K1K Jaccard Exact 100 1.0 413.033 1.0 484.01 1.0 492.027 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"dims":3000}} {"field":"vec","similarity":"jaccard","vec":{},"model":"exact"}
Random3000d50K1K Jaccard LSH 100 0.28 25.002 0.29005 32.005 0.30009 34.0 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"model":"lsh","similarity":"jaccard","dims":3000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"jaccard","model":"lsh"}

@github-actions
Copy link

Benchmark Results

dataset similarity algorithm k recallP10 durationP10 recallP50 durationP50 recallP90 durationP90 mapping query
Random1000d50K1K Angular Exact 100 1.0 189.034 1.0 227.015 1.0 237.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"angular","vec":{},"model":"exact"}
Random1000d50K1K Angular LSH 100 0.15 43.002 0.16005 50.0 0.18 52.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"angular","dims":1000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"angular","model":"lsh"}
Random1000d50K1K L2 Exact 100 1.0 130.001 1.0 146.005 1.0 149.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"l2","vec":{},"model":"exact"}
Random1000d50K1K L2 LSH 100 1.0E-5 10.0 0.02 11.0 0.02 12.0 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"l2","dims":1000,"L":400,"k":1,"r":3}} {"field":"vec","candidates":1000,"vec":{},"similarity":"l2","model":"lsh"}
Random3000d50K1K Jaccard Exact 100 1.0 447.024 1.0 489.015 1.0 504.0 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"dims":3000}} {"field":"vec","similarity":"jaccard","vec":{},"model":"exact"}
Random3000d50K1K Jaccard LSH 100 0.28 32.0 0.29005 34.005 0.30009 35.0 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"model":"lsh","similarity":"jaccard","dims":3000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"jaccard","model":"lsh"}

@github-actions
Copy link

Benchmark Results

dataset similarity algorithm k recallP10 durationP10 recallP50 durationP50 recallP90 durationP90 mapping query
Random1000d50K1K Angular Exact 100 1.0 211.001 1.0 217.01 1.0 228.09 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"angular","vec":{},"model":"exact"}
Random1000d50K1K Angular LSH 100 0.15 50.005 0.16005 60.0 0.18 61.0 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"angular","dims":1000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"angular","model":"lsh"}
Random1000d50K1K L2 Exact 100 1.0 124.005 1.0 144.025 1.0 151.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"l2","vec":{},"model":"exact"}
Random1000d50K1K L2 LSH 100 1.0E-5 10.0 0.02 11.0 0.02 12.0 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"l2","dims":1000,"L":400,"k":1,"r":3}} {"field":"vec","candidates":1000,"vec":{},"similarity":"l2","model":"lsh"}
Random3000d50K1K Jaccard Exact 100 1.0 396.1 1.0 504.025 1.0 527.036 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"dims":3000}} {"field":"vec","similarity":"jaccard","vec":{},"model":"exact"}
Random3000d50K1K Jaccard LSH 100 0.28 35.002 0.29005 38.005 0.30009 40.0 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"model":"lsh","similarity":"jaccard","dims":3000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"jaccard","model":"lsh"}

@github-actions
Copy link

Benchmark Results

dataset similarity algorithm k recallP10 durationP10 recallP50 durationP50 recallP90 durationP90 mapping query
Random1000d50K1K Angular Exact 100 1.0 140.027 1.0 172.015 1.0 177.0 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"angular","vec":{},"model":"exact"}
Random1000d50K1K Angular LSH 100 0.15 32.0 0.16005 35.0 0.18 36.009 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"angular","dims":1000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"angular","model":"lsh"}
Random1000d50K1K L2 Exact 100 1.0 123.001 1.0 128.005 1.0 134.0 {"type":"elastiknn_dense_float_vector","elastiknn":{"dims":1000}} {"field":"vec","similarity":"l2","vec":{},"model":"exact"}
Random1000d50K1K L2 LSH 100 1.0E-5 9.0 0.02 9.0 0.02 10.0 {"type":"elastiknn_dense_float_vector","elastiknn":{"model":"lsh","similarity":"l2","dims":1000,"L":400,"k":1,"r":3}} {"field":"vec","candidates":1000,"vec":{},"similarity":"l2","model":"lsh"}
Random3000d50K1K Jaccard Exact 100 1.0 379.008 1.0 401.015 1.0 415.054 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"dims":3000}} {"field":"vec","similarity":"jaccard","vec":{},"model":"exact"}
Random3000d50K1K Jaccard LSH 100 0.28 21.001 0.29005 24.005 0.30009 27.0 {"type":"elastiknn_sparse_bool_vector","elastiknn":{"model":"lsh","similarity":"jaccard","dims":3000,"L":400,"k":1}} {"field":"vec","candidates":1000,"vec":{},"similarity":"jaccard","model":"lsh"}

@alexklibisz alexklibisz merged commit c75b23f into master Jul 24, 2020
@alexklibisz alexklibisz deleted the perf-issue-58 branch July 24, 2020 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant