significantly improve DB compilation speed #58

abulimov · 2024-02-26T13:00:54Z

Summary:
Our bottleneck was unsurprisingly in sorting the entire dataset, which of course is CPU bound and single-threaded.

But we don't need to do it like this.

The reason we did so was to be able to guarantee that same keys end up in same buckets, combined with RocksDB requirement for data being pre-sorted when we use SST ingestion.

Instead, with this change, we use fast hashing to put same keys into the same buckets, and sort each bucket separately, in parallel, so it's ready for ingestion.

The result is a significant speedup in DB compilation, up to 2x for big shards.

Reviewed By: t3lurid3

Differential Revision: D54191772

facebook-github-bot · 2024-02-26T13:01:03Z

This pull request was exported from Phabricator. Differential Revision: D54191772

codecov-commenter · 2024-02-26T13:02:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 30.82%. Comparing base (a4ad55d) to head (6c412fa).

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #58       +/-   ##
===========================================
- Coverage   53.57%   30.82%   -22.75%     
===========================================
  Files         113       26       -87     
  Lines       10489     2073     -8416     
===========================================
- Hits         5619      639     -4980     
+ Misses       4519     1393     -3126     
+ Partials      351       41      -310

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

facebook-github-bot · 2024-02-26T14:36:18Z

This pull request was exported from Phabricator. Differential Revision: D54191772

facebook-github-bot · 2024-02-26T16:03:41Z

This pull request was exported from Phabricator. Differential Revision: D54191772

facebook-github-bot · 2024-02-26T16:04:53Z

This pull request was exported from Phabricator. Differential Revision: D54191772

facebook-github-bot · 2024-02-27T10:18:17Z

This pull request was exported from Phabricator. Differential Revision: D54191772

facebook-github-bot · 2024-02-27T10:18:57Z

This pull request was exported from Phabricator. Differential Revision: D54191772

facebook-github-bot · 2024-02-27T10:23:16Z

This pull request was exported from Phabricator. Differential Revision: D54191772

facebook-github-bot · 2024-02-27T16:11:24Z

This pull request was exported from Phabricator. Differential Revision: D54191772

facebook-github-bot · 2024-02-27T16:12:07Z

This pull request was exported from Phabricator. Differential Revision: D54191772

facebook-github-bot · 2024-02-27T16:42:37Z

This pull request was exported from Phabricator. Differential Revision: D54191772

facebook-github-bot · 2024-02-27T16:43:32Z

This pull request was exported from Phabricator. Differential Revision: D54191772

Summary: Our bottleneck was unsurprisingly in sorting the entire dataset, which of course is CPU bound and single-threaded. But we don't need to do it like this. The reason we did so was to be able to guarantee that same keys end up in same buckets, combined with RocksDB requirement for data being [pre-sorted when we use SST ingestion](https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files). Instead, with this change, we use fast hashing to put same keys into the same buckets, and sort each bucket separately, in parallel, so it's ready for ingestion. NOTE: while RDB is happy with individually sorted SSTs, for real performance we need to either globally sort the data, or call compaction after the ingestion is done. This diff opted for sorting in-memory using merge sort, as this brings significant performance wins at small memory cost (~300mb on 50m keys). The result is a significant speedup in DB compilation, by around 50s on our biggest internal shard with 54m records. Reviewed By: deathowl Differential Revision: D54191772

facebook-github-bot · 2024-02-27T16:58:42Z

This pull request was exported from Phabricator. Differential Revision: D54191772

deathowl · 2024-03-07T16:42:17Z

landed in dd16e29. closing

facebook-github-bot added the cla signed label Feb 26, 2024

facebook-github-bot added the fb-exported label Feb 26, 2024

abulimov force-pushed the export-D54191772 branch from 018f10f to 1217ef4 Compare February 26, 2024 14:36

abulimov force-pushed the export-D54191772 branch from 1217ef4 to 0793212 Compare February 26, 2024 16:03

abulimov changed the title ~~significantly improve compilation speed~~ significantly improve DB compilation speed Feb 26, 2024

abulimov force-pushed the export-D54191772 branch from 0793212 to 448733e Compare February 26, 2024 16:04

abulimov force-pushed the export-D54191772 branch from 448733e to b40a405 Compare February 27, 2024 10:18

abulimov force-pushed the export-D54191772 branch from b40a405 to ab4f5b5 Compare February 27, 2024 10:18

abulimov force-pushed the export-D54191772 branch from ab4f5b5 to 269d34b Compare February 27, 2024 10:23

abulimov force-pushed the export-D54191772 branch from 269d34b to 56bda5f Compare February 27, 2024 16:11

abulimov force-pushed the export-D54191772 branch from 56bda5f to 37b2471 Compare February 27, 2024 16:11

abulimov force-pushed the export-D54191772 branch 2 times, most recently from 64dbf83 to 6c412fa Compare February 27, 2024 16:35

abulimov force-pushed the export-D54191772 branch from 6c412fa to cc99763 Compare February 27, 2024 16:58

deathowl closed this Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

significantly improve DB compilation speed #58

significantly improve DB compilation speed #58

abulimov commented Feb 26, 2024

facebook-github-bot commented Feb 26, 2024

codecov-commenter commented Feb 26, 2024 •

edited

Loading

facebook-github-bot commented Feb 26, 2024

facebook-github-bot commented Feb 26, 2024

facebook-github-bot commented Feb 26, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

deathowl commented Mar 7, 2024

significantly improve DB compilation speed #58

significantly improve DB compilation speed #58

Conversation

abulimov commented Feb 26, 2024

facebook-github-bot commented Feb 26, 2024

codecov-commenter commented Feb 26, 2024 • edited Loading

Codecov Report

facebook-github-bot commented Feb 26, 2024

facebook-github-bot commented Feb 26, 2024

facebook-github-bot commented Feb 26, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

facebook-github-bot commented Feb 27, 2024

deathowl commented Mar 7, 2024

codecov-commenter commented Feb 26, 2024 •

edited

Loading