Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

significantly improve DB compilation speed #58

Closed
wants to merge 1 commit into from

Conversation

abulimov
Copy link
Contributor

Summary:
Our bottleneck was unsurprisingly in sorting the entire dataset, which of course is CPU bound and single-threaded.

But we don't need to do it like this.

The reason we did so was to be able to guarantee that same keys end up in same buckets, combined with RocksDB requirement for data being pre-sorted when we use SST ingestion.

Instead, with this change, we use fast hashing to put same keys into the same buckets, and sort each bucket separately, in parallel, so it's ready for ingestion.

The result is a significant speedup in DB compilation, up to 2x for big shards.

Reviewed By: t3lurid3

Differential Revision: D54191772

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

@codecov-commenter
Copy link

codecov-commenter commented Feb 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 30.82%. Comparing base (a4ad55d) to head (6c412fa).

Additional details and impacted files
@@             Coverage Diff             @@
##             main      #58       +/-   ##
===========================================
- Coverage   53.57%   30.82%   -22.75%     
===========================================
  Files         113       26       -87     
  Lines       10489     2073     -8416     
===========================================
- Hits         5619      639     -4980     
+ Misses       4519     1393     -3126     
+ Partials      351       41      -310     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

@abulimov abulimov changed the title significantly improve compilation speed significantly improve DB compilation speed Feb 26, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

@abulimov abulimov force-pushed the export-D54191772 branch 2 times, most recently from 64dbf83 to 6c412fa Compare February 27, 2024 16:35
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

Summary:
Our bottleneck was unsurprisingly in sorting the entire dataset, which of course is CPU bound and single-threaded.

But we don't need to do it like this.

The reason we did so was to be able to guarantee that same keys end up in same buckets, combined with RocksDB requirement for data being [pre-sorted when we use SST ingestion](https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files).

Instead, with this change, we use fast hashing to put same keys into the same buckets, and sort each bucket separately, in parallel, so it's ready for ingestion.

NOTE: while RDB is happy with individually sorted SSTs, for real performance we need to either globally sort the data, or call compaction after the ingestion is done. This diff opted for sorting in-memory using merge sort, as this brings significant performance wins at small memory cost (~300mb on 50m keys).

The result is a significant speedup in DB compilation, by around 50s on our biggest internal shard with 54m records.

Reviewed By: deathowl

Differential Revision: D54191772
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54191772

@deathowl
Copy link
Member

deathowl commented Mar 7, 2024

landed in dd16e29. closing

@deathowl deathowl closed this Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants