feat: caching optd stats, 12x speedup on TPC-H SF1 #132

wangpatrick57 · 2024-03-23T20:23:15Z

Summary: Now caching the stat objects used by OptCostModel, meaning we don't need to load data into DataFusion after doing it the first time.

Demo:
12x speedup on TPC-H SF1 compared to not caching stats.

Caching everything except optd stats takes 45.6s total.

Caching everything, including optd stats, takes 3.9s total.

Details:

This caching is disabled by default to avoid accidentally using stale stats. I added a CLI arg to enable it.
The main challenge of this PR was making PerTableStats a serializable object for serde.
The serializability refactor will also help down the line when we want to put statistics in the catalog, since that is fundamentally a serialization problem too. Having Box<dyn ...> would make putting stats in the catalog more difficult.
This required a significant refactor of how the MostCommonValues and Distribution traits are handled in OptCostModel. Instead of having Box<dyn ...> values in PerColumnStats which store any object that implements these traits, I made PerColumnStats a templated object.
The one downside of this refactor is that we can no longer have a database which uses different data structures for Distribution (like a t-digest for one column, a histogram for another, etc.). I didn't see this as a big enough reason to not do the refactor because it seems like a rare thing to do. Additionally, if we really needed to do this, we could just make an enum that had both types.

optd-datafusion-repr/src/cost/base_cost.rs

optd-perftest/src/datafusion_dbms.rs

optd-perftest/tests/cardtest_integration.rs

Gun9niR · 2024-03-24T02:21:10Z

optd-perftest/src/benchmark.rs

@@ -37,6 +37,17 @@ impl Benchmark {
        dbname.to_lowercase()
    }

+    /// Use this when you need a file name. The rules for file names are different from the rules
+    ///   for database names, so this is a different function
+    pub fn get_fname(&self) -> String {


More generally, it can be unique name

I mean it doesn't have to be a file name. The function's name is a bit confusing, like: file for what? But actually it's just a unique name that can be used as a file name

Gun9niR

Rest LGTM

wangpatrick57 added 7 commits March 23, 2024 08:22

logic for reading/writing cache in df dbms

acc0af0

templated stats instead of having boxes

650a6bc

fmt and clippy

95471fb

SerializableOrderedF64

6363a7e

stats cache done

21ad5fb

Merge branch 'main' into phw2/df-stat-cache

4e9aa2e

constraints now work

385ba61

wangpatrick57 changed the title ~~feat: caching optd stats~~ feat: caching optd stats, 12x speedup on TPC-H SF1 Mar 23, 2024

wangpatrick57 requested a review from Gun9niR March 23, 2024 20:59

wangpatrick57 marked this pull request as ready for review March 23, 2024 21:00

wangpatrick57 added 3 commits March 23, 2024 17:10

added option to use stats cache

ff2ae5f

comment

e425225

Merge branch 'main' into phw2/df-stat-cache

d0e347d