-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: caching optd stats, 12x speedup on TPC-H SF1 #132
Conversation
@@ -37,6 +37,17 @@ impl Benchmark { | |||
dbname.to_lowercase() | |||
} | |||
|
|||
/// Use this when you need a file name. The rules for file names are different from the rules | |||
/// for database names, so this is a different function | |||
pub fn get_fname(&self) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally, it can be unique name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, wdym?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it doesn't have to be a file name. The function's name is a bit confusing, like: file for what? But actually it's just a unique name that can be used as a file name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
Summary: Now caching the stat objects used by
OptCostModel
, meaning we don't need to load data into DataFusion after doing it the first time.Demo:
12x speedup on TPC-H SF1 compared to not caching stats.
Caching everything except optd stats takes 45.6s total.
Caching everything, including optd stats, takes 3.9s total.
Details:
PerTableStats
a serializable object forserde
.Box<dyn ...>
would make putting stats in the catalog more difficult.MostCommonValues
andDistribution
traits are handled inOptCostModel
. Instead of havingBox<dyn ...>
values inPerColumnStats
which store any object that implements these traits, I madePerColumnStats
a templated object.Distribution
(like a t-digest for one column, a histogram for another, etc.). I didn't see this as a big enough reason to not do the refactor because it seems like a rare thing to do. Additionally, if we really needed to do this, we could just make an enum that had both types.