Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics for multiple branches #7558

Merged
merged 45 commits into from Mar 27, 2024
Merged

Statistics for multiple branches #7558

merged 45 commits into from Mar 27, 2024

Conversation

max-hoffman
Copy link
Contributor

@max-hoffman max-hoffman commented Mar 1, 2024

Fork interface for how to store database statistics. Can either store in the original source database, in a separate database in .dolt/stats, or an alternative implementation like a lsm that will have easier append only semantics. The implementation for the noms chunkstore isn't particularly efficient, we will not deduplicate common chunks between branches.

How the new code is organized: statspro has generic interfaces for how a Dolts-specific stats implementation works. statsnoms is an implementation that uses a second database at .dolt/stats to store statistic refs. The stats refs are the same, just now they are named by the branch they reference (ex: refs/statistics/main). So storage is the concern of the concrete implementation. The common interface forces the implementation to handle branches. The branch switching in statsnoms are just named refs.

A high level of what's happening during certain operations: There are still two operations, load and update. Load now either initializes the stats database at .dolt/stats or loads existing stats. Update is the same, ANALYZE or auto refresh.

Most of the changes are just forcing the logic through a generic interface. There were import cycle issues (dtables) and deadlock issues for creating a database (dolt provider takes a lock that prevents doing certain operation on the session in the stats provider) that motivated packaging decisions.

@max-hoffman max-hoffman marked this pull request as ready for review March 7, 2024 00:04
@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
9c1ddba ok 5937457
version total_tests
9c1ddba 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
958f829 ok 5937457
version total_tests
958f829 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
0654c94 ok 5937457
version total_tests
0654c94 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
3227d03 ok 5937457
version total_tests
3227d03 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
0716608 ok 5937457
version total_tests
0716608 5937457
correctness_percentage
100.0

go/libraries/doltcore/sqle/statspro/analyze.go Outdated Show resolved Hide resolved
go/store/prolly/tuple_mutable_map.go Outdated Show resolved Hide resolved
keyBuilder.PutInt64(3, maxBucketFanout+1)
maxKey := keyBuilder.Build(pool)

// there is a limit on the number of buckets for a given index, iter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of this comment? Is this relevant for this function?

go/libraries/doltcore/sqle/statspro/auto_refresh.go Outdated Show resolved Hide resolved
go/libraries/doltcore/sqle/statspro/auto_refresh.go Outdated Show resolved Hide resolved
go/libraries/doltcore/sqle/statspro/interface.go Outdated Show resolved Hide resolved
go/libraries/doltcore/sqle/statspro/interface.go Outdated Show resolved Hide resolved
}

dSess := dsess.DSessFromSess(loadCtx.Session)
var branches []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the order of precedence here is:

  • Global variable
  • Current session branch
  • Provider default branch

Is this documented somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added helper and docstring

func (p *Provider) Load(ctx *sql.Context, fs filesys.Filesys, db dsess.SqlDatabase, branches []string) error {
// |statPath| is either file://./stat or mem://stat
statsDb, err := p.sf.Init(ctx, db, p.pro, fs, env.GetCurrentUserHomeDir)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining why we don't propagate the error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe i should kill the error return. Was thinking that a bad branch name probably shouldn't prevent the other branches from being loaded

"github.com/dolthub/dolt/go/store/hash"
)

type DoltStats struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring that explains why this exists in addition to sql.Statistic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to reimplement this for TPC-C perf, I'm not quite sure what the right interface is yet. Real reason was just ease of implementing the histogram manipulation. The interface will probably have to have a bunch of interfaces for in-place updates so we don't have to copy this to sql.Statistic

return err
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same logic I commented on in Configure? Let's move it to a common helper function.

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
131b0ac ok 5937457
version total_tests
131b0ac 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
5803831 ok 5937457
version total_tests
5803831 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
1a02a68 ok 5937457
version total_tests
1a02a68 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
e600e09 ok 5937457
version total_tests
e600e09 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
6f8e321 ok 5937457
version total_tests
6f8e321 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
c4df926 ok 5937457
version total_tests
c4df926 5937457
correctness_percentage
100.0

@max-hoffman max-hoffman merged commit 801a82a into main Mar 27, 2024
20 checks passed
@max-hoffman max-hoffman deleted the max/stats-branches branch March 27, 2024 18:54
Copy link

@coffeegoddd DOLT

test_name detail row_cnt sorted mysql_time sql_mult cli_mult
batching LOAD DATA 10000 1 0.05 1.6
batching batch sql 10000 1 0.06 2.5
batching by line sql 10000 1 0.06 2.33
blob 1 blob 200000 1 0.88 4.05 4.57
blob 2 blobs 200000 1 0.87 4.97 6.22
blob no blob 200000 1 0.86 2.29 2.66
col type datetime 200000 1 0.78 2.87 3.55
col type varchar 200000 1 0.66 3.17 3.53
config width 2 cols 200000 1 0.74 2.41 3.12
config width 32 cols 200000 1 1.84 1.83 2.54
config width 8 cols 200000 1 0.95 2.21 2.58
pk type float 200000 1 0.87 2.09 2.49
pk type int 200000 1 0.79 2.7 2.68
pk type varchar 200000 1 1.46 1.58 1.83
row count 1.6mm 1600000 1 5.47 2.73 3.34
row count 400k 400000 1 1.36 2.7 3.27
row count 800k 800000 1 2.72 2.72 3.31
secondary index four index 200000 1 3.59 1.3 1.19
secondary index no secondary 200000 1 0.86 2.28 2.64
secondary index one index 200000 1 1.07 2.34 2.39
secondary index two index 200000 1 1.93 1.66 1.79
sorting shuffled 1mm 1000000 0 5.15 2.62 2.7
sorting sorted 1mm 1000000 1 5.52 2.46 2.5

Copy link

@coffeegoddd DOLT

name detail mean_mult
dolt_blame_basic system table 1.36
dolt_blame_commit_filter system table 3.6
dolt_commit_ancestors_commit_filter system table 0.74
dolt_commits_commit_filter system table 0.91
dolt_diff_log_join_from_commit system table 1.97
dolt_diff_log_join_to_commit system table 2.02
dolt_diff_table_from_commit_filter system table 1.1
dolt_diff_table_to_commit_filter system table 1.1
dolt_diffs_commit_filter system table 0.92
dolt_history_commit_filter system table 1.45
dolt_log_commit_filter system table 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants