Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index rebuilds with external key sorting #7754

Merged
merged 26 commits into from
May 2, 2024

Conversation

max-hoffman
Copy link
Contributor

@max-hoffman max-hoffman commented Apr 16, 2024

Index builds now write keys to intermediate files and merge sort before materializing the prolly tree for the secondary index. This contrasts the default approach, which rebuilds the prolly tree each time we flush keys from memory. The old approach reads most of the tree with random reads and writes when memory flushes are unsorted keys. The new approach structures work for sequential IO by flushing sorted runs that become incrementally merge sorted. The sequential IO is dramatically faster for disk-based systems.

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
c3065ec ok 5937457
version total_tests
c3065ec 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
b9f36b6 ok 5937457
version total_tests
b9f36b6 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
15d6d6a ok 5937457
version total_tests
15d6d6a 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
76893a7 ok 5937457
version total_tests
76893a7 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
8cd6aa2 ok 5937457
version total_tests
8cd6aa2 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
e59b5a5 ok 5937457
version total_tests
e59b5a5 5937457
correctness_percentage
100.0

@max-hoffman max-hoffman changed the title starter for external sorting index rebuilds Index rebuilds with external key sorting Apr 17, 2024
@max-hoffman
Copy link
Contributor Author

I tested the preliminary speedup by comparing 1 million -> 3 million -> 10 million row build time on an HDD.

old: 19 seconds -> 15 minutes -> 280 minutes
new: 30 seconds -> 2 minutes -> 6 minutes

3 million rows on an SSD for comparison goes from 10 seconds -> 30 seconds..

Tuning has some effect on the scaling constants, but the difference is disk seeks on HDD. We only build the tree once with sorted edits now by pre-sorting keys with merge sort. The previous version would build the whole tree log(n) number of times, with each rebuild random IO reading every chunk. The merge sort shifts the work upfront to sequential IO.

@max-hoffman max-hoffman marked this pull request as ready for review April 17, 2024 20:58
@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
82025f1 ok 5937457
version total_tests
82025f1 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
d984695 ok 5937457
version total_tests
d984695 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
0d14698 ok 5937457
version total_tests
0d14698 5937457
correctness_percentage
100.0

Copy link
Contributor

@reltuk reltuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start. Had a number of initial comments. Happy to follow up on any of these. It's quite close, just needs a bit of cleanup, and it's going to be great :)

go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
}
}
})
for i := 0; i < len(a.files); i += 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the whole logic here, the asymptotics on this strategy as implemented here seem off to me, at least compared to optimality. AFAIU:

  1. we're scanning the entire data set on every compact, and
  2. every compact is coming at fixed size intervals, for example, every time we have 64 new files of 32MBs, we scan every existing file...

Don't we get N^2 asymptotics?

If you take a look at the file sizes of the files in each slot, you'll also see we're just building one big file at index 0 and everything else stays quite a bit smaller than it...so it's also probably not great for parallelism, because we're going to block on the linear scan of that file each compact...

I think an approach where we only merge the n/2 smallest files each compact (we only get by n/4 slots in that case) gets us back to better asymptotics, but there are probably much better approaches which take advantage of the k-way merge (instead of 2-way...)

@max-hoffman
Copy link
Contributor Author

max-hoffman commented Apr 19, 2024

@reltuk With your changes index building looks like it runs about 2x as quickly as the first version. Compaction strategies seem edge-casey, so I picked what seems like a simple one. Leveled tiers that get parallelism=1 compacted to the next level after a threshold.

1 milion rows (18s) -> 3 million rows (1 min) -> 10 million rows (4 min)

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
4f2619b ok 5937457
version total_tests
4f2619b 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
9b09fe7 ok 5937457
version total_tests
9b09fe7 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
1b19a44 ok 5937457
version total_tests
1b19a44 5937457
correctness_percentage
100.0

Copy link
Contributor

@reltuk reltuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like the level change and the heap based multi-merger.

I think it's worth taking a short pass to get timely resource finalization on the file handles and cleanup from disk of the temp files. Somewhat your call.

Comment on lines 106 to 109
defer it.Close()
if err != nil {
return nil, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defer it.Close()
if err != nil {
return nil, err
}
if err != nil {
return nil, err
}
defer it.Close()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More Close()es, but I think only a defer sorter.Close() up above, if we add timely resource finalization.

}

func (a *tupleSorter) newFile() (*os.File, error) {
f, err := a.tmpProv.NewFile("", "key_file_")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something seems off about the lifetime of these files. We put them in keyIterables, and those return the Iter, but the iter's Close() method doesn't close the file. And there's no way provided in the API to close and clean them up in the case of an error. And it seems like we would want to os.Remove() these files after we're done with them, since we had a contract was to use a certain number of files?

Near at hand is something like:

  1. Make it explicit that keyFile.IterAll() can only ever be called once.
  2. On keyFile.IterAll(), set keyFile's f *os.File to nil.
  3. func (k *keyFile) Close() { if k.f != nil { k.f.Close(); os.Remove(k.f.Name()) }
  4. keyFileReader gets the *os.File as well as its buffered reader.
  5. func (r keyFileReader) Close() { r.f.Close(); os.Remove(r.f.Name()) }
  6. func (a *tupleSorter) Close() { for _, fs := range a.files { for _, f := range fs { f.Close() } } }

Or something like that?

outF := newKeyFile(newF, a.batchSize)
m, err := newFileMerger(ctx, a.keyCmp, outF, fileLevel[:a.fileMax]...)
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add resource finalization, careful not to leak newF or any fileLevel[:a.fileMax] which didn't successfully IterAll'd here.

return err
}
if err := m.run(ctx); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add resource finalization, careful not to leak file newF here. m.run() should probably guarantee all iterators it made are closed.

} else {
reader.iter.Close()
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the contract is these are all closed when we return, need to add some code to close all the ones remaining in m.mq here.

go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
go/store/prolly/sort/external.go Show resolved Hide resolved
go/store/prolly/sort/external.go Outdated Show resolved Hide resolved
Co-authored-by: Aaron Son <aaron@dolthub.com>
@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
3194cd4 ok 5937457
version total_tests
3194cd4 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
77827df ok 5937457
version total_tests
77827df 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
034fc3b ok 5937457
version total_tests
034fc3b 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
d7bd28b ok 5937457
version total_tests
d7bd28b 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
e190f41 ok 5937457
version total_tests
e190f41 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
ed8bdca ok 5937457
version total_tests
ed8bdca 5937457
correctness_percentage
100.0

@max-hoffman max-hoffman merged commit 9ec3ce2 into main May 2, 2024
20 checks passed
@max-hoffman max-hoffman deleted the max/external-sort-index-build branch May 2, 2024 16:36
Copy link

github-actions bot commented May 2, 2024

@max-hoffman DOLT

test_name detail row_cnt sorted mysql_time sql_mult cli_mult
batching LOAD DATA 10000 1 0.05 1.4
batching batch sql 10000 1 0.07 2
batching by line sql 10000 1 0.07 1.86
blob 1 blob 200000 1 0.9 3.84 4.4
blob 2 blobs 200000 1 0.91 4.68 4.99
blob no blob 200000 1 0.9 1.99 2.08
col type datetime 200000 1 0.81 2.6 2.83
col type varchar 200000 1 0.69 2.78 2.8
config width 2 cols 200000 1 0.82 1.99 2.13
config width 32 cols 200000 1 1.87 1.72 2.49
config width 8 cols 200000 1 0.96 2.03 2.3
pk type float 200000 1 1.01 1.65 1.74
pk type int 200000 1 0.84 1.98 2.06
pk type varchar 200000 1 1.61 1.44 1.39
row count 1.6mm 1600000 1 5.63 2.42 2.58
row count 400k 400000 1 1.49 2.23 2.39
row count 800k 800000 1 2.79 2.44 2.59
secondary index four index 200000 1 3.53 1.29 1.09
secondary index no secondary 200000 1 0.9 2.04 2.1
secondary index one index 200000 1 1.14 2.05 2.11
secondary index two index 200000 1 1.95 1.6 1.51
sorting shuffled 1mm 1000000 0 4.85 2.54 2.71
sorting sorted 1mm 1000000 1 4.88 2.54 2.68

Copy link

github-actions bot commented May 2, 2024

@max-hoffman DOLT

name detail mean_mult
dolt_blame_basic system table 1.23
dolt_blame_commit_filter system table 3.3
dolt_commit_ancestors_commit_filter system table 0.84
dolt_commits_commit_filter system table 0.91
dolt_diff_log_join_from_commit system table 2.08
dolt_diff_log_join_to_commit system table 2.14
dolt_diff_table_from_commit_filter system table 1.12
dolt_diff_table_to_commit_filter system table 1.12
dolt_diffs_commit_filter system table 1
dolt_history_commit_filter system table 1.21
dolt_log_commit_filter system table 0.89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants