-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split MULTI transaction in batches (fix #149) #148
Open
elboulangero
wants to merge
7
commits into
etix:master
Choose a base branch
from
elboulangero:github/split-multi
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Remove the variable count, instead we use this variable only to iterate. Hence in the final log, we replace 'count' with 'len(sourceFiles)'. With this commit, we break two MULTI transactions, so let's look at those in details: The first MULTI transaction is: - DEL FILES_TMP - loop on files: - SADD FILES_TMP <path> I think there's no problem in breaking this in chunks, as we just iterate over a temporary key. It doesn't matter if the program is interrupted and we leave a partially updated key behind. Then comes a lone SDIFF FILES FILES_TMP command, that gives us the list of files to remove. The second MULTI transaction is: - loop on files: - HMSET FILE_<path> - publish FILE_UPDATE <path> - loop on removed files: - DEL FILE_<path> - publish FILE_UPDATE <path> - RENAME FILES_TMP FILES I don't think it's really needed to have all of that in a single MULTI transaction, i *think* it's Ok to break the two loops in chunks. What really matters is that we rename the keys FILES_TMP to FILES in the last step.
Rework how the files are committed to the db. Before: we'd create a MULTI, the scan. The scan function iterates over the scan results and call ScannerAddFile(), which would send commands to Redis. In case of failure, we'd discard the MULTI transaction, remove the temporary key, and bail out. In case of success, we'd finally call ScannerCommit() which was just about calling EXEC to execute the MULTI transaction. With this commit: we now keep an internal slice of filedata. Calling ScannerAddFile() just adds a filedata to the slice. In case of failure, it's easier, we can just return. In case of success, it's now the ScannerCommit() function that does the bulk of the job: sent a MULTI command, then iterate on the files to enqueue all the commands, and finally EXEC. This change of behaviour is needed for what comes next: breaking the MULTI transaction in chunks.
We don't need to maintain a counter, as we keep a slice with the files returned by the scan, so len() does the job.
With this commit we split two big multi transactions in chunks. Let's have a look at those in details. One was about removing the files to remove. I don't think it really matters if it's all done at once, or in several transaction. The other in ScannerCommit() is about committing all the files that were returned by the scan. Once again, I have the impression that it doesn't really matter if it's done all at once, or if the transaction is split in chunks.
elboulangero
changed the title
Split MULTI transaction in batches
Split MULTI transaction in batches (fix #149)
Oct 11, 2023
Looking at the logs when the source is scan: scan.go:494 2023/10/10 06:18:58.633 UTC [source] Scanning the filesystem... scan.go:512 2023/10/10 06:19:22.079 UTC [source] Indexing the files... scan.go:624 2023/10/10 06:19:27.745 UTC [source] Scanned 544001 files And the logs when a mirror is scanned: rsync.go:89 2023/10/10 06:13:23.634 UTC [ftp.jaist.ac.jp] Requesting file list via rsync... trace.go:129 2023/10/10 06:13:23.979 UTC [ftp.jaist.ac.jp] trace last sync: 2023-10-10 00:00:01 +0000 UTC scan.go:221 2023/10/10 06:18:49.781 UTC [ftp.jaist.ac.jp] Indexed 544001 files (544000 known), 0 removed This commit brings two minor improvements: * log the number of files that were removed for the source, similar to how it's done with mirrors. * add a log "Indexing the files" after the mirror scan returns, similar to how it's done for the source.
elboulangero
force-pushed
the
github/split-multi
branch
from
October 19, 2023 07:29
760bdfd
to
7c1e44e
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Unbounded MUTLTI transactions can cause big latency spikes in Redis. Especially when the number of commands in the transaction is a factor of the number of files in the repository, and the repository might have up to a million file.
With this MR, we make sure that the MULTI transactions are executed in batches of 5k files, following the recommendation from https://redis.io/docs/manual/pipelining/: