pkg/cli: add --log-format flag to debug zip for parquet output#162005
Draft
Abhinav1299 wants to merge 1 commit intocockroachdb:masterfrom
Draft
pkg/cli: add --log-format flag to debug zip for parquet output#162005Abhinav1299 wants to merge 1 commit intocockroachdb:masterfrom
Abhinav1299 wants to merge 1 commit intocockroachdb:masterfrom
Conversation
Previously, debug zip collected log files in text format only. For large clusters, node logs can constitute 60-80% of the total debug zip size. While text logs compress well with gzip (~9x compression), the format offers no queryability benefits and the compression ratio is limited by the row-oriented nature of text logs. This change introduces an optional `--log-format` flag for the `cockroach debug zip` command that supports two values: "text" (default, preserving existing behavior) and "parquet". When parquet format is selected, log entries are written using Apache Parquet columnar storage with ZSTD compression. Parquet achieves ~92% compression on raw log data through columnar storage, dictionary encoding on repetitive fields (severity, channel, file paths), and delta encoding on timestamps. The resulting parquet files can be analyzed using SQL-based tools like DuckDB. The implementation adds a new `logParquetWriter` that maps `logpb.Entry` fields to a 15-column parquet schema. The `getLogFiles` function in `zip_per_node.go` branches based on the format flag, calling either the existing `FormatLegacyEntry` for text or the new `writeLogEntriesAsParquet` for parquet output. Part of: CRDB-59104 Epic: none Release note (cli change): Added `--log-format` flag to `cockroach debug zip` command. Valid values are "text" (default) and "parquet". The parquet format uses columnar storage with ZSTD compression, reducing raw log size.
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Member
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously, debug zip collected log files in text format only. For large clusters, node logs can constitute 60-80% of the total debug zip size. While text logs compress well with gzip (~9x compression), the format offers no queryability benefits and the compression ratio is limited by the row-oriented nature of text logs.
This change introduces an optional
--log-formatflag for thecockroach debug zipcommand that supports two values: "text" (default, preserving existing behavior) and "parquet". When parquet format is selected, log entries are written using Apache Parquet columnar storage with ZSTD compression. Parquet achieves ~92% compression on raw log data through columnar storage, dictionary encoding on repetitive fields (severity, channel, file paths), and delta encoding on timestamps. The resulting parquet files can be analyzed using SQL-based tools like DuckDB.The implementation adds a new
logParquetWriterthat mapslogpb.Entryfields to a 15-column parquet schema. ThegetLogFilesfunction inzip_per_node.gobranches based on the format flag, calling either the existingFormatLegacyEntryfor text or the newwriteLogEntriesAsParquetfor parquet output.Part of: CRDB-59104
Epic: none
Release note (cli change): Added
--log-formatflag tocockroach debug zipcommand. Valid values are "text" (default) and "parquet". The parquet format uses columnar storage with ZSTD compression, reducing raw log size.