Conversation
0546bcf to
5c4c298
Compare
5c4c298 to
ab8f1a4
Compare
Entire-Checkpoint: e012b8e855d7
Entire-Checkpoint: ac3f02a3c337
Entire-Checkpoint: ac3f02a3c337
Entire-Checkpoint: a7f0375c8fb0
When archiving an existing session to make room for a new session at the same checkpoint, also move transcript chunk files (full.jsonl.001, full.jsonl.002, etc.) to the archive folder. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Entire-Checkpoint: 0b0195f27fed
khaong
left a comment
There was a problem hiding this comment.
note: added handling for chunks in the archive
👍
Entire-Checkpoint: 91220378d066
There was a problem hiding this comment.
Pull request overview
This pull request adds support for chunking large transcripts to stay under GitHub's 100MB blob limit. The implementation uses a 50MB chunk size as a safety margin and provides agent-specific chunking strategies (Gemini JSON format vs JSONL format for Claude Code).
Changes:
- Introduces a
TranscriptChunkerinterface that agents can implement to provide format-aware chunking - Implements chunking/reassembly logic for Gemini CLI (JSON message arrays) and Claude Code (JSONL line-based)
- Integrates chunking into checkpoint write operations and reassembly into read operations
- Adds comprehensive test coverage for chunking logic and archive operations
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/entire/cli/agent/agent.go | Adds TranscriptChunker interface definition for agents supporting chunking |
| cmd/entire/cli/agent/chunking.go | Core chunking implementation with JSONL default fallback and agent detection |
| cmd/entire/cli/agent/chunking_test.go | Comprehensive tests for chunking, reassembly, and utility functions |
| cmd/entire/cli/agent/geminicli/gemini.go | Gemini-specific JSON chunking that splits message arrays while preserving structure |
| cmd/entire/cli/agent/geminicli/gemini_test.go | Extensive tests for Gemini chunking edge cases and round-trip validation |
| cmd/entire/cli/agent/claudecode/claude.go | Claude Code JSONL chunking using shared line-based chunking |
| cmd/entire/cli/checkpoint/committed.go | Integrates chunking into write operations and reassembly with backward compatibility |
| cmd/entire/cli/checkpoint/temporary.go | Adds chunking support for temporary/shadow branch checkpoints |
| cmd/entire/cli/checkpoint/checkpoint_test.go | Tests archive operations with chunked transcripts |
| // Marshal message to get its size | ||
| msgBytes, err := json.Marshal(msg) | ||
| if err != nil { | ||
| logging.Warn(context.Background(), "failed to marshal Gemini message during chunking", | ||
| slog.Int("message_index", i), | ||
| slog.String("error", err.Error()), | ||
| ) | ||
| continue |
There was a problem hiding this comment.
Skipping unmarshalable messages with only a warning could result in data loss during chunking. If a message fails to marshal (which should be rare but could happen due to encoding issues or corrupt data), it silently disappears from the chunked output. This means the reassembled transcript won't match the original.
Consider returning an error instead of logging and continuing. If there's a specific reason to skip malformed messages, document that reasoning in a comment and ensure it's intentional data sanitization rather than silent data loss.
| // Marshal message to get its size | |
| msgBytes, err := json.Marshal(msg) | |
| if err != nil { | |
| logging.Warn(context.Background(), "failed to marshal Gemini message during chunking", | |
| slog.Int("message_index", i), | |
| slog.String("error", err.Error()), | |
| ) | |
| continue | |
| // Marshal message to get its size. If this fails, we fail the entire | |
| // chunking operation rather than silently dropping the message, to | |
| // avoid data loss and ensure reassembly can faithfully reconstruct | |
| // the original transcript. | |
| msgBytes, err := json.Marshal(msg) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to marshal Gemini message at index %d during chunking: %w", i, err) |
There was a problem hiding this comment.
Already addressed - there's a logging.Warn call here. We intentionally log and continue rather than failing, since marshal failures are persistent (won't resolve on retry) and blocking the user's workflow for a rare edge case isn't worth it.
| if transcriptContent, readErr := os.ReadFile(opts.TranscriptPath); readErr == nil { | ||
| if blobHash, blobErr := CreateBlobFromContent(s.repo, transcriptContent); blobErr == nil { | ||
| transcriptPath := sessionMetadataDir + "/" + paths.TranscriptFileName | ||
| entries[transcriptPath] = object.TreeEntry{ | ||
| Name: transcriptPath, | ||
| Mode: filemode.Regular, | ||
| Hash: blobHash, | ||
| // Detect agent type from content for proper chunking | ||
| agentType := agent.DetectAgentTypeFromContent(transcriptContent) | ||
|
|
||
| // Chunk if necessary | ||
| chunks, chunkErr := agent.ChunkTranscript(transcriptContent, agentType) | ||
| if chunkErr == nil { | ||
| for i, chunk := range chunks { | ||
| chunkPath := sessionMetadataDir + "/" + agent.ChunkFileName(paths.TranscriptFileName, i) | ||
| if blobHash, blobErr := CreateBlobFromContent(s.repo, chunk); blobErr == nil { | ||
| entries[chunkPath] = object.TreeEntry{ | ||
| Name: chunkPath, | ||
| Mode: filemode.Regular, | ||
| Hash: blobHash, | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Chunking errors are silently ignored here. If transcript chunking fails, the transcript is simply omitted from the checkpoint without any indication to the user or logs. This could result in checkpoints being created without their transcripts, which would be confusing for users.
Consider logging an error when chunking fails, or propagating the error up if it's critical. The error handling pattern should match the project's guidelines - either log with context or return the error.
There was a problem hiding this comment.
Added warning logging in 25fbbfd. We log the error with session context and continue - the checkpoint is saved without the transcript rather than failing entirely.
|
|
||
| // TranscriptChunker interface implementation | ||
|
|
||
| // ChunkTranscript splits a JSONL transcript at line boundaries. | ||
| // Claude Code uses JSONL format (one JSON object per line), so chunking | ||
| // is done at newline boundaries to preserve message integrity. | ||
| func (c *ClaudeCodeAgent) ChunkTranscript(content []byte, maxSize int) ([][]byte, error) { | ||
| chunks, err := agent.ChunkJSONL(content, maxSize) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to chunk JSONL transcript: %w", err) | ||
| } | ||
| return chunks, nil | ||
| } | ||
|
|
||
| // ReassembleTranscript concatenates JSONL chunks with newlines. | ||
| // | ||
| //nolint:unparam // error return is required by interface, kept for consistency | ||
| func (c *ClaudeCodeAgent) ReassembleTranscript(chunks [][]byte) ([]byte, error) { | ||
| return agent.ReassembleJSONL(chunks), nil | ||
| } |
There was a problem hiding this comment.
The ChunkTranscript and ReassembleTranscript implementations for ClaudeCodeAgent lack dedicated test coverage. While these methods delegate to shared JSONL functions that are tested in chunking_test.go, agent-specific tests would ensure the implementation correctly satisfies the TranscriptChunker interface and handles ClaudeCode-specific scenarios.
Consider adding tests similar to the comprehensive GeminiCLI tests (TestChunkTranscript_, TestReassembleTranscript_) in a new file like claude_chunking_test.go or in the existing transcript_test.go file.
There was a problem hiding this comment.
Fair point, but since Claude Code's implementation just delegates to the shared ChunkJSONL/ReassembleJSONL functions which have comprehensive tests in chunking_test.go, we're comfortable with the current coverage. The interface contract is simple (passthrough to shared code).
| var chunks [][]byte | ||
| for _, chunkFile := range chunkFiles { | ||
| file, err := tree.File(chunkFile) | ||
| if err != nil { | ||
| continue | ||
| } | ||
| content, err := file.Contents() | ||
| if err != nil { | ||
| continue | ||
| } | ||
| chunks = append(chunks, []byte(content)) | ||
| } | ||
|
|
||
| if len(chunks) > 0 { | ||
| result, err := agent.ReassembleTranscript(chunks, agentType) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to reassemble transcript: %w", err) | ||
| } | ||
| return result, nil | ||
| } |
There was a problem hiding this comment.
Errors reading individual chunk files are silently ignored with continue statements. This means if any chunk file is missing or unreadable, the function proceeds with reassembling only the chunks it could successfully read, potentially returning a corrupted or incomplete transcript without any indication of the problem.
Consider collecting and returning errors when chunk files can't be read. If any chunk is missing or fails to read, the reassembly should fail with a clear error message indicating which chunk(s) couldn't be loaded. This is especially important since transcript integrity is critical for checkpoint restoration.
| var chunks [][]byte | |
| for _, chunkFile := range chunkFiles { | |
| file, err := tree.File(chunkFile) | |
| if err != nil { | |
| continue | |
| } | |
| content, err := file.Contents() | |
| if err != nil { | |
| continue | |
| } | |
| chunks = append(chunks, []byte(content)) | |
| } | |
| if len(chunks) > 0 { | |
| result, err := agent.ReassembleTranscript(chunks, agentType) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to reassemble transcript: %w", err) | |
| } | |
| return result, nil | |
| } | |
| var ( | |
| chunks [][]byte | |
| failedChunks []string | |
| ) | |
| for _, chunkFile := range chunkFiles { | |
| file, err := tree.File(chunkFile) | |
| if err != nil { | |
| failedChunks = append(failedChunks, fmt.Sprintf("%s (tree lookup failed: %v)", chunkFile, err)) | |
| continue | |
| } | |
| content, err := file.Contents() | |
| if err != nil { | |
| failedChunks = append(failedChunks, fmt.Sprintf("%s (read failed: %v)", chunkFile, err)) | |
| continue | |
| } | |
| chunks = append(chunks, []byte(content)) | |
| } | |
| if len(failedChunks) > 0 { | |
| return nil, fmt.Errorf("failed to read transcript chunk file(s): %s", strings.Join(failedChunks, ", ")) | |
| } | |
| if len(chunks) == 0 { | |
| return nil, errors.New("no transcript chunk data read despite chunk files being present") | |
| } | |
| result, err := agent.ReassembleTranscript(chunks, agentType) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to reassemble transcript: %w", err) | |
| } | |
| return result, nil |
There was a problem hiding this comment.
Added warning logging in 25fbbfd. We now log which chunk file failed and why, then continue with partial data. These are persistent failures (missing/corrupted git objects won't self-heal), so failing entirely would just block the user without benefit.
Add logging when: - Transcript chunking fails during checkpoint write (continues without transcript) - Chunk file read fails during transcript reassembly (continues with partial data) These are persistent failures that won't resolve on retry, so we log and continue rather than blocking the user's workflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| var chunks [][]byte | ||
| for _, chunkFile := range chunkFiles { | ||
| file, err := tree.File(chunkFile) | ||
| if err != nil { | ||
| logging.Warn(context.Background(), "failed to read transcript chunk file from tree", | ||
| slog.String("chunk_file", chunkFile), | ||
| slog.String("error", err.Error()), | ||
| ) | ||
| continue | ||
| } | ||
| content, err := file.Contents() | ||
| if err != nil { | ||
| logging.Warn(context.Background(), "failed to read transcript chunk contents", | ||
| slog.String("chunk_file", chunkFile), | ||
| slog.String("error", err.Error()), | ||
| ) | ||
| continue | ||
| } | ||
| chunks = append(chunks, []byte(content)) | ||
| } | ||
|
|
||
| if len(chunks) > 0 { | ||
| result, err := agent.ReassembleTranscript(chunks, agentType) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to reassemble transcript: %w", err) | ||
| } | ||
| return result, nil | ||
| } |
There was a problem hiding this comment.
When chunk files are detected but some fail to read (due to errors at lines 935-942 or 943-950), the function continues with partial chunks and reassembles them. This could result in a corrupted transcript that appears valid but is actually missing data.
Consider tracking whether any chunks failed to read and returning an error if any are missing, rather than silently assembling incomplete data. This is critical for data integrity since partial transcripts could be misleading.
There was a problem hiding this comment.
We log warnings for failed chunks (added in 25fbbfd) but continue with partial data. These are persistent failures, and blocking entirely would just frustrate users. The logging provides visibility for debugging.
| func ChunkJSONL(content []byte, maxSize int) ([][]byte, error) { | ||
| lines := strings.Split(string(content), "\n") | ||
| var chunks [][]byte | ||
| var currentChunk strings.Builder | ||
|
|
||
| for i, line := range lines { | ||
| // Check if adding this line would exceed the chunk size | ||
| lineWithNewline := line + "\n" | ||
|
|
||
| // Check if a single line exceeds maxSize - this is an error since we can't split JSONL lines | ||
| if len(lineWithNewline) > maxSize { | ||
| return nil, fmt.Errorf("JSONL line %d exceeds maximum chunk size (%d bytes > %d bytes); cannot split a single JSON object", i+1, len(lineWithNewline), maxSize) | ||
| } | ||
|
|
||
| if currentChunk.Len()+len(lineWithNewline) > maxSize && currentChunk.Len() > 0 { | ||
| // Save current chunk and start a new one | ||
| chunks = append(chunks, []byte(strings.TrimSuffix(currentChunk.String(), "\n"))) | ||
| currentChunk.Reset() | ||
| } | ||
| currentChunk.WriteString(lineWithNewline) | ||
| } | ||
|
|
||
| // Add the last chunk if it has content | ||
| if currentChunk.Len() > 0 { | ||
| chunks = append(chunks, []byte(strings.TrimSuffix(currentChunk.String(), "\n"))) | ||
| } | ||
|
|
||
| return chunks, nil |
There was a problem hiding this comment.
The ChunkJSONL function doesn't handle empty content properly. If the input content is empty or contains only newlines, strings.Split will produce a slice with empty strings, and the function will still create a chunk containing those empty strings.
Consider adding a check at the beginning to handle empty content explicitly, returning a single empty chunk or an error depending on the desired behavior for this edge case.
There was a problem hiding this comment.
Fixed in d1b15fb - added early return for empty content.
| } | ||
| chunks = append(chunks, chunkData) | ||
| } | ||
|
|
There was a problem hiding this comment.
If all messages fail to marshal during chunking (lines 388-396 where errors are logged and skipped), the function could return an empty chunks slice. This would result in losing the entire transcript without a clear error, as the function would successfully return an empty slice rather than failing with an error.
Consider checking after the loop if no chunks were created and returning an error in that case, to ensure the transcript isn't silently lost.
| if len(chunks) == 0 { | |
| return nil, errors.New("failed to create any chunks: all messages failed to marshal") | |
| } |
There was a problem hiding this comment.
Fixed in d1b15fb - now returns error if no chunks created.
| // Sort chunk files by index | ||
| chunkFiles = agent.SortChunkFiles(chunkFiles, paths.TranscriptFileName) | ||
|
|
||
| // Check if base file should be included as chunk 0 |
There was a problem hiding this comment.
In the readTranscriptFromTree function, when the base file exists along with numbered chunk files (.001, .002, etc.), the base file is prepended to the chunk list as chunk 0. However, this logic assumes that chunked transcripts always start with the base file as the first chunk.
This assumption might not hold if future implementations change the chunking strategy, or if chunk 0 wasn't written (in which case only .001, .002, etc. exist). Consider documenting this assumption more clearly in comments, or adding validation to ensure the base file should indeed be treated as chunk 0 when numbered chunks are present.
| // Check if base file should be included as chunk 0 | |
| // Check if base file should be included as chunk 0. | |
| // NOTE: This assumes the current chunking convention where the unsuffixed | |
| // transcript file (paths.TranscriptFileName) represents chunk 0 when | |
| // numbered chunk files (e.g. full.jsonl.001, full.jsonl.002, ...) are present. | |
| // If the chunking strategy changes (for example, if the base file is no longer | |
| // part of the chunk sequence or chunk 0 is never written), this logic must be | |
| // revisited so we do not incorrectly prepend the base file. |
| Name: chunkPath, | ||
| Mode: filemode.Regular, | ||
| Hash: blobHash, | ||
| } |
There was a problem hiding this comment.
When creating a blob from a chunk fails (line 338), the error is silently ignored and the chunk is simply not added to the entries. This means some chunks might be written while others are silently dropped, resulting in incomplete transcript data without any indication.
Consider logging a warning when blob creation fails for a chunk, similar to the pattern used for chunking errors at lines 331-334. This would help diagnose issues where transcripts are partially saved.
| } | |
| } | |
| } else { | |
| logging.Warn(context.Background(), "failed to create blob from transcript chunk, chunk will be skipped", | |
| slog.String("error", blobErr.Error()), | |
| slog.String("session_id", opts.SessionID), | |
| slog.Int("chunk_index", i), | |
| ) |
There was a problem hiding this comment.
Fixed in d1b15fb - now logs warning with chunk index when blob creation fails.
- Handle empty content in ChunkJSONL (return empty slice) - Error if all Gemini messages fail to marshal (no silent empty result) - Document chunk 0 assumption in readTranscriptFromTree - Log blob creation failures for transcript chunks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Entire-Checkpoint: 8fcc5c40b846
| if len(chunks) > 0 { | ||
| result, err := agent.ReassembleTranscript(chunks, agentType) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to reassemble transcript: %w", err) | ||
| } | ||
| return result, nil |
There was a problem hiding this comment.
When reading chunked transcripts from temporary checkpoints, an empty string is passed as the agent type to readTranscriptFromTree, which causes agent.ReassembleTranscript to fall back to JSONL reassembly (simple concatenation with newlines). For Gemini transcripts that were chunked into multiple JSON objects with separate message arrays, this will produce invalid JSON instead of properly merging the message arrays.
To fix this, detect the agent type from the first chunk before reassembling. You can call agent.DetectAgentTypeFromContent on chunks[0] after line 954, then pass the detected agent type to agent.ReassembleTranscript at line 957. This matches the pattern used during chunking at temporary.go line 326.
|
Regarding the chunked transcript reassembly comment at committed.go:961 - good catch. Fixed by wiring the agentType through from the caller:
This is simpler than auto-detection and makes the data flow explicit. |
… reassembly The function was passing empty agentType to readTranscriptFromTree, which caused JSONL reassembly (simple concatenation) to be used for all transcripts. For Gemini transcripts chunked as separate JSON objects, this produced invalid JSON instead of properly merging the message arrays. The caller in rewind.go already had the agent available, so we now pass agent.Name() through to ensure correct format-specific reassembly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Entire-Checkpoint: 04e5fdc0cadb
| // Get transcript from shadow branch commit tree | ||
| store := checkpoint.NewGitStore(repo) | ||
| content, err := store.GetTranscriptFromCommit(hash, metadataDir) | ||
| content, err := store.GetTranscriptFromCommit(hash, metadataDir, agent.Name()) |
There was a problem hiding this comment.
The agent.Name() method returns the registry name (e.g., "claude-code", "gemini"), but GetTranscriptFromCommit expects an agent type like "Claude Code" or "Gemini CLI" to pass to agent.GetByAgentType().
This will cause chunked transcript reassembly to fail during rewind operations from shadow branches, because the agent lookup will fail and the function will fall back to JSONL reassembly even for Gemini transcripts (which use JSON format, not JSONL).
The agent type should be extracted from agent.Description() by taking everything before " - ", matching the pattern used elsewhere in the codebase (see hooks_geminicli_handlers.go:119-122).
| content, err := store.GetTranscriptFromCommit(hash, metadataDir, agent.Name()) | |
| agentDesc := agent.Description() | |
| agentType := agentDesc | |
| if idx := strings.Index(agentDesc, " - "); idx != -1 { | |
| agentType = agentDesc[:idx] | |
| } | |
| content, err := store.GetTranscriptFromCommit(hash, metadataDir, agentType) |
Interim fix to support passing agent.Name() (registry name like "gemini") to functions that ultimately call GetByAgentType, which previously only accepted display names like "Gemini CLI". This will be superseded by a proper Agent.Type() method in a follow-up PR. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Entire-Checkpoint: b69e695e7b48
|
Re: the agent name format comment at rewind.go:782 - fixed by making |
Add support for transcript/log chunking.