Skip to content

add chunking to agent capabilities#90

Merged
khaong merged 11 commits intomainfrom
soph/chunking
Jan 23, 2026
Merged

add chunking to agent capabilities#90
khaong merged 11 commits intomainfrom
soph/chunking

Conversation

@Soph
Copy link
Collaborator

@Soph Soph commented Jan 22, 2026

Add support for transcript/log chunking.

@Soph Soph requested a review from a team as a code owner January 22, 2026 17:49
@Soph Soph force-pushed the soph/do-not-truncate branch from 0546bcf to 5c4c298 Compare January 22, 2026 19:38
@Soph Soph force-pushed the soph/do-not-truncate branch from 5c4c298 to ab8f1a4 Compare January 22, 2026 19:55
Base automatically changed from soph/do-not-truncate to main January 23, 2026 01:01
Soph added 4 commits January 23, 2026 12:06
Entire-Checkpoint: e012b8e855d7
Entire-Checkpoint: ac3f02a3c337
Entire-Checkpoint: a7f0375c8fb0
When archiving an existing session to make room for a new session at
the same checkpoint, also move transcript chunk files (full.jsonl.001,
full.jsonl.002, etc.) to the archive folder.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Entire-Checkpoint: 0b0195f27fed
Copilot AI review requested due to automatic review settings January 23, 2026 01:21
Copy link
Contributor

@khaong khaong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: added handling for chunks in the archive

👍

Entire-Checkpoint: 91220378d066
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for chunking large transcripts to stay under GitHub's 100MB blob limit. The implementation uses a 50MB chunk size as a safety margin and provides agent-specific chunking strategies (Gemini JSON format vs JSONL format for Claude Code).

Changes:

  • Introduces a TranscriptChunker interface that agents can implement to provide format-aware chunking
  • Implements chunking/reassembly logic for Gemini CLI (JSON message arrays) and Claude Code (JSONL line-based)
  • Integrates chunking into checkpoint write operations and reassembly into read operations
  • Adds comprehensive test coverage for chunking logic and archive operations

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cmd/entire/cli/agent/agent.go Adds TranscriptChunker interface definition for agents supporting chunking
cmd/entire/cli/agent/chunking.go Core chunking implementation with JSONL default fallback and agent detection
cmd/entire/cli/agent/chunking_test.go Comprehensive tests for chunking, reassembly, and utility functions
cmd/entire/cli/agent/geminicli/gemini.go Gemini-specific JSON chunking that splits message arrays while preserving structure
cmd/entire/cli/agent/geminicli/gemini_test.go Extensive tests for Gemini chunking edge cases and round-trip validation
cmd/entire/cli/agent/claudecode/claude.go Claude Code JSONL chunking using shared line-based chunking
cmd/entire/cli/checkpoint/committed.go Integrates chunking into write operations and reassembly with backward compatibility
cmd/entire/cli/checkpoint/temporary.go Adds chunking support for temporary/shadow branch checkpoints
cmd/entire/cli/checkpoint/checkpoint_test.go Tests archive operations with chunked transcripts

Comment on lines +388 to +395
// Marshal message to get its size
msgBytes, err := json.Marshal(msg)
if err != nil {
logging.Warn(context.Background(), "failed to marshal Gemini message during chunking",
slog.Int("message_index", i),
slog.String("error", err.Error()),
)
continue
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping unmarshalable messages with only a warning could result in data loss during chunking. If a message fails to marshal (which should be rare but could happen due to encoding issues or corrupt data), it silently disappears from the chunked output. This means the reassembled transcript won't match the original.

Consider returning an error instead of logging and continuing. If there's a specific reason to skip malformed messages, document that reasoning in a comment and ensure it's intentional data sanitization rather than silent data loss.

Suggested change
// Marshal message to get its size
msgBytes, err := json.Marshal(msg)
if err != nil {
logging.Warn(context.Background(), "failed to marshal Gemini message during chunking",
slog.Int("message_index", i),
slog.String("error", err.Error()),
)
continue
// Marshal message to get its size. If this fails, we fail the entire
// chunking operation rather than silently dropping the message, to
// avoid data loss and ensure reassembly can faithfully reconstruct
// the original transcript.
msgBytes, err := json.Marshal(msg)
if err != nil {
return nil, fmt.Errorf("failed to marshal Gemini message at index %d during chunking: %w", i, err)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed - there's a logging.Warn call here. We intentionally log and continue rather than failing, since marshal failures are persistent (won't resolve on retry) and blocking the user's workflow for a rare edge case isn't worth it.

Comment on lines 322 to 340
if transcriptContent, readErr := os.ReadFile(opts.TranscriptPath); readErr == nil {
if blobHash, blobErr := CreateBlobFromContent(s.repo, transcriptContent); blobErr == nil {
transcriptPath := sessionMetadataDir + "/" + paths.TranscriptFileName
entries[transcriptPath] = object.TreeEntry{
Name: transcriptPath,
Mode: filemode.Regular,
Hash: blobHash,
// Detect agent type from content for proper chunking
agentType := agent.DetectAgentTypeFromContent(transcriptContent)

// Chunk if necessary
chunks, chunkErr := agent.ChunkTranscript(transcriptContent, agentType)
if chunkErr == nil {
for i, chunk := range chunks {
chunkPath := sessionMetadataDir + "/" + agent.ChunkFileName(paths.TranscriptFileName, i)
if blobHash, blobErr := CreateBlobFromContent(s.repo, chunk); blobErr == nil {
entries[chunkPath] = object.TreeEntry{
Name: chunkPath,
Mode: filemode.Regular,
Hash: blobHash,
}
}
}
}
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunking errors are silently ignored here. If transcript chunking fails, the transcript is simply omitted from the checkpoint without any indication to the user or logs. This could result in checkpoints being created without their transcripts, which would be confusing for users.

Consider logging an error when chunking fails, or propagating the error up if it's critical. The error handling pattern should match the project's guidelines - either log with context or return the error.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added warning logging in 25fbbfd. We log the error with session context and continue - the checkpoint is saved without the transcript rather than failing entirely.

Comment on lines +416 to +435

// TranscriptChunker interface implementation

// ChunkTranscript splits a JSONL transcript at line boundaries.
// Claude Code uses JSONL format (one JSON object per line), so chunking
// is done at newline boundaries to preserve message integrity.
func (c *ClaudeCodeAgent) ChunkTranscript(content []byte, maxSize int) ([][]byte, error) {
chunks, err := agent.ChunkJSONL(content, maxSize)
if err != nil {
return nil, fmt.Errorf("failed to chunk JSONL transcript: %w", err)
}
return chunks, nil
}

// ReassembleTranscript concatenates JSONL chunks with newlines.
//
//nolint:unparam // error return is required by interface, kept for consistency
func (c *ClaudeCodeAgent) ReassembleTranscript(chunks [][]byte) ([]byte, error) {
return agent.ReassembleJSONL(chunks), nil
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChunkTranscript and ReassembleTranscript implementations for ClaudeCodeAgent lack dedicated test coverage. While these methods delegate to shared JSONL functions that are tested in chunking_test.go, agent-specific tests would ensure the implementation correctly satisfies the TranscriptChunker interface and handles ClaudeCode-specific scenarios.

Consider adding tests similar to the comprehensive GeminiCLI tests (TestChunkTranscript_, TestReassembleTranscript_) in a new file like claude_chunking_test.go or in the existing transcript_test.go file.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, but since Claude Code's implementation just delegates to the shared ChunkJSONL/ReassembleJSONL functions which have comprehensive tests in chunking_test.go, we're comfortable with the current coverage. The interface contract is simple (passthrough to shared code).

Comment on lines +931 to +950
var chunks [][]byte
for _, chunkFile := range chunkFiles {
file, err := tree.File(chunkFile)
if err != nil {
continue
}
content, err := file.Contents()
if err != nil {
continue
}
chunks = append(chunks, []byte(content))
}

if len(chunks) > 0 {
result, err := agent.ReassembleTranscript(chunks, agentType)
if err != nil {
return nil, fmt.Errorf("failed to reassemble transcript: %w", err)
}
return result, nil
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors reading individual chunk files are silently ignored with continue statements. This means if any chunk file is missing or unreadable, the function proceeds with reassembling only the chunks it could successfully read, potentially returning a corrupted or incomplete transcript without any indication of the problem.

Consider collecting and returning errors when chunk files can't be read. If any chunk is missing or fails to read, the reassembly should fail with a clear error message indicating which chunk(s) couldn't be loaded. This is especially important since transcript integrity is critical for checkpoint restoration.

Suggested change
var chunks [][]byte
for _, chunkFile := range chunkFiles {
file, err := tree.File(chunkFile)
if err != nil {
continue
}
content, err := file.Contents()
if err != nil {
continue
}
chunks = append(chunks, []byte(content))
}
if len(chunks) > 0 {
result, err := agent.ReassembleTranscript(chunks, agentType)
if err != nil {
return nil, fmt.Errorf("failed to reassemble transcript: %w", err)
}
return result, nil
}
var (
chunks [][]byte
failedChunks []string
)
for _, chunkFile := range chunkFiles {
file, err := tree.File(chunkFile)
if err != nil {
failedChunks = append(failedChunks, fmt.Sprintf("%s (tree lookup failed: %v)", chunkFile, err))
continue
}
content, err := file.Contents()
if err != nil {
failedChunks = append(failedChunks, fmt.Sprintf("%s (read failed: %v)", chunkFile, err))
continue
}
chunks = append(chunks, []byte(content))
}
if len(failedChunks) > 0 {
return nil, fmt.Errorf("failed to read transcript chunk file(s): %s", strings.Join(failedChunks, ", "))
}
if len(chunks) == 0 {
return nil, errors.New("no transcript chunk data read despite chunk files being present")
}
result, err := agent.ReassembleTranscript(chunks, agentType)
if err != nil {
return nil, fmt.Errorf("failed to reassemble transcript: %w", err)
}
return result, nil

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added warning logging in 25fbbfd. We now log which chunk file failed and why, then continue with partial data. These are persistent failures (missing/corrupted git objects won't self-heal), so failing entirely would just block the user without benefit.

Add logging when:
- Transcript chunking fails during checkpoint write (continues without transcript)
- Chunk file read fails during transcript reassembly (continues with partial data)

These are persistent failures that won't resolve on retry, so we log
and continue rather than blocking the user's workflow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Comment on lines +933 to +960
var chunks [][]byte
for _, chunkFile := range chunkFiles {
file, err := tree.File(chunkFile)
if err != nil {
logging.Warn(context.Background(), "failed to read transcript chunk file from tree",
slog.String("chunk_file", chunkFile),
slog.String("error", err.Error()),
)
continue
}
content, err := file.Contents()
if err != nil {
logging.Warn(context.Background(), "failed to read transcript chunk contents",
slog.String("chunk_file", chunkFile),
slog.String("error", err.Error()),
)
continue
}
chunks = append(chunks, []byte(content))
}

if len(chunks) > 0 {
result, err := agent.ReassembleTranscript(chunks, agentType)
if err != nil {
return nil, fmt.Errorf("failed to reassemble transcript: %w", err)
}
return result, nil
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When chunk files are detected but some fail to read (due to errors at lines 935-942 or 943-950), the function continues with partial chunks and reassembles them. This could result in a corrupted transcript that appears valid but is actually missing data.

Consider tracking whether any chunks failed to read and returning an error if any are missing, rather than silently assembling incomplete data. This is critical for data integrity since partial transcripts could be misleading.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We log warnings for failed chunks (added in 25fbbfd) but continue with partial data. These are persistent failures, and blocking entirely would just frustrate users. The logging provides visibility for debugging.

Comment on lines +76 to +103
func ChunkJSONL(content []byte, maxSize int) ([][]byte, error) {
lines := strings.Split(string(content), "\n")
var chunks [][]byte
var currentChunk strings.Builder

for i, line := range lines {
// Check if adding this line would exceed the chunk size
lineWithNewline := line + "\n"

// Check if a single line exceeds maxSize - this is an error since we can't split JSONL lines
if len(lineWithNewline) > maxSize {
return nil, fmt.Errorf("JSONL line %d exceeds maximum chunk size (%d bytes > %d bytes); cannot split a single JSON object", i+1, len(lineWithNewline), maxSize)
}

if currentChunk.Len()+len(lineWithNewline) > maxSize && currentChunk.Len() > 0 {
// Save current chunk and start a new one
chunks = append(chunks, []byte(strings.TrimSuffix(currentChunk.String(), "\n")))
currentChunk.Reset()
}
currentChunk.WriteString(lineWithNewline)
}

// Add the last chunk if it has content
if currentChunk.Len() > 0 {
chunks = append(chunks, []byte(strings.TrimSuffix(currentChunk.String(), "\n")))
}

return chunks, nil
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChunkJSONL function doesn't handle empty content properly. If the input content is empty or contains only newlines, strings.Split will produce a slice with empty strings, and the function will still create a chunk containing those empty strings.

Consider adding a check at the beginning to handle empty content explicitly, returning a single empty chunk or an error depending on the desired behavior for this edge case.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d1b15fb - added early return for empty content.

}
chunks = append(chunks, chunkData)
}

Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all messages fail to marshal during chunking (lines 388-396 where errors are logged and skipped), the function could return an empty chunks slice. This would result in losing the entire transcript without a clear error, as the function would successfully return an empty slice rather than failing with an error.

Consider checking after the loop if no chunks were created and returning an error in that case, to ensure the transcript isn't silently lost.

Suggested change
if len(chunks) == 0 {
return nil, errors.New("failed to create any chunks: all messages failed to marshal")
}

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d1b15fb - now returns error if no chunks created.

// Sort chunk files by index
chunkFiles = agent.SortChunkFiles(chunkFiles, paths.TranscriptFileName)

// Check if base file should be included as chunk 0
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the readTranscriptFromTree function, when the base file exists along with numbered chunk files (.001, .002, etc.), the base file is prepended to the chunk list as chunk 0. However, this logic assumes that chunked transcripts always start with the base file as the first chunk.

This assumption might not hold if future implementations change the chunking strategy, or if chunk 0 wasn't written (in which case only .001, .002, etc. exist). Consider documenting this assumption more clearly in comments, or adding validation to ensure the base file should indeed be treated as chunk 0 when numbered chunks are present.

Suggested change
// Check if base file should be included as chunk 0
// Check if base file should be included as chunk 0.
// NOTE: This assumes the current chunking convention where the unsuffixed
// transcript file (paths.TranscriptFileName) represents chunk 0 when
// numbered chunk files (e.g. full.jsonl.001, full.jsonl.002, ...) are present.
// If the chunking strategy changes (for example, if the base file is no longer
// part of the chunk sequence or chunk 0 is never written), this logic must be
// revisited so we do not incorrectly prepend the base file.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documenting comment in d1b15fb.

Name: chunkPath,
Mode: filemode.Regular,
Hash: blobHash,
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating a blob from a chunk fails (line 338), the error is silently ignored and the chunk is simply not added to the entries. This means some chunks might be written while others are silently dropped, resulting in incomplete transcript data without any indication.

Consider logging a warning when blob creation fails for a chunk, similar to the pattern used for chunking errors at lines 331-334. This would help diagnose issues where transcripts are partially saved.

Suggested change
}
}
} else {
logging.Warn(context.Background(), "failed to create blob from transcript chunk, chunk will be skipped",
slog.String("error", blobErr.Error()),
slog.String("session_id", opts.SessionID),
slog.Int("chunk_index", i),
)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d1b15fb - now logs warning with chunk index when blob creation fails.

- Handle empty content in ChunkJSONL (return empty slice)
- Error if all Gemini messages fail to marshal (no silent empty result)
- Document chunk 0 assumption in readTranscriptFromTree
- Log blob creation failures for transcript chunks

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Entire-Checkpoint: 8fcc5c40b846
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment on lines +956 to +961
if len(chunks) > 0 {
result, err := agent.ReassembleTranscript(chunks, agentType)
if err != nil {
return nil, fmt.Errorf("failed to reassemble transcript: %w", err)
}
return result, nil
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading chunked transcripts from temporary checkpoints, an empty string is passed as the agent type to readTranscriptFromTree, which causes agent.ReassembleTranscript to fall back to JSONL reassembly (simple concatenation with newlines). For Gemini transcripts that were chunked into multiple JSON objects with separate message arrays, this will produce invalid JSON instead of properly merging the message arrays.

To fix this, detect the agent type from the first chunk before reassembling. You can call agent.DetectAgentTypeFromContent on chunks[0] after line 954, then pass the detected agent type to agent.ReassembleTranscript at line 957. This matches the pattern used during chunking at temporary.go line 326.

Copilot uses AI. Check for mistakes.
@khaong
Copy link
Contributor

khaong commented Jan 23, 2026

Regarding the chunked transcript reassembly comment at committed.go:961 - good catch. Fixed by wiring the agentType through from the caller:

  • GetTranscriptFromCommit now takes an agentType parameter
  • rewind.go passes agent.Name() which it already had available

This is simpler than auto-detection and makes the data flow explicit.

… reassembly

The function was passing empty agentType to readTranscriptFromTree, which
caused JSONL reassembly (simple concatenation) to be used for all transcripts.
For Gemini transcripts chunked as separate JSON objects, this produced invalid
JSON instead of properly merging the message arrays.

The caller in rewind.go already had the agent available, so we now pass
agent.Name() through to ensure correct format-specific reassembly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Entire-Checkpoint: 04e5fdc0cadb
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

// Get transcript from shadow branch commit tree
store := checkpoint.NewGitStore(repo)
content, err := store.GetTranscriptFromCommit(hash, metadataDir)
content, err := store.GetTranscriptFromCommit(hash, metadataDir, agent.Name())
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent.Name() method returns the registry name (e.g., "claude-code", "gemini"), but GetTranscriptFromCommit expects an agent type like "Claude Code" or "Gemini CLI" to pass to agent.GetByAgentType().

This will cause chunked transcript reassembly to fail during rewind operations from shadow branches, because the agent lookup will fail and the function will fall back to JSONL reassembly even for Gemini transcripts (which use JSON format, not JSONL).

The agent type should be extracted from agent.Description() by taking everything before " - ", matching the pattern used elsewhere in the codebase (see hooks_geminicli_handlers.go:119-122).

Suggested change
content, err := store.GetTranscriptFromCommit(hash, metadataDir, agent.Name())
agentDesc := agent.Description()
agentType := agentDesc
if idx := strings.Index(agentDesc, " - "); idx != -1 {
agentType = agentDesc[:idx]
}
content, err := store.GetTranscriptFromCommit(hash, metadataDir, agentType)

Copilot uses AI. Check for mistakes.
Interim fix to support passing agent.Name() (registry name like "gemini")
to functions that ultimately call GetByAgentType, which previously only
accepted display names like "Gemini CLI".

This will be superseded by a proper Agent.Type() method in a follow-up PR.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Entire-Checkpoint: b69e695e7b48
@khaong
Copy link
Contributor

khaong commented Jan 23, 2026

Re: the agent name format comment at rewind.go:782 - fixed by making GetByAgentType accept both display names ("Gemini CLI") and registry names ("gemini"). This is an interim fix; we'll add a proper Agent.Type() method in the follow-up PR (#91) to make the mapping explicit at the agent interface level.

@khaong khaong merged commit 0ba3fda into main Jan 23, 2026
3 checks passed
@khaong khaong deleted the soph/chunking branch January 23, 2026 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants