Skip to content

refactor: update checkpoint data structure#142

Merged
khaong merged 18 commits intomainfrom
gtrrz-victor/restructure-entire-session-files
Feb 5, 2026
Merged

refactor: update checkpoint data structure#142
khaong merged 18 commits intomainfrom
gtrrz-victor/restructure-entire-session-files

Conversation

@gtrrz-victor
Copy link
Contributor

@gtrrz-victor gtrrz-victor commented Feb 4, 2026

Summary

  • Restructure checkpoint storage to use numbered session subdirectories (0/, 1/, 2/)
  • Add CheckpointSummary as root-level metadata with aggregated statistics and session file paths
  • Store SessionFilePaths as absolute paths from git tree root for direct file access
  • Move session-specific metadata (including initial_attribution) to session subdirectories

Background

Previously, multi-session checkpoints archived older sessions to numbered folders while keeping the latest session's files at the root. This led to inconsistent path handling
and made it difficult to track which files belonged to which session.

New Storage Structure

  <checkpoint-id[:2]>/<checkpoint-id[2:]>/                                                                                                                                       
  ├── metadata.json         # CheckpointSummary (aggregated stats + session file paths)                                                                                          
  ├── 1/                    # First session                                                                                                                                      
  │   ├── metadata.json     # CommittedMetadata (session-specific, includes initial_attribution)                                                                                 
  │   ├── full.jsonl                                                                                                                                                             
  │   ├── prompt.txt                                                                                                                                                             
  │   ├── context.md                                                                                                                                                             
  │   └── content_hash.txt                                                                                                                                                       
  ├── 2/                    # Second session                                                                                                                                     
  │   └── ...                                                                                                                                                                    
  └── 3/                    # Third session...                                

Root metadata.json (CheckpointSummary)

  {                                                                                                                                                                              
    "checkpoint_id": "a1b2c3d4e5f6",                                                                                                                                             
    "strategy": "manual-commit",                                                                                                                                                 
    "checkpoints_count": 5,                                                                                                                                                      
    "files_touched": ["file1.go", "file2.go"],                                                                                                                                   
    "sessions": [                                                                                                                                                                
      {                                                                                                                                                                          
        "metadata": "/a1/b2c3d4e5f6/0/metadata.json",                                                                                                                            
        "transcript": "/a1/b2c3d4e5f6/0/full.jsonl",                                                                                                                             
        "context": "/a1/b2c3d4e5f6/0/context.md",                                                                                                                                
        "prompt": "/a1/b2c3d4e5f6/0/prompt.txt",                                                                                                                                 
        "content_hash": "/a1/b2c3d4e5f6/0/content_hash.txt"                                                                                                                      
      }                                                                                                                                                                          
    ],                                                                                                                                                                           
    "token_usage": { ... }                                                                                                                                                       
  }          

Session metadata.json (CommittedMetadata)

Contains all session-specific fields including initial_attribution, session_id, agent, etc.


Note

Medium Risk
Touches core persistence and read paths for checkpoints across CLI commands and both strategies; migration/compatibility issues could break log/context retrieval if older checkpoint layouts or branch naming aren’t handled correctly.

Overview
Refactors committed checkpoint storage and APIs to support true multi-session checkpoints: each session’s transcript/prompts/context/metadata now lives under numbered subdirectories (0/, 1/, …), while the checkpoint root metadata.json becomes a new CheckpointSummary that aggregates stats (files touched, checkpoint count, token usage) and stores per-session absolute file paths.

Updates GitStore and callers so ReadCommitted returns only the summary, and new methods (ReadSessionContent, ReadLatestSessionContent, ReadSessionContentByID) retrieve actual session content; related CLI flows (explain, strategies, cleanup/listing) and tests are adjusted accordingly, the sessions metadata branch is versioned to entire/sessions/v1, and obsolete multi-session archiving/agent-array logic plus the session domain types are removed.

Written by Cursor Bugbot for commit 1c77b2c. This will update automatically on new commits. Configure here.

@gtrrz-victor gtrrz-victor requested a review from a team as a code owner February 4, 2026 04:36
Copilot AI review requested due to automatic review settings February 4, 2026 04:36
@gtrrz-victor gtrrz-victor force-pushed the gtrrz-victor/restructure-entire-session-files branch from 957fc8e to c3eedce Compare February 4, 2026 04:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restructures checkpoint storage to use numbered subdirectories for sessions with a root-level summary. The new format creates a clearer separation between aggregated checkpoint statistics and session-specific data.

Changes:

  • Introduces CheckpointSummary and SessionFilePaths data structures to support multi-session checkpoints with 1-based indexing (1/, 2/, 3/, etc.)
  • Refactors checkpoint storage: root metadata.json now contains CheckpointSummary with aggregated stats, while session-specific data (including InitialAttribution) moves to numbered subdirectories
  • Updates all checkpoint read/write operations, tests, and integration tests to work with the new storage format

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
cmd/entire/cli/checkpoint/checkpoint.go Adds SessionFilePaths and CheckpointSummary types to support new hierarchical storage format
cmd/entire/cli/checkpoint/committed.go Major refactor of write/read operations to use numbered subdirectories and CheckpointSummary at root
cmd/entire/cli/checkpoint/checkpoint_test.go Comprehensive test additions for new format including single/multi-session scenarios
cmd/entire/cli/strategy/session.go Updates checkpoint description reading to navigate new subdirectory structure
cmd/entire/cli/strategy/session_test.go Updates test helpers to create checkpoints in new format with numbered subdirectories
cmd/entire/cli/strategy/common.go Updates ListCheckpoints and ReadCheckpointMetadata to parse new CheckpointSummary format
cmd/entire/cli/strategy/manual_commit_logs.go Updates GetSessionContext to read from sessions array in CheckpointSummary
cmd/entire/cli/strategy/manual_commit_hooks.go Adds TODO about getStagedFiles behavior in post-commit hook
cmd/entire/cli/strategy/manual_commit_condensation.go Refactors attribution calculation and adds TODOs about token usage gaps
cmd/entire/cli/strategy/manual_commit_test.go Updates tests to read session metadata from new subdirectory structure
cmd/entire/cli/strategy/auto_commit.go Updates task checkpoint transcript reading to use sessions array
cmd/entire/cli/integration_test/testenv.go Adds helper functions for session file paths (SessionFilePath, SessionMetadataPath, CheckpointSummaryPath)
cmd/entire/cli/integration_test/manual_commit_workflow_test.go Updates all test assertions to use new subdirectory paths, some changed to t.Logf
cmd/entire/cli/integration_test/attribution_test.go Updates to read InitialAttribution from session subdirectory metadata
cmd/entire/cli/integration_test/gemini_concurrent_session_test.go Updates multi-session test to verify new sessions array structure
cmd/entire/cli/integration_test/concurrent_session_warning_test.go Updates concurrent session tests for new format
cmd/entire/cli/integration_test/auto_commit_checkpoint_fix_test.go Updates to use SessionFilePath helper
Comments suppressed due to low confidence (3)

cmd/entire/cli/strategy/manual_commit_condensation.go:283

  • This TODO suggests removing paths.TranscriptFileNameLegacy usage, but no timeline or reason is provided. If backward compatibility with legacy transcript files is no longer needed, this fallback should be removed. If it's still needed, the TODO should clarify when it can be safely removed.
	// TODO: remove paths.TranscriptFileNameLegacy usage ?
	var fullTranscript string
	if file, fileErr := tree.File(metadataDir + "/" + paths.TranscriptFileName); fileErr == nil {
		if content, contentErr := file.Contents(); contentErr == nil {
			fullTranscript = content
		}
	} else if file, fileErr := tree.File(metadataDir + "/" + paths.TranscriptFileNameLegacy); fileErr == nil {
		if content, contentErr := file.Contents(); contentErr == nil {
			fullTranscript = content
		}
	}

cmd/entire/cli/strategy/manual_commit_condensation.go:328

  • These TODO comments indicate incomplete token usage calculation functionality. The code currently only calculates token usage for Claude Code transcripts (using claudecode.CalculateTokenUsage) but is missing implementation for Gemini. Additionally, the second TODO suggests token usage should be calculated per checkpoint slice rather than for the full transcript. This could lead to inaccurate token usage reporting for multi-checkpoint sessions.
	// TODO: Missing Gemini token usage
	if len(data.Transcript) > 0 {
		// TODO: Calculate token usage per transcript slice (only checkpoint related)
		transcriptLines, err := claudecode.ParseTranscript(data.Transcript)
		if err == nil && len(transcriptLines) > 0 {
			data.TokenUsage = claudecode.CalculateTokenUsage(transcriptLines)
		}

cmd/entire/cli/checkpoint/committed.go:553

  • The archiveExistingSession function is no longer used in the new implementation - it's only called in one test (TestArchiveExistingSession_ChunkedTranscript). The new approach writes sessions directly to numbered subdirectories without needing to archive existing sessions. Consider removing this function and its test if the archival approach is no longer part of the design, or document why it's being kept.
// archiveExistingSession moves existing session files to a numbered subfolder.
// The subfolder number is based on the current session count (so first archived session goes to "1/").
func (s *GitStore) archiveExistingSession(basePath string, existingMetadata *CommittedMetadata, entries map[string]object.TreeEntry) {
	// Determine archive folder number
	sessionCount := existingMetadata.SessionCount
	if sessionCount == 0 {
		sessionCount = 1 // backwards compat
	}
	archivePath := fmt.Sprintf("%s%d/", basePath, sessionCount)

	// Files to archive (standard checkpoint files at basePath, excluding tasks/ subfolder)
	filesToArchive := []string{
		paths.MetadataFileName,
		paths.TranscriptFileName,
		paths.PromptFileName,
		paths.ContextFileName,
		paths.ContentHashFileName,
	}

	// Also include transcript chunk files (full.jsonl.001, full.jsonl.002, etc.)
	chunkPrefix := basePath + paths.TranscriptFileName + "."
	for srcPath := range entries {
		if strings.HasPrefix(srcPath, chunkPrefix) {
			chunkSuffix := strings.TrimPrefix(srcPath, basePath+paths.TranscriptFileName)
			if idx := agent.ParseChunkIndex(paths.TranscriptFileName+chunkSuffix, paths.TranscriptFileName); idx > 0 {
				filesToArchive = append(filesToArchive, paths.TranscriptFileName+chunkSuffix)
			}
		}
	}

	// Move each file to archive folder
	for _, filename := range filesToArchive {
		srcPath := basePath + filename
		if entry, exists := entries[srcPath]; exists {
			// Add to archive location
			dstPath := archivePath + filename
			entries[dstPath] = object.TreeEntry{
				Name: dstPath,
				Mode: entry.Mode,
				Hash: entry.Hash,
			}
			// Remove from original location (will be overwritten by new session)
			delete(entries, srcPath)
		}
	}
}

@gtrrz-victor gtrrz-victor force-pushed the gtrrz-victor/restructure-entire-session-files branch from c3eedce to 26b7d05 Compare February 4, 2026 05:18
@gtrrz-victor gtrrz-victor force-pushed the gtrrz-victor/restructure-entire-session-files branch from 9958ef2 to a3ea51e Compare February 4, 2026 05:58
@gtrrz-victor gtrrz-victor changed the title update checkpoint data structure refactor: update checkpoint data structure Feb 4, 2026
@gtrrz-victor gtrrz-victor force-pushed the gtrrz-victor/restructure-entire-session-files branch from a3ea51e to 5032f0e Compare February 4, 2026 06:05
@gtrrz-victor gtrrz-victor force-pushed the gtrrz-victor/restructure-entire-session-files branch from 3822c29 to 766731c Compare February 4, 2026 23:19
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

khaong and others added 5 commits February 5, 2026 11:47
staticcheck SA5011 flags "possible nil pointer dereference" after
t.Fatal() checks because it doesn't recognize t.Fatal as terminating.
Adding explicit return statements makes the control flow clear to the
analyzer without needing nolint directives.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add comprehensive usage documentation at top of script
- Add dry-run mode as default (use --apply to execute)
- Add ability to migrate a single checkpoint by passing ID as argument
- Add detailed format descriptions for old and new checkpoint structures
- Add examples and rollback instructions

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Entire-Checkpoint: bd247e41d7dd
Entire-Checkpoint: be28ae5f33d2
Entire-Checkpoint: eb03327f76c6
- Add checkpoint_exists_on_target() to detect already-migrated checkpoints
- Skip checkpoints that already exist on target branch with valid v1 format
- Handle existing target branch gracefully (use it instead of failing)
- Update dry-run output to show migration status per checkpoint
- Document idempotency behavior in script header

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Entire-Checkpoint: 8cb061b3e545
khaong and others added 5 commits February 5, 2026 14:06
- Add migration_source_commit field to migrated checkpoint metadata
- Change skip check from existence to source commit comparison
- Re-migrate checkpoints when source commit differs (handles updates)
- Preserve original commit author using git commit --author
- Update dry-run output to show up-to-date status

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds git_retry() helper that retries git commands up to 5 times with
exponential backoff (0.2s, 0.4s, 0.8s, 1.6s, 3.2s) to handle transient
index.lock file race conditions during migration.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes from O(N×M) to O(M) complexity by using git diff-tree instead
of git ls-tree. Now only processes checkpoints actually modified in
each commit rather than scanning all checkpoints in the tree.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@khaong khaong enabled auto-merge February 5, 2026 04:13
@khaong khaong merged commit 2c8c128 into main Feb 5, 2026
4 checks passed
@khaong khaong deleted the gtrrz-victor/restructure-entire-session-files branch February 5, 2026 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants