Skip to content

Storing AI transcripts on a git branch is architecturally problematic for privacy #340

@tobihagemann

Description

@tobihagemann

The problem

Entire stores full AI session transcripts (prompts, responses, tool calls, file paths, commands) on the entire/checkpoints/v1 branch within the same git repository, which means transcripts inherit the repo's access model.

For open source projects, this is especially concerning because the repo is public, so the transcripts are too. Anyone who fetches the branch can read the complete AI conversation history, including the developer's reasoning, mistakes, internal context, and potentially sensitive information that slipped past redaction.

Even for private repos, transcripts become visible to every collaborator with read access, and the trust boundary for "who can see the code" is not the same as "who should see the raw AI session history."

Why a flag doesn't solve this

--skip-push-sessions exists, but it's not even the default, so sessions are pushed to the remote on every git push unless you explicitly opt out. Even with the flag, the transcripts are still committed to a local branch that can be inadvertently pushed, forked, or included in mirrors. The fundamental issue is that coupling transcript storage to the git repo means they will always travel with the code.

The sensitivity of AI transcripts

AI coding transcripts are uniquely sensitive because they can contain:

  • Internal reasoning and architectural decision-making
  • Partial secrets or credentials that slip past entropy-based redaction
  • Context about proprietary systems shared in prompts
  • Debugging discussions that reveal security weaknesses
  • File paths and system information

These transcripts capture a complete record of how code was written, including the parts developers would never put in a commit message or PR description.

Alternative approaches

Projects like AgentLogs decouple transcript storage from the repository entirely, which seems like a more sound architecture for this kind of data because the transcripts don't inherit the repo's access model and can be managed with their own access controls.

Has the team considered an architecture where transcripts are stored outside the git repo (e.g., a local database, a self-hostable server, or an optional remote backend), rather than on a git branch?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions