Go-Git: Version Control from Scratch

A Git implementation built from first principles in Go to understand how distributed version control actually works. No libraries for Git operations - just hash-based storage, tree structures, and commit graphs.

Core features: Content-addressable storage, staging area, commit history, tree-based snapshots

Learning-focused, not production-ready

Quick Start

git clone https://github.com/codetesla51/go-git.git
cd go-git
./install.sh

# Or build manually:
go build -buildvcs=false -o go-git
ln -s $(pwd)/go-git ~/.local/bin/go-git

Try it:

mkdir my-project
cd my-project
go-git init
go-git config  # Set your name and email

echo "Hello World" > README.md
go-git add README.md
go-git commit -m "Initial commit"
go-git log

Features

Content-Addressable Storage - Files stored by SHA-256 hash (deduplication built-in)
Staging Area (Index) - Git's "three-tree" architecture with staging layer
Tree Objects - Directory structure stored as nested tree objects
Commit History - Full commit chain with parent references
Compression - zlib compression for all objects (blobs, trees, commits)
Config Management - User name/email stored in .git/config

Commands

go-git init                    # Initialize repository
go-git config                  # Set user name and email
go-git add <files...>          # Stage files (supports directories)
go-git commit -m "message"     # Create commit
go-git log                     # View commit history
go-git help                    # Show available commands

Command Examples

Initialize Repository

go-git init

Creates .git/objects/ for content storage, .git/refs/heads/ for branch pointers, and HEAD pointing to main branch.

Configure User Identity

go-git config

Prompts for user name and email, then writes them to .git/config. Every commit includes this information in the author field.

View Commit History

go-git log

Reads commit hash from HEAD → refs/heads/main, loads commit object, displays info, follows parent pointer, repeats until no parent exists.

Get Help

go-git help

Displays CLI help menu with all available commands and their descriptions.

How It Works

1. Content-Addressable Storage

Every object (file, directory, commit) is stored by its SHA-256 hash:

.git/objects/ab/c123def456...  ← File content stored here
                ↑↑  ↑↑↑↑↑↑↑
                │   └─ Rest of hash (filename)
                └───── First 2 chars (subdirectory)

Why this matters: Same content = same hash = automatic deduplication. Store README.md 100 times, uses disk space once.

2. The Three Trees

Git tracks files through three conceptual "trees":

Working Directory  →  Staging Area (Index)  →  Repository (.git/objects)
     (your files)         (.git/index)              (committed history)

go-git add moves files from working directory → staging area
go-git commit snapshots staging area → repository

3. Tree Objects (The Hard Part)

Files are stored as blobs. Directories are stored as trees.

Example structure:

project/
  README.md
  src/
    main.go
    lib/
      helper.go

Git stores this as:

Root Tree
├─ blob: README.md (hash: abc123)
└─ tree: src/ (hash: def456)
      ├─ blob: main.go (hash: ghi789)
      └─ tree: lib/ (hash: jkl012)
            └─ blob: helper.go (hash: mno345)

The trick: Trees must be built bottom-up (deepest first) because parent trees need their children's hashes.

Building order:

Build src/lib/ tree (contains helper.go) → get hash jkl012
Build src/ tree (contains main.go + lib/ tree with hash jkl012) → get hash def456
Build root tree (contains README.md + src/ tree with hash def456) → done

This was the hardest part to implement. Understanding why trees reference other trees (not their contents) took multiple attempts.

4. Commits

A commit is just a pointer to a tree + metadata:

commit <size>\0
tree abc123def456...           ← Snapshot of entire project
parent 789xyz...               ← Previous commit (forms history chain)
author Uthman <email> timestamp
committer Uthman <email> timestamp

Initial commit

Commits form a directed acyclic graph (DAG):

C3 (HEAD) → C2 → C1 (root)

go-git log traverses this chain backwards from HEAD.

Under the Hood

Object Format

Every object (blob, tree, commit) follows this format:

<type> <size>\0<content>

Compressed with zlib, stored at .git/objects/<hash[:2]>/<hash[2:]>

Example - Blob:

blob 13\0Hello, World!

Hash this → a0b1c2d3... → Store at .git/objects/a0/b1c2d3...

Example - Tree:

tree 74\0100644 README.md\0<32-byte-binary-hash>040000 src\0<32-byte-binary-hash>
        ↑       ↑           ↑                    ↑      ↑    ↑
      mode    name      null byte              mode   name  null + hash

Mode meanings:

100644 = regular file
040000 = directory (tree object)

Critical detail: Hashes in tree objects are binary bytes, not hex strings. Took me an hour to debug this.

Staging Area (Index)

.git/index is a simple text file:

100644 abc123... README.md
100644 def456... src/main.go
100644 ghi789... src/lib/helper.go

Format: <mode> <hash> <path>

When you go-git add, we:

Hash the file content → get blob hash
Store blob in .git/objects/
Update index with path → hash mapping

Note: This implementation stores the index as plain text for simplicity. Real Git stores the index in a binary format for performance and to support additional metadata (timestamps, file sizes, etc.).

Branch Pointers

.git/refs/heads/main is a text file containing one line: the current commit hash.

abc123def456...

When you commit:

Create tree from index
Create commit pointing to that tree
Update .git/refs/heads/main with new commit hash

.git/HEAD points to the current branch:

ref: refs/heads/main

Performance Characteristics

Deduplication: Identical files stored once (hash collision = same content)
Compression: zlib reduces object size by ~60-70%
Scalability: Linear time for add/commit operations

Not optimized for speed (built for learning), but functional for small-to-medium repos.

Project Structure

├── cmd/
│   └── cmd.go              # CLI with Cobra (init, add, commit, log, config)
├── internals/
│   ├── init.go             # Repository initialization
│   ├── config.go           # User configuration (name/email)
│   ├── hash.go             # SHA-256 hashing + zlib compression
│   ├── log.go              # Commit history traversal
│   ├── index/
│   │   └── index.go        # Staging area management
│   └── objects/
│       ├── blobs.go        # File content hashing
│       ├── trees.go        # Directory tree building
│       └── commit.go       # Commit object creation
├── install.sh              # Automatic installation script
└── main.go                 # Entry point

The Story: Building This

What Went Smoothly

Blobs (file hashing): Straightforward - read file, hash content, compress, store
Staging area: Simple text file tracking path → hash mappings
Commits: Just string formatting + hashing

What Was Hard

1. Tree Building (5+ hours of debugging)

The problem: How do you build a tree for src/ when it contains a subdirectory lib/?

Initial attempt: Build trees top-down (root first). Failed - you don't have child tree hashes yet.

Solution: Build bottom-up (deepest directories first). But how do you know the order?

Final approach:

// Sort directories by depth (count slashes)
sort.Slice(dirs, func(i, j int) bool {
    return strings.Count(dirs[i], "/") > strings.Count(dirs[j], "/")
})

Then for each directory, check if any already-built trees are its children:

for subdir, hash := range treeHashes {
    if filepath.Dir(subdir) == currentDir {
        // This tree is a child - add it as an entry
        entries = append(entries, Entry{
            Filename: filepath.Base(subdir),
            BlobHash: hash,  // Use tree hash, not blob hash
            FileMode: "040000"  // Directory mode
        })
    }
}

Why this was hard: Trees reference other trees by hash. You need to build children before parents so you have their hashes to reference. Took multiple attempts to understand this wasn't just "parse directories recursively."

2. Binary vs Hex Hash Encoding

Tree objects store hashes as 32 binary bytes, not 64-character hex strings.

Initial bug:

content += entry.BlobHash  // Wrong! This is hex string (64 chars)

Fix:

hashBytes, _ := hex.DecodeString(entry.BlobHash)
content = append(content, hashBytes...)  // 32 binary bytes

Symptom: Trees were twice as large as they should be. Debugging this took an hour because the error was subtle - Git could read the objects, but tree traversal was broken.

3. Excluding .git/ from staging

When you go-git add ., you don't want to add .git/objects/ to the index.

First attempt: Check if path contains .git. Failed - also excluded my.git.file.

Fix:

if d.Name() == ".git" && d.IsDir() {
    return filepath.SkipDir  // Skip entire directory
}

Using filepath.SkipDir is the correct way to exclude directories during traversal.

What I Learned

Content-addressable storage is elegant: Hash = address. Same content = same hash = automatic deduplication. This is how Git handles millions of files efficiently.

Trees are graphs, not just nested structures: A tree object doesn't "contain" subtrees - it references them by hash. This indirection is what enables Git's efficiency (multiple commits can share the same tree if directory didn't change).

Building bottom-up is necessary, not optional: You can't hash a parent tree without knowing its children's hashes. The order matters fundamentally.

Go's filepath package is powerful: filepath.Walk, filepath.Dir, filepath.Base handle cross-platform path logic correctly. Don't reinvent this.

Compression matters: Without zlib, .git/objects/ would be 3-4x larger. Git uses compression everywhere.

Binary formats are tricky: Working with \0 null bytes and binary hash data required careful handling. Text formats would be easier but less efficient.

Testing

Currently manual testing via real usage. To test:

mkdir test-repo
cd test-repo
go-git init
go-git config
echo "test" > file.txt
go-git add file.txt
go-git commit -m "test commit"
go-git log  # Should show your commit

Verify objects:

ls -la .git/objects/  # Should see subdirectories with hashes

Limitations

This is a learning project, not production software:

No branches - Only main branch exists
No merge - Can't combine commit histories
No diff - Can't compare commits or working tree changes
No status - Can't see modified/untracked files
No remote operations - No push/pull/fetch
No .gitignore - Manual exclusion only
No packed objects - Each object is a separate file (Git packs them for efficiency)
No index optimization - Linear scan on every operation
Plain text index - Index stored as plain text instead of binary format (real Git uses binary for performance and metadata)

Why these limitations exist: This project focuses on Git's core - the object model, staging, and commits. Adding branches/merging/remotes would be another 2-3x the code and shift focus from fundamentals to features.

Why Build This?

Most developers use Git daily but don't understand how it works internally. We type git add, git commit, and assume magic happens.

Building Git from scratch reveals:

Why commits are cheap (just pointers to trees)
How deduplication works (content-addressable storage)
Why branching is fast (just moving a pointer)
What "detached HEAD" actually means
How merge conflicts arise (competing tree references)

The best way to understand a tool is to build it yourself.

This project taught me more about Git in a week than years of using it did.

Built With

Go 1.25 - Core language
Cobra - CLI framework
fatih/color - Terminal colors
Standard library only - All Git logic hand-written

No Git libraries used. Everything from hashing to object storage is custom implementation.

Built by Uthman | Portfolio | GitHub

Learning project focused on understanding version control internals

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Go-Git: Version Control from Scratch

Quick Start

Features

Commands

Command Examples

How It Works

1. Content-Addressable Storage

2. The Three Trees

3. Tree Objects (The Hard Part)

4. Commits

Under the Hood

Object Format

Staging Area (Index)

Branch Pointers

Performance Characteristics

Project Structure

The Story: Building This

What Went Smoothly

What Was Hard

What I Learned

Testing

Limitations

Why Build This?

Built With

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
cmd		cmd
internals		internals
README.MD		README.MD
go.mod		go.mod
go.sum		go.sum
install.sh		install.sh
main.go		main.go

codetesla51/go-git

Folders and files

Latest commit

History

Repository files navigation

Go-Git: Version Control from Scratch

Quick Start

Features

Commands

Command Examples

How It Works

1. Content-Addressable Storage

2. The Three Trees

3. Tree Objects (The Hard Part)

4. Commits

Under the Hood

Object Format

Staging Area (Index)

Branch Pointers

Performance Characteristics

Project Structure

The Story: Building This

What Went Smoothly

What Was Hard

What I Learned

Testing

Limitations

Why Build This?

Built With

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages