DataFolio

A lightweight, filesystem-based data versioning and experiment tracking library for Python.

DataFolio helps you organize, version, and track your data science experiments by storing datasets, models, and artifacts in a simple, transparent directory structure. Everything is saved as plain files (Parquet, JSON, etc) that you can inspect, version with git, or backup to any storage system.

Note: DataFolio has been an exercise in how extensively I can use Claude Code. Currently all work has been done via Mr Claude, but now that it's getting very useful for workflows I might transition over to more manual curation.

Features

Universal Data Management: Single add_data() method automatically handles DataFrames, numpy arrays, dicts, lists, and scalars
Model Support: Save and load scikit-learn models with full metadata tracking
Data Lineage: Track inputs and dependencies between datasets and models
External References: Point to data stored externally (S3, local paths) without copying
Multi-Instance Sync: Automatic refresh when multiple notebooks/processes access the same bundle
Autocomplete Access: IDE-friendly folio.data.item_name.content syntax with full autocomplete support
Smart Metadata Display: Automatic metadata truncation and formatting in describe()
Item Management: Delete items with dependency tracking and warnings
Git-Friendly: All data stored as standard file formats in a simple directory structure
Type-Safe: Full type hints and comprehensive error handling
Snapshots: Create immutable checkpoints of your experiments with copy-on-write versioning
CLI Tools: Command-line interface for snapshot management and bundle operations

Quick Start

from datafolio import DataFolio
import pandas as pd
import numpy as np

# Create a new folio
folio = DataFolio('experiments/my_experiment')

# Add any type of data with a single method
folio.add_data('results', df)                          # DataFrame
folio.add_data('embeddings', np.array([1, 2, 3]))    # Numpy array
folio.add_data('config', {'lr': 0.01})                # Dict/JSON
folio.add_data('accuracy', 0.95)                      # Scalar

# Retrieve data (automatically returns correct type)
df = folio.get_data('results')           # Returns DataFrame
arr = folio.get_data('embeddings')       # Returns numpy array
config = folio.get_data('config')        # Returns dict

# Or use autocomplete-friendly access
df = folio.data.results.content          # Same as get_data()
arr = folio.data.embeddings.content
config = folio.data.config.content

# View everything (including custom metadata)
folio.describe()

# Clean up temporary items
folio.delete('temp_data')

Installation

pip install datafolio

This includes the datafolio command-line tool for snapshot management and bundle operations.

Core Concepts

Generic Data Methods

The add_data() and get_data() methods provide a unified interface for all data types:

# add_data() automatically detects type and uses the appropriate method
folio.add_data('my_data', data)  # Works with DataFrame, array, dict, list, scalar

# get_data() automatically detects stored type and returns correct format
data = folio.get_data('my_data')  # Returns original type

Supported data types:

DataFrames (pd.DataFrame) → stored as Parquet
Numpy arrays (np.ndarray) → stored as .npy
JSON data (dict, list, int, float, str, bool, None) → stored as JSON
External references → metadata only, data stays in original location

Multi-Instance Access

DataFolio automatically keeps multiple instances synchronized when accessing the same bundle:

# Notebook 1: Create and update bundle
folio1 = DataFolio('experiments/shared')
folio1.add_data('results', df)

# Notebook 2: Open same bundle
folio2 = DataFolio('experiments/shared')

# Notebook 1: Add more data
folio1.add_data('analysis', new_df)

# Notebook 2: Automatically sees new data!
folio2.describe()  # Shows both 'results' and 'analysis'
analysis = folio2.get_data('analysis')  # Works immediately ✅

All read operations (describe(), list_contents(), get_*() methods, and folio.data accessors) automatically refresh from disk when changes are detected, ensuring you always see the latest data without manual intervention.

Data Lineage

Track dependencies between datasets and models:

# Create dependency chain
folio.reference_table('raw', reference='s3://bucket/raw.parquet')
folio.add_table('clean', cleaned_df, inputs=['raw'])
folio.add_table('features', feature_df, inputs=['clean'])
folio.add_model('model', clf, inputs=['features'])

# Lineage is preserved in metadata and shown in describe()

Autocomplete-Friendly Access

Access your data with autocomplete support using the folio.data property:

# Attribute-style access (autocomplete-friendly!)
df = folio.data.results.content          # Get DataFrame
desc = folio.data.results.description    # Get description
type_str = folio.data.results.type       # Get item type
inputs = folio.data.results.inputs       # Get lineage inputs

# Works for all data types
arr = folio.data.embeddings.content      # numpy array
cfg = folio.data.config.content          # dict
model = folio.data.classifier.content    # model object

In IPython/Jupyter, folio.data.<TAB> shows all available items with autocomplete!

Directory Structure

DataFolio creates a transparent directory structure:

experiments/my_experiment/
├── metadata.json              # Folio metadata
├── items.json                 # Unified manifest with versioning
├── snapshots.json             # Snapshot registry (when using snapshots)
├── tables/
│   └── results.parquet       # DataFrame storage
├── models/
│   ├── classifier.joblib     # Sklearn model (v1)
│   └── classifier_v2.joblib  # Version 2 (when snapshot exists)
└── artifacts/
    ├── embeddings.npy        # Numpy arrays
    ├── config.json           # JSON data
    └── plot.png              # Any file type

Snapshots: Version Control for Experiments

Snapshots let you create immutable checkpoints of your experiments, making it easy to track different versions, compare results, and return to previous states without duplicating data.

Why Snapshots?

The Problem: You train a model with 89% accuracy, then experiment with improvements. The new version gets 85%—worse! But you've already overwritten your good model. You need to recreate it from git history.

The Solution: Create snapshots before experimenting. Snapshots preserve exact states while sharing unchanged data to save disk space.

Quick Start with Snapshots

from datafolio import DataFolio

# Create your experiment
folio = DataFolio('experiments/classifier')
folio.add_data('train_data', train_df)
folio.add_model('model', baseline_model)
folio.metadata['accuracy'] = 0.89

# Create a snapshot before experimenting
folio.create_snapshot('v1.0-baseline',
    description='Baseline random forest model',
    tags=['baseline', 'production'])

# Experiment freely - the snapshot is preserved
folio.add_model('model', experimental_model, overwrite=True)
folio.metadata['accuracy'] = 0.85  # Worse!

# Load the original version
baseline = DataFolio.load_snapshot('experiments/classifier', 'v1.0-baseline')
model = baseline.get_model('model')  # Original model with 89% accuracy!

CLI for Snapshot Management

DataFolio includes a command-line tool for easy snapshot operations:

# Create a snapshot
datafolio snapshot create v1.0 -d "Baseline model" -t baseline

# List all snapshots
datafolio snapshot list

# Show snapshot details
datafolio snapshot show v1.0

# Compare two snapshots
datafolio snapshot compare v1.0 v2.0

# Delete old snapshots and cleanup
datafolio snapshot delete experimental-v5 --cleanup

# Show reproduction instructions
datafolio snapshot reproduce v1.0

Key Features

Immutable: Once created, snapshots never change—guaranteed reproducibility
Space-efficient: Uses copy-on-write versioning—only changed items create new files
Git integration: Automatically captures commit hash, branch, and dirty status
Environment tracking: Records Python version and dependencies for full reproducibility
Metadata preservation: Snapshots include complete metadata state at that moment
Multiple snapshots: Load different versions simultaneously for comparison

Use Cases

Paper Submission: Snapshot your exact code, data, and model state when submitting. Months later, you can reproduce those exact results.

A/B Testing: Create snapshots for baseline and experimental versions, deploy both, and compare performance metrics.

Hyperparameter Tuning: Snapshot each configuration, then compare results to find the best settings.

Production Deployment: Tag production-ready snapshots and deploy specific versions with confidence.

For complete snapshot documentation, see snapshots.md.

Examples

Complete ML Workflow

from datafolio import DataFolio
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Initialize
folio = DataFolio('experiments/classifier_v1')

# Reference external data
folio.add_data('raw', reference='s3://bucket/raw.csv',
    description='Raw training data from database')

# Add processed data
folio.add_data('clean', cleaned_df,
    description='Cleaned and preprocessed data',
    inputs=['raw'])

# Add features
folio.add_data('features', feature_df,
    description='Engineered features',
    inputs=['clean'])

# Train and save model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

folio.add_model('classifier', clf,
    description='Random forest classifier',
    inputs=['features'])

# Save metrics
folio.add_data('metrics', {
    'accuracy': 0.95,
    'f1': 0.92,
    'precision': 0.94
})

# Add custom metadata to the folio itself
folio.metadata['experiment_name'] = 'rf_baseline'
folio.metadata['tags'] = ['classification', 'production']

# View summary (shows data and custom metadata)
folio.describe()

# Access data with autocomplete
config = folio.data.config.content
metrics = folio.data.metrics.content
trained_model = folio.data.classifier.content

Best Practices

Use descriptive names: add_data('training_features', ...) not add_data('data1', ...)
Track lineage: Always specify inputs to track data dependencies
Add descriptions: Help future you understand what each item contains
Use custom metadata: Store experiment context in folio.metadata for better tracking
Leverage autocomplete: Use folio.data.item_name.content for cleaner, more discoverable code
Clean up regularly: Use delete() to remove temporary or obsolete items
Version control: Commit your folio directories to git (data is stored efficiently)
Use references: For large external datasets, use reference to avoid copying
Check describe(): Regularly review your folio with folio.describe() to see data and metadata
Share across notebooks: Multiple DataFolio instances can safely access the same bundle - changes are automatically detected and synchronized
Snapshot before major changes: Create snapshots before experimenting with new approaches—it's free insurance
Tag snapshots meaningfully: Use tags like baseline, production, paper to organize versions

Development

# Clone the repo
git clone https://github.com/caseysm/datafolio.git
cd datafolio

# Install with dev dependencies
uv sync

# Run tests
poe test

# Preview documentation
poe doc-preview

# Lint
uv run ruff check src/ tests/

# Bump version
poe bump patch  # or minor, major

Documentation

For complete API documentation and detailed guides, see the full documentation.

Requirements

Python 3.10+
pandas >= 2.0.0
pyarrow >= 14.0.0
joblib >= 1.3.0
orjson >= 3.9.0
cloud-files >= 5.8.1
click >= 8.1.0 (for CLI)
rich >= 13.0.0 (for CLI formatting)

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass (poe test)
Submit a pull request

See CLAUDE.md for development guidelines.

Made with ❤️ for data scientists who need simple, lightweight experiment tracking.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/datafolio		src/datafolio
tests		tests
.bmv-post-commit.sh		.bmv-post-commit.sh
.copier-answers.yml		.copier-answers.yml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
coverage.json		coverage.json
mkdocs.yml		mkdocs.yml
notebook_cells.py		notebook_cells.py
pyproject.toml		pyproject.toml
scratch.ipynb		scratch.ipynb
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataFolio

Features

Quick Start

Installation

Core Concepts

Generic Data Methods

Multi-Instance Access

Data Lineage

Autocomplete-Friendly Access

Directory Structure

Snapshots: Version Control for Experiments

Why Snapshots?

Quick Start with Snapshots

CLI for Snapshot Management

Key Features

Use Cases

Examples

Complete ML Workflow

Best Practices

Development

Documentation

Requirements

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataFolio

Features

Quick Start

Installation

Core Concepts

Generic Data Methods

Multi-Instance Access

Data Lineage

Autocomplete-Friendly Access

Directory Structure

Snapshots: Version Control for Experiments

Why Snapshots?

Quick Start with Snapshots

CLI for Snapshot Management

Key Features

Use Cases

Examples

Complete ML Workflow

Best Practices

Development

Documentation

Requirements

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages