Skip to content

ceesem/datafolio

Repository files navigation

DataFolio

Python 3.10+ Tests Coverage

A lightweight, filesystem-based data versioning and experiment tracking library for Python.

DataFolio helps you organize, version, and track your data science experiments by storing datasets, models, and artifacts in a simple, transparent directory structure. Everything is saved as plain files (Parquet, JSON, etc) that you can inspect, version with git, or backup to any storage system.

Note: DataFolio has been an exercise in how extensively I can use Claude Code. Currently all work has been done via Mr Claude, but now that it's getting very useful for workflows I might transition over to more manual curation.

Features

  • Universal Data Management: Single add_data() method automatically handles DataFrames, numpy arrays, dicts, lists, and scalars
  • Model Support: Save and load scikit-learn models with full metadata tracking
  • Data Lineage: Track inputs and dependencies between datasets and models
  • External References: Point to data stored externally (S3, local paths) without copying
  • Multi-Instance Sync: Automatic refresh when multiple notebooks/processes access the same bundle
  • Autocomplete Access: IDE-friendly folio.data.item_name.content syntax with full autocomplete support
  • Smart Metadata Display: Automatic metadata truncation and formatting in describe()
  • Item Management: Delete items with dependency tracking and warnings
  • Git-Friendly: All data stored as standard file formats in a simple directory structure
  • Type-Safe: Full type hints and comprehensive error handling
  • Snapshots: Create immutable checkpoints of your experiments with copy-on-write versioning
  • CLI Tools: Command-line interface for snapshot management and bundle operations

Quick Start

from datafolio import DataFolio
import pandas as pd
import numpy as np

# Create a new folio
folio = DataFolio('experiments/my_experiment')

# Add any type of data with a single method
folio.add_data('results', df)                          # DataFrame
folio.add_data('embeddings', np.array([1, 2, 3]))    # Numpy array
folio.add_data('config', {'lr': 0.01})                # Dict/JSON
folio.add_data('accuracy', 0.95)                      # Scalar

# Retrieve data (automatically returns correct type)
df = folio.get_data('results')           # Returns DataFrame
arr = folio.get_data('embeddings')       # Returns numpy array
config = folio.get_data('config')        # Returns dict

# Or use autocomplete-friendly access
df = folio.data.results.content          # Same as get_data()
arr = folio.data.embeddings.content
config = folio.data.config.content

# View everything (including custom metadata)
folio.describe()

# Clean up temporary items
folio.delete('temp_data')

Installation

pip install datafolio

This includes the datafolio command-line tool for snapshot management and bundle operations.

Core Concepts

Generic Data Methods

The add_data() and get_data() methods provide a unified interface for all data types:

# add_data() automatically detects type and uses the appropriate method
folio.add_data('my_data', data)  # Works with DataFrame, array, dict, list, scalar

# get_data() automatically detects stored type and returns correct format
data = folio.get_data('my_data')  # Returns original type

Supported data types:

  • DataFrames (pd.DataFrame) → stored as Parquet
  • Numpy arrays (np.ndarray) → stored as .npy
  • JSON data (dict, list, int, float, str, bool, None) → stored as JSON
  • External references → metadata only, data stays in original location

Multi-Instance Access

DataFolio automatically keeps multiple instances synchronized when accessing the same bundle:

# Notebook 1: Create and update bundle
folio1 = DataFolio('experiments/shared')
folio1.add_data('results', df)

# Notebook 2: Open same bundle
folio2 = DataFolio('experiments/shared')

# Notebook 1: Add more data
folio1.add_data('analysis', new_df)

# Notebook 2: Automatically sees new data!
folio2.describe()  # Shows both 'results' and 'analysis'
analysis = folio2.get_data('analysis')  # Works immediately ✅

All read operations (describe(), list_contents(), get_*() methods, and folio.data accessors) automatically refresh from disk when changes are detected, ensuring you always see the latest data without manual intervention.

Data Lineage

Track dependencies between datasets and models:

# Create dependency chain
folio.reference_table('raw', reference='s3://bucket/raw.parquet')
folio.add_table('clean', cleaned_df, inputs=['raw'])
folio.add_table('features', feature_df, inputs=['clean'])
folio.add_model('model', clf, inputs=['features'])

# Lineage is preserved in metadata and shown in describe()

Autocomplete-Friendly Access

Access your data with autocomplete support using the folio.data property:

# Attribute-style access (autocomplete-friendly!)
df = folio.data.results.content          # Get DataFrame
desc = folio.data.results.description    # Get description
type_str = folio.data.results.type       # Get item type
inputs = folio.data.results.inputs       # Get lineage inputs

# Works for all data types
arr = folio.data.embeddings.content      # numpy array
cfg = folio.data.config.content          # dict
model = folio.data.classifier.content    # model object

In IPython/Jupyter, folio.data.<TAB> shows all available items with autocomplete!

Directory Structure

DataFolio creates a transparent directory structure:

experiments/my_experiment/
├── metadata.json              # Folio metadata
├── items.json                 # Unified manifest with versioning
├── snapshots.json             # Snapshot registry (when using snapshots)
├── tables/
│   └── results.parquet       # DataFrame storage
├── models/
│   ├── classifier.joblib     # Sklearn model (v1)
│   └── classifier_v2.joblib  # Version 2 (when snapshot exists)
└── artifacts/
    ├── embeddings.npy        # Numpy arrays
    ├── config.json           # JSON data
    └── plot.png              # Any file type

Snapshots: Version Control for Experiments

Snapshots let you create immutable checkpoints of your experiments, making it easy to track different versions, compare results, and return to previous states without duplicating data.

Why Snapshots?

The Problem: You train a model with 89% accuracy, then experiment with improvements. The new version gets 85%—worse! But you've already overwritten your good model. You need to recreate it from git history.

The Solution: Create snapshots before experimenting. Snapshots preserve exact states while sharing unchanged data to save disk space.

Quick Start with Snapshots

from datafolio import DataFolio

# Create your experiment
folio = DataFolio('experiments/classifier')
folio.add_data('train_data', train_df)
folio.add_model('model', baseline_model)
folio.metadata['accuracy'] = 0.89

# Create a snapshot before experimenting
folio.create_snapshot('v1.0-baseline',
    description='Baseline random forest model',
    tags=['baseline', 'production'])

# Experiment freely - the snapshot is preserved
folio.add_model('model', experimental_model, overwrite=True)
folio.metadata['accuracy'] = 0.85  # Worse!

# Load the original version
baseline = DataFolio.load_snapshot('experiments/classifier', 'v1.0-baseline')
model = baseline.get_model('model')  # Original model with 89% accuracy!

CLI for Snapshot Management

DataFolio includes a command-line tool for easy snapshot operations:

# Create a snapshot
datafolio snapshot create v1.0 -d "Baseline model" -t baseline

# List all snapshots
datafolio snapshot list

# Show snapshot details
datafolio snapshot show v1.0

# Compare two snapshots
datafolio snapshot compare v1.0 v2.0

# Delete old snapshots and cleanup
datafolio snapshot delete experimental-v5 --cleanup

# Show reproduction instructions
datafolio snapshot reproduce v1.0

Key Features

  • Immutable: Once created, snapshots never change—guaranteed reproducibility
  • Space-efficient: Uses copy-on-write versioning—only changed items create new files
  • Git integration: Automatically captures commit hash, branch, and dirty status
  • Environment tracking: Records Python version and dependencies for full reproducibility
  • Metadata preservation: Snapshots include complete metadata state at that moment
  • Multiple snapshots: Load different versions simultaneously for comparison

Use Cases

Paper Submission: Snapshot your exact code, data, and model state when submitting. Months later, you can reproduce those exact results.

A/B Testing: Create snapshots for baseline and experimental versions, deploy both, and compare performance metrics.

Hyperparameter Tuning: Snapshot each configuration, then compare results to find the best settings.

Production Deployment: Tag production-ready snapshots and deploy specific versions with confidence.

For complete snapshot documentation, see snapshots.md.

Examples

Complete ML Workflow

from datafolio import DataFolio
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Initialize
folio = DataFolio('experiments/classifier_v1')

# Reference external data
folio.add_data('raw', reference='s3://bucket/raw.csv',
    description='Raw training data from database')

# Add processed data
folio.add_data('clean', cleaned_df,
    description='Cleaned and preprocessed data',
    inputs=['raw'])

# Add features
folio.add_data('features', feature_df,
    description='Engineered features',
    inputs=['clean'])

# Train and save model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

folio.add_model('classifier', clf,
    description='Random forest classifier',
    inputs=['features'])

# Save metrics
folio.add_data('metrics', {
    'accuracy': 0.95,
    'f1': 0.92,
    'precision': 0.94
})

# Add custom metadata to the folio itself
folio.metadata['experiment_name'] = 'rf_baseline'
folio.metadata['tags'] = ['classification', 'production']

# View summary (shows data and custom metadata)
folio.describe()

# Access data with autocomplete
config = folio.data.config.content
metrics = folio.data.metrics.content
trained_model = folio.data.classifier.content

Best Practices

  1. Use descriptive names: add_data('training_features', ...) not add_data('data1', ...)
  2. Track lineage: Always specify inputs to track data dependencies
  3. Add descriptions: Help future you understand what each item contains
  4. Use custom metadata: Store experiment context in folio.metadata for better tracking
  5. Leverage autocomplete: Use folio.data.item_name.content for cleaner, more discoverable code
  6. Clean up regularly: Use delete() to remove temporary or obsolete items
  7. Version control: Commit your folio directories to git (data is stored efficiently)
  8. Use references: For large external datasets, use reference to avoid copying
  9. Check describe(): Regularly review your folio with folio.describe() to see data and metadata
  10. Share across notebooks: Multiple DataFolio instances can safely access the same bundle - changes are automatically detected and synchronized
  11. Snapshot before major changes: Create snapshots before experimenting with new approaches—it's free insurance
  12. Tag snapshots meaningfully: Use tags like baseline, production, paper to organize versions

Development

# Clone the repo
git clone https://github.com/caseysm/datafolio.git
cd datafolio

# Install with dev dependencies
uv sync

# Run tests
poe test

# Preview documentation
poe doc-preview

# Lint
uv run ruff check src/ tests/

# Bump version
poe bump patch  # or minor, major

Documentation

For complete API documentation and detailed guides, see the full documentation.

Requirements

  • Python 3.10+
  • pandas >= 2.0.0
  • pyarrow >= 14.0.0
  • joblib >= 1.3.0
  • orjson >= 3.9.0
  • cloud-files >= 5.8.1
  • click >= 8.1.0 (for CLI)
  • rich >= 13.0.0 (for CLI formatting)

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass (poe test)
  5. Submit a pull request

See CLAUDE.md for development guidelines.


Made with ❤️ for data scientists who need simple, lightweight experiment tracking.

About

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors