A lightweight, filesystem-based data versioning and experiment tracking library for Python.
DataFolio helps you organize, version, and track your data science experiments by storing datasets, models, and artifacts in a simple, transparent directory structure. Everything is saved as plain files (Parquet, JSON, etc) that you can inspect, version with git, or backup to any storage system.
Note: DataFolio has been an exercise in how extensively I can use Claude Code. Currently all work has been done via Mr Claude, but now that it's getting very useful for workflows I might transition over to more manual curation.
- Universal Data Management: Single
add_data()method automatically handles DataFrames, numpy arrays, dicts, lists, and scalars - Model Support: Save and load scikit-learn models with full metadata tracking
- Data Lineage: Track inputs and dependencies between datasets and models
- External References: Point to data stored externally (S3, local paths) without copying
- Multi-Instance Sync: Automatic refresh when multiple notebooks/processes access the same bundle
- Autocomplete Access: IDE-friendly
folio.data.item_name.contentsyntax with full autocomplete support - Smart Metadata Display: Automatic metadata truncation and formatting in
describe() - Item Management: Delete items with dependency tracking and warnings
- Git-Friendly: All data stored as standard file formats in a simple directory structure
- Type-Safe: Full type hints and comprehensive error handling
- Snapshots: Create immutable checkpoints of your experiments with copy-on-write versioning
- CLI Tools: Command-line interface for snapshot management and bundle operations
from datafolio import DataFolio
import pandas as pd
import numpy as np
# Create a new folio
folio = DataFolio('experiments/my_experiment')
# Add any type of data with a single method
folio.add_data('results', df) # DataFrame
folio.add_data('embeddings', np.array([1, 2, 3])) # Numpy array
folio.add_data('config', {'lr': 0.01}) # Dict/JSON
folio.add_data('accuracy', 0.95) # Scalar
# Retrieve data (automatically returns correct type)
df = folio.get_data('results') # Returns DataFrame
arr = folio.get_data('embeddings') # Returns numpy array
config = folio.get_data('config') # Returns dict
# Or use autocomplete-friendly access
df = folio.data.results.content # Same as get_data()
arr = folio.data.embeddings.content
config = folio.data.config.content
# View everything (including custom metadata)
folio.describe()
# Clean up temporary items
folio.delete('temp_data')pip install datafolioThis includes the datafolio command-line tool for snapshot management and bundle operations.
The add_data() and get_data() methods provide a unified interface for all data types:
# add_data() automatically detects type and uses the appropriate method
folio.add_data('my_data', data) # Works with DataFrame, array, dict, list, scalar
# get_data() automatically detects stored type and returns correct format
data = folio.get_data('my_data') # Returns original typeSupported data types:
- DataFrames (
pd.DataFrame) → stored as Parquet - Numpy arrays (
np.ndarray) → stored as.npy - JSON data (
dict,list,int,float,str,bool,None) → stored as JSON - External references → metadata only, data stays in original location
DataFolio automatically keeps multiple instances synchronized when accessing the same bundle:
# Notebook 1: Create and update bundle
folio1 = DataFolio('experiments/shared')
folio1.add_data('results', df)
# Notebook 2: Open same bundle
folio2 = DataFolio('experiments/shared')
# Notebook 1: Add more data
folio1.add_data('analysis', new_df)
# Notebook 2: Automatically sees new data!
folio2.describe() # Shows both 'results' and 'analysis'
analysis = folio2.get_data('analysis') # Works immediately ✅All read operations (describe(), list_contents(), get_*() methods, and folio.data accessors) automatically refresh from disk when changes are detected, ensuring you always see the latest data without manual intervention.
Track dependencies between datasets and models:
# Create dependency chain
folio.reference_table('raw', reference='s3://bucket/raw.parquet')
folio.add_table('clean', cleaned_df, inputs=['raw'])
folio.add_table('features', feature_df, inputs=['clean'])
folio.add_model('model', clf, inputs=['features'])
# Lineage is preserved in metadata and shown in describe()Access your data with autocomplete support using the folio.data property:
# Attribute-style access (autocomplete-friendly!)
df = folio.data.results.content # Get DataFrame
desc = folio.data.results.description # Get description
type_str = folio.data.results.type # Get item type
inputs = folio.data.results.inputs # Get lineage inputs
# Works for all data types
arr = folio.data.embeddings.content # numpy array
cfg = folio.data.config.content # dict
model = folio.data.classifier.content # model objectIn IPython/Jupyter, folio.data.<TAB> shows all available items with autocomplete!
DataFolio creates a transparent directory structure:
experiments/my_experiment/
├── metadata.json # Folio metadata
├── items.json # Unified manifest with versioning
├── snapshots.json # Snapshot registry (when using snapshots)
├── tables/
│ └── results.parquet # DataFrame storage
├── models/
│ ├── classifier.joblib # Sklearn model (v1)
│ └── classifier_v2.joblib # Version 2 (when snapshot exists)
└── artifacts/
├── embeddings.npy # Numpy arrays
├── config.json # JSON data
└── plot.png # Any file type
Snapshots let you create immutable checkpoints of your experiments, making it easy to track different versions, compare results, and return to previous states without duplicating data.
The Problem: You train a model with 89% accuracy, then experiment with improvements. The new version gets 85%—worse! But you've already overwritten your good model. You need to recreate it from git history.
The Solution: Create snapshots before experimenting. Snapshots preserve exact states while sharing unchanged data to save disk space.
from datafolio import DataFolio
# Create your experiment
folio = DataFolio('experiments/classifier')
folio.add_data('train_data', train_df)
folio.add_model('model', baseline_model)
folio.metadata['accuracy'] = 0.89
# Create a snapshot before experimenting
folio.create_snapshot('v1.0-baseline',
description='Baseline random forest model',
tags=['baseline', 'production'])
# Experiment freely - the snapshot is preserved
folio.add_model('model', experimental_model, overwrite=True)
folio.metadata['accuracy'] = 0.85 # Worse!
# Load the original version
baseline = DataFolio.load_snapshot('experiments/classifier', 'v1.0-baseline')
model = baseline.get_model('model') # Original model with 89% accuracy!DataFolio includes a command-line tool for easy snapshot operations:
# Create a snapshot
datafolio snapshot create v1.0 -d "Baseline model" -t baseline
# List all snapshots
datafolio snapshot list
# Show snapshot details
datafolio snapshot show v1.0
# Compare two snapshots
datafolio snapshot compare v1.0 v2.0
# Delete old snapshots and cleanup
datafolio snapshot delete experimental-v5 --cleanup
# Show reproduction instructions
datafolio snapshot reproduce v1.0- Immutable: Once created, snapshots never change—guaranteed reproducibility
- Space-efficient: Uses copy-on-write versioning—only changed items create new files
- Git integration: Automatically captures commit hash, branch, and dirty status
- Environment tracking: Records Python version and dependencies for full reproducibility
- Metadata preservation: Snapshots include complete metadata state at that moment
- Multiple snapshots: Load different versions simultaneously for comparison
Paper Submission: Snapshot your exact code, data, and model state when submitting. Months later, you can reproduce those exact results.
A/B Testing: Create snapshots for baseline and experimental versions, deploy both, and compare performance metrics.
Hyperparameter Tuning: Snapshot each configuration, then compare results to find the best settings.
Production Deployment: Tag production-ready snapshots and deploy specific versions with confidence.
For complete snapshot documentation, see snapshots.md.
from datafolio import DataFolio
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# Initialize
folio = DataFolio('experiments/classifier_v1')
# Reference external data
folio.add_data('raw', reference='s3://bucket/raw.csv',
description='Raw training data from database')
# Add processed data
folio.add_data('clean', cleaned_df,
description='Cleaned and preprocessed data',
inputs=['raw'])
# Add features
folio.add_data('features', feature_df,
description='Engineered features',
inputs=['clean'])
# Train and save model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
folio.add_model('classifier', clf,
description='Random forest classifier',
inputs=['features'])
# Save metrics
folio.add_data('metrics', {
'accuracy': 0.95,
'f1': 0.92,
'precision': 0.94
})
# Add custom metadata to the folio itself
folio.metadata['experiment_name'] = 'rf_baseline'
folio.metadata['tags'] = ['classification', 'production']
# View summary (shows data and custom metadata)
folio.describe()
# Access data with autocomplete
config = folio.data.config.content
metrics = folio.data.metrics.content
trained_model = folio.data.classifier.content- Use descriptive names:
add_data('training_features', ...)notadd_data('data1', ...) - Track lineage: Always specify
inputsto track data dependencies - Add descriptions: Help future you understand what each item contains
- Use custom metadata: Store experiment context in
folio.metadatafor better tracking - Leverage autocomplete: Use
folio.data.item_name.contentfor cleaner, more discoverable code - Clean up regularly: Use
delete()to remove temporary or obsolete items - Version control: Commit your folio directories to git (data is stored efficiently)
- Use references: For large external datasets, use
referenceto avoid copying - Check describe(): Regularly review your folio with
folio.describe()to see data and metadata - Share across notebooks: Multiple DataFolio instances can safely access the same bundle - changes are automatically detected and synchronized
- Snapshot before major changes: Create snapshots before experimenting with new approaches—it's free insurance
- Tag snapshots meaningfully: Use tags like
baseline,production,paperto organize versions
# Clone the repo
git clone https://github.com/caseysm/datafolio.git
cd datafolio
# Install with dev dependencies
uv sync
# Run tests
poe test
# Preview documentation
poe doc-preview
# Lint
uv run ruff check src/ tests/
# Bump version
poe bump patch # or minor, majorFor complete API documentation and detailed guides, see the full documentation.
- Python 3.10+
- pandas >= 2.0.0
- pyarrow >= 14.0.0
- joblib >= 1.3.0
- orjson >= 3.9.0
- cloud-files >= 5.8.1
- click >= 8.1.0 (for CLI)
- rich >= 13.0.0 (for CLI formatting)
MIT License - see LICENSE file for details.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass (
poe test) - Submit a pull request
See CLAUDE.md for development guidelines.
Made with ❤️ for data scientists who need simple, lightweight experiment tracking.