Skip to content

UriBer/AIFS

Repository files navigation

AIFS - AI-Native File System

AI-native File System - A next-generation file system designed from the ground up for AI/ML workloads, featuring content addressing, vector-first metadata, versioned snapshots, and semantic search capabilities.

🎯 Current Status

Version: 0.1.0-alpha
Test Coverage: 92.3% (170+ tests)
Implementation: Core functionality complete
Docker: Production-ready containerization

✅ Implemented Features

  • Content Addressing: BLAKE3-based content addressing
  • Vector Search: Semantic similarity search with FAISS
  • Asset Kinds: Complete implementation of Blob, Tensor (Arrow2), Embed (FlatBuffers), Artifact (ZIP+MANIFEST)
  • Strong Causality: Transaction system ensuring "Asset B SHALL NOT be visible until A is fully committed"
  • Ed25519 Signatures: Complete snapshot root signing and verification with namespace key management
  • Encryption: AES-256-GCM with KMS integration
  • Versioning: Merkle tree-based snapshots with Ed25519 signatures
  • gRPC API: High-performance RPC interface with reflection (dev mode)
  • URI Schemes: Canonical aifs:// and aifs-snap:// identifiers
  • Authorization: Macaroon-based capability tokens
  • Compression: Gzip compression for transport
  • Error Handling: Structured error responses with google.rpc.Status
  • Docker Support: Production-ready containerization with Docker Compose
  • Testing: Comprehensive test suite with 26+ asset kinds tests, 23+ strong causality tests, and 23+ Ed25519 signature tests

🚧 In Progress

  • Performance optimization and benchmarking
  • Advanced monitoring and metrics

📋 Planned Features

  • FUSE layer for POSIX compatibility
  • Pre-signed URLs for direct streaming
  • Ingest operators for automatic embedding generation
  • Strong causality for lineage tracking
  • Advanced performance optimization

🚀 Quick Start

Option 1: Docker Hub (Recommended)

# Pull and run the latest version from Docker Hub
docker pull uriber/aifs:latest
docker run -p 50051:50051 -v aifs-data:/data/aifs uriber/aifs:latest

# Or use a specific version
docker pull uriber/aifs:v0.1.0-alpha
docker run -p 50051:50051 -v aifs-data:/data/aifs uriber/aifs:v0.1.0-alpha

# Test the API
grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check

Option 2: Local Docker Build

cd local_implementation

# Build and run with Docker Compose
docker-compose up -d

# Or build and run manually
./docker-build.sh
docker run -p 50051:50051 -v aifs-data:/data/aifs aifs:latest

Option 3: Automated Installation

cd local_implementation
python install.py

Option 3: Manual Installation

cd local_implementation
pip install -r requirements.txt

🔧 FAISS Installation Issues

If you encounter FAISS installation problems, the system will automatically fall back to scikit-learn for vector operations. However, for optimal performance, you should install FAISS.

Quick FAISS Installation

# Try the dedicated FAISS installer
python install_faiss.py

Manual FAISS Installation

Option 1: Conda (Recommended)

# Install Anaconda/Miniconda first, then:
conda install -c conda-forge faiss-cpu

Option 2: System Dependencies + Pip

# macOS (with Homebrew)
brew install swig libomp openblas cmake
pip install faiss-cpu

# Ubuntu/Debian
sudo apt-get install swig libomp-dev openblas-dev cmake build-essential
pip install faiss-cpu

# CentOS/RHEL
sudo yum install swig libomp-devel openblas-devel cmake gcc-c++
pip install faiss-cpu

Option 3: Pre-built Wheels

pip install faiss-cpu --only-binary=all

Fallback Behavior

  • With FAISS: High-performance vector similarity search
  • Without FAISS: Uses scikit-learn fallback (slower but functional)

🏗️ Architecture Overview

The implementation follows the layered architecture specified in the AIFS RFC:

┌─────────────────────────────────────┐
│           Application Layer         │
│         (CLI, FUSE, Client)        │
├─────────────────────────────────────┤
│           gRPC API Layer            │
│        (Built-in Services)         │
├─────────────────────────────────────┤
│         Core AIFS Services          │
│   (Asset, Storage, Vector, Auth)   │
├─────────────────────────────────────┤
│         Storage Layer               │
│   (Encrypted, Content-Addressed)   │
└─────────────────────────────────────┘

✨ Key Features Implemented

🔐 Security & Cryptography

  • AES-256-GCM Encryption: All data chunks are encrypted at rest
  • Ed25519 Signatures: Cryptographic verification of snapshots
  • Macaroon Authorization: Capability-based access control
  • Content Addressing: BLAKE3-based deduplication

🌳 Merkle Trees & Snapshots

  • Proper Merkle Trees: Binary tree structure for efficient verification
  • Merkle Proofs: Cryptographic proofs for asset inclusion
  • Snapshot Signatures: Ed25519-signed snapshot roots
  • Lineage Tracking: DAG-based transformation history

🔍 Vector Search & AI

  • FAISS Integration: High-performance similarity search (when available)
  • scikit-learn Fallback: Functional vector search when FAISS unavailable
  • Embedding Storage: Vector database for AI workloads
  • Semantic Search: k-NN search over embeddings
  • Metadata Indexing: Rich metadata querying

🗄️ Storage & Metadata

  • SQLite Metadata Store: ACID-compliant metadata storage
  • Encrypted Storage Backend: AES-256-GCM encrypted chunks
  • Namespace Management: Multi-tenant isolation
  • Lineage Graph: Parent-child relationship tracking

🚀 Performance & Scalability

  • zstd Compression: Efficient data compression
  • Streaming I/O: Chunked data transfer
  • Content Deduplication: Eliminates redundant storage
  • Sharded Storage: Efficient file system organization

📁 File Structure

local_implementation/
├── aifs/                          # Core AIFS implementation
│   ├── __init__.py
│   ├── asset.py                   # Asset management
│   ├── auth.py                    # Authorization system
│   ├── client.py                  # gRPC client
│   ├── compression.py             # Compression service
│   ├── crypto.py                  # Cryptographic operations
│   ├── errors.py                  # Structured error handling
│   ├── fuse.py                    # FUSE layer
│   ├── merkle.py                  # Merkle tree implementation
│   ├── metadata.py                # Metadata store
│   ├── proto/                     # Protocol definitions
│   ├── server.py                  # gRPC server
│   ├── storage.py                 # Storage backend
│   ├── uri.py                     # URI scheme handling
│   └── vector_db.py               # Vector database (FAISS + fallback)
├── tests/                         # Comprehensive test suite
│   ├── test_asset_manager.py      # Asset manager tests
│   ├── test_auth.py               # Authorization tests
│   ├── test_basic.py              # Basic functionality tests
│   ├── test_builtin_services.py   # Built-in services tests
│   ├── test_compression.py        # Compression tests
│   ├── test_crypto.py             # Cryptographic tests
│   ├── test_merkle_tree.py        # Merkle tree tests
│   ├── test_storage.py            # Storage tests
│   ├── test_blake3_uri.py         # BLAKE3 and URI tests
│   ├── test_error_handling.py     # Error handling tests
│   ├── test_encryption_kms.py     # Encryption and KMS tests
│   ├── test_grpc_server.py        # gRPC server tests
│   └── test_merkle_blake3.py      # Merkle tree with BLAKE3 tests
├── examples/                      # Usage examples
├── install.py                     # Automated installer
├── install_faiss.py               # FAISS installation helper
├── run_tests.py                   # Test runner
├── start_server.py                # Server startup script
├── aifs_cli.py                    # Command-line interface
├── requirements.txt               # Dependencies
├── Dockerfile                     # Docker container definition
├── docker-compose.yml             # Docker Compose orchestration
├── docker-build.sh                # Docker build script
├── docker-run.sh                  # Docker run script
├── .dockerignore                  # Docker build exclusions
├── DOCKER.md                      # Docker documentation
├── Makefile                       # Development automation
└── README_IMPLEMENTATION.md       # Detailed implementation guide

🧪 Testing

Run All Tests

python run_tests.py

Run Specific Test Suite

python run_tests.py merkle_tree    # Merkle tree tests
python run_tests.py crypto         # Cryptographic tests
python run_tests.py storage        # Storage tests
python run_tests.py compression    # Compression tests
python run_tests.py asset_manager  # Asset manager tests

Test Coverage

The test suite covers:

  • ✅ All core components
  • ✅ Cryptographic operations
  • ✅ Authorization system
  • ✅ Storage backend
  • ✅ Merkle tree operations
  • ✅ Vector search (FAISS + fallback)
  • ✅ Error handling
  • ✅ Edge cases

🚀 Usage Examples

Docker Deployment

# Start with Docker Compose (recommended)
docker-compose up -d

# Or run manually
./docker-build.sh
docker run -p 50051:50051 -v aifs-data:/data/aifs aifs:latest

# Development mode with gRPC reflection
docker run -p 50051:50051 -v aifs-data:/data/aifs aifs:latest python start_server.py --dev

Local Development

# Start the server
python start_server.py --port 50051 --storage-dir ~/.aifs

# Development mode with gRPC reflection
python start_server.py --dev --port 50051 --storage-dir ~/.aifs

Use the CLI

# Store an asset
python aifs_cli.py put --kind blob ./tests/files/data.txt 

# Search for assets
python aifs_cli.py search --query "test data"

# Create a snapshot
python aifs_cli.py snapshot --namespace test --assets asset1,asset2

# List assets
python aifs_cli.py list

Use the Python Client

from aifs.client import AIFSClient

# Connect to server
client = AIFSClient("localhost:50051")

# Store asset
asset_id = client.put_asset(
    data=b"Hello, AIFS!",
    kind="blob",
    metadata={"description": "Test asset"}
)

# Retrieve asset
asset = client.get_asset(asset_id)

# Vector search
results = client.vector_search(query_embedding, k=10)

Use the FUSE Layer

# Mount AIFS as a filesystem
python -c "
from aifs.fuse import AIFSFuse
from aifs.client import AIFSClient
import fuse

client = AIFSClient('localhost:50051')
fuse_ops = AIFSFuse(client, 'default')
fuse.FUSE(fuse_ops, '/mnt/aifs')
"

🔧 Configuration

Environment Variables

export AIFS_ROOT_DIR=~/.aifs           # Data directory
export AIFS_SERVER_PORT=50051          # Server port
export AIFS_ENCRYPTION_KEY=your_key    # Encryption key (32 bytes)
export AIFS_PRIVATE_KEY=your_priv_key  # Ed25519 private key

Server Configuration

# Custom configuration
from aifs.server import serve

serve(
    root_dir="~/.aifs",
    port=50051,
    max_workers=20
)

🔒 Security Features

Encryption

  • AES-256-GCM: Military-grade encryption for all data
  • Key Derivation: HKDF-based key derivation
  • Nonce Management: Secure random nonce generation
  • Authenticated Encryption: Integrity and confidentiality

Authentication

  • Ed25519 Signatures: Fast, secure digital signatures
  • Public Key Verification: Cryptographic proof of authenticity
  • Timestamp Validation: Prevents replay attacks
  • Namespace Isolation: Multi-tenant security

Authorization

  • Macaroon Tokens: Capability-based access control
  • Method Restrictions: Fine-grained permission control
  • Namespace Scoping: Resource isolation
  • Expiry Management: Time-limited access tokens

📊 Performance Characteristics

Storage Performance

  • Content Deduplication: Eliminates redundant storage
  • Sharded Storage: Efficient file system organization
  • Compression: zstd compression for space efficiency
  • Streaming I/O: Efficient large file handling

Search Performance

  • FAISS Integration: High-performance vector search (when available)
  • scikit-learn Fallback: Functional vector search (slower)
  • Index Optimization: Optimized for similarity queries
  • Caching: Metadata and embedding caching
  • Parallel Processing: Multi-threaded operations

Network Performance

  • gRPC Streaming: Efficient data transfer
  • Compression: zstd compression for network efficiency
  • Connection Pooling: Reusable connections
  • Load Balancing: Ready for horizontal scaling

🚧 Limitations & Future Work

Current Limitations

  • Hash Algorithm: Uses BLAKE3 for content addressing (Rust dependency included)
  • Vector Search: Falls back to scikit-learn if FAISS unavailable
  • Performance: Local implementation, not production-optimized
  • Scalability: Single-node implementation
  • Monitoring: Basic metrics only

Planned Improvements

  • BLAKE3 Integration: Install Rust for full spec compliance
  • FAISS Optimization: Ensure FAISS is always available
  • Performance Optimization: Meet RFC performance targets
  • Distributed Storage: Multi-node deployment
  • Advanced Monitoring: OpenTelemetry integration
  • Load Testing: Performance benchmarking
  • Security Audit: Penetration testing

🐛 Troubleshooting

Common Issues

FAISS Installation Problems

error: command 'swig' failed: No such file or directory

Solutions:

  1. Use the FAISS installer: python install_faiss.py
  2. Install system dependencies:
    • macOS: brew install swig libomp openblas cmake
    • Ubuntu: sudo apt-get install swig libomp-dev openblas-dev cmake
  3. Use conda: conda install -c conda-forge faiss-cpu
  4. Accept fallback: System will use scikit-learn automatically

Rust Compilation Error

error: Cargo, the Rust package manager, is not installed

Solution: BLAKE3 is now included with Rust dependency in Docker images

Permission Errors

error: Permission denied

Solution: Check file permissions and run with appropriate user

Port Already in Use

error: Address already in use

Solution: Change port or stop existing service

Debug Mode

# Enable debug logging
export AIFS_LOG_LEVEL=DEBUG
python start_server.py

# Run tests with verbose output
python run_tests.py --verbose

Performance Tuning

# Check vector database backend
python -c "from aifs.vector_db import VectorDB; vdb = VectorDB('/tmp'); print(vdb.get_stats())"

# Verify FAISS installation
python -c "import faiss; print('FAISS version:', faiss.__version__)"

📚 API Reference

Core Classes

AssetManager

class AssetManager:
    def put_asset(data, kind, embedding=None, metadata=None, parents=None)
    def get_asset(asset_id)
    def vector_search(query_embedding, k=10)
    def create_snapshot(namespace, asset_ids, metadata=None)
    def verify_snapshot(snapshot_id, public_key)

StorageBackend

class StorageBackend:
    def put(data)
    def get(hash_hex)
    def exists(hash_hex)
    def delete(hash_hex)
    def get_chunk_info(hash_hex)

CryptoManager

class CryptoManager:
    def sign_snapshot(merkle_root, timestamp, namespace)
    def verify_snapshot_signature(signature, merkle_root, timestamp, namespace, public_key)
    def get_public_key()

MerkleTree

class MerkleTree:
    def get_root_hash()
    def get_proof(asset_id)
    def verify_proof(asset_id, proof, root_hash)

VectorDB

class VectorDB:
    def add(asset_id, embedding)
    def search(query_embedding, k=10)
    def delete(asset_id)
    def get_stats()  # Shows backend (FAISS or scikit-learn)

📚 Documentation

🤝 Contributing

Development Setup

# Clone repository
git clone https://github.com/UriBer/AIFS.git
cd local_implementation

# Install development dependencies
pip install -r requirements.txt
pip install pytest pytest-cov black flake8

# Run code formatting
black aifs/ tests/

# Run linting
flake8 aifs/ tests/

# Run tests with coverage
pytest --cov=aifs tests/

Code Style

  • Python: PEP 8 compliance
  • Type Hints: Full type annotation
  • Documentation: Comprehensive docstrings
  • Testing: 90%+ test coverage target

📄 License

This implementation is provided under the same license as the main project. See the LICENSE file for details.

🙏 Acknowledgments

  • Open Source Community: For the excellent libraries used

📖 Documentation

Note: This implementation prioritizes functionality and security over performance optimization. For production deployment, additional performance tuning and security hardening is recommended.

Vector Search Note: The system automatically falls back to scikit-learn if FAISS is unavailable, ensuring functionality while maintaining the option for high-performance vector search when FAISS is properly installed.

About

AI-native File System - Designing and implementing an AI-native file system to replace ext4, NTFS (or any traditional file system) means re-thinking storage from the ground up so that it is built **for** machine-learning, vector search, semantic retrieval, continuous training and inference pipelines — not just for human-named files and folders.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors