# Manifold OS Test Notebook

**Testing the complete pipeline:**
1. Synthetic data generation
2. Token absorption via Kernel
3. HLLSet creation and operations
4. N-token ingestion via ManifoldOS
5. Persistent storage to DuckDB
6. Fractal loop primitives (new!)

**Architecture layers tested:**
- Layer 1: HLLSet (Register Layer)
- Layer 2: BSS Morphisms (Lattice Construction)
- Layer 3: Fractal Loop (Scale Hierarchy)

In [1]:
# Imports
import sys
import time
from pathlib import Path
import tempfile

# Core imports
from core import (
    HLLSet, Kernel, compute_sha1,
    P_BITS, SHARED_SEED, DEFAULT_TAU, DEFAULT_RHO
)

# Fractal imports (new!)
from core import (
    n_tokenize, multi_scale_tokenize, overlap,
    build_token_tower, entanglement_number
)

# ManifoldOS
from core.manifold_os import ManifoldOS, TokenizationConfig

print(f"P_BITS={P_BITS}, τ={DEFAULT_TAU}, ρ={DEFAULT_RHO}")
print("✓ All imports successful")

P_BITS=10, τ=0.7, ρ=0.3
✓ All imports successful


## 1. Generate Synthetic Data

Create test data with known structure for validation.

In [2]:
# Synthetic data generator
import random
random.seed(42)

# Sample vocabularies
NOUNS = ['customer', 'order', 'product', 'invoice', 'payment', 'account', 'user', 'item', 'category', 'supplier']
VERBS = ['creates', 'updates', 'deletes', 'processes', 'validates', 'approves', 'rejects', 'modifies']
ADJECTIVES = ['new', 'pending', 'active', 'completed', 'cancelled', 'urgent', 'standard', 'premium']

def generate_sentence() -> str:
    """Generate a simple sentence."""
    adj = random.choice(ADJECTIVES)
    noun = random.choice(NOUNS)
    verb = random.choice(VERBS)
    noun2 = random.choice(NOUNS)
    return f"{adj} {noun} {verb} {noun2}"

def generate_dataset(n_sentences: int = 100) -> list:
    """Generate dataset of sentences."""
    return [generate_sentence() for _ in range(n_sentences)]

# Generate test data
test_sentences = generate_dataset(50)
print(f"Generated {len(test_sentences)} sentences")
print("\nSample sentences:")
for s in test_sentences[:5]:
    print(f"  • {s}")

Generated 50 sentences

Sample sentences:
  • pending customer validates invoice
  • completed product updates category
  • pending supplier rejects customer
  • new order processes invoice
  • new category processes category


## 2. Direct Kernel Operations

Test the stateless Kernel for token absorption and HLLSet operations.

In [3]:
# Initialize kernel
kernel = Kernel()

# Absorb tokens from first sentence
sentence1 = test_sentences[0]
tokens1 = sentence1.split()
print(f"Sentence: '{sentence1}'")
print(f"Tokens: {tokens1}")

# Create HLLSet via absorb
hll1 = kernel.absorb(tokens1)
print(f"\nHLLSet created:")
print(f"  Name (SHA1): {hll1.name[:16]}...")
print(f"  Cardinality: {hll1.cardinality():.1f}")
print(f"  P_BITS: {hll1.p_bits}")

Sentence: 'pending customer validates invoice'
Tokens: ['pending', 'customer', 'validates', 'invoice']

HLLSet created:
  Name (SHA1): d005c83ea421f96d...
  Cardinality: 4.0
  P_BITS: 10


In [5]:
# Test HLLSet operations: union, intersection, difference
sentence2 = test_sentences[1]
tokens2 = sentence2.split()
hll2 = kernel.absorb(tokens2)

print(f"HLL1: '{sentence1}' -> card={hll1.cardinality():.1f}")
print(f"HLL2: '{sentence2}' -> card={hll2.cardinality():.1f}")

# Union
hll_union = kernel.union(hll1, hll2)
print(f"\nUnion: card={hll_union.cardinality():.1f}")

# Intersection
hll_inter = kernel.intersection(hll1, hll2)
print(f"Intersection: card={hll_inter.cardinality():.1f}")

# Similarity (Jaccard)
similarity = kernel.similarity(hll1, hll2)
print(f"Jaccard similarity: {similarity:.3f}")

HLL1: 'pending customer validates invoice' -> card=4.0
HLL2: 'completed product updates category' -> card=5.0

Union: card=9.0
Intersection: card=0.0
Jaccard similarity: 0.000


## 3. Fractal Loop Primitives (New!)

Test the new fractal_core module: n-tokenization, overlap, scale hierarchy.

In [6]:
# Test n-tokenization
all_tokens = ' '.join(test_sentences[:5]).split()
print(f"Sample tokens ({len(all_tokens)}): {all_tokens[:10]}...")

# Generate n-grams
unigrams = n_tokenize(all_tokens, 1)
bigrams = n_tokenize(all_tokens, 2)
trigrams = n_tokenize(all_tokens, 3)

print(f"\n1-grams: {len(unigrams)} tokens")
print(f"2-grams: {len(bigrams)} tokens")
print(f"3-grams: {len(trigrams)} tokens")

print(f"\nSample bigrams: {bigrams[:5]}")
print(f"Sample trigrams: {trigrams[:3]}")

Sample tokens (20): ['pending', 'customer', 'validates', 'invoice', 'completed', 'product', 'updates', 'category', 'pending', 'supplier']...

1-grams: 20 tokens
2-grams: 19 tokens
3-grams: 18 tokens

Sample bigrams: ['pending|customer', 'customer|validates', 'validates|invoice', 'invoice|completed', 'completed|product']
Sample trigrams: ['pending|customer|validates', 'customer|validates|invoice', 'validates|invoice|completed']


In [7]:
# Test multi-scale tokenization
scales = multi_scale_tokenize(all_tokens, scales=(1, 2, 3, 4))
print("Multi-scale tokenization:")
for n, toks in scales.items():
    print(f"  Scale {n}: {len(toks)} tokens")

Multi-scale tokenization:
  Scale 1: 20 tokens
  Scale 2: 19 tokens
  Scale 3: 18 tokens
  Scale 4: 17 tokens


In [8]:
# Build fractal tower
tower = build_token_tower(all_tokens, scales=(1, 2, 3))

print(f"Fractal Tower (depth={tower.depth}):")
for level in tower.levels:
    print(f"  {level}")

print(f"\nEntanglement matrix:")
for (m, n), e in sorted(tower.entanglement.items()):
    if m <= n:
        print(f"  E({m},{n}) = {e:.3f}")

Fractal Tower (depth=3):
  ScaleLevel(L0, 1-gram, card=14, src=tokens)
  ScaleLevel(L1, 2-gram, card=21, src=tokens)
  ScaleLevel(L2, 3-gram, card=19, src=tokens)

Entanglement matrix:
  E(0,0) = 1.000
  E(0,1) = 0.000
  E(0,2) = 0.000
  E(1,1) = 1.000
  E(1,2) = 0.000
  E(2,2) = 1.000


In [9]:
# Test overlap measure (used for lattice topology analysis)
hll_scale1 = tower.levels[0].hllset
hll_scale2 = tower.levels[1].hllset if len(tower.levels) > 1 else hll_scale1

print(f"Overlap between scales:")
print(f"  Scale 0 card: {hll_scale1.cardinality():.1f}")
print(f"  Scale 1 card: {hll_scale2.cardinality():.1f}")
print(f"  Overlap: {overlap(hll_scale1, hll_scale2):.3f}")

Overlap between scales:
  Scale 0 card: 14.0
  Scale 1 card: 21.0
  Overlap: 0.000


## 4. ManifoldOS Initialization

Initialize MOS with DuckDB persistent storage.

In [10]:
# Create temp directory for storage
temp_dir = tempfile.mkdtemp(prefix="mos_test_")
storage_path = Path(temp_dir)
db_path = storage_path / "metadata.duckdb"

print(f"Storage path: {storage_path}")
print(f"DuckDB path: {db_path}")

# Initialize ManifoldOS with DuckDB extension
mos = ManifoldOS(
    storage_path=storage_path,
    extensions={
        'storage': {
            'type': 'duckdb',
            'db_path': str(db_path)
        }
    }
)

print(f"\n✓ ManifoldOS initialized")
print(f"  Kernel: {mos.kernel}")
print(f"  Store: {mos.store}")

Storage path: /tmp/mos_test_6s8nkzhk
DuckDB path: /tmp/mos_test_6s8nkzhk/metadata.duckdb
✓ Extension registered: storage v1.4.4
✓ Extension registered: storage v1.4.4

✓ ManifoldOS initialized
  Kernel: <core.kernel.Kernel object at 0x7f229c3fd810>
  Store: <core.manifold_os.PersistentStore object at 0x7f229c502a50>


## 5. Data Ingestion via ManifoldOS

Ingest synthetic data using the n-token algorithm.

In [11]:
# Ingest single sentence
test_sentence = test_sentences[0]
print(f"Ingesting: '{test_sentence}'")

representation = mos.ingest(
    raw_data=test_sentence,
    metadata={'source': 'synthetic', 'test_id': 1}
)

if representation:
    print(f"\n✓ Ingestion successful")
    print(f"  Original tokens: {representation.original_tokens}")
    print(f"  N-token groups: {list(representation.n_token_groups.keys())}")
    print(f"  HLLSets created: {list(representation.hllsets.keys())}")
    
    for n, hll in representation.hllsets.items():
        print(f"    n={n}: card={hll.cardinality():.1f}, hash={hll.name[:16]}...")
else:
    print("✗ Ingestion failed")

Ingesting: 'pending customer validates invoice'
  ✓ LUT committed: n=1, hash=f72ee3615f29dac0..., id=4
  ✓ LUT committed: n=2, hash=4007656e8395f470..., id=3
  ✓ LUT committed: n=3, hash=58b6804622e07ece..., id=2

✓ Ingestion successful
  Original tokens: ['pending', 'customer', 'validates', 'invoice']
  N-token groups: [1, 2, 3]
  HLLSets created: [1, 2, 3]
    n=1: card=4.0, hash=f72ee3615f29dac0...
    n=2: card=5.0, hash=4007656e8395f470...
    n=3: card=4.0, hash=58b6804622e07ece...


In [12]:
# Ingest batch of sentences
print(f"Ingesting batch of {len(test_sentences)} sentences...")

batch_results = []
for i, sentence in enumerate(test_sentences[:10]):  # First 10 for demo
    rep = mos.ingest(
        raw_data=sentence,
        metadata={'source': 'synthetic', 'batch_idx': i}
    )
    if rep:
        batch_results.append(rep)

print(f"\n✓ Batch ingestion complete: {len(batch_results)} successful")

Ingesting batch of 50 sentences...
  ✓ LUT committed: n=1, hash=f72ee3615f29dac0..., id=4
  ✓ LUT committed: n=2, hash=4007656e8395f470..., id=3
  ✓ LUT committed: n=3, hash=58b6804622e07ece..., id=2
  ✓ LUT committed: n=1, hash=b0110e06f2357894..., id=4
  ✓ LUT committed: n=2, hash=a6ee2563047e734b..., id=3
  ✓ LUT committed: n=3, hash=a7cba9403ce6e239..., id=2
  ✓ LUT committed: n=1, hash=2e326b8af5cad4b9..., id=4
  ✓ LUT committed: n=2, hash=acf70d64b0e98ad5..., id=3
  ✓ LUT committed: n=3, hash=151580b440b2817d..., id=2
  ✓ LUT committed: n=1, hash=60b799e35b6e5cae..., id=4
  ✓ LUT committed: n=2, hash=d3e96dd5c06ce4ab..., id=3
  ✓ LUT committed: n=3, hash=db25275426ce65ce..., id=2
  ✓ LUT committed: n=1, hash=1fc20e542e54eff8..., id=3
  ✓ LUT committed: n=2, hash=1c0bc6c10dd178e7..., id=3
  ✓ LUT committed: n=3, hash=f25a78670b0c1d1c..., id=2
  ✓ LUT committed: n=1, hash=13b504b881dd9529..., id=4
  ✓ LUT committed: n=2, hash=629d5c6d157b935e..., id=3
  ✓ LUT committed: n=3, hash=8

## 6. Query Persistent Storage

Query the DuckDB store for metadata and tokens.

In [13]:
# Check if LUT store is available
if hasattr(mos, 'lut_store') and mos.lut_store:
    print("✓ LUT store available")
    
    # Query stored HLLSets
    if hasattr(mos.lut_store, 'list_hllsets'):
        hllsets = mos.lut_store.list_hllsets()
        print(f"\nStored HLLSets: {len(hllsets)}")
        for h in hllsets[:5]:
            print(f"  • {h}")
else:
    print("LUT store not configured - checking direct DuckDB access")

✓ LUT store available


In [14]:
# Direct DuckDB query to verify data
import duckdb

if db_path.exists():
    conn = duckdb.connect(str(db_path), read_only=True)
    
    # List tables
    tables = conn.execute("SHOW TABLES").fetchall()
    print(f"Tables in DuckDB: {[t[0] for t in tables]}")
    
    # Query each table
    for table_name, in tables:
        count = conn.execute(f"SELECT COUNT(*) FROM {table_name}").fetchone()[0]
        print(f"\n{table_name}: {count} rows")
        
        # Sample data
        sample = conn.execute(f"SELECT * FROM {table_name} LIMIT 3").fetchall()
        if sample:
            columns = [desc[0] for desc in conn.description]
            print(f"  Columns: {columns}")
            for row in sample:
                print(f"  {row}")
    
    conn.close()
else:
    print(f"DuckDB file not found at {db_path}")

ConnectionException: Connection Error: Can't open a connection to same database file with a different configuration than existing connections

## 7. Store Artifacts Directly

Test the artifact storage API.

In [None]:
# Store raw artifact
test_data = b"This is test artifact data for fractal manifold"
artifact_id = mos.store_artifact(
    data=test_data,
    metadata={'type': 'test', 'description': 'Test artifact'}
)

print(f"Artifact stored:")
print(f"  ID: {artifact_id}")
print(f"  Size: {len(test_data)} bytes")

# Retrieve and verify
retrieved = mos.retrieve_artifact(artifact_id)
print(f"\nRetrieved: {retrieved == test_data}")

In [None]:
# Store HLLSet as artifact
hll_test = HLLSet.from_batch(['artifact', 'test', 'hllset', 'storage'])
hll_bytes = hll_test.dump_roaring()

hll_artifact_id = mos.store_artifact(
    data=hll_bytes,
    metadata={
        'type': 'hllset',
        'hll_name': hll_test.name,
        'cardinality': hll_test.cardinality()
    }
)

print(f"HLLSet stored as artifact:")
print(f"  Artifact ID: {hll_artifact_id}")
print(f"  HLLSet name: {hll_test.name[:16]}...")
print(f"  Cardinality: {hll_test.cardinality():.1f}")

## 8. Create Entanglements

Test the entanglement relationship API.

In [None]:
# Create entanglement between artifacts
mos.create_entanglement(
    source_id=artifact_id,
    target_id=hll_artifact_id,
    strength=0.85
)

print(f"Created entanglement: {artifact_id[:16]}... ↔ {hll_artifact_id[:16]}...")

# Query entanglements
entanglements = mos.get_entanglements(artifact_id)
print(f"\nEntanglements for {artifact_id[:16]}...:")
for target, strength in entanglements.items():
    print(f"  → {target[:16]}... (strength={strength})")

## 9. Complete Pipeline Test

Run complete pipeline: synthetic data → kernel → HLLSet → MOS → DuckDB

In [None]:
# Generate fresh dataset
pipeline_data = generate_dataset(20)
print(f"Pipeline test with {len(pipeline_data)} sentences\n")

# Track metrics
start_time = time.time()
total_tokens = 0
hllsets_created = []

for i, sentence in enumerate(pipeline_data):
    # Ingest
    rep = mos.ingest(sentence, metadata={'pipeline_test': True, 'idx': i})
    
    if rep:
        total_tokens += len(rep.original_tokens)
        for n, hll in rep.hllsets.items():
            hllsets_created.append((n, hll.name, hll.cardinality()))

elapsed = time.time() - start_time

print(f"Pipeline completed in {elapsed:.2f}s")
print(f"  Sentences processed: {len(pipeline_data)}")
print(f"  Total tokens: {total_tokens}")
print(f"  HLLSets created: {len(hllsets_created)}")
print(f"  Throughput: {total_tokens/elapsed:.1f} tokens/sec")

In [None]:
# Final DuckDB state
if db_path.exists():
    conn = duckdb.connect(str(db_path), read_only=True)
    
    print("Final DuckDB state:")
    tables = conn.execute("SHOW TABLES").fetchall()
    
    for table_name, in tables:
        count = conn.execute(f"SELECT COUNT(*) FROM {table_name}").fetchone()[0]
        print(f"  {table_name}: {count} rows")
    
    conn.close()

## 10. Cleanup

In [None]:
# Cleanup temp directory (optional - uncomment to remove)
# import shutil
# shutil.rmtree(temp_dir)
# print(f"Cleaned up: {temp_dir}")

print(f"\nTest data preserved at: {temp_dir}")
print("Uncomment cleanup cell to remove.")

## Summary

**Tested components:**
- ✅ Kernel: Token absorption, union, intersection, Jaccard similarity
- ✅ HLLSet: Content-addressed naming, cardinality estimation
- ✅ Fractal Core: N-tokenization, overlap measure, scale tower
- ✅ ManifoldOS: Initialization, ingestion, artifact storage
- ✅ DuckDB: Persistent LUT storage, metadata queries
- ✅ Entanglement: Artifact relationships

**Next steps for Stage 2:**
- Integrate fractal_core with HRT lattice construction
- Implement edge→token loop for full fractal recursion
- Add entanglement number computation across scales