# N-Token Ingestion Algorithm

This notebook demonstrates the n-token ingestion algorithm with a single HLLSet combining all n-token groups.

**Key Architecture**:
- Generate n-token groups (1-tokens, 2-tokens, 3-tokens) via sliding window
- **Single HLLSet** combining all n-tokens together (perfect fingerprint)
- In AM matrix: HLLSet split by (reg, zeros) identifiers for disambiguation
- Build Lookup Tables (LUTs) to map identifiers → original tokens
- Preserve implicit order through sequential n-token generation

**Trade-off**: Cardinality tripled (all n-tokens in one set), but acceptable for BSS metric

## 1. Basic N-Token Generation

Understanding how n-tokens are generated from a sequence of tokens.

In [1]:
from core.manifold_os import ManifoldOS

# Create ManifoldOS instance
os = ManifoldOS()

# Ingest a simple sentence
text = "the quick brown fox jumps"
result = os.ingest(text)

print("Original tokens:", result.original_tokens)
print(f"\nNumber of n-token groups: {len(result.n_token_groups)}")

# Show each n-token group
for n, tokens in sorted(result.n_token_groups.items()):
    print(f"\n{n}-tokens: {len(tokens)} groups")
    print(f"  {tokens[:5]}...")  # Show first 5

Original tokens: ['the', 'quick', 'brown', 'fox', 'jumps']

Number of n-token groups: 3

1-tokens: 5 groups
  [('the',), ('quick',), ('brown',), ('fox',), ('jumps',)]...

2-tokens: 4 groups
  [('the', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps')]...

3-tokens: 3 groups
  [('the', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps')]...


## 2. Single HLLSet with Multiple N-Token Groups

Ingestion creates a **single HLLSet** combining all n-tokens together (1-tokens + 2-tokens + 3-tokens).

**Architecture**:
- Single HLLSet = perfect fingerprint for the document
- Full support for set operations (union, intersection, similarity)
- In AM matrix, this HLLSet is split by (reg, zeros) identifiers for disambiguation
- For concise languages (e.g., Chinese ~80K characters), can split by actual tokens

**Trade-off**:
- Cardinality gets tripled (all n-tokens in one set)
- This is acceptable because:
  - Cardinality rarely used directly
  - In BSS metric, it's just a scale coefficient

In [2]:
# Check HLLSets created
print(f"Number of HLLSets: {len(result.hllsets)}")
print("Note: These are separate per n-group for convenience,")
print("but conceptually form a single combined HLLSet\n")

for n, hllset in sorted(result.hllsets.items()):
    card = hllset.cardinality()
    print(f"{n}-token HLLSet:")
    print(f"  Cardinality: {card:.2f}")
    print(f"  Name: {hllset.short_name}...")
    print(f"  Backend: {hllset.backend}")
    print()

Number of HLLSets: 3
Note: These are separate per n-group for convenience,
but conceptually form a single combined HLLSet

1-token HLLSet:
  Cardinality: 5.01
  Name: 61e1ce87...
  Backend: C/Cython

2-token HLLSet:
  Cardinality: 4.01
  Name: 9bee32c5...
  Backend: C/Cython

3-token HLLSet:
  Cardinality: 3.00
  Name: 83e18592...
  Backend: C/Cython



## 3. Lookup Tables (LUTs)

LUTs map (reg, zeros) identifiers back to original token sequences.

In [3]:
# Examine LUTs
print(f"Number of LUTs: {len(result.luts)}")

for n, lut in sorted(result.luts.items()):
    print(f"\n{n}-token LUT:")
    print(f"  Entries: {len(lut)}")
    
    # Show a few entries
    for (reg, zeros), record in list(lut.items())[:3]:
        print(f"\n  ({reg}, {zeros}):")
        print(f"    Tokens: {record.tokens}")
        print(f"    Hashes: {record.hashes}")

Number of LUTs: 3

1-token LUT:
  Entries: 5

  (989, 0):
    Tokens: [('the',)]
    Hashes: [14035282076209885149]

  (326, 1):
    Tokens: [('quick',)]
    Hashes: [12712478176260299078]

  (1018, 0):
    Tokens: [('brown',)]
    Hashes: [18241176695549286394]

2-token LUT:
  Entries: 4

  (408, 0):
    Tokens: [('the', 'quick')]
    Hashes: [13181924135014296984]

  (378, 3):
    Tokens: [('quick', 'brown')]
    Hashes: [589506280862491002]

  (555, 0):
    Tokens: [('brown', 'fox')]
    Hashes: [4326298031740651051]

3-token LUT:
  Entries: 3

  (174, 2):
    Tokens: [('the', 'quick', 'brown')]
    Hashes: [15360034458666004654]

  (740, 1):
    Tokens: [('quick', 'brown', 'fox')]
    Hashes: [3896194499370466020]

  (177, 2):
    Tokens: [('brown', 'fox', 'jumps')]
    Hashes: [5489326747953926321]


## 4. Implicit Order Preservation

The order of n-tokens implicitly preserves document structure:
1. All 1-tokens come first (in order)
2. Then all 2-tokens (in order)
3. Then all 3-tokens (in order)

In [4]:
# Get implicit order
implicit_order = result.get_implicit_order()

print(f"Implicit order ({len(implicit_order)} items):\n")
for item in implicit_order[:15]:  # Show first 15
    print(f"  {item}")

print("\n✓ Order preserved: 1-tokens, then 2-tokens, then 3-tokens")

Implicit order (12 items):

  ('the',)
  ('quick',)
  ('brown',)
  ('fox',)
  ('jumps',)
  ('the', 'quick')
  ('quick', 'brown')
  ('brown', 'fox')
  ('fox', 'jumps')
  ('the', 'quick', 'brown')
  ('quick', 'brown', 'fox')
  ('brown', 'fox', 'jumps')

✓ Order preserved: 1-tokens, then 2-tokens, then 3-tokens


## 5. Disambiguation Example

When multiple tokens hash to the same (reg, zeros), LUT disambiguates.

In [5]:
# Ingest text with potential hash collisions
text2 = "hello world from multiple representations and multiple contexts"
result2 = os.ingest(text2)

print("Original tokens:", result2.original_tokens)
print(f"\n1-token LUT entries: {len(result2.luts[1])}")
print(f"2-token LUT entries: {len(result2.luts[2])}")
print(f"3-token LUT entries: {len(result2.luts[3])}")

# Find entries with multiple token sequences (disambiguated)
print("\nDisambiguated entries (multiple tokens → same identifier):")
for n, lut in result2.luts.items():
    for (reg, zeros), record in lut.items():
        if len(record.tokens) > 1:
            print(f"\n{n}-token ({reg}, {zeros}):")
            print(f"  Tokens: {record.tokens}")
            break  # Show just one example per n-group

Original tokens: ['hello', 'world', 'from', 'multiple', 'representations', 'and', 'multiple', 'contexts']

1-token LUT entries: 7
2-token LUT entries: 7
3-token LUT entries: 6

Disambiguated entries (multiple tokens → same identifier):


## 6. Batch Processing with N-Tokens

Process multiple documents, each getting their own n-token representation.

In [6]:
# Multiple documents
documents = [
    "first document about data science",
    "second document about machine learning",
    "third document about artificial intelligence"
]

results = os.ingest_batch(documents)

print(f"Processed {len(results)} documents\n")

for i, result in enumerate(results, 1):
    print(f"Document {i}:")
    print(f"  Tokens: {len(result.original_tokens)}")
    print(f"  HLLSets: {len(result.hllsets)}")
    print(f"  1-token cardinality: {result.hllsets[1].cardinality():.2f}")
    print()

Processed 3 documents

Document 1:
  Tokens: 5
  HLLSets: 3
  1-token cardinality: 5.01

Document 2:
  Tokens: 5
  HLLSets: 3
  1-token cardinality: 5.01

Document 3:
  Tokens: 5
  HLLSets: 3
  1-token cardinality: 5.01



## 7. Custom Tokenization Configuration

Configure n-token groups and tokenization behavior.

In [8]:
from core.manifold_os import TokenizationConfig, IngestDriver

# Custom config with different n-token groups
config = TokenizationConfig(
    use_n_tokens=True,
    n_token_groups=[1, 2, 3, 4],  # Include 4-tokens
    maintain_order=True,
    lowercase=True,
    min_token_length=2
)

# Create and register new driver with custom config
custom_driver = IngestDriver("ingest_custom", config=config)
os.register_driver(custom_driver)
custom_driver.wake()  # Activate driver

# Ingest using custom driver
result_custom = os.ingest("testing custom tokenization configuration", driver_id="ingest_custom")

print(f"N-token groups: {list(result_custom.n_token_groups.keys())}")
print(f"HLLSets created: {len(result_custom.hllsets)}")
print(f"\n4-token groups: {len(result_custom.n_token_groups[4])}")
print(f"Examples: {result_custom.n_token_groups[4][:3]}")

N-token groups: [1, 2, 3, 4]
HLLSets created: 4

4-token groups: 1
Examples: [('testing', 'custom', 'tokenization', 'configuration')]


## Summary

**Key Benefits of N-Token Algorithm**:

1. **Single HLLSet Fingerprint**: All n-tokens combined into one HLLSet
   - Perfect fingerprint for document
   - Full support for set operations (union, intersection, similarity)
   
2. **Multiple Representations**: Each n-group provides different granularity
   - 1-tokens, 2-tokens, 3-tokens, etc.
   - Better context and disambiguation
   
3. **AM Matrix Splitting**: HLLSet split by (reg, zeros) identifiers in AM
   - Compact representation (~100K identifiers vs millions of tokens)
   - For concise languages: can split by actual tokens (e.g., Chinese ~80K)
   
4. **Disambiguation via LUTs**: Map identifiers back to original tokens
   - Resolve hash collisions
   - Preserve original token sequences
   
5. **Order Preservation**: Implicit order through sequential generation
   - 1-tokens first, then 2-tokens, then 3-tokens
   
6. **Cardinality Trade-off**: Tripled cardinality is acceptable
   - Rarely used directly
   - BSS metric treats it as scale coefficient

**Next**: See `03_adjacency_matrix.ipynb` for order reconstruction with Adjacency Matrices.