# HRT (Hash Relational Tensor) ingest

**Assignment**:

>class HashRelationTensor (HRT) covers most of functionality that we need to manage in-memory presentation. Lets treat it as a draft of initial part for our R&D project.
>
>In the next part we are going to develop support for data ingestion. The way how we are building HRT is application agnostic - we are passing binary relation as triple (h_1, h_2, v); 2 maps h2i and i2h manage translation from hash to index, and from index to hash.
In our R&D project HRT is immutable in the sense of of immutable git commits. T --> H (tokens to hashes) map can mutate due to hash collision (diff tokens can be mapped to the same hash value); v - that represent frequency of (h_1, h_2) pair will increase with each new use (touch) of this pair. Those changes will be committed to HPT by saving previous state of HRT as a snapshot of changes.
>
>In our project tokens are presented or built from fix size symbol base. Those symbols should sutisfy the following axioms:
>1. **Non-inflectional** (no paradigms, no declensions)  
>2. **Compositionally closed** (complex = stack of simples)  
>3. **Lexicographically frozen** (each symbol has **one** normative definition)  
>4. **Hashable** (deterministic bit-pattern from symbol)
>
>We are going to use Chinese as our base symbol collection. Chinese is **our first substrate** because it is **optimally hieroglyphic**:
>- finite, standardized inventory (≈ 80 k)  
>- unambiguous dictionary definitions **in the same language**  
>- clear **radical→character→word** composition rules  
>- 3 000 years of **continuous semantic fossil record**
>
>80K is good for computing resources but it is very limited for hash based encoding. Generated encodings are prone to false positives when different sequences in Chinese text would produce identical hash presentation.
>
>To solve this problem we are applying n-gram pattern with 3-gram sliding window for scanning incoming text. Each 3-token window will produce 3 n-tokens: 1-token, 2-token, and 3-token. For example, (a, b, c) --> {(a), (a, b), (a, b, c)}; after sliding 1 step right: (b, c, d) --> {(b), (b, c), (b, c, d)}.
>
>Each n-token is token by itself with corresponding hash: (a) --> h_1; (a, b) --> h_2; . . . , (b, c, d) --> h_6.
>
>Binary relations in HRT are defined by following rule: 1-token(t) -> 2-token(t); 2-token(t) -> 3-token(t); 3-token(t) -> 1-token(t+1); and so on. This rule implicitly encodes all pair-wise relations on all n-tokens, for example 2-token(t) -> 1-token(t+1), because (a,b) -> (b); or (a,b,c)(t) -> (b,c)(t+1). The pair  3-token(t) -> 1-token(t+1) looks a little bit artificial and arbitrary, so, if you have better proposal I would like to consider.
This ingestion is still in-memory only. We will take care about persistence on next step.

## Implementation

Below is a complete, in-memory ingestion layer that turns any incoming Chinese-character stream into an immutable, commit-oriented HRT snapshot while obeying the four axioms and the 3-gram sliding-window rule.

### Design choices (briefly justified)

1. 3-gram window → three n-tokens (1, 2, 3) keeps the radical→character→word intuition and guarantees that every pairwise relation is covered transitively without storing an explosion of edges.
2. 3-token(t) → 1-token(t+1) is kept: although it looks arbitrary, it is the minimal bridge that links two adjacent windows without duplicating information already encoded inside the windows.
(If you prefer, you can drop it and still recover the same graph by following 3-token(t) → 2-token(t+1) → 1-token(t+1); we keep the direct edge for speed.)
3. Hash collision handling is explicit:

 - T→H map is versioned inside every commit;
 - on collision we append a running counter to the token string before hashing, store the mapping, and continue;
 - the old HRT snapshot is never mutated—we simply create a new HRT instance that re-uses the unchanged parts of the previous index maps (cheap).
4. Frequency of a relation is just the number of times the triple (h₁, h₂, 1) is fed to HRT; the v field in HRT is incremented accordingly.
Code: ingestion pipeline

In [3]:
from src.hllset_swarm.hrt import HashRelationTensor



Loading HLLSet kernel from: None


In [4]:
from typing import List, Tuple, Dict, Optional
import hashlib
from collections import Counter, defaultdict
from dataclasses import dataclass

# ---------- commit snapshot ----------
@dataclass
class HRTCommit:
    hrt: 'HashRelationTensor'          # immutable snapshot
    t2h: Dict[str, int]                # token → hash  (versioned in this commit)
    h2t: Dict[int, str]                # hash  → token
    stats: Dict[str, int]              # #tokens, #relations, #collisions, etc.

# ---------- ingester ----------
class ChineseNGramIngester:
    """
    Turn any Chinese-char string into an *immutable* HRT commit.
    3-gram sliding window, collision-safe, frequency counted.
    """
    def __init__(self, initial_commit: Optional[HRTCommit] = None):
        # if no parent, start empty
        if initial_commit is None:
            self.base_t2h: Dict[str, int] = {}
            self.base_h2t: Dict[int, str] = {}
            self.base_hrt = HashRelationTensor()
        else:
            self.base_t2h = initial_commit.t2h.copy()
            self.base_h2t = initial_commit.h2t.copy()
            self.base_hrt = initial_commit.hrt          # immutable reference

        # running maps for *this* commit
        self.t2h: Dict[str, int] = self.base_t2h.copy()
        self.h2t: Dict[int, str] = self.base_h2t.copy()
        self.hrt = HashRelationTensor()                # new, empty
        self.collision_cnt = 0
        self.relation_cnt  = 0

    # ---------- public API ----------
    def ingest(self, text: str) -> HRTCommit:
        """Process whole string and return an immutable commit."""
        chars = list(text)          # Chinese chars are already UTF-32 code-points
        for win in self._slide(chars):
            self._process_window(win)
        return HRTCommit(
            hrt=self.hrt,
            t2h=self.t2h,
            h2t=self.h2t,
            stats={
                'tokens': len(self.t2h),
                'relations': self.relation_cnt,
                'collisions': self.collision_cnt,
            }
        )

    # ---------- internal ----------
    def _slide(self, chars: List[str]) -> List[Tuple[str, ...]]:
        """Yield 3-gram windows; pads with '' if needed."""
        n = len(chars)
        for i in range(n):
            yield tuple(chars[i:i + 3])          # 1-3 chars, last may be shorter

    def _process_window(self, win: Tuple[str, ...]):
        """Create n-tokens and wire them into HRT."""
        # 1. build n-tokens
        n_tokens = [
            win[0],                      # 1-token
            ''.join(win[:2]),            # 2-token
            ''.join(win)                 # 3-token
        ]
        # 2. obtain hashes (collision-safe)
        hashes = [self._safe_hash(t) for t in n_tokens]
        # 3. emit binary relations
        edges = [(hashes[0], hashes[1], 1),
                 (hashes[1], hashes[2], 1),
                 (hashes[2], hashes[0], 1)]   # 3-token(t) → 1-token(t+1) bridge
        for h1, h2, v in edges:
            old_v = self.hrt.get(h1, h2)  if (h1 in self.hrt.h2i and h2 in self.hrt.h2i) else 0
            self.hrt.update(h1, h2, old_v + v)
            self.relation_cnt += 1

    def _safe_hash(self, token: str) -> int:
        """Deterministic hash with collision counter."""
        if token in self.t2h:
            return self.t2h[token]
        h = self._hash_string(token)
        # collision handling
        counter = 0
        while h in self.h2t and self.h2t[h] != token:   # collision
            counter += 1
            h = self._hash_string(f"{token}#{counter}")
            self.collision_cnt += 1
        # store new mapping
        self.t2h[token] = h
        self.h2t[h] = token
        return h

    @staticmethod
    def _hash_string(s: str) -> int:
        return int(hashlib.sha256(s.encode('utf-8')).hexdigest()[:16], 16)

## Mini demo

In [5]:
parent = None
for chunk in ["人工智能", "智能未来"]:
    ingester = ChineseNGramIngester(parent)
    commit   = ingester.ingest(chunk)
    parent   = commit          # next commit builds on top
    print(commit.stats)

# example output
# {'tokens': 14, 'relations': 21, 'collisions': 0}
# {'tokens': 22, 'relations': 35, 'collisions': 0}

{'tokens': 9, 'relations': 12, 'collisions': 0}
{'tokens': 15, 'relations': 12, 'collisions': 0}


## What we get

- Every HRTCommit is an immutable snapshot; you can keep them in a list to have a git-like chain.
- Frequency v is automatically incremented when the same edge appears again.
- Collision handling is deterministic and transparent to the rest of the code.
- The whole pipeline is still in-memory; persistence (serializing t2h, h2t, and the dense R tensor) can be added next without touching this layer.