# 01 — Hash and Anchor (Local Simulation)

This notebook teaches hashing + anchoring without running a blockchain node. It uses a **local JSON registry** to simulate on-chain storage.

## Hashing Policy (v1)
- **Algorithm**: SHA-256
- **Row Columns**: all columns in the dataset (explicitly listed in code)
- **Column Order**: sorted ascending (or specified list)
- **Separator**: `|` (with delimiter escaping)
- **Null Handling**: `None`/`NaN` → empty string
- **Floats**: normalized to 6-decimal precision (trailing zeros trimmed)
- **Whitespace**: `strip + collapse internal` for strings
- **Dataset Ordering**: by a stable business key if present (e.g., `shipment_id`), else by `row_hash`
- **Manifest**: JSON written next to the dataset with the exact settings and dataset hash

> Change policy → bump the version string (e.g., `hash-policy-2`) and re-derive hashes.


In [1]:
# Robust imports: find Portfolio root and load the local anchor helpers
from pathlib import Path
import sys

def find_portfolio_root(start: Path) -> Path:
    start = start.resolve()
    for p in [start, *start.parents]:
        if (p / "Blockchain/common/utils/hash_anchor.py").exists():
            return p
    raise RuntimeError(f"Couldn't find Portfolio root from: {start}")

PORTFOLIO_ROOT = find_portfolio_root(Path.cwd())
UTILS_DIR = (PORTFOLIO_ROOT / "Blockchain/common/utils").resolve()
if str(UTILS_DIR) not in sys.path:
    sys.path.insert(0, str(UTILS_DIR))

from hash_anchor import sha256_bytes, anchor_hash, verify_hash
print("Loaded hash_anchor from:", UTILS_DIR)

Loaded hash_anchor from: C:\Users\beall\OneDrive\Documents\Portfolio\Blockchain\common\utils


In [2]:
# Create a tiny demo CSV (idempotent) and read it
import pandas as pd
demo_csv = PORTFOLIO_ROOT / 'Blockchain/common/notebooks/demo_shipments.csv'
if not demo_csv.exists():
    demo_csv.write_text('shipment_id,gtin,case_id,temp_c\nS1,000123,CASE001,4.1\nS2,000123,CASE002,9.9\n', encoding='utf-8')
df = pd.read_csv(demo_csv)
df

Unnamed: 0,shipment_id,gtin,case_id,temp_c
0,S1,123,CASE001,4.1
1,S2,123,CASE002,9.9


In [3]:
# Hardened row hashing with a documented policy
import hashlib, math

SEP = '|'            # record in manifest
FLOAT_FMT = "{:.6f}" # record in manifest

def norm_val(v):
    if v is None or (isinstance(v, float) and math.isnan(v)):
        return ''
    if isinstance(v, str):
        return ' '.join(v.strip().split())
    if isinstance(v, float):
        s = FLOAT_FMT.format(v)
        return s.rstrip('0').rstrip('.') if '.' in s else s
    return str(v)

def row_digest(row, columns=None):
    cols = sorted(columns or row.index)
    parts = [norm_val(row[c]) for c in cols]
    parts = [p.replace(SEP, f"\\{SEP}") for p in parts]
    payload = SEP.join(parts).encode('utf-8')
    return hashlib.sha256(payload).hexdigest()

# Choose columns: here we use all columns; you can pass a subset
cols = sorted(df.columns)
df['row_hash'] = df.apply(lambda r: row_digest(r, columns=cols), axis=1)
df

Unnamed: 0,shipment_id,gtin,case_id,temp_c,row_hash
0,S1,123,CASE001,4.1,a8e361e42d77633e02042d7fc9a41f7ee082a9c9264af3...
1,S2,123,CASE002,9.9,79cae2578b17dda1f71b3d5c03bdc75d4813768d1b5304...


In [4]:
# Build a deterministic dataset-level hash
if 'shipment_id' in df.columns:
    row_hashes = df.sort_values('shipment_id')['row_hash']
else:
    row_hashes = df['row_hash'].sort_values().reset_index(drop=True)

dataset_payload = '\n'.join(row_hashes).encode('utf-8')
dataset_hash = hashlib.sha256(dataset_payload).hexdigest()
print("dataset_hash:", dataset_hash)

dataset_hash: 16a69b607cb86a202c15fd16c1465292076e07dabc2b63f5070ed4e0dbb80fce


In [5]:
# Anchor the dataset hash in the local registry (simulated ledger)
ref = f"local://{demo_csv.as_posix()}#v=hash-policy-1"
try:
    anchor_hash(dataset_hash, ref)
    print("Anchored dataset hash")
except ValueError:
    print("Dataset hash already anchored; skipping.")
print("verify:", verify_hash(dataset_hash, ref))

Anchored dataset hash
verify: True


In [6]:
# Write a manifest documenting the exact policy and dataset-level hash
import json
manifest = {
    "source": demo_csv.as_posix(),
    "hash_algorithm": "sha256",
    "row_policy": {
        "columns": cols,
        "separator": SEP,
        "float_precision": 6,
        "null_as": "",
        "whitespace_norm": "strip+collapse",
        "delimiter_escape": True
    },
    "ordering": "by 'shipment_id' ascending" if 'shipment_id' in df.columns else "by row_hash ascending",
    "dataset_hash": dataset_hash,
}
(PORTFOLIO_ROOT / 'Blockchain/common/notebooks/demo_shipments.manifest.json').write_text(
    json.dumps(manifest, indent=2), encoding='utf-8'
)
print("Wrote manifest →", (PORTFOLIO_ROOT / 'Blockchain/common/notebooks/demo_shipments.manifest.json'))

Wrote manifest → C:\Users\beall\OneDrive\Documents\Portfolio\Blockchain\common\notebooks\demo_shipments.manifest.json


## Next Steps
1. Replace the local registry with `web3.py` + `ProofOfProvenance.sol` on a testnet (e.g., Sepolia).
2. Add a tiny FastAPI endpoint `/verify?hash=...` that consults the registry.
3. Join `row_hash` or `dataset_hash` back into KPIs / optimization models to demonstrate tamper-evidence in your logistics projects.
