Context
Profiling full-model compilation of WRLD3 with callgrind and DHAT shows that SipHash operations on Ident<Canonical> strings account for ~14% of all instructions in project_build, and String::clone on these identifiers adds another ~5%.
Ident<Canonical> wraps an owned String today. Every HashMap lookup hashes the full string content, every clone allocates a new String, and every BTreeSet operation does byte-by-byte comparisons.
Proposal
Change Ident<Canonical> to use #[salsa::interned] so that each unique canonical string is stored once and represented by a lightweight integer ID. This converts:
- Hashing: from SipHash over N bytes to hashing a u32/u64 integer
- Cloning: from
String::clone (allocation + memcpy) to Copy of a small integer
- Equality: from byte-by-byte string comparison to integer
==
- Ordering: would need a custom
Ord impl that dereferences to the underlying string, since integer ordering doesn't match lexicographic ordering
Scope
- Add a
#[salsa::interned] newtype (e.g. InternedIdent) with the canonical string as its field
- Migrate
Ident<Canonical> usages in the compilation pipeline to use the interned type where a &dyn Db is available
Ident<Canonical> may still be needed at the edges (parsing, serialization) where there is no salsa database
- Update
HashMap<Ident<Canonical>, ...> and BTreeSet<Ident<Canonical>> to use the interned type
Acceptance Criteria
Notes
This is a larger refactor. It should be done after the lower-effort canonicalize() cleanup and per-variable unit context caching, which address the most acute regressions from incremental compilation. The interning work is a further optimization on top of those.
Profiling baseline (WRLD3 full pipeline, main branch after incremental compilation):
- SipHash write + finish + c_rounds: ~16.6M instructions (7.8% of total)
- String::clone: ~2.3M instructions
- Total instruction count: 212.4M
Context
Profiling full-model compilation of WRLD3 with
callgrindandDHATshows that SipHash operations onIdent<Canonical>strings account for ~14% of all instructions inproject_build, andString::cloneon these identifiers adds another ~5%.Ident<Canonical>wraps an ownedStringtoday. Every HashMap lookup hashes the full string content, every clone allocates a new String, and every BTreeSet operation does byte-by-byte comparisons.Proposal
Change
Ident<Canonical>to use#[salsa::interned]so that each unique canonical string is stored once and represented by a lightweight integer ID. This converts:String::clone(allocation + memcpy) toCopyof a small integer==Ordimpl that dereferences to the underlying string, since integer ordering doesn't match lexicographic orderingScope
#[salsa::interned]newtype (e.g.InternedIdent) with the canonical string as its fieldIdent<Canonical>usages in the compilation pipeline to use the interned type where a&dyn Dbis availableIdent<Canonical>may still be needed at the edges (parsing, serialization) where there is no salsa databaseHashMap<Ident<Canonical>, ...>andBTreeSet<Ident<Canonical>>to use the interned typeAcceptance Criteria
Ident<Canonical>in the compilation pipeline uses salsa interningcargo bench -p simlin-engine --bench compiler -- wrld3shows measurable improvement inproject_build(target: reduce SipHash + String::clone overhead by >50%)cargo test -p simlin-enginepassesNotes
This is a larger refactor. It should be done after the lower-effort canonicalize() cleanup and per-variable unit context caching, which address the most acute regressions from incremental compilation. The interning work is a further optimization on top of those.
Profiling baseline (WRLD3 full pipeline, main branch after incremental compilation):