Skip to content

Conversation

@CISC
Copy link
Collaborator

@CISC CISC commented Jan 20, 2026

Allow all hashable types as object keys, taking care to replicate special python/jinja behavior between int/float/bool.

Fixed array/object output with string filter.

Fixed object tojson output (did not properly escape key string).

Fixed object item order when replacing an item.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rethinking about this approach a bit, probably we should drop the map::unordered, keep all the data inside std::vector<std::pair<value, value>> ordered, and always perform a linear search when we access an object.

Ofc, that will be slower, but realistically a template in the wild never has more than 50 or even 100 keys inside an object, so it's probably fine.

if (std::all_of(vec.begin(), vec.end(), [&](auto ikv) -> bool {
return is_hashable(std::get<2>(ikv));
})) {
key_type = "NamedTuple";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kwarg is a good alternative to NamedTuple

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, possibly.

@ngxson
Copy link
Collaborator

ngxson commented Jan 20, 2026

Btw, the value.h header is becoming larger. We should move long function definitions into value.cpp, while keeping one-liner definition inside the header (for more visibility)

@github-actions github-actions bot added testing Everything test related jinja parser Issues related to the jinja parser labels Jan 20, 2026
@CISC
Copy link
Collaborator Author

CISC commented Jan 22, 2026

Phew, this led me down quite a few rabbit holes (how to use hash specialization in a class of the specialized type, c++ standard lacking hash_combine drama, virtual equality overload pitfalls, what have you), now it finally all makes sense and even compiles, but unfortunately does not work, more fun for tomorrow! :)

@CISC
Copy link
Collaborator Author

CISC commented Jan 25, 2026

Massive refactoring done, now has proper value hashing and equality operators (equivalence and strict (non-)equal). Added real value_tuple so that we can allow only immutable types as object keys.

Will be easy to add further operators for sorting in follow-up PR.

Lessons learned:

  • You can't override specialized hash of shared_ptr
  • You can't forward declare hashing functor of unordered_map
  • The rest has been repressed, never to be mentioned again

Hopefully not too many mistakes made, but for some reason test-chat-template fails, so obviously some mistakes present, will investigate further unless someone brighter spots my error first. :)

@CISC
Copy link
Collaborator Author

CISC commented Jan 25, 2026

for some reason test-chat-template fails, so obviously some mistakes present, will investigate further unless someone brighter spots my error first. :)

Actually, it's the template that's faulty (not sure why this worked before?), we set tools = [], and the Mistral-Large-Instruct-2407 template does this:

{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}
{%- if tools is not none and (message == user_messages[-1]) %}
    {{- "[AVAILABLE_TOOLS] [" }}
    {%- for tool in tools %}
        ...
    {%- endfor %}
    {{- "[/AVAILABLE_TOOLS]" }}
{%- endif %}

@CISC CISC requested a review from ngxson January 25, 2026 12:32
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty close! Just need to optimize the a little bit by avoiding relying too much on string.

Btw, your hash_bytes with seed is in some ways an implementation of hash_combine, you can combine hash of multiple elements by:

size_t cur_hash = 0; // or maybe something else
for (auto elem : arr) {
  cur_hash = hash_bytes(/* seed */ cur_hash, elem.unique_hash());
}
return cur_hash;

Doing str() or as_repr() uses a lot more memory. An example can be:

my_tuple = ("some string")
my_outer_tuple = (my_arr, my_arr, my_arr, my_arr, my_arr)

Calculating hash of my_outer_tuple will now use 5 times more memory than needed, because as_repr need to allocate memory for all strings, even when they points to the same memory

@CISC
Copy link
Collaborator Author

CISC commented Jan 26, 2026

Btw, your hash_bytes with seed is in some ways an implementation of hash_combine, you can combine hash of multiple elements by:

size_t cur_hash = 0; // or maybe something else
for (auto elem : arr) {
  cur_hash = hash_bytes(/* seed */ cur_hash, elem.unique_hash());
}
return cur_hash;

Yep, that was the point, but I see I made a footgun, never start with an initial seed (unless it is a previous hash).

Calculating hash of my_outer_tuple will now use 5 times more memory than needed, because as_repr need to allocate memory for all strings, even when they points to the same memory

Very good point, I'll look into your suggestions and see what can be done.

Comment on lines +494 to +504
virtual string as_string() const override {
std::ostringstream ss;
ss << "{";
for (size_t i = 0; i < val_obj.size(); i++) {
if (i > 0) ss << ", ";
auto & [key, val] = val_obj.at(i);
ss << value_to_string_repr(key) << ": " << value_to_string_repr(val);
}
ss << "}";
return ss.str();
}
Copy link
Collaborator

@ngxson ngxson Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just note that I initially don't allow as_repr or to_string to be recursive, because it can go into an infinitive loop if the object/array entity is pointed back to itself:

obj = {}
obj["a"] = obj

# or even harder to detect, nested circular
obj["a"] = {"b": obj}

a bit ironically that this is not actually classified as a vulnerability. just sometimes program are actually coded this way, and it is indeed a very common practice in high-level languages like javascript or python:

var node = {"parent": null}
node["child"] = {
  "parent" : node,
  "child": null,
};
// now, JSON.stringify(node) will throw:
// TypeError: Converting circular structure to JSON

btw, may worth a fix for to_json as it can also stuck on this case. IIRC javascript simply throw an error if it detect circular in an object

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, very aware of this, decided to ignore it for now. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested circulars are usually bypassed by keeping a reference of every encountered object that can nest when processing and simply skip/print ... when coming across one that has already been visited. Will follow up if no-one else does.

@CISC
Copy link
Collaborator Author

CISC commented Jan 26, 2026

@ngxson Ok, be80c62 should improve things, though I don't see how we can avoid composing strings as the parts may not be the same while the whole is.

Edit: I mean for equality, for hash it is fine processing parts due to the nature of FNV-1a.

@ngxson
Copy link
Collaborator

ngxson commented Jan 26, 2026

I'll work on this a bit, will push some optimizations directly here

Comment on lines 57 to 74
static size_t hash_bytes(size_t seed, void const * bytes, size_t len, Args&&... args) noexcept
{
static_assert(sizeof...(args) % 2 == 0);
static constexpr size_t prime = size_t_digits == 64 ? 0x100000001b3 : 0x01000193;

unsigned char const * c = static_cast<unsigned char const *>(bytes);
unsigned char const * const e = c + len;

for (; c < e; ++c) {
seed = (seed ^ *c) * prime;
}

if constexpr (sizeof...(args) > 0) {
seed = hash_bytes(seed, std::forward<Args>(args)...);
}

return seed;
}
Copy link
Collaborator

@ngxson ngxson Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking out loud in math terms: if we consider hash_bytes(seed, data, size) as a (surjective) function f(s, x) with s = seed and x = tuple(data, size)

given input data x0, x1, x2:

s = initial_seed
hash = f(f(f(s, x0), x1), x2)

and your "convenient" function hash_bytes(x0, x1, ...) can be defined as g:

g() = s
g(x0) = f(s, x0)
g(x0, x1) = f(f(s, x0), x1)
...

therefore, g is still surjective and g(x0, x1) != g(x1, x0) which is so far so good.

but now, consider: f(x0 ~ x1) == g(x0, x1) with ~ the "concatenation" operation: as long as x0 ~ x1 == x2 ~ x3, then g(x0, x1) == g(x2, x3), which is expected when calculating hash of string parts (where each part is different but concatenated version is the same). technically says, this make g no longer a good hash function, since we have a known set of collisions. but in our context this is acceptable. FNV-1a is not that good anyway.

(note: the property f(x0 ~ x1) == g(x0, x1) can be derived from the implementation and can be considered as a postulation in this context)

in practice, to prevent this, most hash functions use some kind of internal state such that: hash(s, x) = output(s, mix(x)), so hash(x0 ~ x1) != hash(x1 ~ x0). we can implement this easily but it's not really necessary. we are not doing cryptographic hash anyway.

end of the math part. now, in term of speed, what I think can be nice & fun to make hash_bytes to operate on blocks of data instead of one byte at a time. because one update depends on the prior state, CPU will have hard time with out-of-order execution. a simple vectorization should help a lot. just the case of x0 ~ x1 will be a bit complicated. I'll see how to do that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FNV-1a certainly has its flaws, it was mainly chosen for its simplicity and chainability, it performs well enough in terms of quality and speed here I think.

Feel free to improve upon it if you wish though. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I ended up refactor this into a new struct called hasher(), it is inspired by nodejs's crypto.createHash()

// Old notion:
size_t seed = hash_bytes(data0, len0);
       seed = hash_bytes(data1, len1);
       // ...
size_t output = hash_bytes(seed, dataN, lenN);

// New notion
hasher hash = hasher().update(data0, len0).update(data1, len1); // ...
size_t output = hash.digest();

With this type of notion, we can in theory implement the notion of "internal state", though I won't add it because it's unnecessary (just mentioning here for completeness)

The new notion should have the exact same mathematical properties as the old one (reflected by the new test cases in test-jinja)

The result will be different from the old one though, because we are processing block of N bytes at once (on 64-bit system, we process 64/8 = 8 bytes). I expect most compilers will know how to vectorize it. If data is not a multiple of the block size, they will be buffered.

I'll do another pass tomorrow to see if there are any other clean up needed. In the meantime, feel free to review my last commit and lmk if anything seems off to you.

Comment on lines 444 to 447
for (const auto & val : val_arr) {
const size_t val_hash = val->unique_hash().digest();
hash.update(&val_hash, sizeof(size_t));
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow unique_hash to be passed a hasher so that arrays and objects can just update it instead of hashing the hashes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot leave a comment here, but that won't work in this case: ["ab", "c"] vs ["a", "bc"]

the case of reusing state is mostly used by string parts, I don't think it's valid anywhere else

Copy link
Collaborator Author

@CISC CISC Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's actually a problem as typeid is hashed inbetween those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mathematically say, hashing the digest of elements vs letting each element to update are (quasi-)equivalent in our case, because the digest() only do one job: to add padding to the input data.

my current version is still mathematically equivalent to adding a typeid in-between, while being more simple than having to add a version of unique_hash that takes another hasher as input.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this code path is quite rarely used in practice (in case of using tuple as key), so I prefer to keep it simple for now. it should still be efficient enough doing this way.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to merge when you're ready

@CISC
Copy link
Collaborator Author

CISC commented Jan 27, 2026

Feel free to merge when you're ready

Seems to need cstring include, also still think #18955 (comment) is a good idea (see my last comment).

@CISC
Copy link
Collaborator Author

CISC commented Jan 27, 2026

Ok, all good, will merge when CI is done.

@CISC CISC merged commit 2b4cbd2 into master Jan 27, 2026
75 of 78 checks passed
@CISC CISC deleted the cisc/jinja-mixed-type-object-keys branch January 27, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jinja parser Issues related to the jinja parser testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants