-
Notifications
You must be signed in to change notification settings - Fork 14.7k
jinja : implement mixed type object keys #18955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rethinking about this approach a bit, probably we should drop the map::unordered, keep all the data inside std::vector<std::pair<value, value>> ordered, and always perform a linear search when we access an object.
Ofc, that will be slower, but realistically a template in the wild never has more than 50 or even 100 keys inside an object, so it's probably fine.
common/jinja/value.h
Outdated
| if (std::all_of(vec.begin(), vec.end(), [&](auto ikv) -> bool { | ||
| return is_hashable(std::get<2>(ikv)); | ||
| })) { | ||
| key_type = "NamedTuple"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think kwarg is a good alternative to NamedTuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, possibly.
|
Btw, the |
|
Phew, this led me down quite a few rabbit holes (how to use hash specialization in a class of the specialized type, c++ standard lacking hash_combine drama, virtual equality overload pitfalls, what have you), now it finally all makes sense and even compiles, but unfortunately does not work, more fun for tomorrow! :) |
|
Massive refactoring done, now has proper value hashing and equality operators (equivalence and strict (non-)equal). Added real Will be easy to add further operators for sorting in follow-up PR. Lessons learned:
Hopefully not too many mistakes made, |
Actually, it's the template that's faulty (not sure why this worked before?), we set {%- if not tools is defined %}
{%- set tools = none %}
{%- endif %}
{%- if tools is not none and (message == user_messages[-1]) %}
{{- "[AVAILABLE_TOOLS] [" }}
{%- for tool in tools %}
...
{%- endfor %}
{{- "[/AVAILABLE_TOOLS]" }}
{%- endif %} |
verified with transformers
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty close! Just need to optimize the a little bit by avoiding relying too much on string.
Btw, your hash_bytes with seed is in some ways an implementation of hash_combine, you can combine hash of multiple elements by:
size_t cur_hash = 0; // or maybe something else
for (auto elem : arr) {
cur_hash = hash_bytes(/* seed */ cur_hash, elem.unique_hash());
}
return cur_hash;Doing str() or as_repr() uses a lot more memory. An example can be:
my_tuple = ("some string")
my_outer_tuple = (my_arr, my_arr, my_arr, my_arr, my_arr)Calculating hash of my_outer_tuple will now use 5 times more memory than needed, because as_repr need to allocate memory for all strings, even when they points to the same memory
Yep, that was the point, but I see I made a footgun, never start with an initial seed (unless it is a previous hash).
Very good point, I'll look into your suggestions and see what can be done. |
| virtual string as_string() const override { | ||
| std::ostringstream ss; | ||
| ss << "{"; | ||
| for (size_t i = 0; i < val_obj.size(); i++) { | ||
| if (i > 0) ss << ", "; | ||
| auto & [key, val] = val_obj.at(i); | ||
| ss << value_to_string_repr(key) << ": " << value_to_string_repr(val); | ||
| } | ||
| ss << "}"; | ||
| return ss.str(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just note that I initially don't allow as_repr or to_string to be recursive, because it can go into an infinitive loop if the object/array entity is pointed back to itself:
obj = {}
obj["a"] = obj
# or even harder to detect, nested circular
obj["a"] = {"b": obj}a bit ironically that this is not actually classified as a vulnerability. just sometimes program are actually coded this way, and it is indeed a very common practice in high-level languages like javascript or python:
var node = {"parent": null}
node["child"] = {
"parent" : node,
"child": null,
};
// now, JSON.stringify(node) will throw:
// TypeError: Converting circular structure to JSONbtw, may worth a fix for to_json as it can also stuck on this case. IIRC javascript simply throw an error if it detect circular in an object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, very aware of this, decided to ignore it for now. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested circulars are usually bypassed by keeping a reference of every encountered object that can nest when processing and simply skip/print ... when coming across one that has already been visited. Will follow up if no-one else does.
|
I'll work on this a bit, will push some optimizations directly here |
common/jinja/utils.h
Outdated
| static size_t hash_bytes(size_t seed, void const * bytes, size_t len, Args&&... args) noexcept | ||
| { | ||
| static_assert(sizeof...(args) % 2 == 0); | ||
| static constexpr size_t prime = size_t_digits == 64 ? 0x100000001b3 : 0x01000193; | ||
|
|
||
| unsigned char const * c = static_cast<unsigned char const *>(bytes); | ||
| unsigned char const * const e = c + len; | ||
|
|
||
| for (; c < e; ++c) { | ||
| seed = (seed ^ *c) * prime; | ||
| } | ||
|
|
||
| if constexpr (sizeof...(args) > 0) { | ||
| seed = hash_bytes(seed, std::forward<Args>(args)...); | ||
| } | ||
|
|
||
| return seed; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just thinking out loud in math terms: if we consider hash_bytes(seed, data, size) as a (surjective) function f(s, x) with s = seed and x = tuple(data, size)
given input data x0, x1, x2:
s = initial_seed
hash = f(f(f(s, x0), x1), x2)
and your "convenient" function hash_bytes(x0, x1, ...) can be defined as g:
g() = s
g(x0) = f(s, x0)
g(x0, x1) = f(f(s, x0), x1)
...
therefore, g is still surjective and g(x0, x1) != g(x1, x0) which is so far so good.
but now, consider: f(x0 ~ x1) == g(x0, x1) with ~ the "concatenation" operation: as long as x0 ~ x1 == x2 ~ x3, then g(x0, x1) == g(x2, x3), which is expected when calculating hash of string parts (where each part is different but concatenated version is the same). technically says, this make g no longer a good hash function, since we have a known set of collisions. but in our context this is acceptable. FNV-1a is not that good anyway.
(note: the property f(x0 ~ x1) == g(x0, x1) can be derived from the implementation and can be considered as a postulation in this context)
in practice, to prevent this, most hash functions use some kind of internal state such that: hash(s, x) = output(s, mix(x)), so hash(x0 ~ x1) != hash(x1 ~ x0). we can implement this easily but it's not really necessary. we are not doing cryptographic hash anyway.
end of the math part. now, in term of speed, what I think can be nice & fun to make hash_bytes to operate on blocks of data instead of one byte at a time. because one update depends on the prior state, CPU will have hard time with out-of-order execution. a simple vectorization should help a lot. just the case of x0 ~ x1 will be a bit complicated. I'll see how to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FNV-1a certainly has its flaws, it was mainly chosen for its simplicity and chainability, it performs well enough in terms of quality and speed here I think.
Feel free to improve upon it if you wish though. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I ended up refactor this into a new struct called hasher(), it is inspired by nodejs's crypto.createHash()
// Old notion:
size_t seed = hash_bytes(data0, len0);
seed = hash_bytes(data1, len1);
// ...
size_t output = hash_bytes(seed, dataN, lenN);
// New notion
hasher hash = hasher().update(data0, len0).update(data1, len1); // ...
size_t output = hash.digest();With this type of notion, we can in theory implement the notion of "internal state", though I won't add it because it's unnecessary (just mentioning here for completeness)
The new notion should have the exact same mathematical properties as the old one (reflected by the new test cases in test-jinja)
The result will be different from the old one though, because we are processing block of N bytes at once (on 64-bit system, we process 64/8 = 8 bytes). I expect most compilers will know how to vectorize it. If data is not a multiple of the block size, they will be buffered.
I'll do another pass tomorrow to see if there are any other clean up needed. In the meantime, feel free to review my last commit and lmk if anything seems off to you.
| for (const auto & val : val_arr) { | ||
| const size_t val_hash = val->unique_hash().digest(); | ||
| hash.update(&val_hash, sizeof(size_t)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should allow unique_hash to be passed a hasher so that arrays and objects can just update it instead of hashing the hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot leave a comment here, but that won't work in this case: ["ab", "c"] vs ["a", "bc"]
the case of reusing state is mostly used by string parts, I don't think it's valid anywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's actually a problem as typeid is hashed inbetween those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mathematically say, hashing the digest of elements vs letting each element to update are (quasi-)equivalent in our case, because the digest() only do one job: to add padding to the input data.
my current version is still mathematically equivalent to adding a typeid in-between, while being more simple than having to add a version of unique_hash that takes another hasher as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this code path is quite rarely used in practice (in case of using tuple as key), so I prefer to keep it simple for now. it should still be efficient enough doing this way.
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to merge when you're ready
Seems to need |
|
Ok, all good, will merge when CI is done. |
Allow all hashable types as object keys, taking care to replicate special python/jinja behavior between
int/float/bool.Fixed array/object output with
stringfilter.Fixed object
tojsonoutput (did not properly escape key string).Fixed object item order when replacing an item.