Skip to content

Conversation

aldehir
Copy link
Collaborator

@aldehir aldehir commented Oct 12, 2025

The JSON partial parsing does not handle partial Unicode escape sequences, causing exceptions when streaming. This PR adds logic to handle Unicode that aligns with the current JSON healing implementation.

Fixes #16465

Specifics

A Unicode escape sequence comes in the form \uXXXX. Most models do not have unique tokens for each Unicode code point, so they emit them in succession: ['\\', 'u', 'X', 'X', 'X', 'X']. This breaks the JSON parsing and dumping operations used to form a partial arguments string.

Here are the rules I apply to fix the generation so we properly apply a healing marker:

  1. If the last token is \, the existing code handles it.
  2. If the last escape sequence is any variation of \u, \uX, \uXX, \uXXX, or \uXXXX, I pad it with zeros until it becomes a complete sequence.
  3. If the last escape sequence matches a high surrogate (U+D800-U+DBFF), either partially or fully, I pad it as above and add a fake low surrogate (U+DC00).
  4. If a second-to-last high surrogate sequence exists adjacent to the last sequence, I pad the last sequence to form a valid low surrogate (U+DC00).
  5. There is a special case where a single backslash \ may follow a high surrogate. To handle this, the default padding is a valid low surrogate.
  6. I add the padding to the marker, so we know where to split the string to not include the padding.

Risks

I tried my best to minimize any impact, since we use this logic through the code base. I only apply the Unicode logic when the parsing of the existing rules fails.

However, I had to change the call to j.dump() to set ensure_ascii = true, otherwise it won't escape the Unicode sequences. Since Unicode handling is already unstable, I don't think this makes things worse.

@aldehir aldehir requested a review from ggerganov as a code owner October 12, 2025 05:27
@github-actions github-actions bot added the testing Everything test related label Oct 12, 2025
@ServeurpersoCom
Copy link
Collaborator

I test it also

@ServeurpersoCom
Copy link
Collaborator

I just hit the same issue myself while testing UTF-8 compliance in tool calls, and here’s the fix I came up with :
However, I’ll also test your implementation to compare the behavior and confirm whether it resolves my case as well.

diff --git a/common/chat-parser.cpp b/common/chat-parser.cpp
index 7365782e7d6d8670c31ec3417562a986d559b6c9..5d6cf37b943c7182363a36fda2dc25a6bd415a3d 100644
--- a/common/chat-parser.cpp
+++ b/common/chat-parser.cpp
@@ -422,51 +422,51 @@ std::optional<common_chat_msg_parser::consume_json_result> common_chat_msg_parse
     };
 
     if (partial->healing_marker.marker.empty()) {
         if (args_paths.empty()) {
             // No arguments to dump, and JSON was parsed fully.
             return consume_json_result {
                 partial->json,
                 /* .is_partial = */ false,
             };
         }
         if (is_arguments_path({})) {
             // Entire JSON is the arguments and was parsed fully.
             return consume_json_result {
                 partial->json.dump(),
                 /* .is_partial = */ false,
             };
         }
     }
 
     LOG_DBG("Parsed partial JSON: %s (json_healing_marker: %s)\n", partial->json.dump().c_str(), partial->healing_marker.json_dump_marker.c_str());
 
     auto found_healing_marker = false;
     std::vector<std::string> path;
     std::function<json(const json &)> remove_unsupported_healings_and_dump_args = [&](const json & j) -> json {
         if (is_arguments_path(path)) {
-            auto arguments = j.dump();
+            auto arguments = j.dump(-1, ' ', true, json::error_handler_t::strict);
             if (is_partial() && !partial->healing_marker.marker.empty()) {
                 auto idx = arguments.find(partial->healing_marker.json_dump_marker);
                 if (idx != std::string::npos) {
                     arguments.resize(idx);
                     found_healing_marker = true;
                 }
                 if (arguments == "\"") {
                     // This happens because of completing `:"$magic` after `"arguments"`
                     arguments = "";
                 }
             }
             return arguments;
         }
         if (is_content_path(path)) {
             if (!j.is_string()) {
                 throw std::runtime_error("Content path must be a string");
             }
             std::string str = j;
             auto idx = str.find(partial->healing_marker.marker); // not using json_dump_marker as we're inside a string
             if (idx != std::string::npos) {
                 str.resize(idx);
                 found_healing_marker = true;
             }
             return str;
         }
diff --git a/common/json-partial.cpp b/common/json-partial.cpp
index d9d91699899f7ba9870184caa7e3c5ff04280e9b..29517135fccb99bf52ee3cfe569f2735ea2ed597 100644
--- a/common/json-partial.cpp
+++ b/common/json-partial.cpp
@@ -1,35 +1,112 @@
 #include "json-partial.h"
 
 #include "log.h"
 
 #include <nlohmann/json.hpp>
 
+#include <cctype>
+#include <optional>
 #include <string>
 
 using json = nlohmann::ordered_json;
 
+namespace {
+
+std::optional<std::string> common_json_unicode_padding(const std::string & str) {
+    if (str.size() < 2) {
+        return std::nullopt;
+    }
+    const auto escape_pos = str.find_last_of('\\');
+    if (escape_pos == std::string::npos || escape_pos + 1 >= str.size()) {
+        return std::nullopt;
+    }
+    const char escape_type = str[escape_pos + 1];
+    if (escape_type != 'u' && escape_type != 'U') {
+        return std::nullopt;
+    }
+
+    size_t digits_count = 0;
+    for (size_t i = escape_pos + 2; i < str.size(); ++i) {
+        const auto ch = static_cast<unsigned char>(str[i]);
+        if (!std::isxdigit(ch)) {
+            return std::nullopt;
+        }
+        ++digits_count;
+    }
+
+    const auto has_previous_high_surrogate = [&]() {
+        if (escape_pos < 6) {
+            return false;
+        }
+        const auto prev_escape_pos = str.rfind('\\', escape_pos - 1);
+        if (prev_escape_pos == std::string::npos || escape_pos - prev_escape_pos != 6) {
+            return false;
+        }
+        const char prev_type = str[prev_escape_pos + 1];
+        if (prev_type != 'u' && prev_type != 'U') {
+            return false;
+        }
+        const auto prev_digits = str.substr(prev_escape_pos + 2, 4);
+        unsigned long prev_value = 0;
+        try {
+            prev_value = std::stoul(prev_digits, nullptr, 16);
+        } catch (...) {
+            return false;
+        }
+        return prev_value >= 0xD800 && prev_value <= 0xDBFF;
+    };
+
+    if (digits_count < 4) {
+        if (has_previous_high_surrogate()) {
+            static const std::string low_surrogate = "DC00";
+            const auto existing = str.substr(escape_pos + 2, digits_count);
+            if (existing.empty() || low_surrogate.compare(0, digits_count, existing) == 0) {
+                return low_surrogate.substr(digits_count);
+            }
+        }
+        return std::string(4 - digits_count, '0');
+    }
+
+    if (digits_count == 4) {
+        const auto digits = str.substr(escape_pos + 2, 4);
+        unsigned long value = 0;
+        try {
+            value = std::stoul(digits, nullptr, 16);
+        } catch (...) {
+            return std::nullopt;
+        }
+        if (value >= 0xD800 && value <= 0xDBFF) {
+            return std::string("\\uDC00");
+        }
+    }
+
+    return std::nullopt;
+}
+
+} // namespace
+
 enum common_json_stack_element_type {
     COMMON_JSON_STACK_ELEMENT_OBJECT,
     COMMON_JSON_STACK_ELEMENT_KEY,
     COMMON_JSON_STACK_ELEMENT_ARRAY,
 };
 
 struct common_json_stack_element {
     common_json_stack_element_type type;
     std::string key;
 };
 
 bool common_json_parse(
     const std::string & input,
     const std::string & healing_marker,
     common_json & out)
 {
     std::string::const_iterator it = input.begin();
     const auto end = input.end();
     return common_json_parse(it, end, healing_marker, out);
 }
 
 bool common_json_parse(
     std::string::const_iterator & it,
     const std::string::const_iterator & end,
     const std::string & healing_marker,
@@ -161,94 +238,103 @@ bool common_json_parse(
                 auto & el = err_loc.stack[i - 1];
                 if (el.type == COMMON_JSON_STACK_ELEMENT_OBJECT) {
                     closing += "}";
                 } else if (el.type == COMMON_JSON_STACK_ELEMENT_ARRAY) {
                     closing += "]";
                 } else if (el.type != COMMON_JSON_STACK_ELEMENT_KEY) {
                     throw std::runtime_error("Unexpected stack element type");
                 }
             }
 
             const auto & magic_seed = out.healing_marker.marker = healing_marker;//"$llama.cpp.json$";
 
             if (err_loc.stack.back().type == COMMON_JSON_STACK_ELEMENT_KEY) {
                 // We're inside an object value
                 if (last_non_sp_char == ':' && can_parse(str + "1" + closing)) {
                     // Was about to create an object value
                     str += (out.healing_marker.json_dump_marker = "\"" + magic_seed) + "\"" + closing;
                 } else if (can_parse(str + ": 1" + closing)) {
                     str += (out.healing_marker.json_dump_marker = ":\"" + magic_seed) + "\"" + closing;
                 } else if (last_non_sp_char == '{' && can_parse(str + closing)) {
                     // Was about to create an object
                     str += (out.healing_marker.json_dump_marker = "\"" + magic_seed) + "\": 1" + closing;
                 } else if (can_parse(str + "\"" + closing)) {
                     // Was inside an object value string
                     str += (out.healing_marker.json_dump_marker = magic_seed) + "\"" + closing;
+                } else if (auto unicode_padding = common_json_unicode_padding(str);
+                        unicode_padding && can_parse(str + *unicode_padding + "\"" + closing)) {
+                    str += (out.healing_marker.json_dump_marker = *unicode_padding + magic_seed) + "\"" + closing;
                 } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"" + closing)) {
                     // Was inside an object value string after an escape
                     str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"" + closing;
                 } else {
                     // find last :
                     auto last_pos = str.find_last_of(':');
                     if (last_pos == std::string::npos) {
                         throw std::runtime_error("Cannot heal a truncated JSON that stopped in an unknown location");
                     }
                     // Cutting back to opening : for object value
                     str = str.substr(0, last_pos + 1) + (out.healing_marker.json_dump_marker = "\"" + magic_seed) + "\"" + closing;
                 }
             } else if (err_loc.stack.back().type == COMMON_JSON_STACK_ELEMENT_ARRAY) {
                 if ((last_non_sp_char == ',' || last_non_sp_char == '[') && can_parse(str + "1" + closing)) {
                     // Was about to create an array value
                     str += (out.healing_marker.json_dump_marker = "\"" + magic_seed) + "\"" + closing;
                 } else if (can_parse(str + "\"" + closing)) {
                     // Was inside an array value string
                     str += (out.healing_marker.json_dump_marker = magic_seed) + "\"" + closing;
+                } else if (auto unicode_padding = common_json_unicode_padding(str);
+                        unicode_padding && can_parse(str + *unicode_padding + "\"" + closing)) {
+                    str += (out.healing_marker.json_dump_marker = *unicode_padding + magic_seed) + "\"" + closing;
                 } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"" + closing)) {
                     // Was inside an array value string after an escape
                     str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"" + closing;
                 } else if (!was_maybe_number() && can_parse(str + ", 1" + closing)) {
                     // Had just finished a value
                     str += (out.healing_marker.json_dump_marker = ",\"" + magic_seed) + "\"" + closing;
                 } else {
                     auto last_pos = str.find_last_of("[,");
                     if (last_pos == std::string::npos) {
                         throw std::runtime_error("Cannot heal a truncated JSON array stopped in an unknown location");
                     }
                     // Cutting back to last [ or , for array value
                     str = str.substr(0, last_pos + 1) + (out.healing_marker.json_dump_marker = "\"" + magic_seed) + "\"" + closing;
                 }
             } else if (err_loc.stack.back().type == COMMON_JSON_STACK_ELEMENT_OBJECT) {
                 if ((last_non_sp_char == '{' && can_parse(str + closing)) ||
                         (last_non_sp_char == ',' && can_parse(str + "\"\": 1" + closing))) {
                     // Was about to create an object key+value
                     str += (out.healing_marker.json_dump_marker = "\"" + magic_seed) + "\": 1" + closing;
                 } else if (!was_maybe_number() && can_parse(str + ",\"\": 1" + closing)) {
                     // Was about to create an object key+value
                     str += (out.healing_marker.json_dump_marker = ",\"" + magic_seed) + "\": 1" + closing;
                 } else if (can_parse(str + "\": 1" + closing)) {
                     // Was inside an object key string
                     str += (out.healing_marker.json_dump_marker = magic_seed) + "\": 1" + closing;
+                } else if (auto unicode_padding = common_json_unicode_padding(str);
+                        unicode_padding && can_parse(str + *unicode_padding + "\": 1" + closing)) {
+                    str += (out.healing_marker.json_dump_marker = *unicode_padding + magic_seed) + "\": 1" + closing;
                 } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\": 1" + closing)) {
                     // Was inside an object key string after an escape
                     str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\": 1" + closing;
                 } else {
                     auto last_pos = str.find_last_of(':');
                     if (last_pos == std::string::npos) {
                         throw std::runtime_error("Cannot heal a truncated JSON object stopped in an unknown location");
                     }
                     // fprintf(stderr, "Cutting back to last : for object key+value\n");
                     str = str.substr(0, last_pos + 1) + (out.healing_marker.json_dump_marker = "\"" + magic_seed) + "\"" + closing;
                 }
             } else {
                 throw std::runtime_error("Cannot heal a truncated JSON object stopped in an unknown location");
             }
             // fprintf(stderr, "HEALED:\nSTRING <<<\n%s\n>>>\n\nmagic_cut: <<<\n%s\n>>>\n\n", str.c_str(), out.healing_marker.json_dump_marker.c_str());
             out.json = json::parse(str);
             it = temptative_end;
             return true;
         }
         // TODO: handle unclosed top-level primitive if the stack was empty but we got an error (e.g. "tru", "\"", etc...)
         // fprintf(stderr, "Closing: TODO\n");
         return false;
     }
     out.json = json::parse(it, end);
     it = end;

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Oct 12, 2025

Just tested your patch, and it works flawlessly in my case too.
It fully fixes the UTF-8 handling problem I was encountering with tool calls. Great work!

I asked the model to search for the long dash \u2013 on Google.
Without the fix, an extra double quote appears; with your patch (and mine as well), everything works perfectly.

No fix :
Before

This PR :
After

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Oct 12, 2025

Alternative (my patch)
Handles incomplete \u sequences through a lightweight local utility function that inspects the last valid characters.
It counts hexadecimal digits, checks for a preceding surrogate pair, and returns only the minimal padding (0s or \uDC00).
No <regex> or heavy allocations are involved.

This helper is invoked only when the existing healing heuristics fail, then reuses the same can_parse(...) logic used elsewhere, ensuring Unicode padding is applied only when the resulting string is actually parseable.

Argument serialization is done with a single call to dump(-1, ' ', true, json::error_handler_t::strict), preserving escaped Unicode and surfacing inconsistencies immediately, without spreading multiple dump() variants.


PR approach
Introduces a dependency on <regex> and scans suffixes with std::regex_search, maintaining several hardcoded paddings ("udc00", "dc00", "c00", etc.), which increases parsing cost and conceptual complexity.

It handles the same healing branches by directly injecting unicode_marker_padding without reusing the existing verification logic, which requires manual synchronization of each case with string literals.

It also repeats ensure_ascii activations across multiple dump() calls and adds a long series of dedicated unit tests for partial \u variants—making the patch larger but well-tested.

@aldehir
Copy link
Collaborator Author

aldehir commented Oct 12, 2025

@ServeurpersoCom thank you for testing it out. Looks like we came up with pretty similar solutions.

I was originally going your route, but found myself overwhelmed with the boilerplate needed which is why I used regex instead. I don't think the performance impact is significant, but I don't have any metrics to back that up.

@ServeurpersoCom
Copy link
Collaborator

@ServeurpersoCom thank you for testing it out. Looks like we came up with pretty similar solutions.

I was originally going your route, but found myself overwhelmed with the boilerplate needed which is why I used regex instead. I don't think the performance impact is significant--but I don't have any metrics to back that up.

Yes, I prefer your version : it’s easier to read and already well-integrated with the test suite. Mine was a bit lower-level, but yours is definitely cleaner and more maintainable.

@ggerganov ggerganov merged commit 2c301e9 into ggml-org:master Oct 12, 2025
69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Granite 4 crashes when certain strings (seems both : and \u) are included in tool arguments

3 participants