-
Notifications
You must be signed in to change notification settings - Fork 13.3k
common : handle unicode during partial json parsing #16526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common : handle unicode during partial json parsing #16526
Conversation
I test it also |
I just hit the same issue myself while testing UTF-8 compliance in tool calls, and here’s the fix I came up with :
|
Just tested your patch, and it works flawlessly in my case too. I asked the model to search for the long dash \u2013 on Google. |
Alternative (my patch) This helper is invoked only when the existing healing heuristics fail, then reuses the same Argument serialization is done with a single call to PR approach It handles the same healing branches by directly injecting It also repeats |
@ServeurpersoCom thank you for testing it out. Looks like we came up with pretty similar solutions. I was originally going your route, but found myself overwhelmed with the boilerplate needed which is why I used regex instead. I don't think the performance impact is significant, but I don't have any metrics to back that up. |
Yes, I prefer your version : it’s easier to read and already well-integrated with the test suite. Mine was a bit lower-level, but yours is definitely cleaner and more maintainable. |
The JSON partial parsing does not handle partial Unicode escape sequences, causing exceptions when streaming. This PR adds logic to handle Unicode that aligns with the current JSON healing implementation.
Fixes #16465
Specifics
A Unicode escape sequence comes in the form
\uXXXX
. Most models do not have unique tokens for each Unicode code point, so they emit them in succession:['\\', 'u', 'X', 'X', 'X', 'X']
. This breaks the JSON parsing and dumping operations used to form a partial arguments string.Here are the rules I apply to fix the generation so we properly apply a healing marker:
\
, the existing code handles it.\u
,\uX
,\uXX
,\uXXX
, or\uXXXX
, I pad it with zeros until it becomes a complete sequence.U+D800-U+DBFF
), either partially or fully, I pad it as above and add a fake low surrogate (U+DC00
).U+DC00
).\
may follow a high surrogate. To handle this, the default padding is a valid low surrogate.Risks
I tried my best to minimize any impact, since we use this logic through the code base. I only apply the Unicode logic when the parsing of the existing rules fails.
However, I had to change the call to
j.dump()
to setensure_ascii = true
, otherwise it won't escape the Unicode sequences. Since Unicode handling is already unstable, I don't think this makes things worse.