lenient utf8 parser #34

jhump · 2022-09-16T19:11:01Z

As of today, protoc allows invalid UTF8. That means that proto sources that are mainly compiled with protoc (such as the googleapis module) could have bad encoding. And that means that protocompile, at least for now, needs to allow it, too.

This makes protocompile work the same way as protoparse: bad encoding bytes are silently replaced with the unicode replacement char. This is how lenient UTF8 decoders are expected to work. This does not match the behavior of protoc, but this is an acceptable variance for now.

This addresses an old bug filed by @amckinney: jhump/protocompile-old#5

lenient utf8 parser

d33fd6d

jhump requested a review from pkwarren September 16, 2022 19:11

jhump mentioned this pull request Sep 16, 2022

Failed to parse invalid UTF8 jhump/protocompile-old#5

Closed

pkwarren approved these changes Sep 16, 2022

View reviewed changes

jhump merged commit 03fff2f into main Sep 16, 2022

jhump deleted the jh/allow-bad-utf8-for-now branch September 16, 2022 21:01

jhump mentioned this pull request Sep 21, 2022

Use latest protocompile in buf format, fix things up in the formatter bufbuild/buf#1427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lenient utf8 parser #34

lenient utf8 parser #34

jhump commented Sep 16, 2022

lenient utf8 parser #34

lenient utf8 parser #34

Conversation

jhump commented Sep 16, 2022