wrong character set in parsing request #1

Closed
desoi opened this Issue Sep 30, 2011 · 8 comments

Projects

None yet

2 participants

@desoi
desoi commented Sep 30, 2011

An incoming request has

Content-Type: application/json; charset=UTF-8

but the server treats it as latin-1 (the default). It seems the code correctly parses the character set from the content type, but ignores it unless the type is "text". See parse-content-type.

@hanshuebner
Member

application/json is always encoded as utf-8, at least that is what I understood so far.

@desoi
desoi commented Oct 31, 2011

Then the problem is the web server needs to assume UTF-8 or correctly interpret what character set is specified. In the current version it does neither. The only work-around I found was to explicitly specify the external format in the handler, e.g.

(defhandler (rpc-handler :uri "/rpc" :default-request-type :post) ()
(setf (tbnl:content-type_) "application/json")
(setf (tbnl:reply-external-format_) +utf-8+)
(invoke-rpc (tbnl:raw-post-data :external-format +utf-8+)))

The parse-content-type procedure should correctly handle this for the handler, but it is not working correctly.

@hanshuebner
Member

On Mon, Oct 31, 2011 at 1:37 PM, desoi
reply@reply.github.com
wrote:

Then the problem is the web server needs to assume UTF-8 or correctly interpret what character set is specified. In the current version it does neither. The only work-around I found was to explicitly specify the external format in the handler, e.g.

(defhandler (rpc-handler :uri "/rpc" :default-request-type :post) ()
 (setf (tbnl:content-type_) "application/json")
 (setf (tbnl:reply-external-format_) +utf-8+)
 (invoke-rpc (tbnl:raw-post-data :external-format +utf-8+)))

The parse-content-type procedure should correctly handle this for the handler, but it is not working correctly.

This does seem like the right thing to do. At the moment, Hunchentoot
is agnostic about non-text content types, and I'd say rightfully so.
What would be your suggestion?

-Hans

@hanshuebner
Member

One could argue that Hunchentoot should not attempt to convert the request body to a string for non-text content types. If the content type is in fact textual, the caller should have to specify the external format when calling RAW-POST-DATA. I am open to other suggestions (and patches).

@desoi
desoi commented Oct 31, 2011

Does not the fact that charset is provided, tell you that it is textual? There is nothing magic about the mime time having "text" in it. Think of all the popular mime types that are textual. In all cases Hunchentoot is going to give you the wrong character type by default (unless it matches by luck). What I have done above is not the correct solution because, I assumed UTF-8 which is incorrect. Other character sets are allowed. To handle any text request correctly (that does not have "text" in the mime type), the handler will need to parse the charset itself. This repeats what Hunchentoot has already executed, but ignored because "text" was not in the mime type.

This is not a big deal and I can work around it. But I suspect it will come up again in the near future :).

@hanshuebner
Member

Please study http://www.ietf.org/rfc/rfc2045.txt to learn about the
charset attribute in content types, where it applies and where it does
not. application/json is always UTF-8 encoded, and if you are writing
an application that relies on some other encoding, you need to either
use text/json or loose compatibility to clients which follow the RFC.

@desoi
desoi commented Oct 31, 2011

Please see the HTTP 1.1 specification (rfc2616) section 3.7.1. It says:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.

It does not say that only text subtypes are supported nor that it should be ignored if the subtype is not text. All I'm saying is that it would be nice to get a UTF-8 stream and not a ISO-8859-1 stream if the content type tells you what the character set is.

I'm simply using application/json as an example. If "application/whoknowswhat; charset=UTF-8" is the content type, the web server could provide a UTF-8 stream rather than some default (or incorrect) encoding.

@hanshuebner
Member

My argument is that per rfc2045, the charset argument is used only
with text content types. You have not convinced me in that you came
up with an example, "application/json", which is defined - as per
rfc4627 - to use Unicode as the encoding. Maybe you can come up with
a more convincing use case?

As I said, the right thing would be to return an octet vector for all
non-text content types and have the application do the proper
decoding. I agree that falling back to the default content type is
wrong, I don't agree with your proposed fix, but maybe you can come up
with a better example?

-Hans

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment