New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka incorrectly assumes UTF-8 for incoming text/xml media type without charset #1134

Closed
jypma opened this Issue May 16, 2017 · 6 comments

Comments

Projects
None yet
2 participants
@jypma
Member

jypma commented May 16, 2017

According to iana, the text/xml media type's charset property is optional.

When creating a ContentType in akka itself, as the http client, we currently require providing a charset for text/xml, which is a great recommendation.

However, as a server, we can't control what clients send. But an incoming request which just has Content-Type: text/xml is now represented as ContentType(text/xml, charset=UTF-8), e.g.

      path("echo") {
        extractRequest { rq => 
          complete("" + rq.header[`Content-Type`])
        }
      }

and the client:

curl -v -X POST -H 'Content-Type: text/xml' -d '<?xml><hello/>' http://localhost:8889/echo
Some(Content-Type: text/xml; charset=UTF-8)

where I would have hoped to see a ContentType with mediaType text/xml and charsetOption of None.

@jrudolph

This comment has been minimized.

Show comment
Hide comment
@jrudolph

jrudolph May 16, 2017

Member

Thanks, @jypma.

I think the problem is that the Content-Type data model currently does not allow to represent those values. The type hierarchy requires that a MediaType.WithOpenCharset must be paired with a concrete charset on construction.

For outgoing HTTP messages that seems like a sensible default but for incoming messages we decide directly in the parser to use UTF-8 as the default if none else is given:

// if we have an open charset media-type but no charset parameter we default to UTF-8

Member

jrudolph commented May 16, 2017

Thanks, @jypma.

I think the problem is that the Content-Type data model currently does not allow to represent those values. The type hierarchy requires that a MediaType.WithOpenCharset must be paired with a concrete charset on construction.

For outgoing HTTP messages that seems like a sensible default but for incoming messages we decide directly in the parser to use UTF-8 as the default if none else is given:

// if we have an open charset media-type but no charset parameter we default to UTF-8

@jypma

This comment has been minimized.

Show comment
Hide comment
@jypma

jypma May 16, 2017

Member

Right, thanks for the link.

My use case: we're using akka-http in a proxy, and would like to pass on the incoming request as unmodified as possible. Currently it turns text/xml into text/xml; charset=UTF-8.

Since it's a legal (but discouraged) content type, how about adding .toContentTypeWithoutCharset to MediaType.WithOpenCharset?

Member

jypma commented May 16, 2017

Right, thanks for the link.

My use case: we're using akka-http in a proxy, and would like to pass on the incoming request as unmodified as possible. Currently it turns text/xml into text/xml; charset=UTF-8.

Since it's a legal (but discouraged) content type, how about adding .toContentTypeWithoutCharset to MediaType.WithOpenCharset?

@jrudolph

This comment has been minimized.

Show comment
Hide comment
@jrudolph

jrudolph May 16, 2017

Member

Since it's a legal (but discouraged) content type, how about adding .toContentTypeWithoutCharset to MediaType.WithOpenCharset?

You still wouldn't know if the charset originally came with the request or if a default was added later in the parser.

We have a few potential solutions:

  • Add a configuration setting not to apply the default charset when one is missing but to create a custom binary media type in the parser instead. (Unmarshallers should then still work but needs to be tested)
  • Extend the model to add a case class ContentType.WithMissingCharset(MediaType.WithOpenCharset) extends ContentType. That's not strictly binary compatible because a library could have relied on a match over the sealed ContentType hierarchy being exhaustive. I think that's quite unlikely so I'd be fine with breaking binary compatibility in that regard if we fix all internal matches on content type.
Member

jrudolph commented May 16, 2017

Since it's a legal (but discouraged) content type, how about adding .toContentTypeWithoutCharset to MediaType.WithOpenCharset?

You still wouldn't know if the charset originally came with the request or if a default was added later in the parser.

We have a few potential solutions:

  • Add a configuration setting not to apply the default charset when one is missing but to create a custom binary media type in the parser instead. (Unmarshallers should then still work but needs to be tested)
  • Extend the model to add a case class ContentType.WithMissingCharset(MediaType.WithOpenCharset) extends ContentType. That's not strictly binary compatible because a library could have relied on a match over the sealed ContentType hierarchy being exhaustive. I think that's quite unlikely so I'd be fine with breaking binary compatibility in that regard if we fix all internal matches on content type.
@jypma

This comment has been minimized.

Show comment
Hide comment
@jypma

jypma May 16, 2017

Member

I like them both. I think WithMissingCharset is more in line with akka-http's strongly typed HTTP model; it would however make all those already somewhat complex match statements even more complex... OTOH, one could argue that it is real domain complexity, not just complexity we're adding.

I'll give a go at option 2. I hope the compiler will nudge me if I forget a case.

Member

jypma commented May 16, 2017

I like them both. I think WithMissingCharset is more in line with akka-http's strongly typed HTTP model; it would however make all those already somewhat complex match statements even more complex... OTOH, one could argue that it is real domain complexity, not just complexity we're adding.

I'll give a go at option 2. I hope the compiler will nudge me if I forget a case.

@jrudolph

This comment has been minimized.

Show comment
Hide comment
@jrudolph

jrudolph May 16, 2017

Member

👍 Cool, thanks.

Member

jrudolph commented May 16, 2017

👍 Cool, thanks.

@jypma

This comment has been minimized.

Show comment
Hide comment
@jypma

jypma May 17, 2017

Member

@jrudolph what's ContentType.binary actually used for? I'm in doubt whether to set it to true or false for the new WithMissingCharset. Trying false for now.

Member

jypma commented May 17, 2017

@jrudolph what's ContentType.binary actually used for? I'm in doubt whether to set it to true or false for the new WithMissingCharset. Trying false for now.

jypma added a commit to jypma/akka-http that referenced this issue May 17, 2017

Fix #1134: Create representation of Content-Type with missing charset
The text/xml content-type (and possibly others) is best used with a
charset parameter, and akka inforces this on the type level. That's
great practice, but _incoming_ requests can certainly have a
Content-Type header of "text/xml" without a charset. Akka currently
just blindly adds "UTF-8", which is incorrect and a potential problem.

This commit adds a specific type to the ContentType hierarchy to represent
"a media type that should have had a charset but didn't".

jypma added a commit to jypma/akka-http that referenced this issue May 17, 2017

Fix #1134: Create representation of Content-Type with missing charset
The text/xml content-type (and possibly others) is best used with a
charset parameter, and akka inforces this on the type level. That's
great practice, but _incoming_ requests can certainly have a
Content-Type header of "text/xml" without a charset. Akka currently
just blindly adds "UTF-8", which is incorrect and a potential problem.

This commit adds a specific type to the ContentType hierarchy to represent
"a media type that should have had a charset but didn't".

jypma added a commit to jypma/akka-http that referenced this issue May 18, 2017

Fix #1134: Create representation of Content-Type with missing charset
The text/xml content-type (and possibly others) is best used with a
charset parameter, and akka inforces this on the type level. That's
great practice, but _incoming_ requests can certainly have a
Content-Type header of "text/xml" without a charset. Akka currently
just blindly adds "UTF-8", which is incorrect and a potential problem.

This commit adds a specific type to the ContentType hierarchy to represent
"a media type that should have had a charset but didn't".

jypma added a commit to jypma/akka-http that referenced this issue May 22, 2017

Fix #1134: Create representation of Content-Type with missing charset
The text/xml content-type (and possibly others) is best used with a
charset parameter, and akka inforces this on the type level. That's
great practice, but _incoming_ requests can certainly have a
Content-Type header of "text/xml" without a charset. Akka currently
just blindly adds "UTF-8", which is incorrect and a potential problem.

This commit adds a specific type to the ContentType hierarchy to represent
"a media type that should have had a charset but didn't".

jypma added a commit to jypma/akka-http that referenced this issue May 22, 2017

Fix #1134: Create representation of Content-Type with missing charset
The text/xml content-type (and possibly others) is best used with a
charset parameter, and akka inforces this on the type level. That's
great practice, but _incoming_ requests can certainly have a
Content-Type header of "text/xml" without a charset. Akka currently
just blindly adds "UTF-8", which is incorrect and a potential problem.

This commit adds a specific type to the ContentType hierarchy to represent
"a media type that should have had a charset but didn't".

jypma added a commit to jypma/akka-http that referenced this issue May 24, 2017

Fix #1134: Create representation of Content-Type with missing charset
The text/xml content-type (and possibly others) is best used with a
charset parameter, and akka inforces this on the type level. That's
great practice, but _incoming_ requests can certainly have a
Content-Type header of "text/xml" without a charset. Akka currently
just blindly adds "UTF-8", which is incorrect and a potential problem.

This commit adds a specific type to the ContentType hierarchy to represent
"a media type that should have had a charset but didn't".

jypma added a commit to jypma/akka-http that referenced this issue May 29, 2017

Fix #1134: Create representation of Content-Type with missing charset
The text/xml content-type (and possibly others) is best used with a
charset parameter, and akka inforces this on the type level. That's
great practice, but _incoming_ requests can certainly have a
Content-Type header of "text/xml" without a charset. Akka currently
just blindly adds "UTF-8", which is incorrect and a potential problem.

This commit adds a specific type to the ContentType hierarchy to represent
"a media type that should have had a charset but didn't".

jypma added a commit to jypma/akka-http that referenced this issue May 29, 2017

Fix #1134: Create representation of Content-Type with missing charset
The text/xml content-type (and possibly others) is best used with a
charset parameter, and akka inforces this on the type level. That's
great practice, but _incoming_ requests can certainly have a
Content-Type header of "text/xml" without a charset. Akka currently
just blindly adds "UTF-8", which is incorrect and a potential problem.

This commit adds a specific type to the ContentType hierarchy to represent
"a media type that should have had a charset but didn't".

@jrudolph jrudolph closed this in c3b6705 May 31, 2017

jrudolph added a commit that referenced this issue May 31, 2017

Merge pull request #1141 from jypma/contenttype_missing_charset
Fix #1134: Create representation of Content-Type with missing charset

@jrudolph jrudolph added this to the 10.0.8 milestone May 31, 2017

tomrf1 added a commit to tomrf1/akka-http that referenced this issue Aug 13, 2017

Fix #1134: Create representation of Content-Type with missing charset
The text/xml content-type (and possibly others) is best used with a
charset parameter, and akka inforces this on the type level. That's
great practice, but _incoming_ requests can certainly have a
Content-Type header of "text/xml" without a charset. Akka currently
just blindly adds "UTF-8", which is incorrect and a potential problem.

This commit adds a specific type to the ContentType hierarchy to represent
"a media type that should have had a charset but didn't".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment