Add ->decoded_content and a few utility methods. #3

Closed
wants to merge 3 commits into
from

3 participants

@kentfredric
  ->decoded_content(?forced_contenttype) => scalar
  ->content_type()   => scalar or undef
  ->content_type_params => arrayref

decoded_content optionally takes a encoding field that overrides
autodetection logic and ultimately serves as a shorthand that means

  ->decoded_content('utf-8')

Yields

  Encode::decode('utf-8', $result->content, Encode::FB_CROAK );

Removal of this optional parameter will revert to using content_type and
content_type_params to determine the decoding:

  1. If there is no content-type, there is no decoding.
  2. If the content type is not text/* , there is no decoding.
  3. If the content type contains no parameters after a ';' , there is no decoding.
  4. If a parameter in the parameter list matches charset=(.*), $1 is picked as an encoding.
  5. If there are multiple such parameters, the rightmost one takes precedence.
  6. If there is no such parameter, there is no encoding.

The design of the above is also designed to make it easy for people who
are working with content-types that don't qualify as "text" as per
IANA/W3C, and may have non-standard parameters that are also not
supported in the spec.

For instance, myself, I'm dealing with some upstream declaring
'application/json;charset=utf-8', and I could potentially map that
locally now as:

  if ( $result->content_type eq 'application/json' ) {
    my $charset = 'utf-8'; # Spec says to assume utf8 anyway
    for my $param (@{ $result->content_params }) {
      if ( $param =~ /^charset=(.*)$/ ) {
        $charset = $1;
      }
    }
    return $json->decode( $result->decoded_content( $charset ) );
  }

And that would theoretically work if upstream went insane and changed
'=utf-8' to '=iso-8859-15'.

@dagolden

Thanks! FYI, I probably won't get a chance to review it until later this week. If you haven't heard from me by next week, feel free to remind me.

@dagolden

Sorry to take so long.

Overall it seems useful and well designed. I wonder about Encode::FB_CROAK as a default. At the least, it needs to be documented. I wonder whether decoded_content should take a optional hashref of options to control such things

$response->decoded_content({ encoding => 'utf8', fallback => Encode::FB_DEFAULT})

I also wonder whether you want to differ between a forced content type and a default one. Might someone want to say "assume UTF-8 unless specified explicitly"?

@kentfredric

DDDing to see how this looks:

$response->decoded_content(\%opts);

encoding

The decoding type to use if none is specified

Default is none: Returns bytes off the wire.

force_encoding

Use encoding regardless of detected encoding

fallback

How to handle errors in decoding.

Defaults to Encode::FB_CROAK,

Just the question is about sane defaults:

Should fallback be FB_DEFAULT or FB_CROAK by default?

Should specifying an encoding imply force_encoding by default or not?

And it should be clear from reading above that:

$response->decoded_content({ encoding => undef, force_encoding => 1 });

Should function the same as

$response->content();
@dagolden

Bikeshedding, I'd say just force rather than force_encoding since it's more clearly binary and not an alternate encoding.

I agree this:

$response->decoded_content({ encoding => undef, force => 1 });

Would be equivalent to $response->content (bytes on wire).

I tend to think FB_DEFAULT is a better default under the "liberal in what we accept" principle. Otherwise, every ->decoded_content needs to wrapped in eval and that gets ugly quick. Someone who doesn't mind adding all the error handling code won't mind adding a fallback argument, either.

It means this:

$response->decoded_content()

will just DWIM and give a reasonably sane best-efforts answer.

What do you think?

@kentfredric

Seems reasonable.

@kentfredric kentfredric Add ->decoded_content and a few utility methods.
  ->decoded_content(?forced_contenttype) => scalar
  ->content_type()   => scalar or undef
  ->content_type_params => arrayref

decoded_content optionally takes a encoding field that overrides
autodetection logic and ultimately serves as a shorthand that means

  ->decoded_content('utf-8')

Yields

  Encode::decode('utf-8', $result->content, Encode::FB_CROAK );

Removal of this optional parameter will revert to using content_type and
content_type_params to determine the decoding:

  1. If there is no content-type, there is no decoding.
  2. If the content type is not text/* , there is no decoding.
  3. If the content type contains no parameters after a ';' , there is
     no decoding.
  4. If a parameter in the parameter list matches `charset=(.*)`, $1 is
     picked as an encoding.
  5. If there are multiple such parameters, the rightmost one takes
     precedence.
  6. If there is no such parameter, there is no encoding.

The design of the above is also designed to make it easy for people who
are working with content-types that don't qualify as "text" as per
IANA/W3C, and may have non-standard parameters that are also not
supported in the spec.

For instance, myself, I'm dealing with some upstream declaring
'application/json;charset=utf-8', and I could potentially map that
locally now as:

  if ( $result->content_type eq 'application/json' ) {
    my $charset = 'utf-8'; # Spec says to assume utf8 anyway
    for my $param (@{ $result->content_params }) {
      if ( $param =~ /^charset=(.*)$/ ) {
        $charset = $1;
      }
    }
    return $json->decode( $result->decoded_content( $charset ) );
  }

And that would theoretically work if upstream went insane and changed
'=utf-8' to '=iso-8859-15'.
f01b465
@kentfredric kentfredric added a commit to kentfredric/HTTP-Tiny-UA that referenced this pull request Mar 23, 2014
@kentfredric kentfredric Reimplement parts of decoded_content(%opts) as discussed in #3.
opts = {
  encoding => ?
  force    => ?
  fallback => ?
}

`encoding` sets the default encoding in cases where one cannot be
detected.

`force` overrides detection and uses the specified encoding.

`fallback` takes an Encoding.pm fallback value and defaults to
FB_DEFAULT

{ encoding => undef , force => 1 } and { force => 1 } both presently
revert `decoded_content` to work the same as `content`.
7c607f9
@kentfredric

Wow. This bug I had in my code: this is somehow legal:

perl -MEncode -E 'say Encode::decode(q[charset=utf-8], q[hello], Encode::FB_DEFAULT )'
hello

I'm scared:

perl -MEncode -E 'say Encode::find_encoding(q[charset=utf8])->name'
Can't call method "name" on an undefined value at -e line 1.
perl -MEncode -E 'say Encode::find_encoding(q[utf8])->name'
utf8
perl -MEncode -E 'say Encode::find_encoding(q[charset=utf-8])->name'
utf-8-strict
perl -MEncode -E 'say Encode::find_encoding(q[utf-8])->name'
utf-8-strict
kentfredric added some commits Mar 23, 2014
@kentfredric kentfredric Reimplement parts of decoded_content(%opts) as discussed in #3.
opts = {
  encoding => ?
  force    => ?
  fallback => ?
}

`encoding` sets the default encoding in cases where one cannot be
detected.

`force` overrides detection and uses the specified encoding.

`fallback` takes an Encoding.pm fallback value and defaults to
FB_DEFAULT

{ encoding => undef , force => 1 } and { force => 1 } both presently
revert `decoded_content` to work the same as `content`.

Fixup code
36f94aa
@kentfredric kentfredric Update test with new syntax, and cover the 3 kinds of defaulting/forcing 14219b4
@dagolden
@kentfredric

Just a reminder as per requested, does this need anything else done to it?

@kentfredric kentfredric referenced this pull request in kentnl/HTTP-Tiny-Mech Jul 22, 2014
Open

Return decoded_content in _unwrap_response(). #2

@kentfredric

Ugh. Its probably worth mentioning in LWP, the nightmare is that decoded_content() seems to not pertain to the encoding of the content itself in terms of character sets, but the encoding of the content in terms of bit packing format ( ie: gzip ).

In the aforementioned pull req, it seems WWW::Mechanize does something very daft:

  • It sets a custom Accept-Encoding header without the user asking it to
  • It sets it to gzip automatically if the right IO:: stuff exists.
  • But returns ->content undecoded
  • And returns headers indicating encoded content.

Which strikes me as horrible horrible horrible design, and I'd hate to either

  • confuse LWP/WWW::Mech users thinking that method and this method do the same thing.
  • make the same mistake of making users responsible for decoding data they never requested encoded in the first place.

^ If you have strong opinions on what WWW::Mechanize should be doing here such that you agree with me, please comment on that thread, but I'm not sure we can change what WWW::Mech does :(

@xdg
xdg commented Mar 1, 2016

I think this should try to match the capabilities of HTTP::Message, even if the API is different.

Is that something you'd like to add or should I close this PR as incomplete?

@kentfredric

Its a long time ago now and 1.5 years is a bit long for me to reorient myself mentally to what I was doing and why.

I'm not sure which capabilities are specifically missing or needed, and I don't know enough about what I'm doing any more without needing to do a lot of research again in that department. ( esp with analysing HTTP::Message feature set )

Its still a thing that would be useful, and I don't think we need a 99.99% solution off the bat, but I'm not bothered if my commits/implementation of it as such aren't directly used.

The giant time factor has turned it into an "eh, ... "

@xdg
xdg commented Mar 1, 2016

OK. Thanks for clarifying. I'm going to close this and perhaps someday someone will pick it up.

@xdg xdg closed this Mar 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment