Skip to content

Always allow access to the http body and smarter encoding handling #58

@ahenket

Description

@ahenket

Use case: csv from Dutch national body comes back with Content-Type utf-8 but actual body utf-16 LE (BOM FF FE)

Reproduction (based on eXist-db 6)

declare namespace http              = "http://expath.org/ns/http-client";

http:send-request(
    <http:request method="GET" href="https://publicaties.rvig.nl/media/13307/download">
        <http:header name="Accept" value="text/csv"/>
        <http:header name="Cache-Control" value="no-cache"/>
        <http:header name="Max-Forwards" value="1"/>
    </http:request>
)[2]

The response comes back with Content-Type header containing utf-8 encoding, but since the actual contents are utf-16 I now get: "Failed to parse server's response: An invalid XML character (Unicode: 0x0) was found in the element content of the document."

I can override the server provided Content-Type using override-media-type="text/csv; charset=utf-16" but this requires me to know the encoding beforehand. I have reported the mismatched content-type to the responsible party but doubtful what or when that has any effect.

I would like to get to a place were I can always access the contents of a send-request() so I can work out some fall back scheme.

Ideally:

  • Always allow me access to the body, as binary if all else fails so prevent hard uncatchable errors e.g. about hex 0
  • Process body based on BOM if present before relying on Content-Type encoding
  • Process body based on Content-Type encoding if no BOM present
  • Process body based on UTF-8 if no BOM or Content-Type encoding present

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions