Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow any bytes (including non-UTF8 ones) in List Objects response XML #1255

Merged
merged 4 commits into from
Oct 6, 2015

Conversation

shino
Copy link
Contributor

@shino shino commented Sep 29, 2015

This PR addresses #974 (RCS-289)

By some reasons, List Object response XML can not be valid.

  • AWS S3 response is XML 1.0 and XML 1.0 allow only #x9, #xA and
    #xD for characters < #x1F. Then, PUT Object with path like %01
    in URL-encoded form can not be included in valid XML 1.0.
    AWS S3 does allow such bytes [1].
  • AWS S3 responds to List Objects by representing %01 to
    numeric-character-reference-like-but-just-invalid byte in XML, &#x1; [2].
    s3cmd and aws cli both fails to parse response including &#x1;.

This policy that this PR chooses:

  • If all keys are UTF-8 encoded byte sequences and all characters in
    them are valid XML 1.0 characters, then List Object responds contents
    which is valid XML 1.0.
  • Otherwise, it responds with some byte sequences which is not valid XML
    1.0 but as =reasonable for humans= as possible in order to deliver
    information about keys in buckets.

The actual logic is very simple. Just return bytes as it has been
uploaded except xml escaping <, > and & [3]. (For reviewers,
the main commit in this PR is b28fec7,
others are just refactoring.)

For example, assuming uploaded key was %01, then list results
includes binary like <<"<Key>", 16#01, "</Key>">> (in Erlang
notation). Users can manipulate such response by grep, sed, or
anything if XML library fails [4]. What one should do are only:

  • Extract bytes between <Key> and </Key> (not ambiguous because
    < is escaped)
  • Unescape &* references

[1] Example by s3curl to AWS S3

% s3curl.pl --id shino --put rebar.config -- -s -v \
    http://shino.shun.test-us.s3.amazonaws.com/'%01'
> PUT /%01 HTTP/1.1
> User-Agent: curl/7.35.0
> Host: shino.shun.test-us.s3.amazonaws.com
> Accept: */*
> x-amz-date: Tue, 29 Sep 2015 03:02:00 GMT
> Authorization: AWS AKIAJBO7GX36NI32XDRA:gahZlOqOkkrkbGXL34ZyKgT5ZnQ=
> Content-Length: 2852
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
< HTTP/1.1 200 OK
< x-amz-id-2: Ou1uOp+ZJDAwMqGDTnHa82F58sqCO9EWqzffke7/ga2lMEfnpwUQMhHDXCcX5QcQ
< x-amz-request-id: A69F3BE469CC0328
< Date: Tue, 29 Sep 2015 03:02:02 GMT
< ETag: "c90a5c0f80f9b5e2980deee859291373"
< Content-Length: 0
< Server: AmazonS3
<

Sidenote: %00 is NOT allowd.

% s3curl.pl --id shino --put rebar.config -- -s -v \
     http://shino.shun.test-us.s3.amazonaws.com/'%00'
> PUT /%00 HTTP/1.1
> User-Agent: curl/7.35.0
> Host: shino.shun.test-us.s3.amazonaws.com
> Accept: */*
> x-amz-date: Tue, 29 Sep 2015 03:02:06 GMT
> Authorization: AWS AKIAJBO7GX36NI32XDRA:0Oa7g2tTC5VvyRbZEQjnmP9pxbs=
> Content-Length: 2852
> Expect: 100-continue
>
< HTTP/1.1 400 Invalid URI
< Content-Length: 0
< Date: Tue, 29 Sep 2015 03:02:07 GMT
< Connection: close
< Server: AmazonS3
<

Seems like AWS S3 validation is based on XML 1.1 character range for
PUT request {shrug}.

[2] Extracted from XML response: <Contents><Key>abc&#x1;def</Key>[snip...].

[3] Numeric reference like (but not valid) representation, e.g. &#x1; is
not used. It's because 1. it is not still valid in XML 1.0 because
it is outside of character set and 2. if one treat it as XML 1.1 then
the byte 0x01 (or <<1>> in Erlang) is valid as is.

[4] s3curl does nice job for such lower level manipulation. AWS CLI also
nice because it output response body to stderr if it fails to parse
it as XML. s3cmd can produce such output by -d debug switch.

@shino shino added this to the 2.0.2 milestone Sep 29, 2015
@shino shino changed the title Allow any bytes (including non-UTF8 thingies) in List Objects response XML Allow any bytes (including non-UTF8 ones) in List Objects response XML Sep 29, 2015
borshop added a commit that referenced this pull request Oct 6, 2015
Allow any bytes (including non-UTF8 ones) in List Objects response XML

Reviewed-by: kuenishi
@kuenishi
Copy link
Contributor

kuenishi commented Oct 6, 2015

@borshop merge

@borshop borshop merged commit b28fec7 into 2.0 Oct 6, 2015
@kuenishi kuenishi deleted the feature/list-objs-any-bytes branch October 6, 2015 07:44
@kuenishi kuenishi mentioned this pull request Dec 7, 2015
@shino
Copy link
Contributor Author

shino commented Dec 9, 2015

Memo: sample one liner for AWS CLI to see possibly unprintable characters included in Key elements

.aws s3api list-objects --bucket test  2>&1 | sed -e 's/</\n</g' | grep '<Key>' | sed -e 's/<Key>//g' | od -t x1c

@kuenishi kuenishi modified the milestones: 2.1.1, 2.0.2 Jan 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants