Coding of path components in URIs #44

Closed
hungerburg opened this Issue Sep 15, 2013 · 8 comments

Comments

Projects
None yet
2 participants
Contributor

hungerburg commented Sep 15, 2013

This is a port of SF ticket http://sourceforge.net/p/exist/bugs/766/ - for short:

In a few places the eXist codebase uses "java.net.URLDecoder" (which is intended to be used on http query strings) instead of "java.net.URI", resulting in wrong coding of path components in URIs and thereby creation of ressources, that cannot be fetched by the name, they were uploaded with.

Simplest way to show the (mis)behaviour:

  • From the dashboard start collections browser,
  • create collection "/db/testing",
  • upload a ressource, eg. "A+A.svg"
  • note that it gets represented as "A%2BA.svg" in the list widget,
  • which is the same as displayed in "/exist/rest/db/testing/",
  • and also in "/exist/webdav/db/testing/", while
  • in eXide open dialog it will show up as "A+A.svg".

BUT

  • fetching /exist/rest/db/testing/A%2BA.svg will result in a 404,
  • same as /exist/rest/db/testing/A+A.svg,
  • while fetching /exist/rest/db/testing/A%252BA.svg will work,
  • same as /exist/webdav/db/testing/A%252BA.svg,
  • all the while the file is inaccessible from eXide's "open", but
  • double clicking in coll.browser will fire up eXide to "A%2BA.svg".

That is: there is double encoding happening. The plus sign is encoded to %2B on store, and the newly introduced percent sign has to be coded to %25 on retrieve. This applies to all the characters, that java.net.URLDecoder codes different from java.net.URI in path components (most prominently the plus sign, but also others).

Contributor

hungerburg commented Sep 15, 2013

I understand, that this is a tricky situation, and it is not advisable to just swap java.net.URLDecoder with java.net.URI in the code: eg. it will become impossible to store ressources with a hash mark (#) in the name, because URIs must not contain that letter (as an uncoded literal), because it is used to split off the fragment identifier, while several (desktop) applications create temporary files with that character and so would break on a webdav mount of eXist-db.

Further to the above example, I copied a file "W+W.svg" into the webdav share:

  • In the webdav listing it will show up as "W+W.svg",
  • same as in "/exist/rest/db/testing/",
  • and also in the collections browser,
  • while in eXide "open" the listing will show "W W.svg".

YET

  • the URL "/exist/webdav/db/testing/W+W.svg" will work, but
  • the URL "/exist/rest/db/testing/W+W.svg" will fail with 404,
  • here "/exist/rest/db/testing/W%2BW.svg" will work,
  • and eXide "open" of "W W.svg" will fail,
  • same as double clicking in the coll.browser.

SO

  • the ressource "A+A.svg" will display by that name in the webdav listing, but the file cannot be opened by this name and is in fact unaccessible in all or most webdav clients.
  • the ressource "W+W.svg" will display by that name in the webdav listing and can be opened too, but not from the rest servlet.

Conclusions

  • The collections browser does a POST to upload.xql and somewhere in either xmldb:store or xmldb:encode-uri the Plus sign gets encoded to %2B. Webdav does not perform this step.
  • The rest servlet seems to do query string decoding of the path component (turning plus into space, or %2B into plus). Webdav does not perform this step.

In my view Milton/Webdav does it right, and both the rest servlet and upload.xql do it wrong. Unfortunately, I do not know how to reconcile the different interfaces, as this very likely touches quite deep into the core of eXist-db and may have consequences that are difficult to assess without extensive testing.

hungerburg closed this Sep 15, 2013

hungerburg reopened this Sep 15, 2013

Contributor

hungerburg commented Jan 17, 2014

Below code demonstrates the differences between the java.net.URI and URLEncoder classes by walking over the set of reserved characters, encoding and decoding them. Output consists of input string from the reserved character set and different encodings and a try at normalization of an encoded path component.

// javac dub_uri.java ; java dub_uri
// http://www.ietf.org/rfc/rfc3986.txt
// demonstrate the differences between URI and URLEncocder

// Different parts of URIs use different encodings!

// If something is once parsed into a URI, its best to pass around the URI from then on.
// A string can only be properly percent-encoded, when its use is clear, that is,
// which component of an URI it belongs to, respectively normalized, when known
// which component it stems from.

// java.net.URI has no method to "normalize" URIs from strings:
// It looks like URLDecoder can safely be used to normalize the RawPath of a URI
// (it cannot be reliably used to normalize the Path part! Beware the %-sign)

// If a string comes into the system from a GET call
// it is already URLEncoded, therefore
// it must not be URLEncoded again.

import java.net.URI;
import java.net.URISyntaxException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.io.UnsupportedEncodingException;

public class dub_uri {
    static void log(String a, String b) {
        System.out.println(a + ": " + b);
    }
    public static void main(String args[]) {
        String str; URI uri;
        String reserved = "!*'();:@&=+$,/?#[]";
        reserved += "% ";
        for (int i = 0; i < reserved.length(); i++) {
            str = "__" + String.valueOf(reserved.charAt(i)) + "__";
            try {
                log(":", str);
                // URLEncoder, don't do this on form data and query strings
                log("q", URLEncoder.encode(str, "UTF-8"));
                // This constructor works reliable to convert path components:
                // URI(String scheme, String host, String path, String fragment)
                uri = new URI("foo", null, "/" + str, null);
                log("r", uri.getRawPath().substring(1));
                log("p", uri.getPath().substring(1));
                // URLDecoder on the string gives actually the same result as getPath()
                log("d", URLDecoder.decode(uri.getRawPath().substring(1), "UTF-8"));
                //log("d", URLDecoder.decode(uri.getPath().substring(1), "UTF-8"));
            } catch (URISyntaxException e) {
                System.out.println(e.toString());
            } catch (UnsupportedEncodingException e) {
                System.out.println(e.toString());
            }
            System.out.println();
        }
    }
}
Contributor

hungerburg commented Jan 21, 2014

To continue with black-box-tesing, I looked at the /rest/ servlet. I found it to be quite compatible with the webdav servlet. The foremost issue here is with the "+" sign, which the /rest/ servlet wrongly interprets as meaning a space in the GET operation, but treats correctly in the PUT operation. This looks like low hanging fruit to me.

Below shell script tries to create resources with names containing characters from the reserved set and others with HTTP PUT requests to the /rest/ servlet. Redirect stderr to get a nice view.

#!/bin/sh
set -f # disable globbing
exec 2> /dev/null # redirect stderr

## reserved characters
# http://tools.ietf.org/html/rfc3986#section-2.2
#
# reserved    = gen-delims / sub-delims
# gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
# sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
#             / "*" / "+" / "," / ";" / "="
RCHAR=": / ? # [ ] @ ! $ & ' ( ) * + , ; ="

## path-segment
# http://tools.ietf.org/html/rfc3986#section-3.3
#
# pchar       = unreserved / pct-encoded / sub-delims / ":" / "@"
# pct-encoded = "%" HEXDIG HEXDIG
# unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
# sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
#             / "*" / "+" / "," / ";" / "="

## So, characters literally allowed in a path-segment are:
VPATH="A-Z a-z 0-9 - . _ ~ ! $ & ' ( ) * + , ; = : @"
# all the rest has to be percent-encoded
# the percent sign itself _must_ start a code

## So, reserved+ chars in need of encoding - in a path-segment - are:
#      /   ?   #   [   ]   %
# %20 %2F %3F %23 %5B %5D %25

## Interoperability /rest/ space:
# most webbrowsers act mostly correct
# curl does _no_ encoding on its own
# all browsers send a bare / as is (user error? will separate path-segments)
# all browsers send a bare ? as is (user error? will start the query-string)
# no browser sends a bare # at all (user error? will start the fragment-identifier)
# chrome and msie send [] verbatim (wrong? apache can accomodate…)
# all browsers send a bare % as is (user error? will start an escape, apache returns Bad Request)

## Interoperability /webdav/ space:
# the GET and PUT methods mirror /rest/ space
# These characters are not allowed in an NFTS filename
INTFS='/ \ : *  ? " < > |'
# of those, macintosh HFS only prohibits the colon
# most other UN*X FSs only prohibit the slash
# Quick test with bash on Linux extfs:
TWDAV="$VPATH $RCHAR $INTFS %"
# set -f; for fn in $TWDAV; do echo T__${fn}__ > /tmp/T__${fn}__; done
# only the slash will error out (twice)
# anything in this set can be thrown at webdav!

## Beware, some chars valid in a path-segment must not be in a filename (mostly NTFS)

HOST=host.dom
PORT=8080
AUTH=user:pass
REST=exist/rest/db/testing
WDAV=exist/webdav/db/testing

# curl does no percent-encoding on its own
# so we do it ourselves here

# do not include a literal slash or hash below!
# or you will create an indelible collection/resource
# webdav seems to cope though, some mac users here?
CHARS="$VPATH %20 %23 %25 %3F %5B %5D ä"

for C in $CHARS; do
    FILE=T__"$C"__.txt;
    TEXT=$(echo $FILE | ascii2uni -aJ)
    # put to /rest/ space
    echo -n "# PUT" $FILE
    curl -g -u $AUTH -X PUT -H "Content-Type: text/plain" \
    -w " # %{http_code} # " -o /dev/null \
    --data-binary "$TEXT" http://$HOST:$PORT/$REST/$FILE
    # print /rest/ result
    curl -s -u $AUTH http://$HOST:$PORT/$REST/$FILE
    echo -n " # "
    # print /webdav/ result
    curl -s -u $AUTH http://$HOST:$PORT/$WDAV/$FILE
    echo; echo
done

Some resources fail. Of the resources actually created (201 code) only the "+" cannot be fetched afterwards from /rest/. The resources created appear with the same name but NOT always identical content in webdav. That is also, how they mostly appear in the dashboards collection browser although many of them fail to open in eXide. When saving, eXide will double encode names with percent signs inside...

Contributor

hungerburg commented Jan 25, 2014

The script above can be used the same on the webdav servlet: Just s/rest/webdav/g.

#!/bin/sh
set -f # disable globbing

# ./dub_dav.sh > dub_dav.log

HOST=my.host
PORT=8080
AUTH=user:pass
COLL=exist/webdav/db/testing

# do not include a literal slash or hash below!
# or you will create an indelible collection/resource
CHARS="0 - . _ ~ ! $ & ' ( ) * + , ; = : @ %23 %25 %3F %5B %5D"

for C in $CHARS; do
    FILE=__"$C"__.txt;
    echo "# PUT" $FILE
    curl -g -u $AUTH -X PUT -H "Content-Type: text/plain" \
    -w "# %{http_code} " -o /dev/null \
    --data-binary "$FILE" http://$HOST:$PORT/$COLL/$FILE
    curl -s -u $AUTH http://$HOST:$PORT/$COLL/$FILE
    echo
    echo
done

curl -s -u $AUTH http://$HOST:$PORT/$COLL/
Contributor

hungerburg commented Jan 25, 2014

For webdav, curl might not be appropriate, but the above should not trouble a webdav server. There is a scriptable commandline client for webdav. Below script uses it for creating collections with devlish chars in their name.

!/bin/sh
set -f # disable globbing

# ./dub_cadaver.sh > dub_cadaver.log

HOST=my.host
PORT=8080
AUTH=@see .netrc
COLL=exist/webdav/db/testing

# cadaver does percent-encoding on its own
# no need to do it ourselves here
CHARS="0 - . _ ~ ! $ & ' ( ) * + , ; = : @ # % ? [ ]"

# The collection "__#__" is created as "__"
# This is a bug in cadaver, as can be seen from wireshark

for C in $CHARS; do
    FILE=__"$C"__;
    echo "# COL" $FILE
    echo "MKCOL $FILE" | cadaver http://$HOST:$PORT/$COLL/
    echo
done

echo ls | cadaver http://$HOST:$PORT/$COLL/

Again most of the collections/resources are created just fine. The colon and the question mark appear problematic in this stress test and the ones above too.

Contributor

hungerburg commented Feb 4, 2014

The backup routine also seems to URLDecode paths: In the zip the plus-sign of the resource is replaced with a blank, while in contents.xml it appears literally (this may be of course, because I run a patched version). Curiously, it uses yet another set of characters that are getting escaped, i.e. these three: "& * ?".
Also, a binary (unlike an xml one) resource with an encoded blank space %20 in its name will make the backup routine trip up and put lots of seemingly empty stuff into the lost_and_found directory…

Interestingly, eXist-db internally can store all the stuff:
strings webapp/WEB-INF/data/collections.dbx |grep T__
It is all there literally the same as it was PUT, except for the colon, which cannot be PUT.

Collaborator

joewiz commented Feb 5, 2017

Is this not closed by #605?

Contributor

hungerburg commented Feb 5, 2017

Fixed by #605 , I can close

hungerburg closed this Feb 5, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment