Bug in decoding of "+" in xmldb URI-related functions #1824

joewiz · 2018-04-14T16:13:57Z

What is the problem

The + character is a reserved character according to RFC3986, and as such should be "protected from normalization." eXist's xmldb module's functions that deal with database URIs incorrectly normalize this character, treating it as if it were a percent-encoded octet, i.e., treating the + character in an xmldb URI as a character that should be decoded and turned into a space character.

This causes a problem in XQuery applications like eXide where the application must be able to reliably display the URI-decoded form of a resource (its "name"), while also reliably addressing its URI (as a "key") for resource operations like delete and rename.

For example, given a resource in the database created via xmldb:store("/db", "A+B.xml", <foo/>), we can successfully address this document via doc("/db/A+B.xml"), but xmldb:decode("A+B.xml") returns A B.xml instead of A+B.xml.

While this issue is about handling of the + character, it is also about a larger issue: users have reported unpredictable behavior involving characters in resource names, and would benefit from a clear and straightforward description of eXist's xmldb URI scheme, explaining how characters in resource names are treated, and placed in the eXist documentation. This statement should cite any relevant specs/standards for URI encoding that eXist adheres to, and where (if at all) eXist diverges from these standards.

Clarity about eXist's xmldb URI scheme would also let us achieve greater consistency in eXist's various interfaces and built-in apps/tools. For example, while using WebDAV clients (e.g., oXygen and Transmit) to create collections like tést or 你好 causes the resulting collection list to display tést and 你好, doing the same in the Java admin client causes the collection list to display t%C3%A9st and %E4%BD%A0%E5%A5%BD. This suggests that the Java admin client is performing the correct encoding of the resource name into a URI when it is stored/created, but it is failing to decode the resource URI back into its name when displaying it.

But for now, let's focus on the treatment of + in eXist's xmldb URI-related functions.

What did you expect

According to the function documentation for xmldb:decode and xmldb:decode-uri (see http://exist-db.org/exist/apps/fundocs/view.html?uri=http://exist-db.org/xquery/xmldb#decode.1), these functions do the following to the supplied input:

any percent encoded octets will be translated to their decoded UTF-8 representation

Since A+B contains no percent-encoded octets, calling these decode functions should return the original value, A+B. Instead, eXist returns A B.

Let's probe the more restrictive function, xmldb:decode-uri(), which takes an xs:anyURI. XQuery's definition of xs:anyURI can be found at https://www.w3.org/TR/xpath-datamodel-31/#namespace-names, which references https://www.w3.org/TR/xmlschema-2/#anyURI and https://www.w3.org/TR/xmlschema11-2/#anyURI; the former cites https://tools.ietf.org/html/rfc3986#section-2.2, which defines + as a reserved character:

Characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI

I take this to mean that + should not be encoded or decoded; it should be protected from normalization.

Describe how to reproduce or add a test

1. Simple test:

The following query should return true() x2, but instead returns false() x2.

let $string := "A+B"
let $uri := xs:anyURI($string)
return
  (
    xmldb:decode($string) eq "A+B"`,
    xmldb:decode-uri($uri) eq "A+B"`
  )

2. Complex test:

Download a recent nightly build of eXist, or create a file on disk with the name eXist-db-4.1.0-SNAPSHOT+201804130403.dmg
Create a collection /db/test and upload the file to eXist via WebDAV
xmldb:get-child-resources("/db/test") on the collection containing the file will return the following result:
"eXist-db-4.1.0-SNAPSHOT+201804130403.dmg"
xmldb:get-child-resources("/db/test") ! xmldb:decode-uri(.) will return:
"eXist-db-4.1.0-SNAPSHOT 201804130403.dmg"

Context information

eXist-db version + Git Revision hash e.g. eXist-db 4.1.0-SNAPSHOT+201804122141 / 5949303
Java version: Oracle JDK 1.8.0_162
Operating system: macOS 10.13.4
32 or 64 bit: 64 bit
Any custom changes in e.g. conf.xml: none

The text was updated successfully, but these errors were encountered:

duncdrum · 2018-04-14T17:35:05Z

I'd say both cases should feature in the docs, as such can you add a label, to make it easier for me to keep track off.

adamretter · 2018-04-14T17:36:01Z

So the bug is that we are using Java's java.net.URLDecoder is used for the entire URL. However that is not safe and is really only suited to the query string or application/x-www-form-urlencoded data.

There is a good explanation of how the Path and Query parts of a URI have different encoding/decoding requirements here: https://stackoverflow.com/questions/1634271/url-encoding-the-space-character-or-20#answer-29948396

Coincidentally I was just fixing a very similar issue in IzPack!

adamretter · 2018-04-14T17:40:32Z

@joewiz If you would like to contribute some XQuery tests for xmldb:decode and xmldb:decode-uri then I can fix the implementation.

joewiz · 2018-04-14T17:43:52Z

@duncdrum Done.

@adamretter Very promising theory! I had been wondering about + vs. %20 as I wrote up the issue.

I actually started work on an xqsuite module for this, but for some reason I was unable to execute the tests (#1827). Here's the test as I left it when I encountered the problems running any xqsuite test:

xquery version "3.1";

module namespace ru="http://exist-db.org/test/resource-uris";

declare namespace test="http://exist-db.org/xquery/xqsuite";

declare 
    %test:args("A+B")
    %test:assertEquals("A+B")
    function ru:decode-plus-character($string as xs:string) {
        xmldb:decode($string)
};

declare 
    %test:args("A+B")
    %test:assertEquals("A+B")
    function ru:decode-uri-plus-character($uri as xs:anyURI) {
        xmldb:decode-uri($uri)
};

joewiz · 2018-04-15T17:01:07Z

@adamretter It looks like @shabanovd applied a similar fix to the Java admin client in #1344. I notice that in places like https://github.com/eXist-db/exist/pull/1344/files#diff-14cc12330c6d6445bb390d1229576bb1L1426 he stripped off URIUtils.urlDecodeUtf8 or changed instances of URIUtils.encodeXmldbUriFor to XmldbURI.xmldbUriFor. While that PR appears to have achieved its goal of fixing + in the Java admin client, the Java admin client still has the problem that I described above: Creating collections like tést or 你好 results in the collection list displaying the encoded names t%C3%A9st and %E4%BD%A0%E5%A5%BD instead of the decoded names.

hungerburg · 2018-04-15T22:05:38Z

@joewiz As eXide 2.4.5 was released recently, I could test it: When I use eXide to create a resource/collection with a + sign in its name, that works, i.e. collection browser shows it too. From eXide I can delete it, but not browse it. Looking again, eXide sends all data as query strings in http get requests. The difference between creating/deleting and browsing ought to be serverside. My guess.

hungerburg · 2018-04-16T15:20:04Z

Decoding query strings is not idempotent : %2b → + → space; Very likely this should leave "+" as is?
https://github.com/eXist-db/eXide/blob/bbb84da166a98a91d3fdc3be6e32096afe457ca0/modules/collections.xql#L146 - Indeed, the source of the "name" is not a query string, so why is it treated a such?

joewiz · 2018-04-16T15:39:00Z

@hungerburg Thank you, but for eXide-specific comments/observations could you please add to eXist-db/eXide#185 (or always feel free to start a new issue if that one doesn't cover it)? This issue is strongest if it stays focused on the problem with the eXist core.

joewiz · 2018-04-17T07:24:56Z

I hope this comment doesn't distract from the original issue's focus on the + character, but I think I have some additional findings that put the original question in a clearer context. I've just been investigating this issue and found a couple of simple tests that illustrate the range of problems this one fits into. I may have additional use cases to share, so I'll just call this "use case number 1", with 4 sub-cases (1a-1d):

1. Imagine that we are building an application that allows users to supply the name of a new resource or collection (say, in a web-based collection browser). To implement this we naturally look to the xmldb module, whose functions typically take a "collection URI" which "can be specified either as a simple collection path or an XMLDB URI". We wonder, does it make a difference whether we pass the name supplied by the user directly to xmldb functions for creating the resource/collection, or should we first pass this input through a URI-encoding function like xmldb:encode-uri? Is there a best practice or fool-proof method for taking a name and creating a resource with this name?

1a. Let's say the user wants to create a collection, "A加B" whose name happens to contain a Chinese character. If you create this collection by passing this "raw input" to the function, via xmldb:create-collection("/db", "A加B"), the function returns an encoded URI result: "/db/A%E5%8A%A0B". (This is the same value as xmldb:encode-URI("/db/A加B"). Once the collection is created, you have to address this collection via its URI-encoded name; in other words, xmldb:collection-available("/db/A加B") will fail with a "Collection not found" error, but xmldb:collection-available(xmldb:encode-URI("/db/A加B")) does return the expected result, true(). So at first, it appears that whether you pre-encode the URI when creating the resource or not, eXist applies encoding when it's needed: eXist does the "right thing" and doesn't double-encode the input.

1b. Let's say the user wants to create a collection, "A+B", "A=B", or "A@B". Unlike case 1a, xmldb:create-collection creates different resources depending on whether you give it raw or encoded strings. Without pre-encoding the names, the function creates collections with the raw, original values. But if you pre-encode the names, the result is a second set of resources - "/db/A%2BB", "/db/A%3DB", or "/db/A%40B", respectively.

1c. Let's say the user wants to create a collection, "My Project 2018" (note the space characters). Unlike case 1a and 1b, xmldb:create-collection raises an error when you pass it a value that contains a space. The only way to create this collection is to pre-encode the name.

1d. Let's say the user wants to create a collection, "A/B" (note the slash). Unlike cases 1a, 1b, and 1c, xmldb:create-collection creates actually creates two collections! It creates collection "A" and a sub-collection "B". Now, slashes in resource names are generally a bad idea, but eXist allows them through without raising an error, whether you pre-encode the collection name or not.

In conclusion, these cases suggest that eXist sometimes silently performs URI-encoding, sometimes doesn't, and may produce identical or different results when you pre-encode a collection name depending on the characters used, and may raise different errors or produce unexpected outcomes if you don't pre-encode. These differences make it difficult to create applications that handle characters consistently, and thus cause users confusion when they encounter inconsistency or errors caused by this design.

Lastly, an observation: of all of eXist's interfaces, I am surprised to say that WebDAV is by far the best. Its handling of user-provided resource and collection names is consistent and predictable. WebDAV clients let users create resources using the characters they can type, and see and interact the resources as they created them. Behind my probing here is my desire to bring this level of consistency to eXide. I'd also like to see this consistency across all these interfaces, so users who upload a document via WebDAV can find it and rename it with the Java admin client and edit it with eXide - with no 404s or mangled characters.

dizzzz · 2018-04-17T08:47:16Z

I am surprised to say that WebDAV is by far the best
:-)

the webdav extension fully relies on the encoding/decoding feature of the 3rd party library

joewiz · 2018-04-18T02:42:58Z

I've just been re-reading @shabanovd's comments about this issue of resource names in past threads. What he says suggest that perhaps we shouldn't be looking one character at a time (like +), but should consider a wholesale review of how XMLDB URIs are handled within eXist. Here are some key phrases from @shabanovd's comments:

Internally eXist keep paths as string (utf-8, java default encoding)... The problem that it's mess at front-end representation right now. We should clean it up, but because that changes will break backward compatibility. (adapted from #1344 (comment))

So if eXist already stores paths as UTF-8 strings, why are we still encoding and decoding just to address resources within eXist? We should just be able to use any UTF-8 character that is compatible with http://tools.ietf.org/html/rfc3986, right?

I think that these encoding/decoding requirements came in because of file system (FS) limitations, which are not a case anymore (or are now just a question of support/implementation for that "old" FS).

I'd say that user-facing front end apps/clients (e.g., xquery apps, functions, java client) must return utf-8 (as default encoding). How eXist deals with it internally is a question of implementation, and should not be an end-user headache.

The java.nio.file.Path interface already gives end-user readable way and deals with all FS implementations.

(adapted from #1344 (comment))

So what does everyone say, shall we fix this, and end the end-user headaches for once and for all?

Similarly, @hungerburg advocates for a wholesale review of this subsystem:

I remember that most of these problems arise from careless use of the java.net.URLDecoder class functions encode() and decode() in org.exist.xmldb.XmldbURI. The name suggests that it would do the right thing, but that is not true: URLDecoder.encode() must only be used on the keys/values of data, that is to be sent by user agents when submitting "x-www-form-urlencoded" HTML forms. Likewise decode() restores such HTTP-POSTed keys/values only after the data has been split into key/value pairs.

Observations on your findings:

all of xmldb:encode, xmldb:encode-uri, encode-for-uri('A=B') should NOT encode the equals sign when used on path segments

whether xmldb:decode or xmldb:decode-uri should decode the equals sign depends on where the input data comes from

XmldbURI class is so much above my head that I did not dare touch it, not even with a long stick; see 2c00a7b how I worked around it in RESTServer.java and EXistServlet.java by using its method createInternal(path), that does no encoding, instead of create(path) and doing the necessary shenanigans upfront of it.

If it's ok, I'll add this as to our agenda for the next community call. It would be great to have some discussion and settle on a path to a solution.

joewiz · 2020-11-25T20:51:41Z

From the notes of the April 23, 2018 eXist-db Community Call:

Changes here would likely need to be 5.0.0.

@dizzzz suggests approaching from end user perspective. Write tests that should be working: XQuery, WebDAV. Work backwards. For WebDAV consult HTTP spec on risky/forbidden characters like ?, =, +. These may need to be written in Java.

djbpitt · 2020-11-25T21:29:05Z

See now also the thread on exist-open at https://exist-open.markmail.org/thread/3wnkr5amss7r456g. Tested with eXist-db Version 5.3.0-SNAPSHOT under MacOS Catalina. Details:

Experiments involve filenames with percent signs (not as part of percent-escaped URIs), square brackets (in situations where they would have to be percent-escaped to be used in URIs), and non-ASCII (Cyrillic) characters.
Package manager rejects filenames with percent signs and square brackets in them. Also problematic, as reported above, but now with different test characters, are the Java admin client and the eXide file management tool. (The exist-open discussion also addresses inconsistent behavior with the REST interface, but that exploration was complicated by user error on my part, so REST will require further examination.) As reported above, WebDAV can deal with these filenames, so developers can work around the limitation, but because distributing apps depends on the expectation that users will be able to install them with Package manager, the fact that Package manager cannot deal with these characters is likely to be a deal breaker. If the issue has to be fixed separately in different contexts, then, fixing Package manager might be regarded as a higher priority than fixing the Java admin client or eXide.
With WebDAV (through <oXygen/>), resources are stored with their original filenames, that is, not in their percent-escaped, URI-friendly alter-egos. I think this is expected behavior: filenames are preserved, but they are transparently encoded as URIs for HTTP transfer, transferred, and then transparently decoded on the receiving end. This makes it possible to use the original filename in an XQuery context (e.g., as part of the argument to a doc() function), which is a natural developer expectation.

adamretter · 2022-03-15T18:44:15Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in decoding of "+" in xmldb URI-related functions #1824

Bug in decoding of "+" in xmldb URI-related functions #1824

joewiz commented Apr 14, 2018 •

edited

duncdrum commented Apr 14, 2018

adamretter commented Apr 14, 2018

adamretter commented Apr 14, 2018

joewiz commented Apr 14, 2018 •

edited

joewiz commented Apr 15, 2018

hungerburg commented Apr 15, 2018

hungerburg commented Apr 16, 2018

joewiz commented Apr 16, 2018

joewiz commented Apr 17, 2018 •

edited

dizzzz commented Apr 17, 2018 •

edited

joewiz commented Apr 18, 2018

joewiz commented Nov 25, 2020

djbpitt commented Nov 25, 2020 •

edited by joewiz

adamretter commented Mar 15, 2022

Bug in decoding of "+" in xmldb URI-related functions #1824

Bug in decoding of "+" in xmldb URI-related functions #1824

Comments

joewiz commented Apr 14, 2018 • edited

What is the problem

What did you expect

Describe how to reproduce or add a test

1. Simple test:

2. Complex test:

Context information

duncdrum commented Apr 14, 2018

adamretter commented Apr 14, 2018

adamretter commented Apr 14, 2018

joewiz commented Apr 14, 2018 • edited

joewiz commented Apr 15, 2018

hungerburg commented Apr 15, 2018

hungerburg commented Apr 16, 2018

joewiz commented Apr 16, 2018

joewiz commented Apr 17, 2018 • edited

dizzzz commented Apr 17, 2018 • edited

joewiz commented Apr 18, 2018

joewiz commented Nov 25, 2020

djbpitt commented Nov 25, 2020 • edited by joewiz

adamretter commented Mar 15, 2022

joewiz commented Apr 14, 2018 •

edited

joewiz commented Apr 14, 2018 •

edited

joewiz commented Apr 17, 2018 •

edited

dizzzz commented Apr 17, 2018 •

edited

djbpitt commented Nov 25, 2020 •

edited by joewiz