Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in decoding of "+" in xmldb URI-related functions #1824

Open
joewiz opened this issue Apr 14, 2018 · 14 comments
Open

Bug in decoding of "+" in xmldb URI-related functions #1824

joewiz opened this issue Apr 14, 2018 · 14 comments
Assignees
Labels
bug issue confirmed as bug needs documentation Signals issues or PRs that will require an update to the documentation repo
Milestone

Comments

@joewiz
Copy link
Member

joewiz commented Apr 14, 2018

What is the problem

The + character is a reserved character according to RFC3986, and as such should be "protected from normalization." eXist's xmldb module's functions that deal with database URIs incorrectly normalize this character, treating it as if it were a percent-encoded octet, i.e., treating the + character in an xmldb URI as a character that should be decoded and turned into a space character.

This causes a problem in XQuery applications like eXide where the application must be able to reliably display the URI-decoded form of a resource (its "name"), while also reliably addressing its URI (as a "key") for resource operations like delete and rename.

For example, given a resource in the database created via xmldb:store("/db", "A+B.xml", <foo/>), we can successfully address this document via doc("/db/A+B.xml"), but xmldb:decode("A+B.xml") returns A B.xml instead of A+B.xml.

While this issue is about handling of the + character, it is also about a larger issue: users have reported unpredictable behavior involving characters in resource names, and would benefit from a clear and straightforward description of eXist's xmldb URI scheme, explaining how characters in resource names are treated, and placed in the eXist documentation. This statement should cite any relevant specs/standards for URI encoding that eXist adheres to, and where (if at all) eXist diverges from these standards.

Clarity about eXist's xmldb URI scheme would also let us achieve greater consistency in eXist's various interfaces and built-in apps/tools. For example, while using WebDAV clients (e.g., oXygen and Transmit) to create collections like tést or 你好 causes the resulting collection list to display tést and 你好, doing the same in the Java admin client causes the collection list to display t%C3%A9st and %E4%BD%A0%E5%A5%BD. This suggests that the Java admin client is performing the correct encoding of the resource name into a URI when it is stored/created, but it is failing to decode the resource URI back into its name when displaying it.

But for now, let's focus on the treatment of + in eXist's xmldb URI-related functions.

What did you expect

According to the function documentation for xmldb:decode and xmldb:decode-uri (see http://exist-db.org/exist/apps/fundocs/view.html?uri=http://exist-db.org/xquery/xmldb#decode.1), these functions do the following to the supplied input:

any percent encoded octets will be translated to their decoded UTF-8 representation

Since A+B contains no percent-encoded octets, calling these decode functions should return the original value, A+B. Instead, eXist returns A B.

Let's probe the more restrictive function, xmldb:decode-uri(), which takes an xs:anyURI. XQuery's definition of xs:anyURI can be found at https://www.w3.org/TR/xpath-datamodel-31/#namespace-names, which references https://www.w3.org/TR/xmlschema-2/#anyURI and https://www.w3.org/TR/xmlschema11-2/#anyURI; the former cites https://tools.ietf.org/html/rfc3986#section-2.2, which defines + as a reserved character:

Characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI

I take this to mean that + should not be encoded or decoded; it should be protected from normalization.

Describe how to reproduce or add a test

1. Simple test:

The following query should return true() x2, but instead returns false() x2.

let $string := "A+B"
let $uri := xs:anyURI($string)
return
  (
    xmldb:decode($string) eq "A+B"`,
    xmldb:decode-uri($uri) eq "A+B"`
  )

2. Complex test:

  1. Download a recent nightly build of eXist, or create a file on disk with the name eXist-db-4.1.0-SNAPSHOT+201804130403.dmg
  2. Create a collection /db/test and upload the file to eXist via WebDAV
  3. xmldb:get-child-resources("/db/test") on the collection containing the file will return the following result:
    "eXist-db-4.1.0-SNAPSHOT+201804130403.dmg"
  4. xmldb:get-child-resources("/db/test") ! xmldb:decode-uri(.) will return:
    "eXist-db-4.1.0-SNAPSHOT 201804130403.dmg"

Context information

  • eXist-db version + Git Revision hash e.g. eXist-db 4.1.0-SNAPSHOT+201804122141 / 5949303
  • Java version: Oracle JDK 1.8.0_162
  • Operating system: macOS 10.13.4
  • 32 or 64 bit: 64 bit
  • Any custom changes in e.g. conf.xml: none
@duncdrum
Copy link
Contributor

I'd say both cases should feature in the docs, as such can you add a label, to make it easier for me to keep track off.

@joewiz joewiz added the needs documentation Signals issues or PRs that will require an update to the documentation repo label Apr 14, 2018
@triage-new-issues triage-new-issues bot removed the triage issue needs to be investigated label Apr 14, 2018
@adamretter
Copy link
Member

So the bug is that we are using Java's java.net.URLDecoder is used for the entire URL. However that is not safe and is really only suited to the query string or application/x-www-form-urlencoded data.

There is a good explanation of how the Path and Query parts of a URI have different encoding/decoding requirements here: https://stackoverflow.com/questions/1634271/url-encoding-the-space-character-or-20#answer-29948396

Coincidentally I was just fixing a very similar issue in IzPack!

@adamretter adamretter self-assigned this Apr 14, 2018
@adamretter adamretter added the bug issue confirmed as bug label Apr 14, 2018
@adamretter adamretter added this to the eXist-4.0.1 milestone Apr 14, 2018
@adamretter
Copy link
Member

@joewiz If you would like to contribute some XQuery tests for xmldb:decode and xmldb:decode-uri then I can fix the implementation.

@joewiz
Copy link
Member Author

joewiz commented Apr 14, 2018

@duncdrum Done.

@adamretter Very promising theory! I had been wondering about + vs. %20 as I wrote up the issue.

I actually started work on an xqsuite module for this, but for some reason I was unable to execute the tests (#1827). Here's the test as I left it when I encountered the problems running any xqsuite test:

xquery version "3.1";

module namespace ru="http://exist-db.org/test/resource-uris";

declare namespace test="http://exist-db.org/xquery/xqsuite";

declare 
    %test:args("A+B")
    %test:assertEquals("A+B")
    function ru:decode-plus-character($string as xs:string) {
        xmldb:decode($string)
};

declare 
    %test:args("A+B")
    %test:assertEquals("A+B")
    function ru:decode-uri-plus-character($uri as xs:anyURI) {
        xmldb:decode-uri($uri)
};

@joewiz
Copy link
Member Author

joewiz commented Apr 15, 2018

@adamretter It looks like @shabanovd applied a similar fix to the Java admin client in #1344. I notice that in places like https://github.com/eXist-db/exist/pull/1344/files#diff-14cc12330c6d6445bb390d1229576bb1L1426 he stripped off URIUtils.urlDecodeUtf8 or changed instances of URIUtils.encodeXmldbUriFor to XmldbURI.xmldbUriFor. While that PR appears to have achieved its goal of fixing + in the Java admin client, the Java admin client still has the problem that I described above: Creating collections like tést or 你好 results in the collection list displaying the encoded names t%C3%A9st and %E4%BD%A0%E5%A5%BD instead of the decoded names.

@hungerburg
Copy link
Contributor

@joewiz As eXide 2.4.5 was released recently, I could test it: When I use eXide to create a resource/collection with a + sign in its name, that works, i.e. collection browser shows it too. From eXide I can delete it, but not browse it. Looking again, eXide sends all data as query strings in http get requests. The difference between creating/deleting and browsing ought to be serverside. My guess.

@hungerburg
Copy link
Contributor

Decoding query strings is not idempotent : %2b → + → space; Very likely this should leave "+" as is?
https://github.com/eXist-db/eXide/blob/bbb84da166a98a91d3fdc3be6e32096afe457ca0/modules/collections.xql#L146 - Indeed, the source of the "name" is not a query string, so why is it treated a such?

@joewiz
Copy link
Member Author

joewiz commented Apr 16, 2018

@hungerburg Thank you, but for eXide-specific comments/observations could you please add to eXist-db/eXide#185 (or always feel free to start a new issue if that one doesn't cover it)? This issue is strongest if it stays focused on the problem with the eXist core.

@joewiz
Copy link
Member Author

joewiz commented Apr 17, 2018

I hope this comment doesn't distract from the original issue's focus on the + character, but I think I have some additional findings that put the original question in a clearer context. I've just been investigating this issue and found a couple of simple tests that illustrate the range of problems this one fits into. I may have additional use cases to share, so I'll just call this "use case number 1", with 4 sub-cases (1a-1d):

1. Imagine that we are building an application that allows users to supply the name of a new resource or collection (say, in a web-based collection browser). To implement this we naturally look to the xmldb module, whose functions typically take a "collection URI" which "can be specified either as a simple collection path or an XMLDB URI". We wonder, does it make a difference whether we pass the name supplied by the user directly to xmldb functions for creating the resource/collection, or should we first pass this input through a URI-encoding function like xmldb:encode-uri? Is there a best practice or fool-proof method for taking a name and creating a resource with this name?

1a. Let's say the user wants to create a collection, "A加B" whose name happens to contain a Chinese character. If you create this collection by passing this "raw input" to the function, via xmldb:create-collection("/db", "A加B"), the function returns an encoded URI result: "/db/A%E5%8A%A0B". (This is the same value as xmldb:encode-URI("/db/A加B"). Once the collection is created, you have to address this collection via its URI-encoded name; in other words, xmldb:collection-available("/db/A加B") will fail with a "Collection not found" error, but xmldb:collection-available(xmldb:encode-URI("/db/A加B")) does return the expected result, true(). So at first, it appears that whether you pre-encode the URI when creating the resource or not, eXist applies encoding when it's needed: eXist does the "right thing" and doesn't double-encode the input.

1b. Let's say the user wants to create a collection, "A+B", "A=B", or "A@B". Unlike case 1a, xmldb:create-collection creates different resources depending on whether you give it raw or encoded strings. Without pre-encoding the names, the function creates collections with the raw, original values. But if you pre-encode the names, the result is a second set of resources - "/db/A%2BB", "/db/A%3DB", or "/db/A%40B", respectively.

1c. Let's say the user wants to create a collection, "My Project 2018" (note the space characters). Unlike case 1a and 1b, xmldb:create-collection raises an error when you pass it a value that contains a space. The only way to create this collection is to pre-encode the name.

1d. Let's say the user wants to create a collection, "A/B" (note the slash). Unlike cases 1a, 1b, and 1c, xmldb:create-collection creates actually creates two collections! It creates collection "A" and a sub-collection "B". Now, slashes in resource names are generally a bad idea, but eXist allows them through without raising an error, whether you pre-encode the collection name or not.

In conclusion, these cases suggest that eXist sometimes silently performs URI-encoding, sometimes doesn't, and may produce identical or different results when you pre-encode a collection name depending on the characters used, and may raise different errors or produce unexpected outcomes if you don't pre-encode. These differences make it difficult to create applications that handle characters consistently, and thus cause users confusion when they encounter inconsistency or errors caused by this design.

Lastly, an observation: of all of eXist's interfaces, I am surprised to say that WebDAV is by far the best. Its handling of user-provided resource and collection names is consistent and predictable. WebDAV clients let users create resources using the characters they can type, and see and interact the resources as they created them. Behind my probing here is my desire to bring this level of consistency to eXide. I'd also like to see this consistency across all these interfaces, so users who upload a document via WebDAV can find it and rename it with the Java admin client and edit it with eXide - with no 404s or mangled characters.

@dizzzz
Copy link
Member

dizzzz commented Apr 17, 2018

I am surprised to say that WebDAV is by far the best
:-)

the webdav extension fully relies on the encoding/decoding feature of the 3rd party library

@joewiz
Copy link
Member Author

joewiz commented Apr 18, 2018

I've just been re-reading @shabanovd's comments about this issue of resource names in past threads. What he says suggest that perhaps we shouldn't be looking one character at a time (like +), but should consider a wholesale review of how XMLDB URIs are handled within eXist. Here are some key phrases from @shabanovd's comments:

Internally eXist keep paths as string (utf-8, java default encoding)... The problem that it's mess at front-end representation right now. We should clean it up, but because that changes will break backward compatibility. (adapted from #1344 (comment))

So if eXist already stores paths as UTF-8 strings, why are we still encoding and decoding just to address resources within eXist? We should just be able to use any UTF-8 character that is compatible with http://tools.ietf.org/html/rfc3986, right?

I think that these encoding/decoding requirements came in because of file system (FS) limitations, which are not a case anymore (or are now just a question of support/implementation for that "old" FS).

I'd say that user-facing front end apps/clients (e.g., xquery apps, functions, java client) must return utf-8 (as default encoding). How eXist deals with it internally is a question of implementation, and should not be an end-user headache.

The java.nio.file.Path interface already gives end-user readable way and deals with all FS implementations.

(adapted from #1344 (comment))

So what does everyone say, shall we fix this, and end the end-user headaches for once and for all?

Similarly, @hungerburg advocates for a wholesale review of this subsystem:

I remember that most of these problems arise from careless use of the java.net.URLDecoder class functions encode() and decode() in org.exist.xmldb.XmldbURI. The name suggests that it would do the right thing, but that is not true: URLDecoder.encode() must only be used on the keys/values of data, that is to be sent by user agents when submitting "x-www-form-urlencoded" HTML forms. Likewise decode() restores such HTTP-POSTed keys/values only after the data has been split into key/value pairs.

Observations on your findings:

  • all of xmldb:encode, xmldb:encode-uri, encode-for-uri('A=B') should NOT encode the equals sign when used on path segments
  • whether xmldb:decode or xmldb:decode-uri should decode the equals sign depends on where the input data comes from

XmldbURI class is so much above my head that I did not dare touch it, not even with a long stick; see 2c00a7b how I worked around it in RESTServer.java and EXistServlet.java by using its method createInternal(path), that does no encoding, instead of create(path) and doing the necessary shenanigans upfront of it.

If it's ok, I'll add this as to our agenda for the next community call. It would be great to have some discussion and settle on a path to a solution.

@joewiz
Copy link
Member Author

joewiz commented Nov 25, 2020

From the notes of the April 23, 2018 eXist-db Community Call:

  • Changes here would likely need to be 5.0.0.
  • @dizzzz suggests approaching from end user perspective. Write tests that should be working: XQuery, WebDAV. Work backwards. For WebDAV consult HTTP spec on risky/forbidden characters like ?, =, +. These may need to be written in Java.

@djbpitt
Copy link

djbpitt commented Nov 25, 2020

See now also the thread on exist-open at https://exist-open.markmail.org/thread/3wnkr5amss7r456g. Tested with eXist-db Version 5.3.0-SNAPSHOT under MacOS Catalina. Details:

  1. Experiments involve filenames with percent signs (not as part of percent-escaped URIs), square brackets (in situations where they would have to be percent-escaped to be used in URIs), and non-ASCII (Cyrillic) characters.
  2. Package manager rejects filenames with percent signs and square brackets in them. Also problematic, as reported above, but now with different test characters, are the Java admin client and the eXide file management tool. (The exist-open discussion also addresses inconsistent behavior with the REST interface, but that exploration was complicated by user error on my part, so REST will require further examination.) As reported above, WebDAV can deal with these filenames, so developers can work around the limitation, but because distributing apps depends on the expectation that users will be able to install them with Package manager, the fact that Package manager cannot deal with these characters is likely to be a deal breaker. If the issue has to be fixed separately in different contexts, then, fixing Package manager might be regarded as a higher priority than fixing the Java admin client or eXide.
  3. With WebDAV (through <oXygen/>), resources are stored with their original filenames, that is, not in their percent-escaped, URI-friendly alter-egos. I think this is expected behavior: filenames are preserved, but they are transparently encoded as URIs for HTTP transfer, transferred, and then transparently decoded on the receiving end. This makes it possible to use the original filename in an XQuery context (e.g., as part of the argument to a doc() function), which is a natural developer expectation.

@adamretter
Copy link
Member

See also #3795

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug issue confirmed as bug needs documentation Signals issues or PRs that will require an update to the documentation repo
Projects
None yet
Development

No branches or pull requests

6 participants