New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large XQuery range can crash eXist-db #1971

Closed
adamretter opened this Issue Jun 25, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@adamretter
Member

adamretter commented Jun 25, 2018

The following very simple query from the XQTS 3.1 will result in a java.lang.OutOfMemoryError.

count(subsequence(1 to 3000000000, -2147483649))

It should also be considered as a security issue. This simple query could be sent to the REST end-point of any eXist-db server (which exposes the service), and it will cause the JVM to shutdown.

The XQTS states that we can either return the value 3000000000, or any error code. So we can decide what we want to do here.

@adamretter adamretter added this to the eXist-4.2.2 milestone Jun 26, 2018

@duncdrum

This comment has been minimized.

Member

duncdrum commented Jun 26, 2018

let's go with 3000000000

@adamretter

This comment has been minimized.

Member

adamretter commented Jun 26, 2018

@duncdrum that would be the ideal solution, however that value has to be computed, and doing that efficiently (i.e. not running our of memory) would need quite some changes to eXist-db as far as I can see.

@duncdrum

This comment has been minimized.

Member

duncdrum commented Jun 26, 2018

@adamretter major work for a major release? I checked saxon and it produces and error, about the maximum sequence length no exceeding 2147483647. I vaguely recall new BigInteger methods coming to java 8 could we maybe leverage that for throwing ArithmenticException?

@adamretter

This comment has been minimized.

Member

adamretter commented Jun 29, 2018

@duncdrum so the problem is as always, who will do the Major work?

I don't think BigInteger can help us here. We could easily go the Saxon route and return an error. If we don't want to do that, then the OutOfMemoryError problem at the moment is caused by the interactions between RangeSequence and FunSubSequence.

(I am CC'ing @wolfgangmm @dizzzz @joewiz @ljo as I think the explanation below will be of interest to them).

From RangeSequence line ~63:

private static class RangeSequenceIterator implements SequenceIterator {
   
    ...

    public Item nextItem() {
        if (current <= end) {
            return new IntegerValue(current++);
        } else {
            return null;
        }
    }

  ...

}

Basically RangeSequence#RangeSequenceIterator backs the XQuery range expression, e.g. 1 to 3000000000. For each call of RangeSequenceIterator#nextItem(), it will create a new IntegerValue Java object in memory that holds a single value of the range.

From FunSubSequence.java line ~149:

public class FunSubSequence extends Function {

    ...

    public Sequence eval(Sequence contextSequence, Item contextItem) throws XPathException {

        ...

        Item item;
        final SequenceIterator iterator = seq.iterate();
        for(int i = 0; i < start; i++) {
            item = iterator.nextItem();
        } 
        int i=0;
        while (iterator.hasNext() && i < length) {
            item = iterator.nextItem();
            tmp.add(item);
            i++;
        }

        ...
    }
}

Above, when the fn:subsequence is eval(uated), on the seq (which is a RangeSequence in this instance):

  1. the first loop (the for loop), moves to the $startingLoc position (which is the 1st argument to fn:subsequence) in the underlying range sequence. Unfortunately it does this by calling RangeSequenceIterator#nextItem(), which creates a new IntegerValue object for each call. This is very inefficient as we don't need those IntegerValue objects, because we are just re-positioning to the location we need to start from. We are wasting memory and CPU here unnecessarily.

  2. the second loop (the while loop), iterates from $startingLoc to the position indicated by $length (which is the 2nd argument to fn:subsequence). If $length is not provided, we just iterate until the end of the RangeSequence. Again each iteration causes RangeSequenceIterator#nextItem() to be called, which creates a new IntegerValue.

So the query fn:subsequence(1 to 3000000000, 1) will cause 3000000000 IntegerValue objects to be allocated.

Unfortunately because of the design above, the similar query: fn:subsequence(1 to 3000000000, 2999999999) will also cause 3000000000 IntegerValue objects to be allocated, even though we ignore the first 2999999999 items :-(

An IntegerValue in eXist-db contains both a int (which indicates the XDM type) and a BigInteger (the actual value), so its memory use is the requirements of those two plus the object framing overhead. We can calculate the rough size of an IntegerValue as follows:

  1. The int itself is 4 bytes.
  2. The reference to the BigInteger is 8 bytes (on a 64 bit machine).
  3. A BigInteger holding a 32 bit number will consume approx 68 bytes. A BigInteger holding a 64 bit number will consume approx 72 bytes.

So the size of an IntegerValue (for this experiment) is approximately between 80 bytes and 84 bytes.
So the memory use for the range of 1 to 3000000000, will be between 223.52 GB and 234.69 GB... Hence the OutOfMemoryError!

We have a few options for fixing this:

  1. Mutable Object reuse. We need to rewrite the QueryOptimizer, so that it can detect the dependencies (bounds) of an expression and/or variable and optimize the query so that when it detects that objects have no more forward references, the objects can be reused or GC'd.
  2. Optimize the query engine in several ways so that subsequence does not need to realize the sequence (or at least not the entire sequence).
  3. Optimize fn:count - fn:count should not need to actually retrieve items from the sequence to get the count of them, unless the number of items in the sequence is not stable/deterministic.
  4. Other things I have not yet thought of...

@adamretter adamretter modified the milestones: eXist-4.2.2, eXist-4.3.1 Jul 7, 2018

@adamretter adamretter modified the milestones: eXist-4.3.1, eXist-4.3.2 Jul 24, 2018

@dizzzz dizzzz closed this in #1977 Sep 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment