Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArangoError: AQL: got invalid vpack value on lookup #3375

Closed
22 tasks
roman-aprias-ipf opened this issue Oct 5, 2017 · 19 comments
Closed
22 tasks

ArangoError: AQL: got invalid vpack value on lookup #3375

roman-aprias-ipf opened this issue Oct 5, 2017 · 19 comments
Labels
1 Bug 2 Fixed Resolution 3 AQL Query language related 3 Core 3 MMFiles Storage engine related Waiting User Reply
Milestone

Comments

@roman-aprias-ipf
Copy link

my environment running ArangoDB

I'm using the latest ArangoDB of the respective release series:

  • 2.8
  • 3.0
  • 3.1
  • [ *] 3.2
  • self-compiled devel branch

Mode:

  • Cluster
  • [ *] Single-Server

Storage-Engine:

  • [ *] mmfiles
  • rocksdb

On this operating system:

  • DCOS on
    • AWS
    • Azure
    • own infrastructure
  • Linux
    • [* ] Debian .deb
    • Ubuntu .deb
    • SUSE .rpm
    • RedHat .rpm
    • Fedora .rpm
    • Gentoo
    • docker - official docker library
    • other:
  • Windows, version:
  • MacOS, version:

this is an AQL-related issue:

[* ] I'm using graph features

I'm issuing AQL via:

  • web interface with this browser: running on this OS:
  • arangosh
  • [* ] this Driver: arangojs

Unfortunately I have only error message and I don't know how to reproduce it:

ArangoError: AQL: got invalid vpack value on lookup (exception location: /var/lib/otherjenkins/workspace/RELEASE__BuildPackages/arangod/MMFiles/MMFilesCollection.cpp:2960). Please report this error to arangodb.com (while executing) (exception location: /var/lib/otherjenkins/workspace/RELEASE__BuildPackages/arangod/RestHandler/RestCursorHandler.cpp:130). Please report this error to arangodb.com

@dothebart dothebart added 1 Bug 3 AQL Query language related 3 Core 3 MMFiles Storage engine related labels Oct 5, 2017
@roman-aprias-ipf
Copy link
Author

roman-aprias-ipf commented Oct 5, 2017

I tried to debug which aql causes this problem and probably it's this one:

LET list = (
       FOR t IN vTask
       FILTER t.status IN ["ready", "run"]
       SORT t.primary desc, t.status desc, rand()
       RETURN t
      )
      LET existsPrimaryTasks = LENGTH(FOR t in vTask FILTER t.primary == true && t.status in ['ready', 'run'] RETURN 1)
      FOR l IN list
        FILTER l.status == 'ready' && (existsPrimaryTasks > 0 ? l.primary == true : 1 == 1)
        //Sort ip by /24
        LET net = SUBSTRING(l.ip, 0, FIND_LAST(l.ip, '.'))
        COLLECT prefix = net into g
        LET ips = g[*].t.ip
        LET data = UNIQUE(FLATTEN(
          FOR idx IN 0..(LENGTH(ips) - 1)
            LET ip = ips[idx]
            RETURN {ip, idx}
        ))
          FOR d IN data
          SORT d.idx
          LIMIT 250 //We want send 250 ip max in 1 round
            FOR t IN vTask
            FILTER t._key == d.ip
            RETURN t

It has been written by my colleague, ternary operator in filter seems suspicious to me

@roman-aprias-ipf
Copy link
Author

So it's crashing also with other commands then the one above. We tried 3.1.27 with no errors.

@dothebart
Copy link
Contributor

when you dump and restore the stage you've got into a new arangodb on another box, can you still reproduce this?

@roman-aprias-ipf
Copy link
Author

well when I run query which failed in our code then manually, it's OK. so there's no stage which I can dump and try on another box. but when I run our app (it uses dozens of queries) on different boxes (with clean db) against different use cases then it fails after some time, and so far every time. So it's quite reproducible, but unfortunately I can't just take dump and have query fails. Any ideas how to troubleshoot it?

@dothebart
Copy link
Contributor

can you try to use some loadtesting suite like JMeter to load your application on another system to reproduce this?

Compiling arangodb with -DUSE_BACKTRACE=1 as an option would probably give more details about where its actually happening at the cost of a little slowdown.. You would have to start the server with --console from the commandline, and type ENABLE_NATIVE_BACKTRACES(true) into it.

@roman-aprias-ipf
Copy link
Author

I don't have any experience with JMeter, but our app contains several processes communicating over rabbitmq and to get vpack error it has to run against our live customer network.

I compiled arango with backtrace option, started with console, and typed that command. But then I'm little bit loss how to continue. I reproduced error, but was not able to collect backtrace with gdb, as it gives me some error after attaching to arangod:

[New LWP 1600]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
225 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No such file or directory.

Sorry I don't have much experience with this kind of debugging. Could you please give next steps how to collect backtrace which would be useful for you?

@dothebart
Copy link
Contributor

At first you should see the c++ stacktraces within the javascript stacktraces.

As for GDB you could do:
break MMFilesCollection.cpp:2960
so it would stop at that place where the exception is rethrown.

@roman-aprias-ipf
Copy link
Author

I don't see any c++ stacktrace within javascript stacktrace

2017-11-01T16:44:30.380Z - error: ArangoError: AQL: got invalid vpack value on lookup (exception location: /usr/local/src/arangodb/arangod/MMFiles/MMFilesCollection.cpp:2959). Please report this error to arangodb.com (while executing) (exception location: /usr/local/src/arangodb/arangod/RestHandler/RestCursorHandler.cpp:130). Please report this error to arangodb.com
at new ArangoError (/home/autoboss/discovery/node_modules/arangojs/lib/error.js:30:15)
at /home/autoboss/discovery/node_modules/arangojs/lib/connection.js:203:21
at callback (/home/autoboss/discovery/node_modules/arangojs/lib/util/request.js:51:12)
at IncomingMessage. (/home/autoboss/discovery/node_modules/arangojs/lib/util/request.js:62:11)
at emitNone (events.js:91:20)
at IncomingMessage.emit (events.js:185:7)
at endReadableNT (_stream_readable.js:974:12)
at _combinedTickCallback (internal/process/next_tick.js:80:11)
at process._tickCallback (internal/process/next_tick.js:104:9)

I compiled arango with -DUSE_BACKTRACE=1 and in console I typed ENABLE_NATIVE_BACKTRACES(true)

Problem with GDB break is that app has to to run around 5 minutes until error is thrown and I don't know when exactly this error will occur and that line in code is called all the time.

Any other ideas? We are now stuck to 3.1 and we need to move back to 3.2, but we cant' with this error :(

@jsteemann
Copy link
Contributor

If you are in a position to compile 3.2 from source (as it seems), then the easiest from my perspective is to apply the following patch to the source code and recompile it:

diff --git a/arangod/MMFiles/MMFilesCollection.cpp b/arangod/MMFiles/MMFilesCollection.cpp
index 8902e92433..6e77d1d680 100644
--- a/arangod/MMFiles/MMFilesCollection.cpp
+++ b/arangod/MMFiles/MMFilesCollection.cpp
@@ -2958,6 +2958,7 @@ uint8_t const* MMFilesCollection::lookupRevisionVPack(
     TRI_ASSERT(VPackSlice(vpack).isObject());
     return vpack;
   }
+  TRI_LogBacktrace();
   THROW_ARANGO_EXCEPTION_MESSAGE(TRI_ERROR_INTERNAL,
                                  "got invalid vpack value on lookup");
 }

This will make arangod log a warning every time the problem occurs. The warning will include a backtrace including the call stack.
Note that this will only work when compiling with -DUSE_BACKTRACE=On and -DUSE_MAINTAINER_MODE=On. Only using -DUSE_BACKTRACE without -DUSE_MAINTAINER_MODE will not make it log any backtraces.
Please note that when the maintainer mode is enabled, there will also be several assertions compiled into the binary. These assertions may also help revealing the actual problem. But please note that the assertions will have a (very small) runtime overhead and, whenever triggered, will immediately terminate arangod. So please don't run the resulting binary in production.
If you encounter any such warning in the logs, they will only make sense in conjunction with the local binary produced by your compile run. The backtraces will contain addresses that are only valid in your specific environment.
For example:

Each of the addresses contained (e.g. 0x11f19bb) can then be resolved using the addr2line utility, e.g.

addr2line -e build/bin/arangod 0x11f19bb
./lib/Basics/debugging.cpp:286

and if applied to the relevant frames in the call stack, it will show the call stack for the function that triggers the problem.

If for any reason an assertion should be triggered, the violated assertion should be logged to ArangoDB's regular log. The binary will also terminate with std::abort, which will produce a coredump if coredumps are enabled. If you could enable coredumps in your system that would also be super-useful because they may provide some further valuable information. If you start arangod manually from the command line, then something like ulimit -c unlimited may already help. If you use systemd or any other service, then it may be necessary to enable coredumps system-wide.

@roman-aprias-ipf
Copy link
Author

Hi Jan, thanks for your help! Here is stacktrace:

arangod> ENABLE_NATIVE_BACKTRACES(true)
arangod> 2017-11-02T11:34:15Z [3362] FATAL assertion failed in /usr/local/src/arangodb/arangod/MMFiles/MMFilesCollection.cpp:2953: revisionId != 0
./arangod() [0x1152f68] TRI_GetBacktrace(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&)
./arangod() [0x1154424] TRI_PrintBacktrace()
./arangod() [0xef6c41] arangodb::MMFilesCollection::lookupRevisionVPack(unsigned long) const
./arangod() [0xef6ce4] arangodb::MMFilesCollection::readDocumentWithCallback(arangodb::transaction::Methods*, arangodb::DocumentIdentifierToken const&, std::function<void (arangodb::DocumentIdentifierToken const&, arangodb::velocypack::Slice)> const&)
./arangod() [0xaebb01] arangodb::MMFilesHashIndexIterator::next(std::function<void (arangodb::DocumentIdentifierToken const&)> const&, unsigned long)
./arangod() [0xdabb2c] arangodb::IndexIterator::nextDocument(std::function<void (arangodb::DocumentIdentifierToken const&, arangodb::velocypack::Slice)> const&, unsigned long)
./arangod() [0xa095dc] arangodb::OperationCursor::nextDocument(std::function<void (arangodb::DocumentIdentifierToken const&, arangodb::velocypack::Slice)> const&, unsigned long)
./arangod() [0xca9d36] arangodb::aql::IndexBlock::getSome(unsigned long, unsigned long)
./arangod() [0xff3625] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xff3c3f] arangodb::aql::ExecutionBlock::getOrSkipSome(unsigned long, unsigned long, bool, arangodb::aql::AqlItemBlock*&, unsigned long&)
./arangod() [0xff1671] arangodb::aql::ExecutionBlock::getSomeWithoutRegisterClearout(unsigned long, unsigned long)
./arangod() [0xfc7194] arangodb::aql::CalculationBlock::getSome(unsigned long, unsigned long)
./arangod() [0xff3625] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xfc187a] arangodb::aql::FilterBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xfc2019] arangodb::aql::FilterBlock::getOrSkipSome(unsigned long, unsigned long, bool, arangodb::aql::AqlItemBlock*&, unsigned long&)
./arangod() [0xff1671] arangodb::aql::ExecutionBlock::getSomeWithoutRegisterClearout(unsigned long, unsigned long)
./arangod() [0xff35a1] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xff3c3f] arangodb::aql::ExecutionBlock::getOrSkipSome(unsigned long, unsigned long, bool, arangodb::aql::AqlItemBlock*&, unsigned long&)
./arangod() [0xff1671] arangodb::aql::ExecutionBlock::getSomeWithoutRegisterClearout(unsigned long, unsigned long)
./arangod() [0xfc3a2a] arangodb::aql::ReturnBlock::getSome(unsigned long, unsigned long)
./arangod() [0xd043e3] arangodb::aql::SubqueryBlock::executeSubquery()
./arangod() [0xd04568] arangodb::aql::SubqueryBlock::getSome(unsigned long, unsigned long)
./arangod() [0xff3625] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xff3c3f] arangodb::aql::ExecutionBlock::getOrSkipSome(unsigned long, unsigned long, bool, arangodb::aql::AqlItemBlock*&, unsigned long&)
./arangod() [0xff1671] arangodb::aql::ExecutionBlock::getSomeWithoutRegisterClearout(unsigned long, unsigned long)
./arangod() [0xfc7194] arangodb::aql::CalculationBlock::getSome(unsigned long, unsigned long)
./arangod() [0xff3625] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xcfb1ba] arangodb::aql::SortBlock::getOrSkipSome(unsigned long, unsigned long, bool, arangodb::aql::AqlItemBlock*&, unsigned long&)
./arangod() [0xff1671] arangodb::aql::ExecutionBlock::getSomeWithoutRegisterClearout(unsigned long, unsigned long)
./arangod() [0xff35a1] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xff3c3f] arangodb::aql::ExecutionBlock::getOrSkipSome(unsigned long, unsigned long, bool, arangodb::aql::AqlItemBlock*&, unsigned long&)
./arangod() [0xff1671] arangodb::aql::ExecutionBlock::getSomeWithoutRegisterClearout(unsigned long, unsigned long)
./arangod() [0xfc7194] arangodb::aql::CalculationBlock::getSome(unsigned long, unsigned long)
./arangod() [0xff3625] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xfddd85] arangodb::aql::SortedCollectBlock::getOrSkipSome(unsigned long, unsigned long, bool, arangodb::aql::AqlItemBlock*&, unsigned long&)
./arangod() [0xff1671] arangodb::aql::ExecutionBlock::getSomeWithoutRegisterClearout(unsigned long, unsigned long)
./arangod() [0xff35a1] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xff3c3f] arangodb::aql::ExecutionBlock::getOrSkipSome(unsigned long, unsigned long, bool, arangodb::aql::AqlItemBlock*&, unsigned long&)
./arangod() [0xff1671] arangodb::aql::ExecutionBlock::getSomeWithoutRegisterClearout(unsigned long, unsigned long)
./arangod() [0xfc7194] arangodb::aql::CalculationBlock::getSome(unsigned long, unsigned long)
./arangod() [0xff3625] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xff3c3f] arangodb::aql::ExecutionBlock::getOrSkipSome(unsigned long, unsigned long, bool, arangodb::aql::AqlItemBlock*&, unsigned long&)
./arangod() [0xff1671] arangodb::aql::ExecutionBlock::getSomeWithoutRegisterClearout(unsigned long, unsigned long)
./arangod() [0xd0448a] arangodb::aql::SubqueryBlock::getSome(unsigned long, unsigned long)
./arangod() [0xff3625] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xff3c3f] arangodb::aql::ExecutionBlock::getOrSkipSome(unsigned long, unsigned long, bool, arangodb::aql::AqlItemBlock*&, unsigned long&)
./arangod() [0xff1671] arangodb::aql::ExecutionBlock::getSomeWithoutRegisterClearout(unsigned long, unsigned long)
./arangod() [0xfc7194] arangodb::aql::CalculationBlock::getSome(unsigned long, unsigned long)
./arangod() [0xff3625] arangodb::aql::ExecutionBlock::getBlock(unsigned long, unsigned long)
./arangod() [0xfefbf0] arangodb::aql::EnumerateListBlock::getSome(unsigned long, unsigned long)

Here are addresses resolution:

/usr/local/src/arangodb/lib/Basics/debugging.cpp:286
/usr/include/x86_64-linux-gnu/bits/stdio2.h:98
/usr/local/src/arangodb/arangod/MMFiles/MMFilesCollection.cpp:2953 (discriminator 1)
/usr/local/src/arangodb/arangod/MMFiles/MMFilesCollection.cpp:1991
/usr/include/c++/5/functional:2267
/usr/local/src/arangodb/arangod/Indexes/IndexIterator.cpp:63
/usr/include/c++/5/functional:1829
/usr/local/src/arangodb/arangod/Aql/IndexBlock.cpp:451
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:279
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:406
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:310
/usr/local/src/arangodb/arangod/Aql/CalculationBlock.cpp:193
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:279
/usr/local/src/arangodb/arangod/Aql/BasicBlocks.cpp:172
/usr/local/src/arangodb/arangod/Aql/BasicBlocks.cpp:227
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:310
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:321
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:406
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:310
/usr/local/src/arangodb/arangod/Aql/BasicBlocks.cpp:435
/usr/local/src/arangodb/arangod/Aql/SubqueryBlock.cpp:146
/usr/local/src/arangodb/arangod/Aql/SubqueryBlock.cpp:89
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:279
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:406
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:310
/usr/local/src/arangodb/arangod/Aql/CalculationBlock.cpp:193
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:279
/usr/local/src/arangodb/arangod/Aql/SortBlock.cpp:69
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:310
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:321
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:406
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:310
/usr/local/src/arangodb/arangod/Aql/CalculationBlock.cpp:193
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:279
/usr/local/src/arangodb/arangod/Aql/CollectBlock.cpp:291
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:310
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:321
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:406
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:310
/usr/local/src/arangodb/arangod/Aql/CalculationBlock.cpp:193
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:279
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:406
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:310
/usr/local/src/arangodb/arangod/Aql/SubqueryBlock.cpp:62
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:279
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:406
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:310
/usr/local/src/arangodb/arangod/Aql/CalculationBlock.cpp:193
/usr/local/src/arangodb/arangod/Aql/ExecutionBlock.cpp:279
/usr/local/src/arangodb/arangod/Aql/EnumerateListBlock.cpp:84

I run arango from console, used ulimit but can't find coredump. I will try harder to generate/find it. Please let me know if debug above helped or I can do anything else.

@jsteemann
Copy link
Contributor

It looks like some of the hash index contents are invalid at the point they are queried from AQL.
Did the application that produced/modified the data encounter any errors before?
If not, the next option would be start arangod inside gdb as follows: gdb --args build/bin/arangod <regular arangod startup options go here>, e.g. gdb --args build/bin/arangod --console datadir.
Then after gdb has started, please type run to actually start arangod inside gdb. This will then do nothing further, but when arangod exits, you do have a chance to inspect the call stack further with gdb. If it exits, you can type bt full in gdb and it should show a proper backtrace together with more info.
I don't think we will need coredumps if we have gdb in place.

@roman-aprias-ipf
Copy link
Author

roman-aprias-ipf commented Nov 2, 2017

I did not noticed any errors except this problem. I will try to run it against complete new clean database just to be sure. Here is backtrace, hope it will help
arango_bt_full.txt

@jsteemann
Copy link
Contributor

It fails for an AQL query that begins with LET list = (FOR t IN vTask FILTER t.status IN ["ready", "run"] SORT t.primary desc, t.status desc, rand() RETURN t) LET existsPrimaryTask before the information in the backtrace is cut off. It seems to fail inside a subquery of that query.
I would guess that the collection in question in vTask, and that there is some hash index on some of its attributes. Can you share the index definitions for that collection?

@roman-aprias-ipf
Copy link
Author

yes, there is hash index on two properties

image

@jsteemann
Copy link
Contributor

are there any updates/inserts/removes to that collection in the workload that produces the problem? if so, which approximately?

@roman-aprias-ipf
Copy link
Author

I send you exact commands to hackers@arangodb.com

There are two processes accessing this collection. Both doing inserts and updates pretty intensively. No removes.

@jsteemann
Copy link
Contributor

If I am not wrong, we fixed this issue in 3.2.9.
Can you confirm this issue can be closed now? Thanks!

@sleto-it sleto-it added this to the 3.2.9 milestone Jan 23, 2018
@maxkernbach
Copy link
Contributor

Hi @Roman1720,

could you please check if this error still occurs in recent versions 3.2.9+?

@roman-aprias-ipf
Copy link
Author

Yes it's fixed in 3.2.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 Bug 2 Fixed Resolution 3 AQL Query language related 3 Core 3 MMFiles Storage engine related Waiting User Reply
Projects
None yet
Development

No branches or pull requests

5 participants