Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] segment fault in 3.4.5 arangod #9266

Closed
daoxian opened this issue Jun 14, 2019 · 9 comments
Closed

[BUG] segment fault in 3.4.5 arangod #9266

daoxian opened this issue Jun 14, 2019 · 9 comments

Comments

@daoxian
Copy link

daoxian commented Jun 14, 2019

My Environment

  • ArangoDB Version: 3.4.5
  • Storage Engine: RocksDB
  • Deployment Mode: Single Server
  • Deployment Strategy: ArangoDB Starter-->
  • Configuration: 1 single node
  • Infrastructure: bear metal box
  • Operating System: CentOS 7.4
  • Total RAM in your machine: 96G
  • Disks in use: HDD
  • Used Package: RedHat .rpm

The arangod process will coredump from time to time, with dmesg log:
[145071.189436] in arangod[400000+2ad9000]
[145160.867151] Scheduler[87052]: segfault at 16c ip 000000000110cb09 sp 00002ab75292f870 error 4 in arangod[400000+2ad9000]
[147624.196025] Scheduler[98860]: segfault at 16c ip 000000000110cb09 sp 00002b1f0dfe0bc0 error 4 in arangod[400000+2ad9000]
[147624.196117] Scheduler[104700]: segfault at 16c ip 000000000110cb09 sp 00002b2815bbfbc0 error 4 in arangod[400000+2ad9000]
[147624.196182] Scheduler[95484]: segfault at 16c ip 000000000110cb09 sp 00002b1f0d3dbbc0 error 4 in arangod[400000+2ad9000]
[151525.445107] Scheduler[3825]: segfault at 16c ip 000000000110cb09 sp 00002b5383055870 error 4 in arangod[400000+2ad9000]
[151525.446050] Scheduler[3294]: segfault at 16c ip 000000000110cb09 sp 00002b5514695870 error 4 in arangod[400000+2ad9000]
[161590.435460] Scheduler[75224]: segfault at 16c ip 000000000110cb09 sp 00002aee8f426870 error 4
[161590.435796] Scheduler[67268]: segfault at 16c ip 000000000110cb09 sp 00002af22b005870 error 4
[161590.435812] in arangod[400000+2ad9000]
[161590.435829] in arangod[400000+2ad9000]
[161590.435889] Scheduler[74910]: segfault at 16c ip 000000000110cb09 sp 00002b0bca4d7870 error 4 in arangod[400000+2ad9000]
[175518.446840] Scheduler[42470]: segfault at 16c ip 000000000110cb09 sp 00002afc8c25dbc0 error 4 in arangod[400000+2ad9000]
[194529.797380] Scheduler[51478]: segfault at 16c ip 000000000110cb09 sp 00002ad72d001a40 error 4
[194529.797656] Scheduler[49609]: segfault at 16c ip 000000000110cb09 sp 00002ad71ef75a40 error 4
[194529.797665] Scheduler[51476]: segfault at 16c ip 000000000110cb09 sp 00002ad726fc5a40 error 4 in arangod[400000+2ad9000]
[194529.797667] in arangod[400000+2ad9000]

[194529.797704] Scheduler[51480]: segfault at 16c ip 000000000110cb09 sp 00002ad99f98bbc0 error 4
[194529.797708] Scheduler[49625]: segfault at 16c ip 000000000110cb09 sp 00002ad722f9da40 error 4 in arangod[400000+2ad9000]
[194529.797711] in arangod[400000+2ad9000]
[194529.797721] Scheduler[46976]: segfault at 16c ip 000000000110cb09 sp 00002ac5cb40aa40 error 4
[194529.797725] in arangod[400000+2ad9000]

[194529.797736] in arangod[400000+2ad9000]

[194529.797776] Scheduler[51086]: segfault at 16c ip 000000000110cb09 sp 00002ac307da1a40 error 4 in arangod[400000+2ad9000]
[197694.158293] Scheduler[65580]: segfault at 16c ip 000000000110cb09 sp 00002aedc6ec3870 error 4 in arangod[400000+2ad9000]
[203652.637396] sd 6:0:0:0: attempting task abort! scmd(ffff880571677b80)
[203652.637404] sd 6:0:0:0: [sda] CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[203652.637408] scsi target6:0:0: handle(0x000a), sas_address(0x500e004aaaaaaa00), phy(0)
[203652.637410] scsi target6:0:0: enclosure_logical_id(0x500e004aaaaaaa00), slot(0)
[203653.000307] sd 6:0:0:0: task abort: SUCCESS scmd(ffff880571677b80)
[203659.672732] Scheduler[102095]: segfault at 16c ip 000000000110cb09 sp 00002ba2f4ebf870 error 4 in arangod[400000+2ad9000]
[211735.793295] Scheduler[39387]: segfault at 16c ip 000000000110cb09 sp 00002b82357e9870 error 4
[211735.793530] Scheduler[39213]: segfault at 16c ip 000000000110cb09 sp 00002b765fd37870 error 4
[211735.793539] in arangod[400000+2ad9000]
[211735.793577] Scheduler[35865]: segfault at 16c ip 000000000110cb09 sp 00002b6549cf4870 error 4 in arangod[400000+2ad9000]
[211735.793623] Scheduler[35866]: segfault at 16c ip 000000000110cb09 sp 00002b6d7bb53870 error 4
[211735.793623] Scheduler[35871]: segfault at 16c ip 000000000110cb09 sp 00002b7659d0fbc0 error 4 in arangod[400000+2ad9000]
[211735.793637] in arangod[400000+2ad9000]
[211735.793641] in arangod[400000+2ad9000]
[211735.793650] Scheduler[39215]: segfault at 16c ip 000000000110cb09 sp 00002b7661541870 error 4
[211735.793652] Scheduler[39396]: segfault at 16c ip 000000000110cb09 sp 00002b8235fee870 error 4
[211735.793664] in arangod[400000+2ad9000]

[211735.793674] in arangod[400000+2ad9000]

Please give me some suggestions.

@ObiWahn
Copy link
Contributor

ObiWahn commented Jun 14, 2019

Provide the arango logs and the core files. If that is not possible you could inspect the core files and provide us with backtraces. https://github.com/arangodb/arangodb/blob/devel/README_maintainers.md#linux-coredumps

@daoxian
Copy link
Author

daoxian commented Jun 16, 2019

@ObiWahn
I've setup a 3-node active failover mode arangodb. The processes are like the following:
$ps afxwww|grep arango
76165 pts/1 S+ 0:00 _ grep --color=auto arango
74883 ? S 0:00 sudo nohup arangodb --starter.mode=activefailover --server.storage-engine=rocksdb --starter.data-dir=/home/admin/arangodb/data_3.4.5_activefailover --dbservers.rocksdb.block-cache-size=5000000000 --dbservers.rocksdb.enforce-block-cache-size-limit=true --dbservers.rocksdb.max-total-wal-size=838860800 --dbservers.rocksdb.max-write-buffer-number=20 --dbservers.rocksdb.min-write-buffer-number-to-merge=10 --dbservers.rocksdb.num-threads-priority-high=10 --dbservers.rocksdb.num-threads-priority-low=5 --dbservers.rocksdb.sync-interval=10000 --dbservers.rocksdb.table-block-size=163840 --dbservers.rocksdb.total-write-buffer-size=5000000000 --dbservers.rocksdb.write-buffer-size=200000000 --dbservers.server.maximal-queue-size 0 --starter.join 11.140.25.69,11.140.129.35,11.140.126.222
74903 ? Sl 0:00 _ arangodb --starter.mode=activefailover --server.storage-engine=rocksdb --starter.data-dir=/home/admin/arangodb/data_3.4.5_activefailover --dbservers.rocksdb.block-cache-size=5000000000 --dbservers.rocksdb.enforce-block-cache-size-limit=true --dbservers.rocksdb.max-total-wal-size=838860800 --dbservers.rocksdb.max-write-buffer-number=20 --dbservers.rocksdb.min-write-buffer-number-to-merge=10 --dbservers.rocksdb.num-threads-priority-high=10 --dbservers.rocksdb.num-threads-priority-low=5 --dbservers.rocksdb.sync-interval=10000 --dbservers.rocksdb.table-block-size=163840 --dbservers.rocksdb.total-write-buffer-size=5000000000 --dbservers.rocksdb.write-buffer-size=200000000 --dbservers.server.maximal-queue-size 0 --starter.join 11.140.25.69,11.140.129.35,11.140.126.222
74925 ? Sl 0:02 _ /usr/sbin/arangod -c /home/admin/arangodb/data_3.4.5_activefailover/agent8531/arangod.conf --database.directory /home/admin/arangodb/data_3.4.5_activefailover/agent8531/data --javascript.startup-directory /usr/share/arangodb3/js --javascript.app-path /home/admin/arangodb/data_3.4.5_activefailover/agent8531/apps --log.file /home/admin arangodb/data_3.4.5_activefailover/agent8531/arangod.log --log.force-direct false --javascript.copy-installation true --agency.activate true --agency.my-address tcp://11.140.25.69:8531 --agency.size 3 --agency.supervision true --foxx.queues false --server.statistics false --agency.endpoint tcp://11.140.126.222:8531 --agency.endpoint tcp://11.140.129.35:8531
75848 ? Sl 0:40 _ /usr/sbin/arangod -c /home/admin/arangodb/data_3.4.5_activefailover/resilientsingle8529/arangod.conf --database.directory /home/admin/arangodb/data_3.4.5_activefailover/resilientsingle8529/data --javascript.startup-directory /usr/share/arangodb3/js --javascript.app-path /home/admin/arangodb/data_3.4.5_activefailover/resilientsingle8529/apps --log.file /home/admin/arangodb/data_3.4.5_activefailover/resilientsingle8529/arangod.log --log.force-direct false --javascript.copy-installation true --foxx.queues true --server.statistics true --replication.automatic-failover true --cluster.my-address tcp://11.140.25.69:8529 --cluster.my-role SINGLE --cluster.agency-endpoint tcp://11.140.126.222:8531 --cluster.agency-endpoint tcp://11.140.25.69:8531 --cluster.agency-endpoint tcp://11.140.129.35:8531 --rocksdb.max-total-wal-size 838860800 --rocksdb.max-write-buffer-number 20 --rocksdb.min-write-buffer-number-to-merge 10 --rocksdb.num-threads-priority-low 5 --rocksdb.sync-interval 10000 --rocksdb.table-block-size 163840 --rocksdb.total-write-buffer-size 5000000000 --rocksdb.enforce-block-cache-size-limit true --server.maximal-queue-size 0 --rocksdb.write-buffer-size 200000000 --rocksdb.num-threads-priority-high 10 --rocksdb.block-cache-size 5000000000

Then after a while's running, coredump emergence. The backtrace:

Core was generated by `/usr/sbin/arangod -c /home/admin/arangodb/data_3.4.5_activefailover/resilientsi'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000110cb09 in arangodb::rest::GeneralCommTask::executeRequest(std::unique_ptr<arangodb::GeneralRequest, std::default_deletearangodb::GeneralRequest >&&, std::unique_ptr<arangodb::GeneralResponse, std::default_deletearangodb::GeneralResponse >&&) ()
Missing separate debuginfos, use: debuginfo-install arangodb3-3.4.5-1.0.x86_64
(gdb) bt
#0 0x000000000110cb09 in arangodb::rest::GeneralCommTask::executeRequest(std::unique_ptr<arangodb::GeneralRequest, std::default_deletearangodb::GeneralRequest >&&, std::unique_ptr<arangodb::GeneralResponse, std::default_deletearangodb::GeneralResponse >&&) ()
#1 0x000000000106710e in arangodb::rest::HttpCommTask::processRequest(std::unique_ptr<arangodb::HttpRequest, std::default_deletearangodb::HttpRequest >) ()
#2 0x000000000106ee05 in arangodb::rest::HttpCommTask::processRead(double) ()
#3 0x00000000010e28bb in arangodb::rest::SocketTask::processAll() ()
#4 0x00000000010e2d85 in arangodb::rest::SocketTask::asyncReadSome() ()
#5 0x00000000008bc09e in asio::detail::completion_handler<arangodb::rest::Scheduler::post(asio::io_context::strand&, std::function<void ()>)::{lambda()#2}>::do_complete(void*, asio::detail::scheduler_operation*, std::error_code const&, unsigned long) ()
#6 0x00000000008c10a5 in asio::detail::strand_service::do_complete(void*, asio::detail::scheduler_operation*, std::error_code const&, unsigned long) ()
#7 0x00000000008c2e5b in arangodb::SchedulerThread::run() ()
#8 0x00000000011f7709 in arangodb::Thread::startThread(void*) ()
#9 0x00000000012b76e9 in ThreadStarter(void*) ()
#10 0x00000000027c5a9c in start ()
#11 0x0000000000000000 in ?? ()

@daoxian daoxian changed the title segment fault in 3.4.5 arangod [BUG] segment fault in 3.4.5 arangod Jun 16, 2019
@mpoeter
Copy link
Member

mpoeter commented Jun 17, 2019

I think I found the issue. Based on the backtrace I knew that it had to be in executeRequest, so I had a look at the arangod executable with IDAFree to check out the code at the instruction pointer 0x000000000110cb09:

.text:000000000110CAF0 2B8                 mov     rax, [r13+0]
.text:000000000110CAF4 2B8                 mov     r8d, 21003      ; TRI_ERROR_QUEUE_FULL
.text:000000000110CAFA 2B8                 mov     esi, 503        ; SERVICE_UNAVAILABLE
.text:000000000110CAFF 2B8                 mov     rdi, r14
.text:000000000110CB02 2B8                 mov     rcx, [rbp+messageId]
.text:000000000110CB09 2B8                 mov     edx, [rax+16Ch] ; CRASH
.text:000000000110CB0F 2B8                 call    addErrorResponse

The according source line is:

addErrorResponse(rest::ResponseCode::SERVICE_UNAVAILABLE,
request->contentTypeResponse(), messageId, TRI_ERROR_QUEUE_FULL);

The problem is that in line 294 the request variable is moved to the handler, so the call to request->contentTypeResponse() crashes:

std::shared_ptr<RestHandler> handler(
GeneralServerFeature::HANDLER_FACTORY->createHandler(std::move(request),
std::move(response)));

Maybe someone from the ArangoDB team can take over from here. 🙂

@graetzer
Copy link
Contributor

graetzer commented Jun 17, 2019

thats fixed in 3.4.x: 8925edc

@mpoeter
Copy link
Member

mpoeter commented Jun 17, 2019

The v3.4.5 tag refers to commit 648fbb8 on which the permalinks in my previous post are based on. So it looks like this fix got lost somewhere...?

@graetzer
Copy link
Contributor

graetzer commented Jun 17, 2019

it’s in >= 3.4.6.1

@dothebart
Copy link
Contributor

as pointed out by @graetzer this bug has been fixed with ArangoDB 3.4.6.1 - Please upgrade your environment.
Closing as fixed.

@mpoeter
Copy link
Member

mpoeter commented Jun 17, 2019

Got it - I should just have looked at the changelog changes of your referenced commit where it clearly says: 3.4.6

@daoxian
Copy link
Author

daoxian commented Jun 18, 2019

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants