Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on table creation: "check failed: 'ysql_catalog_config_.get()' Must be non NULL" #1995

Closed
aphyr opened this issue Aug 7, 2019 · 3 comments

Comments

@aphyr
Copy link

commented Aug 7, 2019

In roughly 1:80 Jepsen runs, YugaByte DB 1.3.1.0 crashes on startup while a client is trying to create a new table. The client creating the table throws an error like Rollback failed handling "An I/O error occurred while sending to the backend., which I've seen before when nodes crash during table creation, e.g. due to blowing the file descriptor limit. Multiple masters seem to have crashed with

F0807 18:39:44.229851 42000 catalog_manager.cc:4396] Check failed: 'ysql_catalog_config_.get()' Must be non NULL 
Fatal failure details written to /home/yugabyte/data/yb-data/master/logs/yb-master.FATAL.details.2019-08-07T18_39_44.pid41968.txt
F20190807 18:39:44 ../../src/yb/master/catalog_manager.cc:4396] Check failed: 'ysql_catalog_config_.get()' Must be non NULL 
    @     0x7f2819cecfb5  yb::LogFatalHandlerSink::send()
    @     0x7f2818efd666  google::LogMessage::SendToLog()
    @     0x7f2818efad5a  google::LogMessage::Flush()
    @     0x7f2818efdc89  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f28228074ec  yb::master::CatalogManager::IncrementYsqlCatalogVersion()
    @     0x7f2822895ebf  yb::master::MasterTabletServiceImpl::Write()
    @     0x7f281e5e7f87  yb::tserver::TabletServerServiceIf::Handle()
    @     0x7f281d108dc1  yb::rpc::ServicePoolImpl::Handle()
    @     0x7f281d0b5ef4  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x7f281d114788  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x7f2819d72849  yb::Thread::SuperviseThread()
    @     0x7f2815991694  start_thread
    @     0x7f28150ce41d  __clone
    @              (nil)  (unknown)

*** Check failure stack trace: ***
    @     0x7f2819ceb6ec  yb::(anonymous namespace)::DumpStackTraceAndExit()
    @     0x7f2818efb20d  google::LogMessage::Fail()
    @     0x7f2818efd26d  google::LogMessage::SendToLog()
    @     0x7f2818efad5a  google::LogMessage::Flush()
    @     0x7f2818efdc89  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f28228074ec  yb::master::CatalogManager::IncrementYsqlCatalogVersion()
    @     0x7f2822895ebf  yb::master::MasterTabletServiceImpl::Write()
    @     0x7f281e5e7f87  yb::tserver::TabletServerServiceIf::Handle()
    @     0x7f281d108dc1  yb::rpc::ServicePoolImpl::Handle()
    @     0x7f281d0b5ef4  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x7f281d114788  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x7f2819d72849  yb::Thread::SuperviseThread()
    @     0x7f2815991694  start_thread
    @     0x7f28150ce41d  __clone
    @              (nil)  (unknown)

And n1's tserver logs what looks like a... WAL corruption issue?

W0807 18:39:39.101908 42597 log_util.cc:211] Could not read footer for segment: /home/yugabyte/data/yb-data/tserver/wals/table-d6d4c0d25d4b4be2953d4deae3581c7f/tablet-1fc5af7344b34397802c8679b868e2c1/wal-000000001: Not found (yb/consensus/log_util.cc:446): Footer not found. Footer magic doesn't match
W0807 18:39:39.101989 42597 log_reader.cc:153] T 1fc5af7344b34397802c8679b868e2c1 P 86987bd3521a47d7b214c7986ec5d5e1: Log segment /home/yugabyte/data/yb-data/tserver/wals/table-d6d4c0d25d4b4be2953d4deae3581c7f/tablet-1fc5af7344b34397802c8679b868e2c1/wal-000000001 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data.
W0807 18:39:39.102344 42597 log_util.cc:211] Could not read footer for segment: /home/yugabyte/data/yb-data/tserver/wals/table-d6d4c0d25d4b4be2953d4deae3581c7f/tablet

In this test run (20190807T143846.000-0400.zip):

14:39:31.439 - Client sends CREATE TABLE request
14:39:39.102 - n1 tserver could not read footer for segment
18:39:39.853 - n4 tserver Failed to destroy regular DB... No such file or directory
14:39:44.??? - n1 master crashes with Check failed: 'ysql_catalog_config_.get()' Must be non NULL
14:39:44.??? - n3 master crashes with Check failed: 'ysql_catalog_config_.get()' Must be non NULL
14:40:01.471 - Client request crashes, attempts reconnection
14:40:31,472 - Client times out reconnecting (server likely not accepting connections)

@kmuthukk kmuthukk added the area/docdb label Aug 7, 2019

@kmuthukk kmuthukk added this to To Do in YBase features via automation Aug 7, 2019

@kmuthukk kmuthukk added the kind/bug label Aug 7, 2019

@ttyusupov ttyusupov added this to To do in Jepsen Testing via automation Aug 8, 2019

@bmatican

This comment has been minimized.

Copy link
Contributor

commented Aug 8, 2019

Some updates:

TL;DR:

  • Why are we crashing? A Write request sent to a non-leader master + missing safety code to reject these
  • Why are we in this situation to begin with?? Still unclear...

On n1:

I0807 18:38:55.582062 42002 sys_catalog.cc:355] T 00000000000000000000000000000000 P f2dd799314bf455bbf0a00f860d8895a [sys.catalog]: This master's current role is: FOLLOWER
...
I0807 18:38:55.715646 42002 tablet.cc:2192] T 00000000000000000000000000000000 P f2dd799314bf455bbf0a00f860d8895a: Checkpoint restored from /home/yugabyte/share/initial_sys_catalog_snapshot/rocksdb
...
F0807 18:39:44.229851 42000 catalog_manager.cc:4396] Check failed: 'ysql_catalog_config_.get()' Must be non NULL·

So we sent a Write to a non-leader...

From n4 postgres logs:

I0807 18:39:31.409332 13699 async_initializer.cc:77] Successfully built ybclient
I0807 18:39:31.501487 13697 ybccmds.c:180] Creating Table postgres.public.append0
I0807 18:39:31.501627 13697 pg_ddl.cc:141] PgCreateTable: creating a transactional table: postgres.append0
I0807 18:39:36.696817 13697 table_creator.cc:286] Created table postgres.append0 of type PGSQL_TABLE_TYPE
I0807 18:39:46.946707 13705 tcp_stream.cc:292] { local: 192.168.122.14:34600 remote: 192.168.122.11:7100 }:  Recv failed: Network error (yb/util/net/socket.cc:590): recvmsg error: Connection refused (error 111)
I0807 18:39:46.967732 13710 meta_cache.cc:588] Got resp error 7, code=OK
W0807 18:39:46.968286 13710 meta_cache.cc:602] Leader Master has changed, re-trying...
I0807 18:39:46.969182 13709 meta_cache.cc:588] Got resp error 7, code=OK
I0807 18:39:46.970381 13711 meta_cache.cc:588] Got resp error 7, code=OK
I0807 18:39:46.971410 13710 meta_cache.cc:588] Got resp error 7, code=OK
I0807 18:39:46.972440 13709 meta_cache.cc:588] Got resp error 7, code=OK
I0807 18:39:46.973425 13711 meta_cache.cc:588] Got resp error 7, code=OK
I0807 18:39:46.974287 13710 meta_cache.cc:588] Got resp error 7, code=OK
I0807 18:39:46.975102 13709 meta_cache.cc:588] Got resp error 7, code=OK
...

Root cause of the issue is the following, in src/yb/master/master_tablet_service.cc:

 56 void MasterTabletServiceImpl::Write(const tserver::WriteRequestPB* req,
 57                                     tserver::WriteResponsePB* resp,
 58                                     rpc::RpcContext context) {
 59   for (const auto& pg_req : req->pgsql_write_batch()) {
 60     if (pg_req.is_ysql_catalog_change()) {
 61       const auto &res = master_->catalog_manager()->IncrementYsqlCatalogVersion();
 62       if (!res.ok()) {
 63         context.RespondRpcFailure(rpc::ErrorStatusPB::ERROR_APPLICATION,
 64             STATUS(InternalError, "Failed to increment YSQL catalog version"));
 65       }
 66     }
 67   }
 68
 69   tserver::TabletServiceImpl::Write(req, resp, std::move(context));
 70 }

Unlike in most other paths in the code, where we normally do something like:

CatalogManager::ScopedLeaderSharedLock l(master_->catalog_manager());
...
if (!l.CheckIsInitializedAndIsLeaderOrRespondTServer(resp, context)) {

cc @frozenspider @m-iancu @ndeodhar

@aphyr

This comment has been minimized.

Copy link
Author

commented Aug 13, 2019

I've encountered another crash which looks like it could be similar to one of the issues we found here: a node (n2) returned a connection-not-ready error in response to a create-table DDL command, and then crashed, rejecting all connections from that point on. Node n2 logged:

E0813 13:00:02.528512 35284 tablet_metadata.cc:389] Failed to destroy regular DB at: /home/yugabyte/data/yb-data/tserver/data/rocksdb/table-716e3cf0ba274252a60577b33852b5c5/tablet-369915202320483bb68dcd22f14c20ae: IO error (yb/rocksdb/util/env_posix.cc:325): /home/yugabyte/data/yb-data/tserver/data/rocksdb/table-716e3cf0ba274252a60577b33852b5c5/tablet-369915202320483bb68dcd22f14c20ae/LOCK: No such file or directory

20190813T085924.000-0400.zip

@bmatican bmatican moved this from To do to In progress in Jepsen Testing Aug 14, 2019

bmatican added a commit that referenced this issue Aug 21, 2019
[#1995] Add leader check on master TS Write interface
Summary: Adding the leader check on the Write API as well.

Test Plan: should add a regression test, but for now, let's run our Jenkins suite..

Reviewers: mihnea, hector, rahuldesirazu, mikhail

Reviewed By: mikhail

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7041

@aphyr aphyr changed the title Crash on startup: "check failed: 'ysql_catalog_config_.get()' Must be non NULL" Crash on table creation: "check failed: 'ysql_catalog_config_.get()' Must be non NULL" Aug 21, 2019

@bmatican

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2019

d22ad16 should have the fix for this. Will try to get a new release up for some final testing.

@bmatican bmatican closed this Aug 21, 2019

YBase features automation moved this from To Do to Done Aug 21, 2019

Jepsen Testing automation moved this from In progress to Done Aug 21, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
You can’t perform that action at this time.