Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Creating a table with "IF NOT EXISTS" leads to tables being created over and over again until OOM #2001

Open
aphyr opened this issue Aug 8, 2019 · 8 comments

Comments

@aphyr
Copy link

commented Aug 8, 2019

I had a cluster of 1.3.1 nodes go very sideways in a Jepsen test last night. During table creation, (which uses CREATE TABLE ... IF NOT EXISTS), one particular table, append4, repeatedly failed to create, logging:

INFO [2019-08-07 17:41:37,057] jepsen worker 2 - yugabyte.ysql.append Creating table append0
INFO [2019-08-07 17:41:37,076] jepsen worker 2 - yugabyte.ysql.append Creating table append1
INFO [2019-08-07 17:41:37,081] jepsen worker 2 - yugabyte.ysql.append Creating table append2
INFO [2019-08-07 17:41:37,088] jepsen worker 2 - yugabyte.ysql.append Creating table append3
INFO [2019-08-07 17:41:37,092] jepsen worker 2 - yugabyte.ysql.append Creating table append4
WARN [2019-08-07 17:41:43,612] jepsen worker 2 - yugabyte.ysql.append Encountered error with conn "n3"; reopening
INFO [2019-08-07 17:41:43,697] jepsen worker 2 - yugabyte.ysql.append Caught ERROR: type "append4" already exists
  Hint: A relation has an associated type of the same name, so you must use a name that doesn't conflict with any existing type. during DDL setup; retrying.

... over and over. We've seen this error before, during testing, but it's usually transient. This time it went on for hours before the box OOMed, and presumably killed a bunch of processes, but left one tserver (on node n3) spinning at 93-97% CPU use, on a 48-way (with HT) box.

Tasks: 2132 total,   5 running, 2127 sleeping,   0 stopped,   0 zombie
%Cpu(s): 93.9 us,  4.3 sy,  0.0 ni,  1.2 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
KiB Mem : 13198026+total, 77707432 free, 36253092 used, 18019740 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 89351608 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND      
45420 root      20   0  0.496t 0.014t 359320 S 94.6 11.1  23002:03 yb-tserver

Some debugging data:

mem-trackers
threadz?group=all
contention
rpcz.txt

Unfortunately I wasn't able to get a corefile--attaching GDB to the process made it crash immediately.

This happened with Jepsen af7285b96952258f3e3cb22cb18796ddbf37c56f, running

lein run test-all --os debian --version 1.3.1.0 --time-limit 5 -w ysql/append --test-count 50 --concurrency 1n

@ttyusupov ttyusupov added this to To do in Jepsen Testing via automation Aug 8, 2019

@bmatican

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2019

@aphyr I'm assuming you don't have the logs for this (or they were too massive). Dug a bit into the sampled data you were able to collect and found this:

grep 'tablet-' ~/Downloads/mem-trackers.txt | wc
   19961   39922 1976095

So my assumption of what happened was that we generated 19961 tablets worth of tables until we could not do any more... cc @m-iancu @ndeodhar I'm guessing this is still txn DDL related, despite CREATE .. IF NOT EXISTS, as if the ysql system metadata did not get updated, it would not matter that the master had created the tables, correct?

@aphyr

This comment has been minimized.

Copy link
Author

commented Aug 9, 2019

I've been working on logs--I have a 6.3GB tarball of logs from n3, which took all day to make, haha. Would it be helpful for me to throw that up on S3? It's 121GB expanded, which might... be difficult...

@bmatican

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2019

@aphyr if I'm right, maybe it would be useful to instead get the logs just from the master (leader). Is that significantly smaller? 🙏

@aphyr

This comment has been minimized.

Copy link
Author

commented Aug 9, 2019

They're... all pretty huge. I actually just wiped the other node logs to get started on some new tests--figured it wasn't worth waiting the whole weekend for the other node logs to pack up. :(

@bmatican

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2019

Haha, definitely not! erm...it wouldn't happen that n3 was the master leader? :)

In any case, yeah, some s3 upload would be sweet, we can copy it in our own s3 and go from there.

@aphyr

This comment has been minimized.

Copy link
Author

commented Aug 9, 2019

OK! Upload starting! See y'all... in the morning, haha :)

@bmatican bmatican self-assigned this Aug 9, 2019

@bmatican bmatican added the area/sql label Aug 9, 2019

@aphyr

This comment has been minimized.

Copy link
Author

commented Aug 9, 2019

@bmatican bmatican changed the title OOM & very high tserver CPU use [YSQL] Creating a table with "IF NOT EXISTS" leads to tables being created over and over again until OOM Aug 19, 2019

@ndeodhar

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

See related comment in #1991. Duplicate tables are being created due to an existing limitation with the way we maintain caches in master for postgres tables.

@bmatican bmatican assigned ndeodhar and unassigned bmatican Aug 21, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants
You can’t perform that action at this time.