Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Creating a table with "IF NOT EXISTS" leads to tables being created over and over again until OOM #2001

Closed
aphyr opened this issue Aug 8, 2019 · 9 comments

Comments

@aphyr
Copy link

@aphyr aphyr commented Aug 8, 2019

I had a cluster of 1.3.1 nodes go very sideways in a Jepsen test last night. During table creation, (which uses CREATE TABLE ... IF NOT EXISTS), one particular table, append4, repeatedly failed to create, logging:

INFO [2019-08-07 17:41:37,057] jepsen worker 2 - yugabyte.ysql.append Creating table append0
INFO [2019-08-07 17:41:37,076] jepsen worker 2 - yugabyte.ysql.append Creating table append1
INFO [2019-08-07 17:41:37,081] jepsen worker 2 - yugabyte.ysql.append Creating table append2
INFO [2019-08-07 17:41:37,088] jepsen worker 2 - yugabyte.ysql.append Creating table append3
INFO [2019-08-07 17:41:37,092] jepsen worker 2 - yugabyte.ysql.append Creating table append4
WARN [2019-08-07 17:41:43,612] jepsen worker 2 - yugabyte.ysql.append Encountered error with conn "n3"; reopening
INFO [2019-08-07 17:41:43,697] jepsen worker 2 - yugabyte.ysql.append Caught ERROR: type "append4" already exists
  Hint: A relation has an associated type of the same name, so you must use a name that doesn't conflict with any existing type. during DDL setup; retrying.

... over and over. We've seen this error before, during testing, but it's usually transient. This time it went on for hours before the box OOMed, and presumably killed a bunch of processes, but left one tserver (on node n3) spinning at 93-97% CPU use, on a 48-way (with HT) box.

Tasks: 2132 total,   5 running, 2127 sleeping,   0 stopped,   0 zombie
%Cpu(s): 93.9 us,  4.3 sy,  0.0 ni,  1.2 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
KiB Mem : 13198026+total, 77707432 free, 36253092 used, 18019740 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 89351608 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND      
45420 root      20   0  0.496t 0.014t 359320 S 94.6 11.1  23002:03 yb-tserver

Some debugging data:

mem-trackers
threadz?group=all
contention
rpcz.txt

Unfortunately I wasn't able to get a corefile--attaching GDB to the process made it crash immediately.

This happened with Jepsen af7285b96952258f3e3cb22cb18796ddbf37c56f, running

lein run test-all --os debian --version 1.3.1.0 --time-limit 5 -w ysql/append --test-count 50 --concurrency 1n
@ttyusupov ttyusupov added this to To do in Jepsen Testing via automation Aug 8, 2019
@bmatican
Copy link
Contributor

@bmatican bmatican commented Aug 9, 2019

@aphyr I'm assuming you don't have the logs for this (or they were too massive). Dug a bit into the sampled data you were able to collect and found this:

grep 'tablet-' ~/Downloads/mem-trackers.txt | wc
   19961   39922 1976095

So my assumption of what happened was that we generated 19961 tablets worth of tables until we could not do any more... cc @m-iancu @ndeodhar I'm guessing this is still txn DDL related, despite CREATE .. IF NOT EXISTS, as if the ysql system metadata did not get updated, it would not matter that the master had created the tables, correct?

Loading

@aphyr
Copy link
Author

@aphyr aphyr commented Aug 9, 2019

I've been working on logs--I have a 6.3GB tarball of logs from n3, which took all day to make, haha. Would it be helpful for me to throw that up on S3? It's 121GB expanded, which might... be difficult...

Loading

@bmatican
Copy link
Contributor

@bmatican bmatican commented Aug 9, 2019

@aphyr if I'm right, maybe it would be useful to instead get the logs just from the master (leader). Is that significantly smaller? 🙏

Loading

@aphyr
Copy link
Author

@aphyr aphyr commented Aug 9, 2019

They're... all pretty huge. I actually just wiped the other node logs to get started on some new tests--figured it wasn't worth waiting the whole weekend for the other node logs to pack up. :(

Loading

@bmatican
Copy link
Contributor

@bmatican bmatican commented Aug 9, 2019

Haha, definitely not! erm...it wouldn't happen that n3 was the master leader? :)

In any case, yeah, some s3 upload would be sweet, we can copy it in our own s3 and go from there.

Loading

@aphyr
Copy link
Author

@aphyr aphyr commented Aug 9, 2019

OK! Upload starting! See y'all... in the morning, haha :)

Loading

@bmatican bmatican self-assigned this Aug 9, 2019
@aphyr
Copy link
Author

@aphyr aphyr commented Aug 9, 2019

Loading

@bmatican bmatican changed the title OOM & very high tserver CPU use [YSQL] Creating a table with "IF NOT EXISTS" leads to tables being created over and over again until OOM Aug 19, 2019
@ndeodhar
Copy link
Contributor

@ndeodhar ndeodhar commented Aug 19, 2019

See related comment in #1991. Duplicate tables are being created due to an existing limitation with the way we maintain caches in master for postgres tables.

Loading

@bmatican bmatican assigned ndeodhar and unassigned bmatican Aug 21, 2019
@m-iancu m-iancu added this to To do in YSQL via automation Mar 3, 2020
@m-iancu m-iancu assigned frozenspider and m-iancu and unassigned ndeodhar Mar 4, 2020
@m-iancu m-iancu added this to To do in DDL improvements via automation Mar 4, 2020
@m-iancu m-iancu added this to the v2.0 milestone Mar 4, 2020
@m-iancu m-iancu removed this from the v2.0 milestone Mar 4, 2020
@m-iancu m-iancu added this to the v2.2 milestone Mar 4, 2020
@frozenspider
Copy link
Contributor

@frozenspider frozenspider commented Jun 15, 2021

Couldn't reproduce using the given Jepsen version and command line (with much higher test-count though) - RAM usage (free without page cache, as reported by free -h) is sitting at ~2.5 GB without increase.

Loading

Jepsen Testing automation moved this from To do to Done Jun 15, 2021
YSQL automation moved this from To do to Done Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
YSQL
  
Done
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants