Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest/jepsen: failure-logs.tbz archiving is broken #73931

Open
aliher1911 opened this issue Dec 16, 2021 · 6 comments
Open

roachtest/jepsen: failure-logs.tbz archiving is broken #73931

aliher1911 opened this issue Dec 16, 2021 · 6 comments
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-roachtest T-kv KV Team T-testeng TestEng Team

Comments

@aliher1911
Copy link
Contributor

aliher1911 commented Dec 16, 2021

Jepsen test failure is captured by running tar to stdout and reading other side of ssh pipe. We seem to merge stdout with stderr somewhere along the way so resulting archive file contains garbage if tar issues a warning.

Piece of code that captures archive:

if output, err := c.RunWithBuffer(
ctx, t.L(), controller,
// -h causes tar to follow symlinks; needed by the "latest" symlink.
// -f- sends the output to stdout, we read it and save it to a local file.
"tar -chj --ignore-failed-read -C /mnt/data1/jepsen/cockroachdb -f- store/latest invoke.log",
); err != nil {
t.L().Printf("failed to retrieve jepsen artifacts and invoke.log: %s", err)
} else if err := ioutil.WriteFile(filepath.Join(outputDir, "failure-logs.tbz"), output, 0666); err != nil {
t.Fatal(err)
} else {
t.L().Printf("downloaded jepsen logs in failure-logs.tbz")
}

Resulting file - see that the data is prefixed with a warning before going to BZ where actual archive starts.

00000000  74 61 72 3a 20 73 74 6f  72 65 2f 6c 61 74 65 73  |tar: store/lates|
00000010  74 3a 20 57 61 72 6e 69  6e 67 3a 20 43 61 6e 6e  |t: Warning: Cann|
00000020  6f 74 20 73 74 61 74 3a  20 4e 6f 20 73 75 63 68  |ot stat: No such|
00000030  20 66 69 6c 65 20 6f 72  20 64 69 72 65 63 74 6f  | file or directo|
00000040  72 79 0a 42 5a 68 39 31  41 59 26 53 59 de dc 95  |ry.BZh91AY&SY...|
00000050  99 00 10 0e ff 8c d0 30  01 00 5c e7 ff f5 ff f9  |.......0..\.....|
00000060  dd 9a ff ff ff fb 10 00  00 02 00 08 60 0f 1e 3e  |............`..>|
00000070  22 50 f4 d0 28 00 00 20  2a aa 01 40 50 a0 00 95  |"P..(.. *..@P...|
00000080  05 00 4a 1c c2 68 0d 01  a3 46 11 a0 c4 69 89 93  |..J..h...F...i..|
00000090  13 41 84 68 19 00 c9 81  cc 26 80 d0 1a 34 61 1a  |.A.h.....&...4a.|
000000a0  0c 46 98 99 31 34 18 46  81 90 0c 98 11 4f d5 4f  |.F..14.F.....O.O|
000000b0  d4 d4 68 00 00 f4 80 00  00 00 00 00 00 01 cc 26  |..h............&|
000000c0  80 d0 1a 34 61 1a 0c 46  98 99 31 34 18 46 81 90  |...4a..F..14.F..|
000000d0  0c 98 08 a4 84 62 0d 34  8c 4d 34 d1 aa 7a 7a 11  |.....b.4.M4..zz.|

Data could be extracted using
tail -c +68 artifacts/jepsen/bank/majority-ring/run_1/failure-logs.tbz | tar -xjv
in this particular case, but offset may differ depending on errors and get mixed together in worst case.

Jira issue: CRDB-11816

@aliher1911 aliher1911 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Dec 16, 2021
@srosenberg srosenberg added this to Triage in Test Engineering via automation Dec 16, 2021
@tbg
Copy link
Member

tbg commented Dec 17, 2021

I think this is on KV honestly (who owns jepsen and I would keep it that way unless we decide that jepsen is now test-eng's problem, which might be a conversation worth having but not as part of this issue) - RunWithOutput captures the entire output, including stderr. Since store/latest doesn't exist, we can expect tar to print a warning.

We can probably fix this like this:

 "tar -chj --ignore-failed-read -C /mnt/data1/jepsen/cockroachdb -f- store/latest invoke.log 2>/dev/null", 

or we can get fancy and do something like this (there might be better options)

"find . -maxdepth 2 '(' -name latest -or -name invoke.log ')' | xargs tar -chj -f-"

@tbg tbg removed this from Triage in Test Engineering Dec 17, 2021
@tbg tbg added this to Incoming in KV via automation Dec 17, 2021
@blathers-crl blathers-crl bot added the T-kv KV Team label Dec 17, 2021
@tbg
Copy link
Member

tbg commented Dec 17, 2021

The other thing we could do, now that roachprob is starting get used as a library, is to give the library version of roachprod run the capability to collect stderr and stdout separately. This would fall on dev-inf as the owners of roachprod, which some plumbing in roachtest that could be done by dev-inf (to adjust c.RunWithoutput or rather remove it in lieu of passing options to c.Run).

@aliher1911
Copy link
Contributor Author

This is the second time where I had roachtests being hard to fix because of this behavior (combining stdout with stderr).
I think the change should be straightforward but a bit tedious as there are many call sites using those execs.
Fixing jepsen test is a bandaid IMHO which we should steer away from. Lets reduce the mess instead.

@tbg
Copy link
Member

tbg commented Dec 21, 2021

I agree (you are saying that you prefer improving the interface of RunWithOutput, right), but I think we need to hold off on that for a bit until @healthy-pod has their changes in (to avoid merge conflicts).

@healthy-pod
Copy link
Contributor

I already added this to the library. For the new (unmerged) jepsen.go, if you only need result.Stdout, I only need to change line 293 from result.Stdout+result.Stderr to result.Stdout.

@mwang1026 mwang1026 moved this from Incoming to roachtest/unit test backlog in KV Jan 11, 2022
@tbg tbg added this to Triage in Test Engineering via automation Nov 15, 2022
@tbg tbg removed this from roachtest/unit test backlog in KV Nov 15, 2022
@blathers-crl blathers-crl bot added the T-testeng TestEng Team label Nov 15, 2022
@blathers-crl
Copy link

blathers-crl bot commented Nov 15, 2022

cc @cockroachdb/test-eng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-roachtest T-kv KV Team T-testeng TestEng Team
Projects
Development

No branches or pull requests

4 participants