Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: fix the "heap" admin cmd printing always to error stream #47650

Merged
merged 1 commit into from Sep 1, 2022

Conversation

rzarzynski
Copy link
Contributor

Before the patch ceph::osd_cmds::heap() was confusing the concepts of stderr and stdout. This was the direct cause of the differences in output between ceph tell and ceph daeamon.

Thanks to Laura Flores who made the extremely useful observation noted in https://tracker.ceph.com/issues/57119#note-3.

Fixes: https://tracker.ceph.com/issues/57119
Signed-off-by: Radoslaw Zarzynski rzarzyns@redhat.com

CC: @ljflores.

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@rzarzynski rzarzynski requested a review from a team as a code owner August 17, 2022 13:24
@github-actions github-actions bot added the core label Aug 17, 2022
Copy link

@pdvian pdvian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks Radoslaw! I saw your PR a bit late, actually I was working on similar approach. We need similar fix for MDS as well.

@ljflores
Copy link
Contributor

LGTM. Thanks Radoslaw! I saw your PR a bit late, actually I was working on similar approach. We need similar fix for MDS as well.

Agree with @pdvian, this looks good. Just need to be applied to MDS as well.

@rzarzynski
Copy link
Contributor Author

We need similar fix for MDS as well.

Working on that. However, when experimenting, I encountered a broader difference between tell and daemon on what is being done with the error stream if rval >= 0. It can be affecting more commands . See: https://tracker.ceph.com/issues/57119#note-9.

Before the patch `ceph::osd_cmds::heap()` was confusing
the concepts of _stderr_ and _stdout_. This  was the direct
cause of the differences in output between `ceph tell` and
`ceph daeamon`.

Thanks to Laura Flores who made the extremely useful observation
noted in https://tracker.ceph.com/issues/57119#note-3.

Fixes: https://tracker.ceph.com/issues/57119
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

@vshankar vshankar added the wip-rishabh-testing Rishabh's testing label label Aug 18, 2022
@rzarzynski
Copy link
Contributor Author

jenkins test make check

@rzarzynski
Copy link
Contributor Author

The following tests FAILED:
	195 - unittest_osdmap (Subprocess aborted)

@rzarzynski
Copy link
Contributor Author

jenkins retest this please

1 similar comment
@ljflores
Copy link
Contributor

jenkins retest this please

Copy link
Contributor

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially worried about how this PR might change how mgr modules can access heap command output, i.e. Telemetry. Currently, Telemetry accesses heap stats output via the "error" stream, which is a bug as pointed out in this PR. But we still need to maintain this connection between the heap admin command and the mgr module.

After a conversation with @rzarzynski, we found that manager modules will still be able to access heap stats output despite it not being sent in the error stream. Instead, it will be accessible via outbl, as the out stream is appended there.

For example, Telemetry's current implementation for accessing heap stats is:
https://github.com/ceph/ceph/blob/main/src/pybind/mgr/telemetry/module.py#L478-L483

cmd_dict = {
    'prefix': 'heap',
    'heapcmd': 'stats',
    'id': str(osd_id),
}
r, outb, outs = self.osd_command(cmd_dict)

Where r is the ret value, outb contains JSON where applicable, and outs is the error stream which would contain heap stats string output.

With Radek's change, the heap stats output would be accessible via outb instead of outs. I previously thought that outb only held JSON output, but per Radek's PR, the result from the outstream is also appended there. This means that we can still access heap stats that way, simply by referencing outb instead of outs.

It is important to me that this communication between the heap admin command and the mgr is maintained, and now I see that it is.

@jdurgin this is was I was explaining to you earlier today; we have sorted it out! :)

@ljflores ljflores requested a review from jdurgin August 19, 2022 22:12
Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Glad all the output is still available for the mgr too!

Copy link
Contributor

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rzarzynski can we leave a note here under send_command to clarify how output is received?

Called by the plugin to send a command to the mon
cluster.

Called by the plugin to send a command to the mon
cluster.
+
+ Result contains three values:
+ 1. status int
+ 2. out std (can be used to load JSON output or receive standard string output)
+ 3. err str  (receives error string output)

Something like this to clarify how output can be processed in the manager module.

@ljflores ljflores added needs-qa needs-quincy-backport backport required for quincy needs-pacific-backport PR needs a pacific backport labels Aug 23, 2022
ljflores added a commit to ljflores/ceph that referenced this pull request Aug 24, 2022
Following the merge of ceph#47650, which
fixes the confusion between std out and std err in admin socket
commands, we will need to reference the out stream (outb) instead
of the error stream (outs) when we parse heap stats.

Signed-off-by: Laura Flores <lflores@redhat.com>
@ljflores
Copy link
Contributor

@rishabh-d-dave
Copy link
Contributor

QA run for CephFS was successful - https://tracker.ceph.com/projects/cephfs/wiki/Main#2022-Aug-26

ljflores added a commit that referenced this pull request Aug 26, 2022
Following the merge of #47650, which
fixes the confusion between std out and std err in admin socket
commands, we will need to reference the out stream (outb) instead
of the error stream (outs) when we parse heap stats.

Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 025f810)
@yaarith
Copy link
Contributor

yaarith commented Aug 31, 2022

jenkins test make check arm64

@yaarith
Copy link
Contributor

yaarith commented Aug 31, 2022

jenkins test windows

@NitzanMordhai
Copy link
Contributor

NitzanMordhai commented Sep 1, 2022

Unrelated failure:

  1. 7003409\7003415\7003419 - The CustomResourceDefinition "installations.operator.tigera.io" is invalid: metadata.annotations: Too long: must have at most 262144 bytes
  2. 7003410\7003424\7003414 - ERROR: A cluster with the same fsid '00000000-0000-0000-0000-0000deadbeef' already exists.
  3. 7003412 - 'check osd count' reached maximum tries (90) after waiting for 900 seconds
  4. 7003413\7003422 - free(): invalid pointer

tracked by:

  1. https://tracker.ceph.com/issues/57368
  2. https://tracker.ceph.com/issues/57290
  3. https://tracker.ceph.com/issues/52321
  4. https://tracker.ceph.com/issues/57163

@yuriw
Copy link
Contributor

yuriw commented Sep 1, 2022

jenkins test make check

@yuriw
Copy link
Contributor

yuriw commented Sep 1, 2022

jenkins test make check

@yuriw yuriw merged commit e1f12f7 into ceph:main Sep 1, 2022
Pegonzal pushed a commit that referenced this pull request Sep 21, 2022
Following the merge of #47650, which
fixes the confusion between std out and std err in admin socket
commands, we will need to reference the out stream (outb) instead
of the error stream (outs) when we parse heap stats.

Signed-off-by: Laura Flores <lflores@redhat.com>
Pegonzal pushed a commit that referenced this pull request Oct 13, 2022
Following the merge of #47650, which
fixes the confusion between std out and std err in admin socket
commands, we will need to reference the out stream (outb) instead
of the error stream (outs) when we parse heap stats.

Signed-off-by: Laura Flores <lflores@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants