Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph-volume: add inventory command #24859

Merged
merged 1 commit into from
Nov 9, 2018

Conversation

jan--f
Copy link
Contributor

@jan--f jan--f commented Oct 31, 2018

The inventory command provides information about a nodes disk inventory.
The output can be formatted as plain text or json.

Signed-off-by: Jan Fajerski jfajerski@suse.com
Fixes: http://tracker.ceph.com/issues/24972

@jan--f
Copy link
Contributor Author

jan--f commented Oct 31, 2018

I kept the per-drive report and the json formatted reports quite verbose. Tighter filtering could easily be done.
Some example output:

machine:~/jan--f-ceph-de5dd0a/src/ceph-volume # ceph-volume inventory

Device Path               Size         rotates valid   Model name
/dev/sdl                  185.75 GB    True    False   PERC H700
/dev/sdm                  185.75 GB    True    True    PERC H700
/dev/sdj                  1.82 TB      True    False   PERC H700
/dev/sdk                  1.82 TB      True    False   PERC H700
/dev/sdh                  1.82 TB      True    False   PERC H700
/dev/sdi                  1.82 TB      True    False   PERC H700
/dev/sdf                  1.82 TB      True    True    PERC H700
/dev/sdg                  1.82 TB      True    False   PERC H700
/dev/sdd                  1.82 TB      True    True    PERC H700
/dev/sde                  1.82 TB      True    True    PERC H700
/dev/sdb                  1.82 TB      True    True    PERC H700
/dev/sdc                  1.82 TB      True    True    PERC H700
/dev/sda                  1.82 TB      True    True    PERC H700
machine:~/jan--f-ceph-de5dd0a/src/ceph-volume # ceph-volume inventory /dev/sdg
Device report /dev/sdg
rejected reasons          ['locked'] 
path                      /dev/sdg 
valid                     False 
scheduler_mode            cfq 
rotational                1 
vendor                    DELL 
human_readable_size       1.82 TB 
sectors                   0 
sas_device_handle          
partitions                {'sdg1': {'start': '2048', 'sectorsize': 512, 'sectors': '204800', 'size': '100.00 MB'}} 
rev                       2.10 
sas_address                
locked                    1 
sectorsize                512 
removable                 0 
path                      /dev/sdg 
support_discard            
model                     PERC H700 
ro                        0 
nr_requests               128 
size                      1.9998441472e+12 
GROUP                     disk 
DISC-MAX                  0B 
FSTYPE                     
MODEL                     PERC H700 
SCHED                     cfq 
MODE                      brw-rw---- 
ROTA                      1 
RM                        0 
RO                        0 
UUID                       
STATE                     running 
MOUNTPOINT                 
LABEL                      
SIZE                      1.8T 
MAJ:MIN                   8:96 
DISC-GRAN                 0B 
NAME                      sdg 
PARTLABEL                  
PKNAME                     
LOG-SEC                   512 
DISC-ALN                  0 
ALIGNMENT                 0 
DISC-ZERO                 0 
OWNER                     root 
KNAME                     sdg 
TYPE                      disk 
PHY-SEC                   512 

@jan--f
Copy link
Contributor Author

jan--f commented Oct 31, 2018

Related to #24768

@ErwanAliasr1
Copy link
Contributor

Can you show us the output in json when no device is passed ? Reading the code makes me feel that all the devices will be reported.

@ErwanAliasr1
Copy link
Contributor

This patch is exposing ceph-volume internals data structure. That could be acceptable but that's create a serious dependency with 3rd party tools like orchestrators. If we would merge in the actual state, that would mean that those exposed data structures could never been renamed or amended (data type etc..).

Which would maybe lead to versioning the data structure to track a possible change in the format / types.

That might sound overkill but we may have to get an abstraction here where we map some internals with "external" names like in an API. So the reported structure is consistent even if the internal data structure changes.

@alfredodeza
Copy link
Contributor

This looks like a good first pass. I don't think that the key/value from the data structure needs to change or be renamed. No need to version things either.

However, we would need to improve the presentation a bit for the non-json format, so that the highly verbose output can be parsed a bit better. One thing we've done in other commands is adding a separator between items, and indenting the key/values for each section. ceph-volume lvm list does this for example.

Another thing we've done is to be fully verbose with JSON output, but minimally so with the pretty (or non-json) format. That way it helps trim down the repetitive items.

Finally, I think that some filtering could be added to add or remove attributes from being shown. That might be too much effort from an initial pass, but good to keep in mind!

@jan--f
Copy link
Contributor Author

jan--f commented Nov 1, 2018

Can you show us the output in json when no device is passed ? Reading the code makes me feel that all the devices will be reported.

Yes all devices will be reported, that what I took away from your comment here. See this paste for a json output

However, we would need to improve the presentation a bit for the non-json format

Agreed...this needs some work. I was trying to avoid to many details before people got a chance to comment.

Another thing we've done is to be fully verbose with JSON output, but minimally so with the pretty (or non-json) format. That way it helps trim down the repetitive items.

Same approach here. Only for the per-drive non.json output I wasn't sure what to not print. I'll remove a bunch of not so interesting fields to see what people think.

@ErwanAliasr1
Copy link
Contributor

I don't think that the key/value from the data structure needs to change or be renamed. No need to version things either.

I'm not saying that we have to change or rename the datastructures.
This patch exposes the internal data structure of ceph-volume to external tools in a kid of API meaning. That implies that orchestrator will consider this fields we expose as a stable in name and format.
It's rarely a good idea to expose as-is the internal data structures as an external format as this prevent any internal change on that items and even more if its not versionned.

So two options

  • we keep exposing the internal structures (at risk) : that implies having a strong testing that NONE of that fields will ever be changed/touched in any way unless that will break the 3rd party tools that consumes it
  • we install a translation table from an external format (to define and guarantee to orchestrators) and our internal representation of it. In the current state, that would be a pretty 1-1 table but that would allow having options to modify our internal structures in the future if needed

@ErwanAliasr1
Copy link
Contributor

@jan--f The pretty output is unsorted and seems to be presented in an order which match the python's internal representation. This sorting have strictly no meaning for the admin/user.
It would be nice maybe to have

  • an a-z sorting on the device name ?
  • a 'valid' field sorting ?

@ErwanAliasr1
Copy link
Contributor

I think we are missing here to report which disks are actually used by ceph. We know which block device is a backend storage for our OSDs. In that case, the disk should be reported as in use by ceph (fsid ? osds ?)
That would make a complete view of the state of a machine in a single command.
In the current state, a disk could be reported as "in-use" meaning "non-free for creating an OSD" while its already used by ceph. I think that would be better for the admin to understand,

  • what drives are used by ceph
  • what drives are used by something else (rejected reasons)
  • what drives are free for new OSDs

@alfredodeza
Copy link
Contributor

this PR shouldn't get derailed by a discussion on what @ErwanAliasr1 thinks about "internal" vs. "external". Throughout ceph-volume we keep consistency between system APIs (like output from LVM) with what we present back to the user.

@jan--f I am happy about the current direction, let me know when this is ready for a thorough review!

@ErwanAliasr1
Copy link
Contributor

@alfredodeza I really wonder why you consider a remark on why exposing internal data structures as a default format with external tools (orchestrators in this case) is derailing this PR. Really.
Does Ceph exposes its internal data structures ? I don't think so.
That's a general software design consideration.
Is it allowed to make such considerations not to break the full chain between ceph-volume & orchestrators any possible internal change ?

@ErwanAliasr1
Copy link
Contributor

@alfredodeza one more time, I would have appreciate a technical debate rather then having a moral consideration. Derailed. Waow.

@jan--f
Copy link
Contributor Author

jan--f commented Nov 1, 2018

I'd argue that exposing a dict of {str: str} is manageable even if it could be considered an internal data structure. Should this at some point change to be a more complex data structure further abstractions (like a translation table) could easily be added. At this point I'd consider it unnecessary complexity.

Also the information are physical disk attributes, kernel information and c-v specific info that we explicitly want to expose. None of these pieces are likely to dramatically change.
And again, some of this info can certainly be filtered out.

@ErwanAliasr1
Copy link
Contributor

@jan--f If we go in that direction, we have to get a test that ensure the structures members we expose like disk_api, path, reject_reasons, sys_api/*, valid should not be renamed and keeps the same data structure (string/int/array/...). If we don't do this, any commit that touch this could break the orchestrator/manager/UI.
That test will protect our consistency against 3rd party tools.

@tserong
Copy link
Contributor

tserong commented Nov 2, 2018

This looks generally reasonable to me, but I've not yet spent any real time looking at the ceph-volume code.

Just so I understand, @ErwanAliasr1, you're concerned that this may expose internal data structures of ceph-volume, but @alfredodeza is saying that what's exposed is really only what's passed through from the underlying system tools?

@ErwanAliasr1
Copy link
Contributor

@tserong This feature have two users, humans and a chain of 3rd party tools including ceph-manager and the orchestrators. So the json ouput of this command will be consumed by this chain of 3rd party tools. That implies that once the c-v output will be defined it, will be consumed and defacto considered as the expected output of c-v in this command. So those 3rd party tools will implement their decoding/parsing of it.

To function, c-v is reading stuff on the system from various sources, store them into datastructures for a later use like other c-v functions or presenting them to the user for an output.

What question me here is the fact the output of this command will be an automated extraction of those datastructures. That once the ceph-mgr/orchestrator will start consuming this output, if one is making a commit on any of those expose datastructures (name or format), the whole chain is broken.

When you make a commit on the internals of the project, it's pretty unlikely that would immediately break the whole chain. So, in my opinion we should protect ceph-volume against that potential change by having a test that guarantee that no-one will break this data structures that we expose as a default format for ceph-mgr/orchestrator.

We can also add an indirection layer that allow a potential change inside the datastructure while keeping a stable output. That may be added later once we'll discover a case of this test I'm speaking reporting a breakage in the output. The two approaches are complementary.

@alfredodeza consider that I'm derailing the PR by considering the work we are doing here with the manager should be stable & guaranteed over time. This chain is supposed to be one of the feature of nautilus as repeated by @liewegas.
We should think in way we'll no break our colleagues work (ceph-mgr/orchestrator) every morning.

@jan--f
Copy link
Contributor Author

jan--f commented Nov 2, 2018

I updated the commit with improved formatting, sorted output and a test case to make sure the interface looks as expected. I also dropped the disk_api field from the report since it contained a lot of duplicate info, while the rest didn't seem all that interesting.

Looking forward to comments.

@ErwanAliasr1 I certainly understand your concerns on the conceptual level. I think its worth keeping in mind though that

  • we aren't talking about a complex data structure which internals might change...its a dictionary
  • it only carries disk information from the kernel...this info is unlikely to change
  • even with an indirection layer we can only ensure so much continuity automatically. PRs can still change the indirection layer and automatic tests (like I just added) can be changed. Tools consuming this interface will have to do their own testing regardless.

@jan--f jan--f force-pushed the ceph-volume-inventory branch 2 times, most recently from 909030a to c6c7ce3 Compare November 5, 2018 10:45
@sebastian-philipp
Copy link
Contributor

With respect to the (imo valid) discussion about exposing internal data structures, what about reviewing the data we want to expose?

Device report /dev/sdg
rejected reasons          ['locked'] 
path                      /dev/sdg 
valid                     False 
scheduler_mode            cfq 
rotational                1 
vendor                    DELL 
human_readable_size       1.82 TB 
sectors                   0 
sas_device_handle          
partitions                {'sdg1': {'start': '2048', 'sectorsize': 512, 'sectors': '204800', 'size': '100.00 MB'}} 
rev                       2.10 
sas_address                
locked                    1 
sectorsize                512 
removable                 0 
path                      /dev/sdg 
support_discard            
model                     PERC H700 
ro                        0 
nr_requests               128 
size                      1.9998441472e+12 
GROUP                     disk 
DISC-MAX                  0B 
FSTYPE                     
MODEL                     PERC H700 
SCHED                     cfq 
MODE                      brw-rw---- 
ROTA                      1 
RM                        0 
RO                        0 
UUID                       
STATE                     running 
MOUNTPOINT                 
LABEL                      
SIZE                      1.8T 
MAJ:MIN                   8:96 
DISC-GRAN                 0B 
NAME                      sdg 
PARTLABEL                  
PKNAME                     
LOG-SEC                   512 
DISC-ALN                  0 
ALIGNMENT                 0 
DISC-ZERO                 0 
OWNER                     root 
KNAME                     sdg 
TYPE                      disk 
PHY-SEC                   512 

Looks sane to me. I have just minor things to consider here:

  1. what about exposing all keys as lower-case?
  2. There are a few duplicates. Can we simplify the output of inventory here?
  • model, MODEL
  • path, KNAME
  • rotational, ROTA
  • ro, RO
  • scheduler_mode and SCHED
  • sectorsize, PHY-SEC
  • size and SIZE

@jan--f jan--f force-pushed the ceph-volume-inventory branch 3 times, most recently from 0b898a7 to e015754 Compare November 9, 2018 08:45
@jan--f
Copy link
Contributor Author

jan--f commented Nov 9, 2018

Example output with logical volume scaning:

 # /usr/bin/ceph-volume inventory /dev/vdg

====== Device report /dev/vdg ======

     rejected reasons          ['locked']
     path                      /dev/vdg
     valid                     False
     scheduler mode            mq-deadline
     rotational                1
     vendor                    0x1af4
     human readable size       20.00 GB
     sas address               
     removable                 0
     model                     
     ro                        0
    --- Logical Volume ---
     cluster name              ceph
     name                      osd-block-aebd57ce-d26f-4282-8362-c042f95d1397
     osd id                    29
     cluster fsid              f386cd00-2c98-413f-a332-c1cd81327731
     type                      block
     block uuid                C5tXmI-iJE5-zBww-Rmg4-zB22-GjLt-xYZV7h
     osd fsid                  aebd57ce-d26f-4282-8362-c042f95d1397


/usr/bin/ceph-volume inventory --format json-pretty
[{'lvs': [],
  'path': '/dev/vda',
  'rejected_reasons': ['locked'],
  'sys_api': {'human_readable_size': '40.00 GB',
              'locked': 1,
              'model': '',
              'nr_requests': '256',
              'partitions': {'vda1': {'sectors': '83884032',
                                      'sectorsize': 512,
                                      'size': '40.00 GB',
                                      'start': '2048'}},
              'path': '/dev/vda',
              'removable': '0',
              'rev': '',
              'ro': '0',
              'rotational': '1',
              'sas_address': '',
              'sas_device_handle': '',
              'scheduler_mode': 'mq-deadline',
              'sectors': 0,
              'sectorsize': '512',
              'size': 42949672960.0,
              'support_discard': '',
              'vendor': '0x1af4'},
  'valid': False},
 {'lvs': [{'block_uuid': 'fMW1iz-W77O-1Nsn-B0m4-mQfN-oHcp-x5DekY',
           'cluster_fsid': 'f386cd00-2c98-413f-a332-c1cd81327731',
           'cluster_name': 'ceph',
           'name': 'osd-block-78c985cc-89da-4ea4-a3d8-c11be5acd21d',
           'osd_fsid': '78c985cc-89da-4ea4-a3d8-c11be5acd21d',
           'osd_id': '3',
           'type': 'block'}],
  'path': '/dev/vdb',
  'rejected_reasons': ['locked'],
  'sys_api': {'human_readable_size': '20.00 GB',
              'locked': 1,
              'model': '',
              'nr_requests': '256',
              'partitions': {'vdb1': {'sectors': '41940959',
                                      'sectorsize': 512,
                                      'size': '20.00 GB',
                                      'start': '2048'}},
              'path': '/dev/vdb',
              'removable': '0',
              'rev': '',
              'ro': '0',
              'rotational': '1',
              'sas_address': '',
              'sas_device_handle': '',
              'scheduler_mode': 'mq-deadline',
              'sectors': 0,
              'sectorsize': '512',
              'size': 21474836480.0,
              'support_discard': '',
              'vendor': '0x1af4'},
  'valid': False}]

I'll try to add output from a shared device as well, don't have one handy right now.

@ErwanAliasr1
Copy link
Contributor

Thanks @jan--f. Your PR is very promising to get this important feature inside the product.
Any idea on how to handle the ceph-disk enabled devices ? That's really important we recognize them as being used by ceph and not only rejected because its already in use

@ErwanAliasr1
Copy link
Contributor

@jan--f The rejected_reasons and valid field were added to indicate if a device is free to make a new OSD. So if we know that's already an OSD, there is no need to report that information. That could be confusing.

@jan--f
Copy link
Contributor Author

jan--f commented Nov 9, 2018

Thanks @jan--f. Your PR is very promising to get this important feature inside the product.
Any idea on how to handle the ceph-disk enabled devices ? That's really important we recognize them as being used by ceph and not only rejected because its already in use

The Device class is aware of ceph-disk OSDs iiuc. I'll add it to the report.

@jan--f The rejected_reasons and valid field were added to indicate if a device is free to make a new OSD. So if we know that's already an OSD, there is no need to report that information. That could be confusing.

I think those fields each serve their purpose. I'd rather not change the report structure based on its content.

@ErwanAliasr1
Copy link
Contributor

I think those fields each serve their purpose. I'd rather not change the report structure based on its content.

That makes sense.

Maybe I should rename "valid" by another semantic like "free", "empty" or any other proposal to get a term meaning the disk could be used to make a new OSD. The "valid" here is very very confusing in such context. Any thoughts ? @alfredodeza ?

@alfredodeza
Copy link
Contributor

I think we can make valid be usable or available. But I don't think that should be done in this PR.

@@ -57,43 +129,96 @@ def __repr__(self):
prefix = 'Raw Device'
return '<%s: %s>' % (prefix, self.abspath)

def pretty_report(self):
output = ['\n====== Device report {} ======\n'.format(self.path)]
output.extend(
Copy link
Contributor

@alfredodeza alfredodeza Nov 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last thing I see could be improved is special casing the rejected reasons. Right now displays a Python list. If the list grows to more than one item it defeats the purpose of using the pretty option (might as well just parse the JSON).

When a device is not available it should be pretty clear that it isn't (to avoid having the user scrolling to look for it). Some items that might help would be making the title/header red, and placing the rejected always first (or always last?).

When the rejected reasons are displayed, they shouldn't use the list, they should be rendered one per line, something like:

====== Device report /dev/vdg ======

     rejected reasons          
         * locked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting is already implemented. Devices are sorted by the valid flag (valid < invalid) and the device name otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RIght, but if you are displaying 100 devices and you have 84 devices that are invalid, you still need to browse through quite a bunch to find where the division between valid/invalid starts.

Again, I am not saying this is necessary for this PR, if prioritizing, I think the rejected reasons fix would be preferable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could flip the ordering. Maybe it makes more sense though to add the ability to simply not display invalid devices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my opinion the default should be listing everything.
We have 3 categories of drives ,

  • totally free to be used as new OSDs
  • used by ceph
  • used by anything else

It could be useful to have flags to keep only a single category to ease the reading for humans or make an orchestrator having a way to pre-filter the devices types like "please tell me what devices are free to be used".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would mostly be interesting for the pretty report. Any software piece that interacts with the json output can easily filter this output and an orchestrator implementation might actually want that info (how many unavailable disks).

I do think this would be best addressed in a separate PR.

Copy link
Contributor

@alfredodeza alfredodeza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to go. My last nit about rejected reasons can be done separately (feel free to address here though), as well as changing valid to some other name. This PR is about introducing the inventory sub-command, would like to avoid feature creep.

@ErwanAliasr1
Copy link
Contributor

Once we'll have this PR merge, I'll update the semantic of "valid".

@jan--f
Copy link
Contributor Author

jan--f commented Nov 9, 2018

Once we'll have this PR merge, I'll update the semantic of "valid".

@ErwanAliasr1 I have code to address http://tracker.ceph.com/issues/36701. A property rename could easily be part of that. I did use available in an earlier iteration of this.

The inventory command provides information about a nodes disk inventory.
Existing logical volumes on a disk or one of its partitions are scanned
and reported.
The output can be formatted as plain text or json.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
@jan--f
Copy link
Contributor Author

jan--f commented Nov 9, 2018

k, I added formatting for the rejected_reasons field. The remaining comments should be addressed in separate PRs imho, i.e. this is good to go.

@alfredodeza
Copy link
Contributor

jenkins test ceph-volume tox

@alfredodeza alfredodeza merged commit 974bd43 into ceph:master Nov 9, 2018
@alfredodeza
Copy link
Contributor

@jan--f would you mind following up with some doc updates? both for doc/ceph-volume (new file for new sub-command) and doc/man/8/ceph-volume.rst for the man page

@liewegas
Copy link
Member

liewegas commented Nov 9, 2018

Thanks, everyone!

@@ -117,6 +117,21 @@ def test_not_used_by_ceph(self, device_info, pvolumes, monkeypatch):
disk = device.Device("/dev/sda")
assert not disk.used_by_ceph

disk1 = device.Device("/dev/sda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't catch these, but these are causing failures on systems where /dev/sda doesn't exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants