ceph-volume: add inventory command #24859

jan--f · 2018-10-31T14:55:45Z

The inventory command provides information about a nodes disk inventory.
The output can be formatted as plain text or json.

Signed-off-by: Jan Fajerski jfajerski@suse.com
Fixes: http://tracker.ceph.com/issues/24972

jan--f · 2018-10-31T14:58:02Z

I kept the per-drive report and the json formatted reports quite verbose. Tighter filtering could easily be done.
Some example output:

machine:~/jan--f-ceph-de5dd0a/src/ceph-volume # ceph-volume inventory

Device Path               Size         rotates valid   Model name
/dev/sdl                  185.75 GB    True    False   PERC H700
/dev/sdm                  185.75 GB    True    True    PERC H700
/dev/sdj                  1.82 TB      True    False   PERC H700
/dev/sdk                  1.82 TB      True    False   PERC H700
/dev/sdh                  1.82 TB      True    False   PERC H700
/dev/sdi                  1.82 TB      True    False   PERC H700
/dev/sdf                  1.82 TB      True    True    PERC H700
/dev/sdg                  1.82 TB      True    False   PERC H700
/dev/sdd                  1.82 TB      True    True    PERC H700
/dev/sde                  1.82 TB      True    True    PERC H700
/dev/sdb                  1.82 TB      True    True    PERC H700
/dev/sdc                  1.82 TB      True    True    PERC H700
/dev/sda                  1.82 TB      True    True    PERC H700
machine:~/jan--f-ceph-de5dd0a/src/ceph-volume # ceph-volume inventory /dev/sdg
Device report /dev/sdg
rejected reasons          ['locked'] 
path                      /dev/sdg 
valid                     False 
scheduler_mode            cfq 
rotational                1 
vendor                    DELL 
human_readable_size       1.82 TB 
sectors                   0 
sas_device_handle          
partitions                {'sdg1': {'start': '2048', 'sectorsize': 512, 'sectors': '204800', 'size': '100.00 MB'}} 
rev                       2.10 
sas_address                
locked                    1 
sectorsize                512 
removable                 0 
path                      /dev/sdg 
support_discard            
model                     PERC H700 
ro                        0 
nr_requests               128 
size                      1.9998441472e+12 
GROUP                     disk 
DISC-MAX                  0B 
FSTYPE                     
MODEL                     PERC H700 
SCHED                     cfq 
MODE                      brw-rw---- 
ROTA                      1 
RM                        0 
RO                        0 
UUID                       
STATE                     running 
MOUNTPOINT                 
LABEL                      
SIZE                      1.8T 
MAJ:MIN                   8:96 
DISC-GRAN                 0B 
NAME                      sdg 
PARTLABEL                  
PKNAME                     
LOG-SEC                   512 
DISC-ALN                  0 
ALIGNMENT                 0 
DISC-ZERO                 0 
OWNER                     root 
KNAME                     sdg 
TYPE                      disk 
PHY-SEC                   512

jan--f · 2018-10-31T15:06:06Z

Related to #24768

ErwanAliasr1 · 2018-10-31T17:10:22Z

Can you show us the output in json when no device is passed ? Reading the code makes me feel that all the devices will be reported.

ErwanAliasr1 · 2018-10-31T17:16:51Z

This patch is exposing ceph-volume internals data structure. That could be acceptable but that's create a serious dependency with 3rd party tools like orchestrators. If we would merge in the actual state, that would mean that those exposed data structures could never been renamed or amended (data type etc..).

Which would maybe lead to versioning the data structure to track a possible change in the format / types.

That might sound overkill but we may have to get an abstraction here where we map some internals with "external" names like in an API. So the reported structure is consistent even if the internal data structure changes.

alfredodeza · 2018-10-31T19:22:57Z

This looks like a good first pass. I don't think that the key/value from the data structure needs to change or be renamed. No need to version things either.

However, we would need to improve the presentation a bit for the non-json format, so that the highly verbose output can be parsed a bit better. One thing we've done in other commands is adding a separator between items, and indenting the key/values for each section. ceph-volume lvm list does this for example.

Another thing we've done is to be fully verbose with JSON output, but minimally so with the pretty (or non-json) format. That way it helps trim down the repetitive items.

Finally, I think that some filtering could be added to add or remove attributes from being shown. That might be too much effort from an initial pass, but good to keep in mind!

jan--f · 2018-11-01T08:18:52Z

Can you show us the output in json when no device is passed ? Reading the code makes me feel that all the devices will be reported.

Yes all devices will be reported, that what I took away from your comment here. See this paste for a json output

However, we would need to improve the presentation a bit for the non-json format

Agreed...this needs some work. I was trying to avoid to many details before people got a chance to comment.

Another thing we've done is to be fully verbose with JSON output, but minimally so with the pretty (or non-json) format. That way it helps trim down the repetitive items.

Same approach here. Only for the per-drive non.json output I wasn't sure what to not print. I'll remove a bunch of not so interesting fields to see what people think.

ErwanAliasr1 · 2018-11-01T11:20:18Z

I don't think that the key/value from the data structure needs to change or be renamed. No need to version things either.

I'm not saying that we have to change or rename the datastructures.
This patch exposes the internal data structure of ceph-volume to external tools in a kid of API meaning. That implies that orchestrator will consider this fields we expose as a stable in name and format.
It's rarely a good idea to expose as-is the internal data structures as an external format as this prevent any internal change on that items and even more if its not versionned.

So two options

we keep exposing the internal structures (at risk) : that implies having a strong testing that NONE of that fields will ever be changed/touched in any way unless that will break the 3rd party tools that consumes it
we install a translation table from an external format (to define and guarantee to orchestrators) and our internal representation of it. In the current state, that would be a pretty 1-1 table but that would allow having options to modify our internal structures in the future if needed

ErwanAliasr1 · 2018-11-01T11:22:37Z

@jan--f The pretty output is unsorted and seems to be presented in an order which match the python's internal representation. This sorting have strictly no meaning for the admin/user.
It would be nice maybe to have

an a-z sorting on the device name ?
a 'valid' field sorting ?

ErwanAliasr1 · 2018-11-01T11:26:20Z

I think we are missing here to report which disks are actually used by ceph. We know which block device is a backend storage for our OSDs. In that case, the disk should be reported as in use by ceph (fsid ? osds ?)
That would make a complete view of the state of a machine in a single command.
In the current state, a disk could be reported as "in-use" meaning "non-free for creating an OSD" while its already used by ceph. I think that would be better for the admin to understand,

what drives are used by ceph
what drives are used by something else (rejected reasons)
what drives are free for new OSDs

alfredodeza · 2018-11-01T12:14:24Z

this PR shouldn't get derailed by a discussion on what @ErwanAliasr1 thinks about "internal" vs. "external". Throughout ceph-volume we keep consistency between system APIs (like output from LVM) with what we present back to the user.

@jan--f I am happy about the current direction, let me know when this is ready for a thorough review!

ErwanAliasr1 · 2018-11-01T12:22:54Z

@alfredodeza I really wonder why you consider a remark on why exposing internal data structures as a default format with external tools (orchestrators in this case) is derailing this PR. Really.
Does Ceph exposes its internal data structures ? I don't think so.
That's a general software design consideration.
Is it allowed to make such considerations not to break the full chain between ceph-volume & orchestrators any possible internal change ?

ErwanAliasr1 · 2018-11-01T12:31:18Z

@alfredodeza one more time, I would have appreciate a technical debate rather then having a moral consideration. Derailed. Waow.

jan--f · 2018-11-01T14:13:44Z

I'd argue that exposing a dict of {str: str} is manageable even if it could be considered an internal data structure. Should this at some point change to be a more complex data structure further abstractions (like a translation table) could easily be added. At this point I'd consider it unnecessary complexity.

Also the information are physical disk attributes, kernel information and c-v specific info that we explicitly want to expose. None of these pieces are likely to dramatically change.
And again, some of this info can certainly be filtered out.

ErwanAliasr1 · 2018-11-01T15:07:38Z

@jan--f If we go in that direction, we have to get a test that ensure the structures members we expose like disk_api, path, reject_reasons, sys_api/*, valid should not be renamed and keeps the same data structure (string/int/array/...). If we don't do this, any commit that touch this could break the orchestrator/manager/UI.
That test will protect our consistency against 3rd party tools.

tserong · 2018-11-02T07:53:58Z

This looks generally reasonable to me, but I've not yet spent any real time looking at the ceph-volume code.

Just so I understand, @ErwanAliasr1, you're concerned that this may expose internal data structures of ceph-volume, but @alfredodeza is saying that what's exposed is really only what's passed through from the underlying system tools?

ErwanAliasr1 · 2018-11-02T10:22:16Z

@tserong This feature have two users, humans and a chain of 3rd party tools including ceph-manager and the orchestrators. So the json ouput of this command will be consumed by this chain of 3rd party tools. That implies that once the c-v output will be defined it, will be consumed and defacto considered as the expected output of c-v in this command. So those 3rd party tools will implement their decoding/parsing of it.

To function, c-v is reading stuff on the system from various sources, store them into datastructures for a later use like other c-v functions or presenting them to the user for an output.

What question me here is the fact the output of this command will be an automated extraction of those datastructures. That once the ceph-mgr/orchestrator will start consuming this output, if one is making a commit on any of those expose datastructures (name or format), the whole chain is broken.

When you make a commit on the internals of the project, it's pretty unlikely that would immediately break the whole chain. So, in my opinion we should protect ceph-volume against that potential change by having a test that guarantee that no-one will break this data structures that we expose as a default format for ceph-mgr/orchestrator.

We can also add an indirection layer that allow a potential change inside the datastructure while keeping a stable output. That may be added later once we'll discover a case of this test I'm speaking reporting a breakage in the output. The two approaches are complementary.

@alfredodeza consider that I'm derailing the PR by considering the work we are doing here with the manager should be stable & guaranteed over time. This chain is supposed to be one of the feature of nautilus as repeated by @liewegas.
We should think in way we'll no break our colleagues work (ceph-mgr/orchestrator) every morning.

jan--f · 2018-11-02T13:19:16Z

I updated the commit with improved formatting, sorted output and a test case to make sure the interface looks as expected. I also dropped the disk_api field from the report since it contained a lot of duplicate info, while the rest didn't seem all that interesting.

Looking forward to comments.

@ErwanAliasr1 I certainly understand your concerns on the conceptual level. I think its worth keeping in mind though that

we aren't talking about a complex data structure which internals might change...its a dictionary
it only carries disk information from the kernel...this info is unlikely to change
even with an indirection layer we can only ensure so much continuity automatically. PRs can still change the indirection layer and automatic tests (like I just added) can be changed. Tools consuming this interface will have to do their own testing regardless.

src/ceph-volume/ceph_volume/util/device.py

sebastian-philipp · 2018-11-05T11:35:20Z

With respect to the (imo valid) discussion about exposing internal data structures, what about reviewing the data we want to expose?

Device report /dev/sdg
rejected reasons          ['locked'] 
path                      /dev/sdg 
valid                     False 
scheduler_mode            cfq 
rotational                1 
vendor                    DELL 
human_readable_size       1.82 TB 
sectors                   0 
sas_device_handle          
partitions                {'sdg1': {'start': '2048', 'sectorsize': 512, 'sectors': '204800', 'size': '100.00 MB'}} 
rev                       2.10 
sas_address                
locked                    1 
sectorsize                512 
removable                 0 
path                      /dev/sdg 
support_discard            
model                     PERC H700 
ro                        0 
nr_requests               128 
size                      1.9998441472e+12 
GROUP                     disk 
DISC-MAX                  0B 
FSTYPE                     
MODEL                     PERC H700 
SCHED                     cfq 
MODE                      brw-rw---- 
ROTA                      1 
RM                        0 
RO                        0 
UUID                       
STATE                     running 
MOUNTPOINT                 
LABEL                      
SIZE                      1.8T 
MAJ:MIN                   8:96 
DISC-GRAN                 0B 
NAME                      sdg 
PARTLABEL                  
PKNAME                     
LOG-SEC                   512 
DISC-ALN                  0 
ALIGNMENT                 0 
DISC-ZERO                 0 
OWNER                     root 
KNAME                     sdg 
TYPE                      disk 
PHY-SEC                   512

Looks sane to me. I have just minor things to consider here:

what about exposing all keys as lower-case?
There are a few duplicates. Can we simplify the output of inventory here?

model, MODEL
path, KNAME
rotational, ROTA
ro, RO
scheduler_mode and SCHED
sectorsize, PHY-SEC
size and SIZE

jan--f · 2018-11-09T09:11:54Z

Example output with logical volume scaning:

 # /usr/bin/ceph-volume inventory /dev/vdg

====== Device report /dev/vdg ======

     rejected reasons          ['locked']
     path                      /dev/vdg
     valid                     False
     scheduler mode            mq-deadline
     rotational                1
     vendor                    0x1af4
     human readable size       20.00 GB
     sas address               
     removable                 0
     model                     
     ro                        0
    --- Logical Volume ---
     cluster name              ceph
     name                      osd-block-aebd57ce-d26f-4282-8362-c042f95d1397
     osd id                    29
     cluster fsid              f386cd00-2c98-413f-a332-c1cd81327731
     type                      block
     block uuid                C5tXmI-iJE5-zBww-Rmg4-zB22-GjLt-xYZV7h
     osd fsid                  aebd57ce-d26f-4282-8362-c042f95d1397


/usr/bin/ceph-volume inventory --format json-pretty
[{'lvs': [],
  'path': '/dev/vda',
  'rejected_reasons': ['locked'],
  'sys_api': {'human_readable_size': '40.00 GB',
              'locked': 1,
              'model': '',
              'nr_requests': '256',
              'partitions': {'vda1': {'sectors': '83884032',
                                      'sectorsize': 512,
                                      'size': '40.00 GB',
                                      'start': '2048'}},
              'path': '/dev/vda',
              'removable': '0',
              'rev': '',
              'ro': '0',
              'rotational': '1',
              'sas_address': '',
              'sas_device_handle': '',
              'scheduler_mode': 'mq-deadline',
              'sectors': 0,
              'sectorsize': '512',
              'size': 42949672960.0,
              'support_discard': '',
              'vendor': '0x1af4'},
  'valid': False},
 {'lvs': [{'block_uuid': 'fMW1iz-W77O-1Nsn-B0m4-mQfN-oHcp-x5DekY',
           'cluster_fsid': 'f386cd00-2c98-413f-a332-c1cd81327731',
           'cluster_name': 'ceph',
           'name': 'osd-block-78c985cc-89da-4ea4-a3d8-c11be5acd21d',
           'osd_fsid': '78c985cc-89da-4ea4-a3d8-c11be5acd21d',
           'osd_id': '3',
           'type': 'block'}],
  'path': '/dev/vdb',
  'rejected_reasons': ['locked'],
  'sys_api': {'human_readable_size': '20.00 GB',
              'locked': 1,
              'model': '',
              'nr_requests': '256',
              'partitions': {'vdb1': {'sectors': '41940959',
                                      'sectorsize': 512,
                                      'size': '20.00 GB',
                                      'start': '2048'}},
              'path': '/dev/vdb',
              'removable': '0',
              'rev': '',
              'ro': '0',
              'rotational': '1',
              'sas_address': '',
              'sas_device_handle': '',
              'scheduler_mode': 'mq-deadline',
              'sectors': 0,
              'sectorsize': '512',
              'size': 21474836480.0,
              'support_discard': '',
              'vendor': '0x1af4'},
  'valid': False}]

I'll try to add output from a shared device as well, don't have one handy right now.

ErwanAliasr1 · 2018-11-09T09:52:06Z

Thanks @jan--f. Your PR is very promising to get this important feature inside the product.
Any idea on how to handle the ceph-disk enabled devices ? That's really important we recognize them as being used by ceph and not only rejected because its already in use

ErwanAliasr1 · 2018-11-09T10:23:29Z

@jan--f The rejected_reasons and valid field were added to indicate if a device is free to make a new OSD. So if we know that's already an OSD, there is no need to report that information. That could be confusing.

jan--f · 2018-11-09T10:44:22Z

Thanks @jan--f. Your PR is very promising to get this important feature inside the product.
Any idea on how to handle the ceph-disk enabled devices ? That's really important we recognize them as being used by ceph and not only rejected because its already in use

The Device class is aware of ceph-disk OSDs iiuc. I'll add it to the report.

@jan--f The rejected_reasons and valid field were added to indicate if a device is free to make a new OSD. So if we know that's already an OSD, there is no need to report that information. That could be confusing.

I think those fields each serve their purpose. I'd rather not change the report structure based on its content.

ErwanAliasr1 · 2018-11-09T11:01:23Z

I think those fields each serve their purpose. I'd rather not change the report structure based on its content.

That makes sense.

Maybe I should rename "valid" by another semantic like "free", "empty" or any other proposal to get a term meaning the disk could be used to make a new OSD. The "valid" here is very very confusing in such context. Any thoughts ? @alfredodeza ?

alfredodeza · 2018-11-09T12:04:26Z

I think we can make valid be usable or available. But I don't think that should be done in this PR.

alfredodeza · 2018-11-09T12:10:57Z

src/ceph-volume/ceph_volume/util/device.py

@@ -57,43 +129,96 @@ def __repr__(self):
            prefix = 'Raw Device'
        return '<%s: %s>' % (prefix, self.abspath)

+    def pretty_report(self):
+        output = ['\n====== Device report {} ======\n'.format(self.path)]
+        output.extend(


the last thing I see could be improved is special casing the rejected reasons. Right now displays a Python list. If the list grows to more than one item it defeats the purpose of using the pretty option (might as well just parse the JSON).

When a device is not available it should be pretty clear that it isn't (to avoid having the user scrolling to look for it). Some items that might help would be making the title/header red, and placing the rejected always first (or always last?).

When the rejected reasons are displayed, they shouldn't use the list, they should be rendered one per line, something like:

====== Device report /dev/vdg ====== rejected reasons * locked

Sorting is already implemented. Devices are sorted by the valid flag (valid < invalid) and the device name otherwise.

RIght, but if you are displaying 100 devices and you have 84 devices that are invalid, you still need to browse through quite a bunch to find where the division between valid/invalid starts.

Again, I am not saying this is necessary for this PR, if prioritizing, I think the rejected reasons fix would be preferable

I could flip the ordering. Maybe it makes more sense though to add the ability to simply not display invalid devices?

From my opinion the default should be listing everything.
We have 3 categories of drives ,

totally free to be used as new OSDs

used by ceph

used by anything else

It could be useful to have flags to keep only a single category to ease the reading for humans or make an orchestrator having a way to pre-filter the devices types like "please tell me what devices are free to be used".

I think this would mostly be interesting for the pretty report. Any software piece that interacts with the json output can easily filter this output and an orchestrator implementation might actually want that info (how many unavailable disks).

I do think this would be best addressed in a separate PR.

alfredodeza

This looks good to go. My last nit about rejected reasons can be done separately (feel free to address here though), as well as changing valid to some other name. This PR is about introducing the inventory sub-command, would like to avoid feature creep.

ErwanAliasr1 · 2018-11-09T12:37:55Z

Once we'll have this PR merge, I'll update the semantic of "valid".

jan--f · 2018-11-09T12:49:05Z

Once we'll have this PR merge, I'll update the semantic of "valid".

@ErwanAliasr1 I have code to address http://tracker.ceph.com/issues/36701. A property rename could easily be part of that. I did use available in an earlier iteration of this.

The inventory command provides information about a nodes disk inventory. Existing logical volumes on a disk or one of its partitions are scanned and reported. The output can be formatted as plain text or json. Signed-off-by: Jan Fajerski <jfajerski@suse.com>

jan--f · 2018-11-09T13:30:14Z

k, I added formatting for the rejected_reasons field. The remaining comments should be addressed in separate PRs imho, i.e. this is good to go.

alfredodeza · 2018-11-09T13:38:23Z

jenkins test ceph-volume tox

alfredodeza · 2018-11-09T14:06:51Z

@jan--f would you mind following up with some doc updates? both for doc/ceph-volume (new file for new sub-command) and doc/man/8/ceph-volume.rst for the man page

liewegas · 2018-11-09T14:29:52Z

Thanks, everyone!

alfredodeza · 2018-11-12T16:19:05Z

src/ceph-volume/ceph_volume/tests/util/test_device.py

@@ -117,6 +117,21 @@ def test_not_used_by_ceph(self, device_info, pvolumes, monkeypatch):
        disk = device.Device("/dev/sda")
        assert not disk.used_by_ceph

+    disk1 = device.Device("/dev/sda")


didn't catch these, but these are causing failures on systems where /dev/sda doesn't exist.

jan--f added feature ceph-volume labels Oct 31, 2018

jan--f requested review from alfredodeza, liewegas, tserong, sebastian-philipp and ErwanAliasr1 October 31, 2018 14:55

jan--f force-pushed the ceph-volume-inventory branch from de5dd0a to 97d2db8 Compare November 2, 2018 13:05

jan--f force-pushed the ceph-volume-inventory branch from 97d2db8 to b56845f Compare November 2, 2018 13:19

tserong reviewed Nov 5, 2018

View reviewed changes

src/ceph-volume/ceph_volume/util/device.py Outdated Show resolved Hide resolved

jan--f force-pushed the ceph-volume-inventory branch 2 times, most recently from 909030a to c6c7ce3 Compare November 5, 2018 10:45

jan--f force-pushed the ceph-volume-inventory branch 3 times, most recently from 0b898a7 to e015754 Compare November 9, 2018 08:45

alfredodeza reviewed Nov 9, 2018

View reviewed changes

alfredodeza approved these changes Nov 9, 2018

View reviewed changes

jan--f force-pushed the ceph-volume-inventory branch from e015754 to 57adfc6 Compare November 9, 2018 12:50

alfredodeza merged commit 974bd43 into ceph:master Nov 9, 2018

jan--f mentioned this pull request Nov 9, 2018

ceph-volume: rename Device property valid to available #25007

Merged

This was referenced Nov 9, 2018

mimic ceph-volume: add inventory command #25013

Merged

luminous ceph-volume: add inventory command #25014

Merged

alfredodeza reviewed Nov 12, 2018

View reviewed changes

This was referenced Nov 12, 2018

ceph-volume: patch Device when testing #25063

Merged

mimic ceph-volume: patch Device when testing #25066

Merged

luminous ceph-volume: patch Device when testing #25067

Merged

jan--f mentioned this pull request Nov 14, 2018

doc: add ceph-volume inventory sections #25092

Merged

This was referenced Nov 16, 2018

mimic doc: add ceph-volume inventory sections #25130

Merged

luminous doc: add ceph-volume inventory sections #25131

Merged

ceph-volume: add inventory command #24859

ceph-volume: add inventory command #24859

Conversation

jan--f commented Oct 31, 2018 • edited by alfredodeza Loading

jan--f commented Oct 31, 2018

jan--f commented Oct 31, 2018

ErwanAliasr1 commented Oct 31, 2018

ErwanAliasr1 commented Oct 31, 2018

alfredodeza commented Oct 31, 2018

jan--f commented Nov 1, 2018

ErwanAliasr1 commented Nov 1, 2018

ErwanAliasr1 commented Nov 1, 2018

ErwanAliasr1 commented Nov 1, 2018

alfredodeza commented Nov 1, 2018

ErwanAliasr1 commented Nov 1, 2018

ErwanAliasr1 commented Nov 1, 2018

jan--f commented Nov 1, 2018

ErwanAliasr1 commented Nov 1, 2018

tserong commented Nov 2, 2018

ErwanAliasr1 commented Nov 2, 2018

jan--f commented Nov 2, 2018

sebastian-philipp commented Nov 5, 2018

jan--f commented Nov 9, 2018

ErwanAliasr1 commented Nov 9, 2018

ErwanAliasr1 commented Nov 9, 2018

jan--f commented Nov 9, 2018

ErwanAliasr1 commented Nov 9, 2018

alfredodeza commented Nov 9, 2018

alfredodeza Nov 9, 2018 • edited Loading

Choose a reason for hiding this comment

jan--f Nov 9, 2018

Choose a reason for hiding this comment

alfredodeza Nov 9, 2018

Choose a reason for hiding this comment

jan--f Nov 9, 2018

Choose a reason for hiding this comment

ErwanAliasr1 Nov 9, 2018

Choose a reason for hiding this comment

jan--f Nov 9, 2018

Choose a reason for hiding this comment

alfredodeza left a comment

Choose a reason for hiding this comment

ErwanAliasr1 commented Nov 9, 2018

jan--f commented Nov 9, 2018

jan--f commented Nov 9, 2018

alfredodeza commented Nov 9, 2018

alfredodeza commented Nov 9, 2018

liewegas commented Nov 9, 2018

alfredodeza Nov 12, 2018

Choose a reason for hiding this comment

jan--f commented Oct 31, 2018 •

edited by alfredodeza

Loading

alfredodeza Nov 9, 2018 •

edited

Loading