Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facts not being reported by the Microkernel to Hanlon on some hardware #29

Open
tjmcs opened this issue Feb 24, 2016 · 11 comments
Open
Labels

Comments

@tjmcs
Copy link
Collaborator

tjmcs commented Feb 24, 2016

A user recently reported that when he used the Microkernel to discover HP BL460cG8 blade servers that included with 10Gb Flex Fabric NICs, there were no system tags assigned to the nodes after they registered with the Hanlon. This seems to be associated with facts not being returned to Hanlon from the Microkernel in the node registration process. This issue needs to be explored further to determine the root cause (whether it's a Hanlon issue, a Hanlon-Microkernel issue, or both).

On the plus side, the node seems to be checking in successfully, so this hints at an issue with discovering (and parsing?) the underlying facts from the Hanlon-Microkernel side. This issue is associated with Issue #422 in the Hanlon project

@tjmcs tjmcs added the bug label Feb 24, 2016
@hickey
Copy link

hickey commented Mar 25, 2016

I think I have a related issue. Received a number of Super Micro machines and only getting partial tags/facts being reported.

0:0 ᐅ hanlon node 1a -f attributes                                                                                                   [3:38]
Node Attributes:
                Name                      Value
        architecture                x86_64
        bios_release_date           12/18/2015
        bios_vendor                 American Megatrends Inc.
        bios_version                2.0
        blockdevice_sda_model       SMC3108
        blockdevice_sda_size        959656755200
        blockdevice_sda_vendor      AVAGO
        blockdevices                sda
        boardmanufacturer           Supermicro
        boardproductname            X10DRT-P
        boardserialnumber           ZM152S026018
        filesystems                 ext2,ext3,ext4
        fqdn                        mk0CC47A4BF8D0
        gid                         root
        hardwareisa                 unknown
        hardwaremodel               x86_64
        hostname                    mk0CC47A4BF8D0
        ipaddress                   172.18.42.1
        ipaddress_docker0           172.17.0.1
        ipaddress_docker_sys        172.18.42.1
        ipaddress_eth2              10.33.12.25
        ipaddress_lo                127.0.0.1
        is_virtual                  false
        macaddress                  00:00:00:00:00:00
        macaddress_docker0          02:42:35:62:CC:DA
        macaddress_docker_sys       00:00:00:00:00:00
        macaddress_eth0             0C:C4:7A:4B:F8:D0
        macaddress_eth1             0C:C4:7A:4B:F8:D1
        macaddress_eth2             A0:36:9F:6C:F9:F8
        macaddress_eth3             A0:36:9F:6C:F9:FA
        macaddress_none             DE:EA:AA:7A:C6:4F
        manufacturer                Supermicro
        memorysize                  125.88 GB
        memorysize_mb               128896.68
        mk_hw_bus_description       Motherboard
        mk_hw_bus_physical_id       0
        mk_hw_bus_product           X10DRT-P
        mk_hw_bus_serial            ZM152S026018
        mk_hw_bus_vendor            Supermicro
        mk_hw_bus_version           1.10
        mk_hw_fw_capacity           15MiB
        mk_hw_fw_date               12/18/2015
        mk_hw_fw_description        BIOS
        mk_hw_fw_physical_id        0
        mk_hw_fw_size               64KiB
        mk_hw_fw_vendor             American Megatrends Inc.
        mk_hw_fw_version            2.0
        mk_hw_lscpu_Architecture    x86_64
        mk_hw_lscpu_BogoMIPS        4805.22
        mk_hw_lscpu_Byte_Order      Little Endian
        mk_hw_lscpu_CPU_MHz         1200.656
        mk_hw_lscpu_CPU_family      6
        mk_hw_lscpu_CPU_op-modes    32-bit, 64-bit
        mk_hw_lscpu_L1d_cache       32K
        mk_hw_lscpu_L1i_cache       32K
        mk_hw_lscpu_L2_cache        256K
        mk_hw_lscpu_L3_cache        15360K
        mk_hw_lscpu_Model           63
        mk_hw_lscpu_Stepping        2
        mk_hw_lscpu_Vendor_ID       GenuineIntel
        mk_hw_lscpu_Virtualization  VT-x
        mtu_docker0                 1500
        mtu_docker_sys              1500
        mtu_eth0                    1500
        mtu_eth1                    1500
        mtu_eth2                    1500
        mtu_eth3                    1500
        mtu_lo                      65536
        mtu_none                    1500
        netmask                     255.255.0.0
        netmask_docker0             255.255.0.0
        netmask_docker_sys          255.255.0.0
        netmask_eth2                255.255.255.0
        netmask_lo                  255.0.0.0
        network_docker0             172.17.0.0
        network_docker_sys          172.18.0.0
        network_eth2                10.33.12.0
        network_lo                  127.0.0.0
        physicalprocessorcount      2
        processorcount              24
        productname                 SYS-2028TP-HC1R
        serialnumber                E168774X6101021
        type                        Other
        virtual                     physical

Note the lack of some of the microkernel facts not defined so the system tags do not appear. Specifically mk_hw_cpu_count, mk_hw_mem_size and mk_hw_nic_count.

Probably related is the exception that is occurring in hnl_mk_hardware_facter.rb.

E, [2016-03-24T23:16:04.885474 #71] ERROR -- HanlonMicrokernel::HnlMkHardwareFacter#rescue in add_facts_to_map!: /usr/local/lib/ruby/hanl
on_microkernel/hnl_mk_hardware_facter.rb:79:in `add_facts_to_map!'
    /usr/local/lib/ruby/hanlon_microkernel/hnl_mk_registration_manager.rb:52:in `register_with_server'
    /usr/local/lib/ruby/hanlon_microkernel/hnl_mk_registration_manager.rb:42:in `register_node_if_changed'
    /usr/local/bin/hnl_mk_control_server.rb:256:in `block in <top (required)>'
    /usr/local/bin/hnl_mk_control_server.rb:141:in `loop'
    /usr/local/bin/hnl_mk_control_server.rb:141:in `<top (required)>'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:218:in `load'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:218:in `start_load'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:297:in `start'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/controller.rb:56:in `run'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons.rb:144:in `block in run'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/cmdline.rb:88:in `call'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/cmdline.rb:88:in `catch_exceptions'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons.rb:143:in `run'
    /usr/local/bin/hnl_mk_controller.rb:24:in `<main>'
E, [2016-03-24T23:17:04.883214 #71] ERROR -- HanlonMicrokernel::HnlMkHardwareFacter#rescue in add_facts_to_map!: /usr/local/lib/ruby/hanl
on_microkernel/hnl_mk_hardware_facter.rb:79:in `add_facts_to_map!'
    /usr/local/lib/ruby/hanlon_microkernel/hnl_mk_registration_manager.rb:52:in `register_with_server'
    /usr/local/lib/ruby/hanlon_microkernel/hnl_mk_registration_manager.rb:42:in `register_node_if_changed'
    /usr/local/bin/hnl_mk_control_server.rb:256:in `block in <top (required)>'
    /usr/local/bin/hnl_mk_control_server.rb:141:in `loop'
    /usr/local/bin/hnl_mk_control_server.rb:141:in `<top (required)>'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:218:in `load'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:218:in `start_load'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:297:in `start'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/controller.rb:56:in `run'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons.rb:144:in `block in run'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/cmdline.rb:88:in `call'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/cmdline.rb:88:in `catch_exceptions'
    /usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons.rb:143:in `run'
    /usr/local/bin/hnl_mk_controller.rb:24:in `<main>'

@jcpowermac
Copy link
Contributor

@hickey I created this doc a while back, maybe if we increase the logging we can figure out where the problem is.
https://gist.github.com/jcpowermac/3ed70022ba218ad29ce6

@hickey
Copy link

hickey commented Mar 25, 2016

Yes, I was starting to look at that this morning.... I am also starting to cut a new microkernel image that print out values throughout the routine to determine what things look like as the routine is executing.

Not sure how the debug level gets transferred to the microkernel (figure it has to be statically written into the docker image when image add executes), but the lack of controls when starting up the docker containers is getting frustrating.... I have already started to extend the hanlon_docker.sh script to be more 12 factorish. I guess I have another setting to add.

@jcpowermac
Copy link
Contributor

@hickey see https://github.com/csc/Hanlon-Microkernel/blob/master/hnl_mk_web_server.rb#L57

Most of the configurations of Hanlon don't need to be changed which is the reason why we did not add those options when starting container. Its easy enough to enter the container modify the config temporarily for testing/debugging and restart puma.

@hickey
Copy link

hickey commented Mar 25, 2016

I have grabbed a copy of the log for analysis. Here is the gist: https://gist.github.com/hickey/6207183c78ea0903cea1

@tjmcs
Copy link
Collaborator Author

tjmcs commented Mar 26, 2016

My guess (from looking at line 79 of the hanlon_microkernel/hnl_mk_hardware_facter.rb file in the Hanlon-Microkernel project) is that the command being exec’ed on line 65 of that file (the sudo lshw -c memory command) isn’t returning the sizes of the memory slots on that hardware (so the bank_array entry in the hash map constructed from the output of that command is empty, a nil is returned from the hash_map[“bank_array”] statement, and the rest of the code on that line is attempting to run a select statement on a nil object.

This is the first time I’ve seen this sort of error before…can you run a sudo lshw -c memory command on that node from the command line in the Microkernel just to be sure???

Cheers,

Tom

On Mar 25, 2016, at 4:00 PM, Gerard Hickey notifications@github.com wrote:

I have grabbed a copy of the log for analysis. Here is the gist: https://gist.github.com/hickey/6207183c78ea0903cea1 https://gist.github.com/hickey/6207183c78ea0903cea1

You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub #29 (comment)

@hickey
Copy link

hickey commented Mar 26, 2016

Here is the value of hash_map (prettied up to make it readable) just before the exception:

D, [2016-03-27T03:44:02.369209 #71] DEBUG -- HanlonMicrokernel::HnlMkHardwareFacter#add_facts_to_map!: 
  hash_map = {
    "firmware"=>{
      "description"=>"BIOS",
      "vendor"=>"American Megatrends Inc.",
      "physical_id"=>"0", 
      "version"=>"2.0", 
      "date"=>"12/18/2015", 
      "size"=>"64KiB", 
      "capacity"=>"15MiB", 
      "capabilities"=>"pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17
printer acpi usb biosbootspecification uefi"
    }, 
    "memory_array"=>[
      {
        "description"=>"System Memory", 
        "physical_id"=>"57", 
        "slot"=>"System board or motherboard", 
        "bank_array"=>[
          {
            "description"=>"DIMM Synchronous 2133 MHz (0.5 ns)", 
            "product"=>"M393A2G40DB0-CPB", 
            "vendor"=>"Samsung", 
            "physical_id"=>"0", 
            "serial"=>"3130B89B", 
            "slot"=>"P1-DIMMA1", 
            "size"=>"16GiB", 
            "width"=>"64 bits", 
            "clock"=>"2133MHz (0.5ns)"
          }, 
          {
            "description"=>"DIMM Synchronous [empty]", 
            "product"=>"NO DIMM", 
            "vendor"=>"NO DIMM", 
            "physical_id"=>"1",
            "serial"=>"NO DIMM",
            "slot"=>"P1-DIMMA2"
          }, 
          {
            "description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
            "product"=>"M393A2G40DB0-CPB",
            "vendor"=>"Samsung",
            "physical_id"=>"2",
            "serial"=>"3130B8A6",
            "slot"=>"P1-DIMMB1",
            "size"=>"16GiB",
            "width"=>"64 bits",
            "clock"=>"2133MHz (0.5ns)"
          }, 
          {
            "description"=>"DIMM Synchronous [empty]",
            "product"=>"NO DIMM",
            "vendor"=>"NO DIMM",
            "physical_id"=>"3",
            "serial"=>"NO DIMM",
            "slot"=>"P1-DIMMB2"
          }, 
          {
            "description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
            "product"=>"M393A2G40DB0-CPB",
            "vendor"=>"Samsung",
            "physical
            _id"=>"4",
            "serial"=>"3130B89E",
            "slot"=>"P1-DIMMC1",
            "size"=>"16GiB",
            "width"=>"64 bits",
            "clock"=>"2133MHz (0.5ns)"
          }, 
          {
            "description"=>"DIMM Synchronous [empty]",
            "product"=>"NO DIMM",
            "vendor"=>"NO DIMM",
            "physical_id"=>"5",
            "serial"=>"NO DIMM",
            "slot"=>"P1-DIMMC2"
          }, 
          {
            "description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
            "product"=>"M393A2G40DB0-CPB",
            "vendor"=>"Samsung",
            "physical_id"=>"6",
            "serial"=>"3130B8AB",
            "slot"=>"P1-DIMMD1",
            "size"=>"16GiB",
            "width"=>"64 bits",
            "clock"=>"2133MHz (0.5ns)"
          }, 
          {
            "description"=>"DIMM Synchronous [empty]",
            "product"=>"NO DIMM",
            "vendor"=>"NO DIMM",
            "physical_id"=>"7",
            "serial"=>"NO DIMM",
            "slot"=>"P1-DIMMD2"
          }
        ]
      }, 
      {
        "description"=>"System Memory",
        "physical_id"=>"60",
        "slot"=>"System board or motherboard",
        "bank_array"=>[
          {
            "description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
            "product"=>"M393A2G40DB0-CPB",
            "vendor"=>"Samsung",
            "physical_id"=>"0",
            "serial"=>"3130B899",
            "slot"=>"P2-DIMME1",
            "size"=>"16GiB",
            "width"=>"64 bits",
            "clock"=>"2133MHz (0.5ns)"
          }, 
          {
            "description"=>"DIMM Synchronous [empty]",
            "product"=>"NO DIMM",
            "vendor"=>"NO DIMM",
            "physical_id"=>"1",
            "serial"=>"NO DIMM",
            "slot"=>"P2-DIMME2"
          }, 
          {
            "description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
            "product"=>"M393A2G40DB0-CPB",
            "vendor"=>"Samsung",
            "physical_id"=>"2",
            "serial"=>"3130B8A2",
            "slot"=>"P2-DIMMF1",
            "size"=>"16GiB",
            "width"=>"64 bits",
            "clock"=>"2133MHz (0.5ns)"
          }, 
          {
            "description"=>"DIMM Synchronous [empty]",
            "product"=>"NO DIMM",
            "vendor"=>"NO DIMM",
            "physical_id"=>"3",
            "serial"=>"NO DIMM",
            "slot"=>"P2-DIMMF2"
          }, 
          {
            "description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
            "product"=>"M393A2G40DB0-CPB",
            "vendor"=>"Samsung",
            "physical_id"=>"4",
            "serial"=>"3130B8A1",
            "slot"=>"P2-DIMMG1",
            "size"=>"16GiB",
            "width"=>"64 bits",
            "clock"=>"2133MHz (0.5ns)"
          }, 
          {
            "description"=>"DIMM Synchronous [empty]", 
            "product"=>"NO DIMM",
            "vendor"=>"NO DIMM",
            "physical_id"=>"5",
            "serial"=>"NO DIMM",
            "slot"=>"P2-DIMMG2"
          }, 
          {
            "description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
            "product"=>"M393A2G40DB0-CPB",
            "vendor"=>"Samsung",
            "physical_id"=>"6",
            "serial"=>"3130B89D",
            "slot"=>"P2-DIMMH1",
            "size"=>"16GiB",
            "width"=>"64 bits",
            "clock"=>"2133MHz (0.5ns)"
          }, 
          {
            "description"=>"DIMM Synchronous [empty]",
            "product"=>"NO DIMM",
            "vendor"=>"NO DIMM",
            "physical_id"=>"7",
            "serial"=>"NO DIMM",
            "slot"=>"P2-DIMMH2"
          }
        ]
      }, 
      {
        "UNCLAIMED"=>true, "physical_id"=>"1"
      }
    ], 
    "cache_array"=>[
      {
        "description"=>"L1 cache",
        "physical_id"=>"74",
        "slot"=>"CPU Internal L1",
        "size"=>"384KiB",
        "capacity"=>"384KiB",
        "capabilities"=>"internal write-back"
      }, 
      {
        "description"=>"L2 cache",
        "physical_id"=>"75",
        "slot"=>"CPU Internal L2",
        "size"=>"1536KiB",
        "capacity"=>"1536KiB",
        "capabilities"=>"internal write-back unified"
      }, 
      {
        "description"=>"L3 cache",
        "physical_id"=>"76",
        "slot"=>"CPU Internal L3",
        "size"=>"15MiB",
        "capacity"=>"15MiB",
        "capabilities"=>"internal write-back unified"
      }, 
      {
        "description"=>"L1 cache",
        "physical_id"=>"78",
        "slot"=>"CPU Internal L1",
        "size"=>"384KiB",
        "capacity"=>"384KiB",
        "capabilities"=>"internal write-back"
      }, 
      {
        "description"=>"L2 cache",
        "physical_id"=>"79",
        "slot"=>"CPU Internal L2",
        "size"=>"1536KiB",
        "capacity"=>"1536KiB",
        "capabilities"=>"internal write-back unified"
      },
      {
        "description"=>"L3 cache",
        "physical_id"=>"7a",
        "slot"=>"CPU Internal L3",
        "size"=>"15MiB",
        "capacity"=>"15MiB",
        "capabilities"=>"internal write-back
        unified"
      }
    ]
  }

@hickey
Copy link

hickey commented Mar 26, 2016

So clearly in my output it is memory_array rather than bank_array. OK that is an easy fix.

I walked through the rest of the sections and executed the commands to look at the output. Everything else looks like it will parse correctly. Well everything up to the point of trying to gather BMC/IPMI information. I have already created an issue (#31) for this. It will be fairly critical to gather this information also, so hopefully this will be able to be overcome.

@hickey
Copy link

hickey commented Mar 26, 2016

A though I was just having (more of a question) is why was the hardware gathering done this way instead of using the regular Facter interface? If all these gathering processes were written as either Facter modules or even using the facter-dot-d interface, then any one of them blowing up would not disturb the other bits of code gathering information. At lease then in my case only the memory information would be missing.

It would also be easier to add new code to test and generate facts. The other advantage is that running facter on the command line would yield all the regular facts along with the the ones being created by the hanlon code.

Would not be that much trouble to break it apart and make it Facter modules. Although the more I think about it, the more I like adding it as facts-dot-d scripts. One of the principle reasons being that it would make it pretty easy to interface to an external PAAS system through a hook (yes the microkernel needs to support an easy way for someone to drop scripts in a directory and have them executed prior to and after registration--maybe even at each checkin) to retrieve information and drop a JSON/YAML/text file in the facter-dot-d directory to add PAAS values as facts. This would allow Hanlon tags to be created from exposed PAAS information.

@hickey
Copy link

hickey commented Mar 30, 2016

I an report back that the initial changes I have made to support memory_array and bank_array are working. I did not get any real memory information back, so I will look to solve this before I submit a PR. But I am seeing the facts to generate number of CPUs and NICs. So improvements :-)

@hickey
Copy link

hickey commented Apr 8, 2016

While I still have the patch for this issue in my local repo, I would suggest that the PR I just created for issue #32 be used instead. That code base also solves some (if not all) of the issues on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants