Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redfish collector crashes on DELL PowerEdge R740 / PERC H330 Adapter #69

Closed
przemeklal opened this issue Apr 16, 2024 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@przemeklal
Copy link
Member

After deploying hardware-exporter using the hw-observer charm revision 59:

Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Getting storage controller data...
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Attempt 1 of /redfish/v1/
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Response Time for GET to /redfish/v1/: 0.05335437599569559 seconds.
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Attempt 1 of /redfish/v1/Systems
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Response Time for GET to /redfish/v1/Systems: 0.023388846078887582 seconds.
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Attempt 1 of /redfish/v1/Systems/System.Embedded.1
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Response Time for GET to /redfish/v1/Systems/System.Embedded.1: 0.2340633668936789 seconds.
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Attempt 1 of /redfish/v1/Systems/System.Embedded.1/Storage/
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Response Time for GET to /redfish/v1/Systems/System.Embedded.1/Storage/: 0.03526876401156187 seconds.
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Attempt 1 of /redfish/v1/Systems/System.Embedded.1/Storage/RAID.Slot.6-1
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Response Time for GET to /redfish/v1/Systems/System.Embedded.1/Storage/RAID.Slot.6-1: 0.132918894989416 seconds.
Apr 16 10:58:11 redacted-15 python3[3252062]: 2024-04-16 10:58:11 INFO Attempt 1 of /redfish/v1/Systems/System.Embedded.1/Storage/AHCI.Embedded.1-1
Apr 16 10:58:12 redacted-15 python3[3252062]: 2024-04-16 10:58:12 INFO Response Time for GET to /redfish/v1/Systems/System.Embedded.1/Storage/AHCI.Embedded.1-1: 0.09446246200241148 seconds.
Apr 16 10:58:12 redacted-15 python3[3252062]: 2024-04-16 10:58:12 INFO Attempt 1 of /redfish/v1/Systems/System.Embedded.1/Storage/AHCI.Embedded.2-1
Apr 16 10:58:12 redacted-15 python3[3252062]: 2024-04-16 10:58:12 INFO Response Time for GET to /redfish/v1/Systems/System.Embedded.1/Storage/AHCI.Embedded.2-1: 0.10998136200942099 seconds.
Apr 16 10:58:12 redacted-15 python3[3252062]: 2024-04-16 10:58:12 INFO Attempt 1 of /redfish/v1/Systems/System.Embedded.1/Storage/PCIeSSD.Slot.2-C
Apr 16 10:58:12 redacted-15 python3[3252062]: 2024-04-16 10:58:12 INFO Response Time for GET to /redfish/v1/Systems/System.Embedded.1/Storage/PCIeSSD.Slot.2-C: 0.08474618685431778 seconds.
Apr 16 10:58:12 redacted-15 python3[3252062]: 2024-04-16 10:58:12 ERROR Exception occurred while using redfish object: 'StorageControllers'
Apr 16 10:58:12 redacted-15 python3[3252062]: Traceback (most recent call last):
Apr 16 10:58:12 redacted-15 python3[3252062]:   File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/prometheus_hardware_exporter/collector.py", line 1036, in fetch
Apr 16 10:58:12 redacted-15 python3[3252062]:     ) = redfish_helper.get_storage_controller_data()
Apr 16 10:58:12 redacted-15 python3[3252062]:   File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/prometheus_hardware_exporter/collectors/redfish.py", line 306, in get_storage_controller_data
Apr 16 10:58:12 redacted-15 python3[3252062]:     storage_controllers_list: List[Dict] = self.redfish_obj.get(curr_storage_uri).dict[
Apr 16 10:58:12 redacted-15 python3[3252062]: KeyError: 'StorageControllers'
@facundofc
Copy link

We're seeing the same (or something very similar) on an HPE system, ProLiant DL385 Gen10 Plus v2.

In this case the traceback isn't logged but just "StorageControllers":

Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Getting storage controller data...
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Attempt 1 of /redfish/v1/
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Response Time for GET to /redfish/v1/: 0.030871829017996788 seconds.
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Attempt 1 of /redfish/v1/Systems/
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Response Time for GET to /redfish/v1/Systems/: 0.014283128082752228 seconds.
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Attempt 1 of /redfish/v1/Systems/1
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Response Time for GET to /redfish/v1/Systems/1: 0.11970742791891098 seconds.
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Attempt 1 of /redfish/v1/Systems/1/Storage/
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Response Time for GET to /redfish/v1/Systems/1/Storage/: 0.012170223984867334 seconds.
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Attempt 1 of /redfish/v1/Systems/1/Storage/DE042000
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Response Time for GET to /redfish/v1/Systems/1/Storage/DE042000: 0.019633877091109753 seconds.
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Attempt 1 of /redfish/v1/SessionService/Sessions/canonical-ipmi000000006621387e1369689/
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO Response Time for DELETE to /redfish/v1/SessionService/Sessions/canonical-ipmi000000006621387e1369689/: 0.019039443228393793 seconds.
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 INFO User logged out: {"error":{"code":"iLO.0.10.ExtendedInfo","message":"See @Message.ExtendedInfo for more information.","@Message.ExtendedInfo":[{"MessageId":"Base.1.17.Success"}]}}
Apr 18 15:13:02 infra-p-ma-02 python3[1185432]: 2024-04-18 15:13:02 ERROR 'StorageControllers'

Charm revision 25.

@przemeklal
Copy link
Member Author

After upgrading hardware-observer on the same environment to the revision 64 from the edge channel it crashes on something else now:

Apr 26 14:02:15 redacted-15 python3[72331]: 2024-04-26 14:02:15 INFO Attempt 1 of /redfish/v1/Chassis/Enclosure.Internal.0-1:RAID.Slot.6-1/SmartStorage
Apr 26 14:02:15 redacted-15 python3[72331]: 2024-04-26 14:02:15 INFO Response Time for GET to /redfish/v1/Chassis/Enclosure.Internal.0-1:RAID.Slot.6-1/SmartStorage: 0.1321243392303586 seconds.
Apr 26 14:02:15 redacted-15 python3[72331]: 2024-04-26 14:02:15 INFO Attempt 1 of /redfish/v1/SessionService/Sessions/15670
Apr 26 14:02:15 redacted-15 python3[72331]: 2024-04-26 14:02:15 INFO Response Time for DELETE to /redfish/v1/SessionService/Sessions/15670: 0.1414796831086278 seconds.
Apr 26 14:02:15 redacted-15 python3[72331]: 2024-04-26 14:02:15 INFO User logged out: {"@Message.ExtendedInfo":[{"Message":"The request completed successfully.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"Base.1.12.Success","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"None","Severity":"OK"},{"Message":"The operation successfully completed.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"IDRAC.2.8.SYS413","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"No response action is required.","Severity":"Informational"}]}
Apr 26 14:02:15 redacted-15 python3[72331]: 2024-04-26 14:02:15 WARNING Failed to get smart_storage_health_data via redfish
Apr 26 14:02:15 redacted-15 python3[72331]: Traceback (most recent call last):
Apr 26 14:02:15 redacted-15 python3[72331]:   File "/usr/lib/python3.8/wsgiref/handlers.py", line 137, in run
Apr 26 14:02:15 redacted-15 python3[72331]:     self.result = application(self.environ, self.start_response)
Apr 26 14:02:15 redacted-15 python3[72331]:   File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/prometheus_client/exposition.py", line 128, in prometheus_app
Apr 26 14:02:15 redacted-15 python3[72331]:     status, headers, output = _bake_output(registry, accept_header, accept_encoding_header, params, disable_compression)
Apr 26 14:02:15 redacted-15 python3[72331]:   File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/prometheus_client/exposition.py", line 104, in _bake_output
Apr 26 14:02:15 redacted-15 python3[72331]:     output = encoder(registry)
Apr 26 14:02:15 redacted-15 python3[72331]:   File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/prometheus_client/openmetrics/exposition.py", line 32, in generate_latest
Apr 26 14:02:15 redacted-15 python3[72331]:     ['{}="{}"'.format(
Apr 26 14:02:15 redacted-15 python3[72331]:   File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/prometheus_client/openmetrics/exposition.py", line 33, in <listcomp>
Apr 26 14:02:15 redacted-15 python3[72331]:     k, v.replace('\\', r'\\').replace('\n', r'\n').replace('"', r'\"'))
Apr 26 14:02:15 redacted-15 python3[72331]: AttributeError: ("'NoneType' object has no attribute 'replace'", Metric(redfish_storage_controller, Storage Controller information obtained from redfish., info, , [Sample(name='redfish_storage_controller_info', labels={'system_id': 'System.Embedded.1', 'storage_id': 'AHCI.Embedded.1-1', 'controller_id': '0', 'health': None, 'state': 'Enabled'}, value=1, timestamp=None, exemplar=None)]))

@honghan-wong
Copy link

hw-obs charm stuck on "update-status", working after running juju resolve hardware-observer/<unit>
after update to rev 69, i don't see any more error.

@chanchiwai-ray
Copy link
Contributor

chanchiwai-ray commented May 9, 2024

Closing this for now since @honghan-wong has tested this on the same environment, and it was resolved. @przemeklal feel free to reopen this or open other issue if you encounter one (we do expected to see some similar errors because different hardware vendors will provide different "keys", and we haven't capture them all. Hopefully, we can have more sample outputs from different vendor to fix them all)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants