Skip to content

Commit 0739d24

Browse files
committed
Merge branch 'devlink-health'
Eran Ben Elisha says: ==================== Devlink health reporting and recovery system The health mechanism is targeted for Real Time Alerting, in order to know when something bad had happened to a PCI device - Provide alert debug information - Self healing - If problem needs vendor support, provide a way to gather all needed debugging information. The main idea is to unify and centralize driver health reports in the generic devlink instance and allow the user to set different attributes of the health reporting and recovery procedures. The devlink health reporter: Device driver creates a "health reporter" per each error/health type. Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) or unknown (driver specific). For each registered health reporter a driver can issue error/health reports asynchronously. All health reports handling is done by devlink. Device driver can provide specific callbacks for each "health reporter", e.g. - Recovery procedures - Diagnostics and object dump procedures - OOB initial attributes Different parts of the driver can register different types of health reporters with different handlers. Once an error is reported, devlink health will do the following actions: * A log is being send to the kernel trace events buffer * Health status and statistics are being updated for the reporter instance * Object dump is being taken and saved at the reporter instance (as long as there is no other dump which is already stored) * Auto recovery attempt is being done. Depends on: - Auto-recovery configuration - Grace period vs. time passed since last recover The user interface: User can access/change each reporter attributes and driver specific callbacks via devlink, e.g per error type (per health reporter) - Configure reporter's generic attributes (like: Disable/enable auto recovery) - Invoke recovery procedure - Run diagnostics - Object dump The devlink health interface (via netlink): DEVLINK_CMD_HEALTH_REPORTER_GET Retrieves status and configuration info per DEV and reporter. DEVLINK_CMD_HEALTH_REPORTER_SET Allows reporter-related configuration setting. DEVLINK_CMD_HEALTH_REPORTER_RECOVER Triggers a reporter's recovery procedure. DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE Retrieves diagnostics data from a reporter on a device. DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET Retrieves the last stored dump. Devlink health saves a single dump. If an dump is not already stored by the devlink for this reporter, devlink generates a new dump. dump output is defined by the reporter. DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR Clears the last saved dump file for the specified reporter. netlink +--------------------------+ | | | + | | | | +--------------------------+ |request for ops |(diagnose, mlx5_core devlink |recover, |dump) +--------+ +--------------------------+ | | | reporter| | | | | +---------v----------+ | | | ops execution | | | | | <----------------------------------+ | | | | | | | | | | | + ^------------------+ | | | | | request for ops | | | | | (recover, dump) | | | | | | | | | +-+------------------+ | | | health report | | health handler | | | +-------------------------------> | | | | | +--------------------+ | | | health reporter create | | | +----------------------------> | +--------+ +--------------------------+ In this patchset, mlx5e TX reporter is implemented. Cmdline format: devlink health show [DEV reporter REPORTE_NAME] devlink health recover DEV reporter REPORTER_NAME devlink health diagnose DEV reporter REPORTER_NAME devlink health dump show DEV reporter REPORTER_NAME devlink health dump clear DEV reporter REPORTER_NAME devlink health set DEV reporter REPORTER_NAME NAME VALUE Cmdline examples: $devlink health show pci/0000:00:09.0: name tx state healthy #err 1 #recover 0 last_dump_ts N/A parameters: grace_period 500 auto_recover false $devlink health diagnose pci/0000:00:09.0 reporter tx -j -p { "SQs": [ { "sqn": 138, "HW state": 1, "stopped": false },{ "sqn": 142, "HW state": 1, "stopped": false } ] } $devlink health diagnose pci/0000:00:09.0 reporter tx SQs: sqn: 138 HW state: 1 stopped: false sqn: 142 HW state: 1 stopped: false $devlink health recover pci/0000:00:09 reporter tx $devlink health set pci/0000:00:09.0 reporter tx grace_period 3500 $devlink health set pci/0000:00:09.0 reporter tx auto_recover false Changelog: v4: - Rebase on latest net-next - Remove trace_devlink_health signature exposure in case CONFIG_NET_DEVLINK is not defined as it shall only be used from devlink. v3: - Redesign of devlink <-> driver fmsg API - Various bug fixes v2: - Remove FW* reporters to decrease the amount of patches in the patchset ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2 parents 8f28980 + db2ab7a commit 0739d24

File tree

11 files changed

+1755
-165
lines changed

11 files changed

+1755
-165
lines changed
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
The health mechanism is targeted for Real Time Alerting, in order to know when
2+
something bad had happened to a PCI device
3+
- Provide alert debug information
4+
- Self healing
5+
- If problem needs vendor support, provide a way to gather all needed debugging
6+
information.
7+
8+
The main idea is to unify and centralize driver health reports in the
9+
generic devlink instance and allow the user to set different
10+
attributes of the health reporting and recovery procedures.
11+
12+
The devlink health reporter:
13+
Device driver creates a "health reporter" per each error/health type.
14+
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
15+
or unknown (driver specific).
16+
For each registered health reporter a driver can issue error/health reports
17+
asynchronously. All health reports handling is done by devlink.
18+
Device driver can provide specific callbacks for each "health reporter", e.g.
19+
- Recovery procedures
20+
- Diagnostics and object dump procedures
21+
- OOB initial parameters
22+
Different parts of the driver can register different types of health reporters
23+
with different handlers.
24+
25+
Once an error is reported, devlink health will do the following actions:
26+
* A log is being send to the kernel trace events buffer
27+
* Health status and statistics are being updated for the reporter instance
28+
* Object dump is being taken and saved at the reporter instance (as long as
29+
there is no other dump which is already stored)
30+
* Auto recovery attempt is being done. Depends on:
31+
- Auto-recovery configuration
32+
- Grace period vs. time passed since last recover
33+
34+
The user interface:
35+
User can access/change each reporter's parameters and driver specific callbacks
36+
via devlink, e.g per error type (per health reporter)
37+
- Configure reporter's generic parameters (like: disable/enable auto recovery)
38+
- Invoke recovery procedure
39+
- Run diagnostics
40+
- Object dump
41+
42+
The devlink health interface (via netlink):
43+
DEVLINK_CMD_HEALTH_REPORTER_GET
44+
Retrieves status and configuration info per DEV and reporter.
45+
DEVLINK_CMD_HEALTH_REPORTER_SET
46+
Allows reporter-related configuration setting.
47+
DEVLINK_CMD_HEALTH_REPORTER_RECOVER
48+
Triggers a reporter's recovery procedure.
49+
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
50+
Retrieves diagnostics data from a reporter on a device.
51+
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
52+
Retrieves the last stored dump. Devlink health
53+
saves a single dump. If an dump is not already stored by the devlink
54+
for this reporter, devlink generates a new dump.
55+
dump output is defined by the reporter.
56+
DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
57+
Clears the last saved dump file for the specified reporter.
58+
59+
60+
netlink
61+
+--------------------------+
62+
| |
63+
| + |
64+
| | |
65+
+--------------------------+
66+
|request for ops
67+
|(diagnose,
68+
mlx5_core devlink |recover,
69+
|dump)
70+
+--------+ +--------------------------+
71+
| | | reporter| |
72+
| | | +---------v----------+ |
73+
| | ops execution | | | |
74+
| <----------------------------------+ | |
75+
| | | | | |
76+
| | | + ^------------------+ |
77+
| | | | request for ops |
78+
| | | | (recover, dump) |
79+
| | | | |
80+
| | | +-+------------------+ |
81+
| | health report | | health handler | |
82+
| +-------------------------------> | |
83+
| | | +--------------------+ |
84+
| | health reporter create | |
85+
| +----------------------------> |
86+
+--------+ +--------------------------+

drivers/net/ethernet/mellanox/mlx5/core/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
2222
#
2323
mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
2424
en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \
25-
en_selftest.o en/port.o en/monitor_stats.o
25+
en_selftest.o en/port.o en/monitor_stats.o en/reporter_tx.o
2626

2727
#
2828
# Netdev extra

drivers/net/ethernet/mellanox/mlx5/core/en.h

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -387,10 +387,7 @@ struct mlx5e_txqsq {
387387
struct mlx5e_channel *channel;
388388
int txq_ix;
389389
u32 rate_limit;
390-
struct mlx5e_txqsq_recover {
391-
struct work_struct recover_work;
392-
u64 last_recover;
393-
} recover;
390+
struct work_struct recover_work;
394391
} ____cacheline_aligned_in_smp;
395392

396393
struct mlx5e_dma_info {
@@ -683,6 +680,13 @@ struct mlx5e_rss_params {
683680
u8 hfunc;
684681
};
685682

683+
struct mlx5e_modify_sq_param {
684+
int curr_state;
685+
int next_state;
686+
int rl_update;
687+
int rl_index;
688+
};
689+
686690
struct mlx5e_priv {
687691
/* priv data path fields - start */
688692
struct mlx5e_txqsq *txq2sq[MLX5E_MAX_NUM_CHANNELS * MLX5E_MAX_NUM_TC];
@@ -738,6 +742,7 @@ struct mlx5e_priv {
738742
#ifdef CONFIG_MLX5_EN_TLS
739743
struct mlx5e_tls *tls;
740744
#endif
745+
struct devlink_health_reporter *tx_reporter;
741746
};
742747

743748
struct mlx5e_profile {
@@ -868,6 +873,11 @@ void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params);
868873
void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
869874
struct mlx5e_params *params);
870875

876+
int mlx5e_modify_sq(struct mlx5_core_dev *mdev, u32 sqn,
877+
struct mlx5e_modify_sq_param *p);
878+
void mlx5e_activate_txqsq(struct mlx5e_txqsq *sq);
879+
void mlx5e_tx_disable_queue(struct netdev_queue *txq);
880+
871881
static inline bool mlx5e_tunnel_inner_ft_supported(struct mlx5_core_dev *mdev)
872882
{
873883
return (MLX5_CAP_ETH(mdev, tunnel_stateless_gre) &&
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
/* SPDX-License-Identifier: GPL-2.0 */
2+
/* Copyright (c) 2019 Mellanox Technologies. */
3+
4+
#ifndef __MLX5E_EN_REPORTER_H
5+
#define __MLX5E_EN_REPORTER_H
6+
7+
#include <linux/mlx5/driver.h>
8+
#include "en.h"
9+
10+
int mlx5e_tx_reporter_create(struct mlx5e_priv *priv);
11+
void mlx5e_tx_reporter_destroy(struct mlx5e_priv *priv);
12+
void mlx5e_tx_reporter_err_cqe(struct mlx5e_txqsq *sq);
13+
int mlx5e_tx_reporter_timeout(struct mlx5e_txqsq *sq);
14+
15+
#endif

0 commit comments

Comments
 (0)