Skip to content

Commit 7315692

Browse files
committed
Merge: EDAC: add initial support for El Capitan
MR: https://gitlab.com/redhat/rhel/src/kernel/rhel-9/-/merge_requests/1389 JIRA: https://issues.redhat.com/browse/RHEL-10022 These patches are the patches that made upstream to add support for El Capitan. While there are more to fully support it, with these it's already functional. We have a big customer depending on these. v2: dropped patches that were already in RHEL-10032, which got merged Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Approved-by: Nico Pache <npache@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Approved-by: Lenny Szubowicz <lszubowi@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Merged-by: Scott Weaver <scweaver@redhat.com>
2 parents 66f48ff + 733aab5 commit 7315692

File tree

16 files changed

+657
-640
lines changed

16 files changed

+657
-640
lines changed

Documentation/driver-api/edac.rst

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,16 @@ will occupy those chip-select rows.
106106
This term is avoided because it is unclear when needing to distinguish
107107
between chip-select rows and socket sets.
108108

109+
* High Bandwidth Memory (HBM)
110+
111+
HBM is a new memory type with low power consumption and ultra-wide
112+
communication lanes. It uses vertically stacked memory chips (DRAM dies)
113+
interconnected by microscopic wires called "through-silicon vias," or
114+
TSVs.
115+
116+
Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
117+
interconnect called the "interposer". Therefore, HBM's characteristics
118+
are nearly indistinguishable from on-chip integrated RAM.
109119

110120
Memory Controllers
111121
------------------
@@ -176,3 +186,113 @@ nodes::
176186
the L1 and L2 directories would be "edac_device_block's"
177187

178188
.. kernel-doc:: drivers/edac/edac_device.h
189+
190+
191+
Heterogeneous system support
192+
----------------------------
193+
194+
An AMD heterogeneous system is built by connecting the data fabrics of
195+
both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
196+
GPU nodes can be accessed the same way as the data fabric on CPU nodes.
197+
198+
The MI200 accelerators are data center GPUs. They have 2 data fabrics,
199+
and each GPU data fabric contains four Unified Memory Controllers (UMC).
200+
Each UMC contains eight channels. Each UMC channel controls one 128-bit
201+
HBM2e (2GB) channel (equivalent to 8 X 2GB ranks). This creates a total
202+
of 4096-bits of DRAM data bus.
203+
204+
While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
205+
channel is interfacing 2GB of DRAM (represented as rank).
206+
207+
Memory controllers on AMD GPU nodes can be represented in EDAC thusly:
208+
209+
GPU DF / GPU Node -> EDAC MC
210+
GPU UMC -> EDAC CSROW
211+
GPU UMC channel -> EDAC CHANNEL
212+
213+
For example: a heterogeneous system with 1 AMD CPU is connected to
214+
4 MI200 (Aldebaran) GPUs using xGMI.
215+
216+
Some more heterogeneous hardware details:
217+
218+
- The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
219+
They have chip selects (csrows) and channels. However, the layouts are different
220+
for performance, physical layout, or other reasons.
221+
- CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
222+
marketing speak. CPU has X memory channels, etc.
223+
- CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
224+
- GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
225+
- GPU UMCs use 8 channels, So UMC channel = EDAC channel.
226+
227+
The EDAC subsystem provides a mechanism to handle AMD heterogeneous
228+
systems by calling system specific ops for both CPUs and GPUs.
229+
230+
AMD GPU nodes are enumerated in sequential order based on the PCI
231+
hierarchy, and the first GPU node is assumed to have a Node ID value
232+
following those of the CPU nodes after latter are fully populated::
233+
234+
$ ls /sys/devices/system/edac/mc/
235+
mc0 - CPU MC node 0
236+
mc1 |
237+
mc2 |- GPU card[0] => node 0(mc1), node 1(mc2)
238+
mc3 |
239+
mc4 |- GPU card[1] => node 0(mc3), node 1(mc4)
240+
mc5 |
241+
mc6 |- GPU card[2] => node 0(mc5), node 1(mc6)
242+
mc7 |
243+
mc8 |- GPU card[3] => node 0(mc7), node 1(mc8)
244+
245+
For example, a heterogeneous system with one AMD CPU is connected to
246+
four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
247+
via the following sysfs entries::
248+
249+
/sys/devices/system/edac/mc/..
250+
251+
CPU # CPU node
252+
├── mc 0
253+
254+
GPU Nodes are enumerated sequentially after CPU nodes have been populated
255+
GPU card 1 # Each MI200 GPU has 2 nodes/mcs
256+
├── mc 1 # GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
257+
│   ├── csrow 0 # UMC 0
258+
│   │   ├── channel 0 # Each UMC has 8 channels
259+
│   │   ├── channel 1 # size of each channel is 2 GB, so each UMC has 16 GB
260+
│   │   ├── channel 2
261+
│   │   ├── channel 3
262+
│   │   ├── channel 4
263+
│   │   ├── channel 5
264+
│   │   ├── channel 6
265+
│   │   ├── channel 7
266+
│   ├── csrow 1 # UMC 1
267+
│   │   ├── channel 0
268+
│   │   ├── ..
269+
│   │   ├── channel 7
270+
│   ├── .. ..
271+
│   ├── csrow 3 # UMC 3
272+
│   │   ├── channel 0
273+
│   │   ├── ..
274+
│   │   ├── channel 7
275+
│   ├── rank 0
276+
│   ├── .. ..
277+
│   ├── rank 31 # total 32 ranks/dimms from 4 UMCs
278+
279+
├── mc 2 # GPU node 1 == mc2
280+
│   ├── .. # each GPU has total 64 GB
281+
282+
GPU card 2
283+
├── mc 3
284+
│   ├── ..
285+
├── mc 4
286+
│   ├── ..
287+
288+
GPU card 3
289+
├── mc 5
290+
│   ├── ..
291+
├── mc 6
292+
│   ├── ..
293+
294+
GPU card 4
295+
├── mc 7
296+
│   ├── ..
297+
├── mc 8
298+
│   ├── ..

arch/x86/include/asm/mce.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -307,6 +307,7 @@ enum smca_bank_types {
307307
SMCA_PIE, /* Power, Interrupts, etc. */
308308
SMCA_UMC, /* Unified Memory Controller */
309309
SMCA_UMC_V2,
310+
SMCA_MA_LLC, /* Memory Attached Last Level Cache */
310311
SMCA_PB, /* Parameter Block */
311312
SMCA_PSP, /* Platform Security Processor */
312313
SMCA_PSP_V2,
@@ -322,14 +323,15 @@ enum smca_bank_types {
322323
SMCA_SHUB, /* System HUB Unit */
323324
SMCA_SATA, /* SATA Unit */
324325
SMCA_USB, /* USB Unit */
326+
SMCA_USR_DP, /* Ultra Short Reach Data Plane Controller */
327+
SMCA_USR_CP, /* Ultra Short Reach Control Plane Controller */
325328
SMCA_GMI_PCS, /* GMI PCS Unit */
326329
SMCA_XGMI_PHY, /* xGMI PHY Unit */
327330
SMCA_WAFL_PHY, /* WAFL PHY Unit */
328331
SMCA_GMI_PHY, /* GMI PHY Unit */
329332
N_SMCA_BANK_TYPES
330333
};
331334

332-
extern const char *smca_get_long_name(enum smca_bank_types t);
333335
extern bool amd_mce_is_memory_error(struct mce *m);
334336

335337
extern int mce_threshold_create_device(unsigned int cpu);

arch/x86/kernel/cpu/mce/amd.c

Lines changed: 40 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -87,61 +87,50 @@ struct smca_bank {
8787
static DEFINE_PER_CPU_READ_MOSTLY(struct smca_bank[MAX_NR_BANKS], smca_banks);
8888
static DEFINE_PER_CPU_READ_MOSTLY(u8[N_SMCA_BANK_TYPES], smca_bank_counts);
8989

90-
struct smca_bank_name {
91-
const char *name; /* Short name for sysfs */
92-
const char *long_name; /* Long name for pretty-printing */
93-
};
94-
95-
static struct smca_bank_name smca_names[] = {
96-
[SMCA_LS ... SMCA_LS_V2] = { "load_store", "Load Store Unit" },
97-
[SMCA_IF] = { "insn_fetch", "Instruction Fetch Unit" },
98-
[SMCA_L2_CACHE] = { "l2_cache", "L2 Cache" },
99-
[SMCA_DE] = { "decode_unit", "Decode Unit" },
100-
[SMCA_RESERVED] = { "reserved", "Reserved" },
101-
[SMCA_EX] = { "execution_unit", "Execution Unit" },
102-
[SMCA_FP] = { "floating_point", "Floating Point Unit" },
103-
[SMCA_L3_CACHE] = { "l3_cache", "L3 Cache" },
104-
[SMCA_CS ... SMCA_CS_V2] = { "coherent_slave", "Coherent Slave" },
105-
[SMCA_PIE] = { "pie", "Power, Interrupts, etc." },
90+
static const char * const smca_names[] = {
91+
[SMCA_LS ... SMCA_LS_V2] = "load_store",
92+
[SMCA_IF] = "insn_fetch",
93+
[SMCA_L2_CACHE] = "l2_cache",
94+
[SMCA_DE] = "decode_unit",
95+
[SMCA_RESERVED] = "reserved",
96+
[SMCA_EX] = "execution_unit",
97+
[SMCA_FP] = "floating_point",
98+
[SMCA_L3_CACHE] = "l3_cache",
99+
[SMCA_CS ... SMCA_CS_V2] = "coherent_slave",
100+
[SMCA_PIE] = "pie",
106101

107102
/* UMC v2 is separate because both of them can exist in a single system. */
108-
[SMCA_UMC] = { "umc", "Unified Memory Controller" },
109-
[SMCA_UMC_V2] = { "umc_v2", "Unified Memory Controller v2" },
110-
[SMCA_PB] = { "param_block", "Parameter Block" },
111-
[SMCA_PSP ... SMCA_PSP_V2] = { "psp", "Platform Security Processor" },
112-
[SMCA_SMU ... SMCA_SMU_V2] = { "smu", "System Management Unit" },
113-
[SMCA_MP5] = { "mp5", "Microprocessor 5 Unit" },
114-
[SMCA_MPDMA] = { "mpdma", "MPDMA Unit" },
115-
[SMCA_NBIO] = { "nbio", "Northbridge IO Unit" },
116-
[SMCA_PCIE ... SMCA_PCIE_V2] = { "pcie", "PCI Express Unit" },
117-
[SMCA_XGMI_PCS] = { "xgmi_pcs", "Ext Global Memory Interconnect PCS Unit" },
118-
[SMCA_NBIF] = { "nbif", "NBIF Unit" },
119-
[SMCA_SHUB] = { "shub", "System Hub Unit" },
120-
[SMCA_SATA] = { "sata", "SATA Unit" },
121-
[SMCA_USB] = { "usb", "USB Unit" },
122-
[SMCA_GMI_PCS] = { "gmi_pcs", "Global Memory Interconnect PCS Unit" },
123-
[SMCA_XGMI_PHY] = { "xgmi_phy", "Ext Global Memory Interconnect PHY Unit" },
124-
[SMCA_WAFL_PHY] = { "wafl_phy", "WAFL PHY Unit" },
125-
[SMCA_GMI_PHY] = { "gmi_phy", "Global Memory Interconnect PHY Unit" },
103+
[SMCA_UMC] = "umc",
104+
[SMCA_UMC_V2] = "umc_v2",
105+
[SMCA_MA_LLC] = "ma_llc",
106+
[SMCA_PB] = "param_block",
107+
[SMCA_PSP ... SMCA_PSP_V2] = "psp",
108+
[SMCA_SMU ... SMCA_SMU_V2] = "smu",
109+
[SMCA_MP5] = "mp5",
110+
[SMCA_MPDMA] = "mpdma",
111+
[SMCA_NBIO] = "nbio",
112+
[SMCA_PCIE ... SMCA_PCIE_V2] = "pcie",
113+
[SMCA_XGMI_PCS] = "xgmi_pcs",
114+
[SMCA_NBIF] = "nbif",
115+
[SMCA_SHUB] = "shub",
116+
[SMCA_SATA] = "sata",
117+
[SMCA_USB] = "usb",
118+
[SMCA_USR_DP] = "usr_dp",
119+
[SMCA_USR_CP] = "usr_cp",
120+
[SMCA_GMI_PCS] = "gmi_pcs",
121+
[SMCA_XGMI_PHY] = "xgmi_phy",
122+
[SMCA_WAFL_PHY] = "wafl_phy",
123+
[SMCA_GMI_PHY] = "gmi_phy",
126124
};
127125

128126
static const char *smca_get_name(enum smca_bank_types t)
129127
{
130128
if (t >= N_SMCA_BANK_TYPES)
131129
return NULL;
132130

133-
return smca_names[t].name;
131+
return smca_names[t];
134132
}
135133

136-
const char *smca_get_long_name(enum smca_bank_types t)
137-
{
138-
if (t >= N_SMCA_BANK_TYPES)
139-
return NULL;
140-
141-
return smca_names[t].long_name;
142-
}
143-
EXPORT_SYMBOL_GPL(smca_get_long_name);
144-
145134
enum smca_bank_types smca_get_bank_type(unsigned int cpu, unsigned int bank)
146135
{
147136
struct smca_bank *b;
@@ -178,6 +167,7 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
178167
{ SMCA_CS, HWID_MCATYPE(0x2E, 0x0) },
179168
{ SMCA_PIE, HWID_MCATYPE(0x2E, 0x1) },
180169
{ SMCA_CS_V2, HWID_MCATYPE(0x2E, 0x2) },
170+
{ SMCA_MA_LLC, HWID_MCATYPE(0x2E, 0x4) },
181171

182172
/* Unified Memory Controller MCA type */
183173
{ SMCA_UMC, HWID_MCATYPE(0x96, 0x0) },
@@ -212,6 +202,8 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
212202
{ SMCA_SHUB, HWID_MCATYPE(0x80, 0x0) },
213203
{ SMCA_SATA, HWID_MCATYPE(0xA8, 0x0) },
214204
{ SMCA_USB, HWID_MCATYPE(0xAA, 0x0) },
205+
{ SMCA_USR_DP, HWID_MCATYPE(0x170, 0x0) },
206+
{ SMCA_USR_CP, HWID_MCATYPE(0x180, 0x0) },
215207
{ SMCA_GMI_PCS, HWID_MCATYPE(0x241, 0x0) },
216208
{ SMCA_XGMI_PHY, HWID_MCATYPE(0x259, 0x0) },
217209
{ SMCA_WAFL_PHY, HWID_MCATYPE(0x267, 0x0) },
@@ -715,11 +707,13 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
715707

716708
bool amd_mce_is_memory_error(struct mce *m)
717709
{
710+
enum smca_bank_types bank_type;
718711
/* ErrCodeExt[20:16] */
719712
u8 xec = (m->status >> 16) & 0x1f;
720713

714+
bank_type = smca_get_bank_type(m->extcpu, m->bank);
721715
if (mce_flags.smca)
722-
return smca_get_bank_type(m->extcpu, m->bank) == SMCA_UMC && xec == 0x0;
716+
return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0;
723717

724718
return m->bank == 4 && xec == 0x8;
725719
}
@@ -1049,7 +1043,7 @@ static const char *get_name(unsigned int cpu, unsigned int bank, struct threshol
10491043
if (bank_type >= N_SMCA_BANK_TYPES)
10501044
return NULL;
10511045

1052-
if (b && bank_type == SMCA_UMC) {
1046+
if (b && (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2)) {
10531047
if (b->block < ARRAY_SIZE(smca_umc_block_names))
10541048
return smca_umc_block_names[b->block];
10551049
return NULL;

0 commit comments

Comments
 (0)